Вы находитесь на странице: 1из 2

Category Escapes Character Blocks Block Block Block Regular Expression Examples

Start End Name


A category escape matches a character from a set Any character within a Unicode character block
specified by a property or using a block: can be matched using a category escape 2190 21FF Arrows ^[A-Za-z]
\p indicates match any character in the set. consisting of “Is” followed by the block‟s name. 2200 22FF MathematicalOperators An Ascii letter at the start of a string or line.
For example: \p{IsBasicLatin} 2300 23FF MiscellaneousTechnical ^\p{Lu}
\P indicates match any character not in the set.
2400 243F ControlPictures
Block Block Block An upper-case Unicode letter at the start of a
2440 245F OpticalCharacterRecognition
Categories and Properties Start End Name string or line.
2460 24FF EnclosedAlphanumerics
Any character can be matched by its properties 0000 007F BasicLatin 2500 257F BoxDrawing \.$
using a category escape consisting of a Category 0080 00FF Latin-1Supplement 2580 259F BlockElements A period at the end of a string or line.
code followed by an optional Property code: 0100 017F LatinExtended-A 25A0 25FF GeometricShapes
\p{IsGreek}+
\p{L} Any Letter 0180 024F LatinExtended-B 2600 26FF MiscellaneousSymbols
0250 02AF IPAExtensions 2700 27BF Dingbats One or more Greek letters.
\p{Lu} Any Upper-case Letter
02B0 02FF SpacingModifierLetters 2800 28FF BraillePatterns \p{IsGreek}{1,}
\p{Ll} Any Lower-case Letter
0300 036F CombiningDiacriticalMarks 2E80 2EFF CJKRadicalsSupplement One or more Greek letters.
\p{Lt} Any Title-case Letter 0370 03FF Greek 2F00 2FDF KangxiRadicals .*?;
\p{Lm} Any Letter Modifier 0400 04FF Cyrillic 2FF0 2FFF
\p{Lo} Any “Other” Letter 0530 058F Armenian Up to and including the next semicolon.
IdeographicDescriptionCharacters
\p{M} Any Mark 0590 05FF Hebrew 3000 303F CJKSymbolsandPunctuation .*;
0600 06FF Arabic 3040 309F Hiragana Up to and including the last semicolon.
\p{Mn} Any Non-Spacing Mark
0700 074F Syriac 30A0 30FF Katakana ^\c+$
\p{Mc} Any Combining Mark 0780 07BF Thaana 3100 312F Bopomofo
\p{Me} Any Enclosing Mark 0900 097F Devanagari Match only if the string consists entirely of
3130 318F HangulCompatibilityJamo
\p{N} Any Digit 0980 09FF Bengali 3190 319F Kanbun XML name characters.
\p{Nd} Any Decimal Digit 0A00 0A7F Gurmukhi 31A0 31BF BopomofoExtended [ -~-[\[\]]]+
0A80 0AFF Gujarati 3200 32FF EnclosedCJKLettersandMonths Any Ascii printable character except the
\p{Nl} Any Letter Digit
0B00 0B7F Oriya 3300 33FF CJKCompatibility
\p{No} Any “Other” Digit square brackets.
0B80 0BFF Tamil 3400 4DB5
\p{P} Any Punctuation Character 0C00 0C7F Telugu CJKUnifiedIdeographsExtensionA \w+
\p{Pc} Any Connector Character 0C80 0CFF Kannada 4E00 9FFF CJKUnifiedIdeographs A "word".
\p{Pd} Any Dash Character 0D00 0D7F Malayalam A000 A48F YiSyllables [^\s]+
0D80 0DFF Sinhala A490 A4CF YiRadicals
\p{Ps} Any Open Character Non-white-space characters.
0E00 0E7F Thai AC00 D7A3 HangulSyllables
\p{Pe} Any Close Character 0E80 0EFF Lao \S+
E000 F8FF PrivateUse
\p{Pi} Any Initial Quote Character 0F00 0FFF Tibetan F900 FAFF CJKCompatibilityIdeographs Non-white-space characters.
\p{Pf} Any Final Quote Character 1000 109F Myanmar FB00 FB4F AlphabeticPresentationForms (['"])(.*?)\1
\p{Po} Any “Other” Punctuation 10A0 10FF Georgian FB50 FDFF ArabicPresentationForms-A A string delimited by single or double quotes.
1100 11FF HangulJamo FE20 FE2F CombiningHalfMarks
\p{Z} Any Separator Character $2 or regex-group(2) will return the unquoted
1200 137F Ethiopic FE30 FE4F CJKCompatibilityForms
\p{Zs} Any Space Separator 13A0 13FF Cherokee substring. (\1 is the quote character used.)
FE50 FE6F SmallFormVariants
\p{Zl} Any Line Separator 1400 167F FE70 FEFE ArabicPresentationForms-B \s*(\i\c*)\s*=\s*(["'])(.*?)\2
\p{Zp} Any Paragraph Separator UnifiedCanadianAboriginalSyllabics FEFF FEFF Specials An XML-attribute-like name, equal and
\p{S} Any Symbol Character 1680 169F Ogham FF00 FFEF HalfwidthandFullwidthForms
quoted value (with optional leading and
16A0 16FF Runic FFF0 FFFD Specials
\p{Sm} Any Math Symbol intervening white space). $1 is the name and
1780 17FF Khmer
\p{Sc} Any Currency Symbol 1800 18AF Mongolian $3 is the value.
\p{Sk} Any Modifier Symbol 1E00 1EFF LatinExtendedAdditional XSLT 2.0: \((\d+|\p{L}+)\)
\p{So} Any “Other” Symbol 1F00 1FFF GreekExtended http://www.w3.org/TR/xslt20/
A parenthesized sequence either of digits or
\p{C} Any “Other” Character 2000 206F GeneralPunctuation XQuery 1.0:
2070 209F SuperscriptsandSubscripts of letters (but not a mixture of both).
\p{Cc} Any Control Character http://www.w3.org/TR/xquery/
20A0 20CF CurrencySymbols \p{Sc}(\d+(\.\d*)?|\.\d+)
\p{Cf} Any Format Character XPath 2.0:
20D0 20FF CombiningMarksforSymbols A decimal number with a leading currency
\p{Co} Any Private Use Character http://www.w3.org/TR/xpath20/
2100 214F LetterlikeSymbols symbol.
\p{Cn} Any “Not Assigned” Character 2150 218F NumberForms Unicode:
http://www.unicode.org
Escaping Characters XPath 2.0 and XQuery 1.0 Functions Regular Expression Basics
Characters that have special meaning in regular That Use Regular Expressions A regular expression is:
expressions need to be escaped if they are to be
represented “as is”. These characters are:
\ | . ? * + ( ) { } [ ] - ^ $
matches(xs:string?, xs:string) as xs:boolean
matches(xs:string?, xs:string, xs:string) as Regular Expressions oneThing | anotherThing | yetAnother
Match one thing or another or another (one or
more things).
xs:boolean

in XSLT 2.0,
In addition, the following escapes represent oneThing anotherThing yetAnother
replace(xs:string?, xs:string, xs:string) as Match one thing followed by another etc. (one
single characters: xs:string or more things)
\n newline or line-feed character (
) replace(xs:string?, xs:string, xs:string, xs:string) atom quantifier

\r carriage return character (
)


as xs:string
tokenize(xs:string?, xs:string) as xs:string*
XQuery 1.0 and Match atom the number of times indicated by
quantifier; once if quantifier is omitted.
\t tab character (	) tokenize(xs:string?, xs:string, xs:string) as Where atom is any of:

Multi-Character Escapes
xs:string*
XPath 2.0 an unescaped character,
an escaped character,
. (dot) Any Non-Line-End Character XSLT 2.0 Instructions That Use a parenthesized regular expression, or

\s Any Space Character Regular Expressions a character class expression.

\i Any Initial Name Character Where quantifier is any of:


<xsl:analyze-string select = expression
(including „_‟ and „:‟) regex = { string } ? zero or one times (i.e. optional)
flags = { string }> * zero or more times
\c Any Name Character <xsl:matching-substring> + one or more times
(including „.‟, „-„, „_‟ and „:‟) sequence-constructor
{N} exactly N times
</xsl:matching-substring>
\d Any Decimal Digit <xsl:non-matching-substring> {N,} N or more times
\w Any “Word” Character (anything other sequence-constructor {N,M} between N and M times inclusive.
</xsl:non-matching-substring> An extra trailing ?, as in ??, +? or {N,M}? means
than Punctuation, Separator or “Other”) xsl:fallback* match the shortest possible number of
An upper-case multi-character escape matches </xsl:analyze-string> repetitions rather than the (default) longest.
any character not described by the lower-case Sam Wilmott
escape. The upper-case escapes are: One but not both of xsl:matching-substring and sam@wilmott.ca
xsl:non-matching-substring can be omitted. http://www.wilmott.ca Line Starts and Ends
\S \I \C \D \W A regular expression can be anchored at the start
Inside xsl:matching-substring, the and and/or end of a string using ^ (the start) and $
regex-group(N) function returns the Nth group
Character Class Expressions captured by the regular expression. Mulberry Technologies, Inc. (the end). If a regular expression is used with
17 West Jefferson Street, Suite 207 the m flag, ^ and $ match at the start and end of
A character class expression matches a single each line.
Rockville, MD 20850 USA
character. It‟s wrapped in square brackets and Regular Expression Matching Flags Phone: +1 301/315-9631 In the absence of ^ or $, a regular expression
consists of three parts:
Flags are letters used to indicate how Regular
Fax: +1 301/315-8285 matches unanchored: anywhere within the string.
1. an optional negation indicator, ^. info@mulberrytech.com
Expression matching is to be done:
2. one or more characters or ranges, and http://www.mulberrytech.com Subexpressions and Back References
3. an optional character class subtraction. s Dot (.) matches any character, line-end
Each parenthesized group in a regular expression
characters included.
If the negation indicator is used, the single is assigned a group number counting unescaped
character matched is any character not given m ^ and $ match at the start and end of all left parentheses starting from the left.
following it or in a given range. lines, not just the start and end of the
selected string as a whole. Group numbers can be used in three ways:
A character range consists of two characters 1. Within a regular expression, to match what
separated by a dash, as in: i Match case insensitive. was matched by a previous subexpression. A
[-a-zA-Z0-9_] x Remove white-space (space, tab and line- previously matched group is identified by
end) characters from the regular expression backslash and a number: \1, \2 etc.
A leading dash (-) is a dash, not a range. before using it. 2. Within a replace replacement expression to
A character class subtraction consists of a dash match what was matched by a previous
Zero or more flags are specified as a string using subexpression. A group is identified by a
followed by a character, category escape or the optional flags= attribute of xsl:analyze-string
nested character class expression, as in: numeric name: $1, $2 etc. As well, $0
or the optional last argument of the matches, © 2007-2008 Sam Wilmott and identifies the whole matched substring.
[a-z-[aeiou]] replace and tokenize functions. Mulberry Technologies, Inc. 3. within a XSLT regex-group(N) to access the
i.e. Match lower-case letters but not the vowels. 2008-07-21 matched subexpression.

Вам также может понравиться