Вы находитесь на странице: 1из 12

C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 1 / 12

Programme

Search C#Today
Living Book

i Index
j
k
l
m
n j Full Text
k
l
m
n
Advanced

CATEGORIES HOME SITE MAP SEARCH REFERENCE FORUM FEEDBACK ADVERTISE SU

The C#Today Article Previous article - Next art


February 1, 2002 January 31, 2002 February

Introducing .NET Regular Expressions with C#


by Tony Loton

CATEGORY: Other Technologies


ARTICLE TYPE: Cutting Edge Reader Comments

ABSTRACT Article
If you’re building any kind of application that involves looking for patterns in text, picking out segments Usefu
of text according to certain criteria, or transforming the text itself, you can save a lot of time and
energy by becoming familiar with regular expressions. In this article, Tony Loton demonstrates the Innov
Regular Expression language itself, and also introduces the main classes of the .NET
System.Text.RegularExpressions namespace that allow you to harness the power of regular Inform
expressions from within your C#, Visual Basic or C++ programs.
10 resp

Article Discussion Rate this article Related Links Index Entries

ARTICLE

Editor's Note: This article's code has been updated to work with the final release of the .Net framework.

If you're building any kind of application that involves looking for patterns in text, picking out segments of text
according to certain criteria, or transforming the text itself, you can save a lot of time and energy by becoming
familiar with regular expressions.

In this article, I'll demonstrate the Regular Expression language itself, which is not unique to the .NET framework.
That discussion may well be of interest if you're thinking of using the RE engine of the JDK 1.4, or if you're
working with one of the older languages - like AWK or Perl - that support regular expressions. Then I'll introduce
the main classes of the .NET System.Text.RegularExpressions namespace that allow you to harness the
power of regular expressions from within your C#, Visual Basic or C++ programs.

Introduction

Consider the following problem: how many lines of code do you think you would need in order to transform all the
$ (dollar) amounts in the following sentence from the form "$10" to the form "10 US dollars"?

Each item is priced at $10 but you can purchase ten for only $80 which
is much cheaper than 10 * $10 = $100.

One solution would be to loop through the string using strtok (for C) or StringTokenizer (for Java), read
ahead from each $ symbol that you found up to the next word boundary, and meanwhile build a second version of
the string using concatenation.

Or, you could use the .NET Regular Expression engine and this single line of code

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 2 / 12

String newText = Regex.Replace(sourceText, @"\$(?<amount>\d*)"


,"${amount} US dollars");

Which would give:

Each item is priced at 10 US dollars but you can purchase ten for only 80 US dollars whi
is much cheaper than 10 * 10 US dollars = 100 US dollars.

Other practical uses include parsing (of HTML / XML content, or even natural language), feature extraction (to pick
out names and addresses from documents), and even the humble search-and-replace operation of a text editor.

On the subject of HTML and XML, in a previous article entitled "Working with Web Data in
C#" (http://www.csharptoday.com/content/articles/20020128.asp) I hinted at a practical use for regular
expressions to extract data from web pages. In the following screenshot, the SELECT and MATCHES clauses both
contain regular expressions.

The text " .html#0.table#1.tr#0.td#1.table#1.tr#1.td#0.table#0.tr#\d+.td#0$ " is a regular


expression that matches certain HTML elements - in this case book titles from the Wrox web site - according to
their positions on the page.

The text " \w*.NET\w* " is a regular expression that picks out only those titles that mention .NET.

Later you'll see that the RE language elements used in those examples are:

\d (as in .tr#\d+ above) - a character class that matches any decimal digit.

+ (as in .tr#\d+ above) - a quantifier that matches one or more of the preceding character.

$ (at the end of the SELECT string) - an atomic zero-width assertion that ensures a match up to the end of the
string.

\w (as in \w*.NET\w*) - a character class that matches any word character.

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 3 / 12

* (as in \w*.NET\w*) - a quantifier that matches zero or more of the preceding character.

Regular Expressions Language Elements

We'll now look a little deeper at the elements that comprise the regular expressions language, which the .NET
documentation splits up into these categories:

z Character Classes and Character Escapes


z Substitutions
z Atomic Zero Width Assertions
z Quantifiers
z Grouping Constructs, Backreference Constructs and Alternation Constructs

The aim is not to provide an exhaustive account of each language element, but to demonstrate the use of each
kind of element via a simple concrete example. For consistency, I'll make use of a single test sentence
throughout, which I introduced earlier as:

Each item is priced at $10 but you can purchase ten for only $80 which
is much cheaper than 10 * $10 = $100.

After explaining the regular expression syntax, I'll show you the C# code that I used to drive the examples.

Character Escapes

The characters . $ ^ { [ ( | ) * + ? \ have special meanings as operators in regular expressions. You will see
later how $ is an Atomic Zero-Width Assertion and how * is a Quantifier. This leaves us with a problem if we
want to use a regular expression to discover, for example, the occurrences of the literal $ character in a given
text. We solve the problem by prepending a backslash to the front of any operator that we wish to match literally.

Consider our test sentence:

Each item is priced at $10 but you can purchase ten for only $80 which
is much cheaper than 10 * $10 = $100.

A regular expression "\$" (ignore the quotes, and notice the backslash) will match four times, corresponding with
the four occurrences of the $ sign, whereas a regular expression of "$" (no backslash) will match only once as the
special end-of-string assertion. More about end-of-string later.

Similarly, the regular expression "\*" will match once, whereas "*" will trigger an exception in C# / .NET with the
following description (because the "*" has been treated as a quantifier rather then a literal character).

An unhandled exception of type 'System.ArgumentException' occurred in system.dll. Additional information: Parsing


"*" - Quantifier {x,y} following nothing.

So, any operator character will be treated literally when escaped, that is preceded by a backslash.

For a complete list of Character Escapes, look in the "Character Escapes" section of the ".NET Framework General
Reference".

Character Classes

We may wish to treat certain groups of characters as being of the same type, or class, of characters, and thus
form a regular expression that matches any character of a particular class. For example, we might wish to treat
characters 0 to 9 as belonging to the "decimal digit" class.

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 4 / 12

The general syntax for character classes uses the square brackets [ and ] to show that certain characters belong
to the same class. For decimal digits we could therefore use the regular expression "[0123456789]" which would
match 11 times in our test sentence for each of the 11 decimal digits.

For convenience we could shorten this to the range "[0-9]" (or a more limited range of say "[0-1]" for binary
numbers) and we could reverse the sense in each case - that is, find non-digits - by adding the "^" character to
give [^0123456789] or [^0-9].

You might be interested to know that we could instead take advantage of the special escape sequences "\d" (for
decimal digits) and "\D" (for non-digits).

For a complete list of Character Classes, look in the "Character Classes" section of the ".NET Framework General
Reference".

Atomic Zero-Width Assertions

Atomic Zero-Width Assertions cause a match to succeed or fail depending on the current position in the string.
We've already met one of these, the $ character that matches the end-of-line or end-of-string position providing it
is not preceded by a backslash. Do you remember the following result?

Technically that result was true, because there is exactly one end-of-string position in our sample text.

Now suppose we wanted to count up the number of words in the input text. We can use the assertion \b to ensure
that a regular expression matches at a word boundary, like this:

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 5 / 12

We're saying that each collection of one or more characters A-z - i.e. [A-z]+ - that has a word boundary \b before,
and after, counts as a word. Note that there are only 17 such words (comprising only letters) in the input text.

Note also the use of the + symbol to specify "one or more", which will lead us neatly onto quantifiers in the next
section.

For a complete list of Atomic Zero-Width Assertions, look in the "Atomic Zero-Width Assertions" section of the
".NET Framework General Reference".

Quantifiers

Quantifiers act like multiplicities in database or object modeling, allowing you to specify "one or more", "zero or
more", "exactly 2", and so on.

To find any dollar amounts in our input text that are multiples of $100 dollars (including $200, and up to $900) we
could use the regular expression \$\d0{2}\b that matches a literal $ sign followed by a single digit, followed by
exactly 2 zeros and a word boundary.

So we've matched $100, but not £10 or $80, and the word boundary is a necessary inclusion to avoid matching
values of the form £1001

For a complete list of Quantifiers, look in the "Quantifiers" section of the ".NET Framework General Reference".

Grouping Constructs and Substitutions

Suppose we wanted to reformat the dollar amounts in our test sentence to replace the prefixed symbol "$" with
postfixed text "US dollars". The .NET API allows us to do that with a piece of code like this:

String newText = Regex.Replace(sourceText, @"\$(?<amount>\d*)"


,"${amount} US dollars");

That single line of code transforms our test sentence as follows.

Each item is priced at $10 but you can purchase ten for only $80 which
is much cheaper than 10 * $10 = $100.

The sentence now becomes:

Each item is priced at 10 US dollars but you can purchase ten for only 80 US dollars whi
is much cheaper than 10 * 10 US dollars = 100 US dollars.

The regular expression we're searching for is "\$(?<amount>\d*)". In words it means "match a literal dollar sign
followed by a group of characters (named "amount") comprising zero or more decimal digits". Think of amount as
a variable that will contain the decimal amount (without the $ sign) for each match.

We could write some .NET code to step through the matches one-by-one, and for each one we could pick out the
value of the amount grouping. Or, as we have done in the code above, we could simply specify a replacement
string to substitute the matched text with our new format. That simple substitution should be quite self-
explanatory.

As an aside, you might be wondering why the regular expression string was preceded by the @ symbol in our
method call. That's how we tell the C# compiler not to interpret backslashes as its own escape characters, and to
preserve them to be interpreted by the Regex engine. This is not necessary if you're using Visual Basic.

For a complete list of Grouping Constructs and Substitutions, look in the "Grouping Constructs" and
"Substitutions" sections of the ".NET Framework General Reference".

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 6 / 12

Backreference Constructs

The grouping constructs mentioned above allow us to mark out sections of text with variable names, so that we
can refer to them in our code or for use in substitutions as we've seen. We can also make use of groups by
referring back to them within the same regular expression.

A classic example is to find instances of double letters in words with a regular expression like (?<letter>[A-z])
\k<letter>. When run against our test sentence we get no matches because there are no words with double
letters.

But, if we run it against the substituted version (having $ symbols replaced by "US dollars" text, see above) we
get this result corresponding with the four occurrences of the word "dollars".

In our regular expression we defined a group named "letter" comprising any character A-z, and we referred back
to it with the backreference \k<letter> as the next matching character. The \k backreference succeeds if the
subsequent character is the same as that matched by the prior grouping construct of the same name, in this case
"<letter>".

For your information, we could have used a simpler version ([A-z])\1 instead, which backreferences the 1st
unnamed group.

For a more practical purpose, backreferences could be used to pick out content from between matching tags in a
HTML or XML document using a regular expression like this one.

"<(?<tag>\w+)>(?<content>(.|\n)*?)</\k<tag>>"

For a complete list of Backreference Constructs, look in the "Backreference Constructs" section of the ".NET
Framework General Reference".

Alternation Constructs

To understand the regular expression that I've just shown you, you really need to understand the meaning of the
pipe "|" symbol. It's an alternation construct meaning "OR".

As a simpler example, consider the number of times the number 10 appears in our test sentence. I count four, if
we include the spelled-out word "ten" in the calculation.

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 7 / 12

That regular expression looks for occurrences of the text "ten" or "10" followed by a word boundary.

For a complete list of Alternation Constructs, look in the "Alternation Constructs" section of the ".NET Framework
General Reference".

.NET Regular Expression Classes

Once you understand how to build regular expressions, you'll need a mechanism for driving them from
you're .NET program, so we'll look briefly at these .NET classes included in the
System.Text.RegularExpressions namespace:

z System.Text.RegularExpressions.Regex
z System.Text.RegularExpressions.Match
z System.Text.RegularExpressions.Group
z System.Text.RegularExpressions.RegexOptions

Regex class encapsulates a regular expression string and a set of options. You can look for occurrences (returned
as matches) of the regular expression pattern within an input text using instance or static methods of the class.

RegexOptions provide enumeration values that may be combined to affect the matching operations, for example
to run in case sensitive or case insensitive mode.

Match class instances represent occurrences of the regular expression pattern within the input text and provide
access to text segments via group names or numbers (demonstrated in the Review and Further Work section
later).

Group class instances represent text segments from the input text that were mapped to group name or numbers
in the regular expression. Each match may have several groups associated with it as demonstrated in the Review
and Further work section later.

The first sample application, which I'll list next and which was used to drive the previous examples, makes use of
the following classes from that namespace:

Sample Application #1 - RegexCounter

To demonstrate the various regular expressions language elements, I used a simple C# / .NET program that took
a regular expression as input, and which produced as output a message box like this:

You might like to try out the examples for yourself and experiment with some of your own, for which you'll need
the following program listing:

using System;
using System.Text.RegularExpressions;
using System.Windows.Forms;

namespace RegularExpressions
{

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 8 / 12

public class RegexCounter


{

public static void Main(String[] args)


{
String sourceText ="Each item is priced at $10 but you can purchase ten for only $
is much cheaper than 10 * $10 = $100.";

RegexCounter.countRegex(sourceText, @"\$\d0{2}\b",true);
}

That Main() method should be easy enough to understand. All the interesting work is done in the next method:

public static void countRegex(String sourceText, String regexText


, bool ignoreCase)
{

First of all we can set some RegexOptions, for example to make the matches case-sensitive or case-insensitive,
before creating the Regex instance:

RegexOptions options = 0;
if (ignoreCase) options = RegexOptions.IgnoreCase;

Regex countRegex = new Regex(regexText,options);

Next we create a Match instance to match occurrences of the regular expression in our source text:

Match countMatch = countRegex.Match(sourceText);

We step through each successful match, incrementing the count as we go:

int count=0;
for ( ; countMatch.Success; countMatch
= countMatch.NextMatch()) count++;

Finally we display the result.

MessageBox.Show("The regular expression '"+regexText


+"' appears "+count+" times.");

}
}
}

That code may be found in the RegexCounter.cs source file provided, and you can drive it using the
RegexTester.cs file also provided.

You can find out more about the classes used by looking in the"System.Text.RegularExpressions Namespace"
section of the ".NET Framework Class Library".

Sample Application #2 - TagParser

The previous sample application that counts up the number of occurrences of a given regular expression was
sufficient for the basic demonstration. In any practical application we'd want to know a little more about each
matching instance of the regular expression, and in particular we might want to gain access to the text segments
enclosed within named groups.

Remember this regular expression that I showed earlier?

"<(?<tag>\w+)>(?<content>(.|\n)*?)</\k<tag>>"

It's used by the next class to pick out matching tag pairs from an input string and to print out the tags with
appropriate indents to show the levels of embedding. A method call as follows:

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 9 / 12

RegexSamples.showTags("<HTML><HEAD><TITLE>Sample HTML
Table</TITLE></HEAD><BODY><TABLE><TR><TD>row0,col0</TD><TD>row0,col1</TD></TR>
<TR><TD>row1,col0</TD><TD>row1,col1</TD></TR></TABLE></BODY>/</HTML>","");

Will result in console output of:

<HTML>
<HEAD>
<TITLE>
</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>

Hopefully that provides a more realistic example of something you might do with regular expressions, and the
code is as follows:

using System;
using System.Text.RegularExpressions;

namespace RegularExpressions
{
public class TagParser
{
public static void showTags(String text, String indent)
{

There we defined a TagParser class within a RegularExpressions namespace, having a single showTags(.)
method. Next we create an instance of our regular expression:

Regex tagRegex =
new Regex(@"<(?<tag>\w+)>(?<content>(.|\n)*?)</\k<tag>>");

We're looking for:

z A group (called <tag>) of one or more (+) word characters (\w) within angled brackets < and > , followed by
z A group (called <content>) of zero or more (*?) any character or newline (.|\n), followed by
z A / character and the backreferenced (\k) content of the <tag> group within angles brackets < and >.

In a nutshell, we want to find matching <anytag> and </anytag> combinations with enclosed content. If you're
wondering about the question mark (?) character that follows the asterisk (*) in our regular expression, I'll explain
that shortly.

Now we match the regular expression against our input text and loop through the matches:

Match tagMatch = tagRegex.Match(text);

for ( ; tagMatch.Success; tagMatch = tagMatch.NextMatch())


{

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 10 / 12

In the following line, take note of the tagMatch.Groups["tag"] method invocation to extract the text that
matches our regular expression's <tag> group:

System.Console.WriteLine(indent+"<"+tagMatch.Groups["tag"]+">");

Because the content enclosed between two matching tags may itself contain other pairs of matching tags, we're
now doing some clever recursion by re-entering the showTags(.) method with the text of the <content> group.

showTags(""+tagMatch.Groups["content"],indent+" ");

And when we wind back out of the recursion we print the closing tag at this level.

System.Console.WriteLine(indent+"</"+tagMatch.Groups["tag"]+">");
}

Now close off the method, class, and namespace:

}
}
}

That recursive approach is not the most performant solution, and for large input files the call stack will grow and
grow, but it's the neatest way to demonstrate the principle. The code may be found in the TagParser.cs source
file provided, and you can drive is using the RegexTester.cs file also provided.

Lazy Quantifiers and the Unexplained Question Mark

That second example illustrates the importance of lazy quantifiers, like *?, that match the minimum number of
repetitions.

Assuming TR as a value matched by the <tag> group in our regular expression, the <content> group would first
match the text shown in bold below because it is enclosed within <TR> </TR> combination.

<TR><TD>row0,col0</TD><TD>row0,col1</TD></TR><TR><TD>row1,col0</TD><TD>row1,col1</TD></T

And the next match would be.

<TR><TD>row0,col0</TD><TD>row0,col1</TD></TR><TR><TD>row1,col0</TD><TD>row1,col1</TD></T

That's exactly what is needed to give our desired end result, but if we omitted the question mark to leave (.|\n)*
in the regular expression, the first match for the <content> group would be the whole of:

<TR><TD>row0,col0</TD><TD>row0,col1</TD></TR><TR><TD>row1,col0</TD><TD>row1,col1</TD></T

Technically that is correct because it is content enclosed within matching <TR> and </TR> tags, but not at all
what we wanted, and the final output in that case would be as follows. Look carefully at how the <TR> and <TD>
tags are indented.

<HTML>
<HEAD>
<TITLE>
</TITLE>
</HEAD>
<BODY>
<TABLE>
<TR>
<TD>
<TD>
</TD>
</TD>
</TR>
</TABLE>
</BODY>

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 11 / 12

</HTML>

That was a practical demonstration of the subtle difference between a lazy quantifier (*?) and a greedy quantifier
(*). The former matches the shortest text segment that meets the criteria, and the latter matches the longest text
segment that matches the criteria.

Conclusion

Over the years, many programming and scripting languages -- like AWK, Perl and Python - have included Regular
Expression as part of their fabric.

Many other languages and runtime environments - like the Java SDK1.4 and C#/VB/C++/.NET - have been
furnished with a regular expressions capability, and with good reason. Regular Expressions provide a very
powerful mechanism for parsing, transforming, and otherwise manipulating text with little coding effort.

For those who are new to the regular expressions, I've tried to demystify them a little via simple, concrete
examples of the various RE language elements. For those familiar with regular expressions, but unfamiliar with
the .NET supporting classes, I've introduced the main classes from the System.Text.RegularExpressions
namespace.

If this article has whetted your appetite you will find plenty of documentation and some more examples in the
official .NET documentation, and you might like to revisit my previous "Working with Web Data in
C#" (http://www.csharptoday.com/content/articles/20020128.asp) article to try out a few more regular
expression in a real-life scenario.

Article Information
Author Tony Loton
Technical Editor Adam Ryland
Author Agent Charlotte Smith
Project Manager Helen Cuthill
Reviewers John Boyd Nolan, Phil Sidari

If you have any questions or comments about this article, please contact the technical editor.

RATE THIS ARTICLE USEFUL LINKS


Related Tasks:
Please rate this article (1-5). Was this article...

Useful? No z Download the support material for this


j n
k
l
m
n j n
k
l
m j n
k
l
m j Yes, Very
j n
k
l
m k
l
m z Enter Technical Discussion on this Artic
z Technical Support on this article - support@
Innovative? No
j n
k
l
m
n j n
k
l
m j n
k
l
m j Yes, Very
j n
k
l
m k
l
m
z See other articles in the Other Technologie
z See other Cutting Edge articles
Informative? No Yes, Very
j n
k
l
m
n j n
k
l
m j n
k
l
m j n
k
l
m j
k
l
m z Reader Comments on this article
z Go to Previous Article
Brief Reader Comments? z Go to Next Article

Your Name:
(Optional)

Related C#Today Articles Index Entries in this Article


z Syntax highlighting with regular expressions – Part 2 z .NET Framework z Match class
(March 8, 2002) z Alternation Constructs z match functio
z Working with Web Data in C# (January 28, 2002) z atomic zero-width assertions z NextMatch m
z Backreference Constructs z Perl
z character classes z Quantified ex
Related Sources z escape characters z Regex class

z Regular Expressions tutorial material: z greedy quantifiers z RegexOption


http://gnosis.cx/publish/programming/regular_expressions.html z Group class z Regular Expr

z Mastering Regular Expressions by Jeffrey E.F. Friedl: z Grouping Constructs z regular expre

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10
C#Today - Your Just-In-Time Resource for C# Code and Techniques 페이지 12 / 12

http://www.amazon.co.uk/exec/obidos/tg/stores/detail/- z Groups property z substitutions


/books/1565922573/toc/026-0887537-5738869 z IgnoreCase value z Success prop
z Matchmaking with regular expressions: z introduction z syntax
http://www.javaworld.com/javaworld/jw-07-2001/jw-0713- z lazy quantifiers z System.Text
regex.html namespace

Search the C#Today Living Book

i Index
j
k
l
m
n j Full Text
k
l
m
n Advanced

HOME | SITE MAP | INDEX | SEARCH | REFERENCE | FEEDBACK | ADVERTIS

Ecommerce Performance Security Site Design XML SO


Application
Data Access/ADO.NET Web Services Graphics/Games Mobile
Development
Other Technologies

C#Today is brought to you by Wrox Press (www.wrox.com). Please see our terms and conditions and privacy
C#Today is optimised for Microsoft Internet Explorer 5 browsers.
Please report any website problems to webmaster@csharptoday.com. Copyright © 2002 Wrox Press. All Rights

http://www.csharptoday.com/content/articles/20020201.asp?WROXE... 2002-07-10

Вам также может понравиться