Advanced Regular Expression Tips and Techniques

Advanced Regular Expression Tips and Techniques | Nettuts+ http://net.tutsplus.com/tutorials/php/advanced-regular-expression-tips-a...
Advertise Here
Burak Guzel on Jan 27th 2011 with 82 comments
Tutorial Details
Technology: Regular Expressions
Difficulty: Advanced
Twice a month, we revisit some of our readers favorite posts from throughout the history of Nettuts+.
Regular Expressions are the Swiss Army knife for searching through information for certain patterns. They have
a wide arsenal of tools, some of which often go undiscovered or underutilized. Today I will show you some
advanced tips for working with regular expressions.
Sometimes, regular expressions can become complex and unreadable. A regular expression you write today may
seem too obscure to you tomorrow even though it was your own work. Much like programming in general, it is a
good idea to add comments to improve the readability of regular expressions.
For example, here is something we might use to check for US phone numbers.
view plaincopy to clipboardprint?
1. preg_match("/^(1[-\s.])?($)?\d{3}(?(2)$)[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)
It can become much more readable with comments and some extra spacing.
1. preg_match("/^
2.
3. (1[-\s.])? # optional '1-', '1.' or '1'
4. ( $ )? # optional opening parenthesis
1 of 11 02/01/2012 05:39 AM
5. \d{3} # the area code

6. (?(2) $ ) # if there was opening parenthesis, close it
7. [-\s.]? # followed by '-' or '.' or space
8. \d{3} # first 3 digits
10. \d{4} # last 4 digits
11.
12. $/x",$number);
Lets put it within a code segment.
1. $numbers = array(
2. "123 555 6789",
3. "1-(123)-555-6789",
4. "(123-555-6789",
5. "(123).555.6789",
6. "123 55 6789");
7.
8. foreach ($numbers as $number) {
9. echo "$number is ";
10.
11. if (preg_match("/^
12.
13. (1[-\s.])? # optional '1-', '1.' or '1'
14. ( $ )? # optional opening parenthesis
15. \d{3} # the area code
16. (?(2) $ ) # if there was opening parenthesis, close it
18. \d{3} # first 3 digits
20. \d{4} # last 4 digits
21.
22. $/x",$number)) {
23.
24. echo "valid\n";
25. } else {
26. echo "invalid\n";
27. }
28. }
29.
30. /* prints
31.
32. 123 555 6789 is valid
33. 1-(123)-555-6789 is valid
34. (123-555-6789 is invalid
35. (123).555.6789 is valid
36. 123 55 6789 is invalid
37.
38. */
The trick is to use the x modifier at the end of the regular expression. It causes the whitespaces in the
pattern to be ignored, unless they are escaped (\s). This makes it easy to add comments. Comments start with #
and end at a newline.
2 of 11 02/01/2012 05:39 AM
In PHP preg_replace_callback() can be used to add callback functionality to regular expression replacements.
Sometimes you need to do multiple replacements. If you call preg_replace() or str_replace() for each pattern, the
string will be parsed over and over again.
Lets look at this example, where we have an e-mail template.
1. $template = "Hello [first_name] [last_name],

2.
3. Thank you for purchasing [product_name] from [store_name].
4.
5. The total cost of your purchase was [product_price] plus [ship_price] for shipping.
6.
7. You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days.
8.
9. Sincerely,
10. [store_manager_name]";
11.
12. // assume $data array has all the replacement data
13. // such as $data['first_name'] $data['product_price'] etc...
14.
15. $template = str_replace("[first_name]",$data['first_name'],$template);
16. $template = str_replace("[last_name]",$data['last_name'],$template);
17. $template = str_replace("[store_name]",$data['store_name'],$template);
18. $template = str_replace("[product_name]",$data['product_name'],$template);
19. $template = str_replace("[product_price]",$data['product_price'],$template);
20. $template = str_replace("[ship_price]",$data['ship_price'],$template);
21. $template = str_replace("[ship_days_min]",$data['ship_days_min'],$template);
22. $template = str_replace("[ship_days_max]",$data['ship_days_max'],$template);
23. $template = str_replace("[store_manager_name]",$data['store_manager_name'],$template);
24.
25. // this could be done in a loop too,
26. // but I wanted to emphasize how many replacements were made
Notice that each replacement has something in common. They are always strings enclosed within square
brackets. We can catch them all with a single regular expression, and handle the replacements in a callback
function.
So here is the better way of doing this with callbacks:
1. // ...
2.
3. // this will call my_callback() every time it sees brackets
4. $template = preg_replace_callback('/\[(.*)\]/','my_callback',$template);
5.
6. function my_callback($matches) {
7. // $matches[1] now contains the string between the brackets
8.
9. if (isset($data[$matches[1]])) {
10. // return the replacement string
11. return $data[$matches[1]];
3 of 11 02/01/2012 05:39 AM
12. } else {
13. return $matches[0];
14. }
15. }
Now the string in $template is only parsed by the regular expression once.
Before I start explaining this concept, I would like to show an example first. Lets say we are looking to find
anchor tags in an html text:
1. $html = 'Hello <a href="http://net.tutsplus.com/world">World!</a>';

2.
3. if (preg_match_all('/<a.*>.*<\/a>/',$html,$matches)) {
4.
5. print_r($matches);
6.
7. }
8. </a.*>
The result will be as expected:
1. /* output:
2. Array
3. (
4. [0] => Array
5. (
6. [0] => <a href="http://net.tutsplus.com/world">World!</a>
7. )
8.
9. )
10. */
Lets change the input and add a second anchor tag:
1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a>

2. <a href="http://net.tutsplus.com/world">World!</a>';
3.
5.
7.
8. }
9.
10. /* output:
11. Array
12. (
13. [0] => Array
14. (
15. [0] => <a href="http://net.tutsplus.com/hello">Hello</a>
4 of 11 02/01/2012 05:39 AM

17.
18. )
19.
20. )
21. */
22. </a.*>
Again, it seems to be fine so far. But dont let this trick you. The only reason it works is because the anchor tags
are on separate lines, and by default PCRE matches patterns only one line at a time (more info on: m modifier).
If we encounter two anchor tags on the same line, it will no longer work as expected:
1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a> <a href="http://net.tutsplus.com

/world">World!</a>';
2.
4.
6.
7. }
8.
9. /* output:
10. Array
11. (
12. [0] => Array
13. (
14. [0] => <a href="http://net.tutsplus.com/hello">Hello</a> <a href="http://net.tutsplus.com
/world">World!</a>
15.
16. )
17.
18. )
19. */
20. </a.*>
This time the pattern matches the first opening tag, and last opening tag, and everything in between as a single
match, instead of making two separate matches. This is due to the default behavior being greedy.
When greedy, the quantifiers (such as * or +) match as many character as possible.
If you add a question mark after the quantifier (.*?) it becomes ungreedy:
1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a> <a href="http://net.tutsplus.com

/world">World!</a>';
2.
3. // note the ?'s after the *'s
4. if (preg_match_all('/<a.*?>.*?<\/a>/',$html,$matches)) {
5.
7.
8. }
9.
10. /* output:
11. Array
12. (
5 of 11 02/01/2012 05:39 AM
13. [0] => Array

14. (
15. [0] => <a href="http://net.tutsplus.com/hello">Hello</a>
17.
18. )
19.
20. )
21. */
22. </a.*?>
Now the result is correct. Another way to trigger the ungreedy behavior is to use the U pattern modifier.
A lookahead assertion searches for a pattern match that follows the current match. This might be explained
easier through an example.
The following pattern first matches for foo, and then it checks to see if it is followed by bar:
1. $pattern = '/foo(?=bar)/';
2.
3. preg_match($pattern,'Hello foo'); // false
4. preg_match($pattern,'Hello foobar'); // true
It may not seem very useful, as we could have simply checked for foobar instead. However, it is also possible
to use lookaheads for making negative assertions. The following example matches foo, only if it is NOT
followed by bar.
1. $pattern = '/foo(?!bar)/';
2.
3. preg_match($pattern,'Hello foo'); // true
4. preg_match($pattern,'Hello foobar'); // false
5. preg_match($pattern,'Hello foobaz'); // true
Lookbehind assertions work similarly, but they look for patterns before the current match. You may use (?< for
positive assertions, and (?<! for negative assertions.
The following pattern matches if there is a bar and it is not following foo.
1. $pattern = '/(?<!foo)bar/';
2.
3. preg_match($pattern,'Hello bar'); // true
4. preg_match($pattern,'Hello foobar'); // false
5. preg_match($pattern,'Hello bazbar'); // true
Regular expressions provide the functionality for checking certain conditions. The format is as follows:
6 of 11 02/01/2012 05:39 AM
1. (?(condition)true-pattern|false-pattern)
2.
3. or
4.
5. (?(condition)true-pattern)
The condition can be a number. In which case it refers to a previously captured subpattern.
For example we can use this to check for opening and closing angle brackets:
1. $pattern = '/^(<)?[a-z]+(?(1)>)$/';
2.
3. preg_match($pattern, '<test>'); // true
4. preg_match($pattern, '<foo'); // false
5. preg_match($pattern, 'bar>'); // false
6. preg_match($pattern, 'hello'); // true
In the example above, 1 refers to the subpattern (<), which is also optional since it is followed by a question
mark. Only if that condition is true, it matches for a closing bracket.
The condition can also be an assertion:
1. // if it begins with 'q', it must begin with 'qu'

2. // else it must begin with 'f'
3. $pattern = '/^(?(?=q)qu|f)/';
4.
5. preg_match($pattern, 'quake'); // true
6. preg_match($pattern, 'qwerty'); // false
7. preg_match($pattern, 'foo'); // true
8. preg_match($pattern, 'bar'); // false
There are various reasons for input filtering when developing web applications. We filter data before inserting it
into a database, or outputting it to the browser. Similarly, it is necessary to filter any arbitrary string before
including it in a regular expression. PHP provides a function named preg_quote to do the job.
In the following example we use a string that contains a special character (*).
1. $word = '*world*';
2.
3. $text = 'Hello *world*!';
4.
5. preg_match('/'.$word.'/', $text); // causes a warning
6. preg_match('/'.preg_quote($word).'/', $text); // true
Same thing can be accomplished also by enclosing the string between \Q and \E. Any special character after \Q
is ignored until \E.
1. $word = '*world*';
7 of 11 02/01/2012 05:39 AM
2.
3. $text = 'Hello *world*!';
4.
5. preg_match('/\Q'.$word.'\E/', $text); // true
However, this second method is not 100% safe, as the string itself can contain \E.
Subpatterns, enclosed by parentheses, get captured into an array so that we can use them later if needed. But
there is a way to NOT capture them also.
Lets start with a very simple example:
1. preg_match('/(f.*)(b.*)/', 'Hello foobar', $matches);

2.
3. echo "f* => " . $matches[1]; // prints 'f* => foo'
4. echo "b* => " . $matches[2]; // prints 'b* => bar'
Now lets make a small change by adding another subpattern (H.*) to the front:
1. preg_match('/(H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

2.
3. echo "f* => " . $matches[1]; // prints 'f* => Hello'
4. echo "b* => " . $matches[2]; // prints 'b* => foo'
The $matches array was changed, which could cause the script to stop working properly, depending on what we
do with those variables in the code. Now we have to find every occurence of the $matches array in the code, and
adjust the index number accordingly.
If we are not really interested in the contents of the new subpattern we just added, we can make it
non-capturing like this:
1. preg_match('/(?:H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

2.
3. echo "f* => " . $matches[1]; // prints 'f* => foo'
4. echo "b* => " . $matches[2]; // prints 'b* => bar'
By adding ?: at the beginning of the subpattern, we no longer capture it in the $matches array, so the other
array values do not get shifted.
There is another method for preventing pitfalls like in the previous example. We can actually give names to each
subpattern, so that we can reference them later on using those names instead of array index numbers. This is the
format: (?Ppattern)
We could rewrite the first example in the previous section, like this:
8 of 11 02/01/2012 05:39 AM
1. preg_match('/(?P<fstar>f.*)(?P<bstar>b.*)/', 'Hello foobar', $matches);

2.
3. echo "f* => " . $matches['fstar']; // prints 'f* => foo'
4. echo "b* => " . $matches['bstar']; // prints 'b* => bar'
5. </bstar></fstar>
Now we can add another subpattern, without disturbing the existing matches in the $matches array:
1. preg_match('/(?P<hi>H.*) (?P<fstar>f.*)(?P<bstar>b.*)/', 'Hello foobar', $matches);

2.
3. echo "f* => " . $matches['fstar']; // prints 'f* => foo'
4. echo "b* => " . $matches['bstar']; // prints 'b* => bar'
5.
6. echo "h* => " . $matches['hi']; // prints 'h* => Hello'
7. </bstar></fstar></hi>
Perhaps its most important to know when NOT to use regular expressions. There are many situations where you
can find existing utilities than you can use instead.
A poster at Stackoverflow has a brilliant explanation on why we should not use regular expressions to parse
[X]HTML.
dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed
humanity to an eternity of dread torture and security holes using regex as a tool to process HTML
establishes a breach between this world and the dread realm of corrupt entities
Joking aside, it is a good idea to take some time and figure out what kind of XML or HTML parsers are
available, and how they work. For example, PHP offers multiple extensions related to XML (and HTML).
Example: Getting the second link url in an HTML page
1. $doc = DOMDocument::loadHTML('
2. <html>
3. <body>Test
4. <a href="http://www.nettuts.com">First link</a>
5. <a href="http://net.tutsplus.com">Second link</a>
6. </body>
7. </html>
8. ');
9.
10. echo $doc->getElementsByTagName('a')
11. ->item(1)
12. ->getAttribute('href');
13.
14. // prints: http://net.tutsplus.com
9 of 11 02/01/2012 05:39 AM
Again, you can use existing functions to validate user inputs, such as form submissions.
1. if (!filter_var($_POST['email'], FILTER_VALIDATE_EMAIL)) {
2.
3. $errors []= "Please enter a valid e-mail.";
4. }
1. // get supported filters

2. print_r(filter_list());
3.
4. /* output
5. Array
6. (
7. [0] => int
8. [1] => boolean
9. [2] => float
10. [3] => validate_regexp
11. [4] => validate_url
12. [5] => validate_email
13. [6] => validate_ip
14. [7] => string
15. [8] => stripped
16. [9] => encoded
17. [10] => special_chars
18. [11] => unsafe_raw
19. [12] => email
20. [13] => url
21. [14] => number_int
22. [15] => number_float
23. [16] => magic_quotes
24. [17] => callback
25. )
26. */
More info: PHP Data Filtering
Here are some other utilities to keep in mind, before using regular expressions:
strtotime() for parsing dates.

Use regular string functions for matches and replacements, if your pattern does not contain regular
expressions.
Examples: str_replace() vs. preg_replace(), explode() vs. preg_split(), strpos() vs. preg_match()
Thanks so much for reading!
Want to talk specifics? Discuss this post on the forums.
10 of 11 02/01/2012 05:39 AM
Like 92 people like this. Be the first of your friends.
Tags: PHPphp regular expressionsregexregular expressions
By Burak Guzel
Burak Guzel is a full time PHP Web Developer living in Arizona,
originally from Istanbul, Turkey. He has a bachelors degree in
Computer Science and Engineering from The Ohio State
University. He has over 8 years of experience with PHP and
MySQL. You can read more of his articles on his website at
PHPandStuff.com and follow him on Twitter here.
11 of 11 02/01/2012 05:39 AM

Advanced Regular Expression Tips and Techniques

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Advanced Regular Expression Tips and Techniques

Загружено:

Авторское право:

Доступные форматы

Advanced Regular Expression Tips and Techniques | Nettuts+ http://net.tutsplus.com/tutorials/php/advanced-regular-expression-tips-a...

Burak Guzel on Jan 27th 2011 with 82 comments

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

5. \d{3} # the area code

Lets put it within a code segment.

view plaincopy to clipboardprint?

Lets look at this example, where we have an e-mail template.

view plaincopy to clipboardprint?

1. $template = "Hello [first_name] [last_name],

So here is the better way of doing this with callbacks:

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

1. $html = 'Hello <a href="http://net.tutsplus.com/world">World!</a>';

The result will be as expected:

view plaincopy to clipboardprint?

Lets change the input and add a second anchor tag:

view plaincopy to clipboardprint?

1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a>

16. [1] => <a href="http://net.tutsplus.com/world">World!</a>

view plaincopy to clipboardprint?

1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a> <a href="http://net.tutsplus.com

When greedy, the quantifiers (such as * or +) match as many character as possible.

view plaincopy to clipboardprint?

1. $html = '<a href="http://net.tutsplus.com/hello">Hello</a> <a href="http://net.tutsplus.com

13. [0] => Array

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

The condition can also be an assertion:

view plaincopy to clipboardprint?

1. // if it begins with 'q', it must begin with 'qu'

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

Lets start with a very simple example:

view plaincopy to clipboardprint?

1. preg_match('/(f.*)(b.*)/', 'Hello foobar', $matches);

view plaincopy to clipboardprint?

1. preg_match('/(H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

view plaincopy to clipboardprint?

1. preg_match('/(?:H.*) (f.*)(b.*)/', 'Hello foobar', $matches);

view plaincopy to clipboardprint?

1. preg_match('/(?P<fstar>f.*)(?P<bstar>b.*)/', 'Hello foobar', $matches);

view plaincopy to clipboardprint?

1. preg_match('/(?P<hi>H.*) (?P<fstar>f.*)(?P<bstar>b.*)/', 'Hello foobar', $matches);

Example: Getting the second link url in an HTML page

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

view plaincopy to clipboardprint?

1. // get supported filters

More info: PHP Data Filtering

strtotime() for parsing dates.

Thanks so much for reading!

Want to talk specifics? Discuss this post on the forums.

Like 92 people like this. Be the first of your friends.

Tags: PHPphp regular expressionsregexregular expressions

Вам также может понравиться

1. preg_match('/(f.)(b.)/', 'Hello foobar', $matches);

1. preg_match('/(H.) (f.)(b.*)/', 'Hello foobar', $matches);

1. preg_match('/(?:H.) (f.)(b.*)/', 'Hello foobar', $matches);

1. preg_match('/(?P<fstar>f.)(?P<bstar>b.)/', 'Hello foobar', $matches);

1. preg_match('/(?P<hi>H.) (?P<fstar>f.)(?P<bstar>b.*)/', 'Hello foobar', $matches);