Вы находитесь на странице: 1из 18

Java - Regular Expressions Java provides the java.util.regex package for pattern matching with regular expressions.

Java regular expressions are very similar to the Perl programming language and very easy to learn. A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. They can be used to search, edit, or manipulate text and data. The java.util.regex package primarily consists of the following three classes:

Pattern Class: A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern class object, you must first invoke one of its public static compile methods, which will then return a Pattern class object. These methods accept a regular expression as the first argument. Matcher Class: A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Pattern object. PatternSyntaxException: A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

Capturing Groups: Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d", "o", and "g". Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups: 1. 2. 3. 4. ((A)(B(C))) (A) (B(C)) (C)

To find out how many groups are present in the expression, call the groupCount method on a matcher object. The groupCount method returns an int showing the number of capturing groups present in the matcher's pattern. There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by groupCount.

Example: Following example illustrate how to find a digit string from the given alphanumeric string: import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { public static void main( String args[] ){ // String to be scanned to find the pattern. String line = "This order was places for QT3000! OK?"; String pattern = "(.*)(\\d+)(.*)"; // Create a Pattern object Pattern r = Pattern.compile(pattern); // Now create matcher object. Matcher m = r.matcher(line); if (m.find( )) { System.out.println("Found value: " + m.group(0) ); System.out.println("Found value: " + m.group(1) ); System.out.println("Found value: " + m.group(2) ); } else { System.out.println("NO MATCH"); } } }

This would produce following result: Found value: This order was places for QT3000! OK? Found value: This order was places for QT300 Found value: 0 Regular Expression Syntax: Here is the table listing down all the regular expression metacharacter syntax available in Java:

Subexpression ^ $ . Matches beginning of line. Matches end of line.

Matches

Matches any single character except newline. Using m option allows it to match newline as well. Matches any single character in brackets. Example: [abc] can match the letter a or b or c
[abc][vz] can match a or b or c followed by either v or z

[...]

Matches any single character not in brackets Example : [^abc] When a "^" appears as the first character inside [] when it [^...]
negates the pattern. This can match any character except a or b or c [a-d1-7] Ranges, letter between a and d and figures from 1 to 7, will not match d1

\A \z \Z re* re+ re? re{ n} re{ n,} re{ n, m} a| b (re) (?: re) (?> re) \w \W \s

Beginning of entire string End of entire string End of entire string except allowable final line terminator. Matches 0 or more occurrences of preceding expression. Matches 1 or more of the previous thing Matches 0 or 1 occurrence of preceding expression. Matches exactly n number of occurrences of preceding expression. Matches n or more occurrences of preceding expression. Matches at least n and at most m occurrences of preceding expression. Matches either a or b. example : X|Y Finds X or Y
XY Finds X directly followed by Y

Groups regular expressions and remembers matched text. Groups regular expressions without remembering matched text. Matches independent pattern without backtracking. Matches word characters. [a-zA-Z0-9] Matches nonword characters. [^\w] Matches whitespace. Equivalent to [\t\n\r\f].

\S \S+ \d \D \A \Z \z \G \n \b \B \n, \t, etc. \Q \E

Matches nonwhitespace. [^\s]


Several non-whitespace characters

Matches digits. Equivalent to [0-9]. Matches nondigits. [^0-9] Matches beginning of string. Matches end of string. If a newline exists, it matches just before newline. Matches end of string. Matches point where last match finished. Back-reference to capture group number "n" Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets. Matches nonword boundaries. Matches newlines, carriage returns, tabs, etc. Escape (quote) all characters up to \E Ends quoting begun with \Q

Regular Expression *

Description Occurs zero or more times, is short for {0,} Occurs one or more times, is short for {1,} Occurs no or one times, ? is short for {0,1} Occurs X number of times, {} describes the order of the preceding liberal Occurs between X and Y times, ? after a qualifier makes it a "reluctant quantifier", it tries to find the smallest match.

+ ? {X}

Examples X* - Finds zero or more times letter X, .* - any character sequence X+ - Finds one or several letter X X? -Finds no or exactly one letter X \d{3} - Three digits, .{10} - any character sequence of length 10 \d{1,4}- \d must occur at least once and at a maximum of four

{X,Y} *?

Methods of the Matcher Class: Here is the lists of useful instance methods: Index methods provide useful index values that show precisely where the match was found in the input string: SN 1 Methods with Description public int start() Returns the start index of the previous match. public int start(int group) Returns the start index of the subsequence captured by the given group during the previous match operation. public int end() Returns the offset after the last character matched. public int end(int group) Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

Study Methods: SN Methods with Description public boolean lookingAt() Attempts to match the input sequence, starting at the beginning of the region, against the pattern. public boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. public boolean find(int start ) Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. public boolean matches() Attempts to match the entire region against the pattern.

Replacement Methods: Replacement methods are useful methods for replacing text in an input string:

SN

Methods with Description public Matcher appendReplacement(StringBuffer sb, String replacement) Implements a non-terminal append-and-replace step. public StringBuffer appendTail(StringBuffer sb) Implements a terminal append-andreplace step. public String replaceAll(String replacement) Replaces every subsequence of the input sequence that matches the pattern with the given replacement string. public String replaceFirst(String replacement) Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string. public static String quoteReplacement(String s) Returns a literal replacement String for the specified String. This method produces a String that will work as a literal replacement s in the appendReplacement method of the Matcher class.

The start and end Methods: Following is the example that counts the number of times the word "cats" appears in the input string: import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static final String PATERN = "\\bcat\\b"; private static final String INPUT = "cat cat cat cattie cat"; public static void main( String args[] ){ Pattern p = Pattern.compile(PATERN); Matcher m = p.matcher(INPUT); // get a matcher object int count = 0; while(m.find()) {

count++; System.out.println("Match number "+count); System.out.println("start(): "+m.start()); System.out.println("end(): "+m.end()); } } } This would produce following result: Match number 1 start(): 0 end(): 3 Match number 2 start(): 4 end(): 7 Match number 3 start(): 8 end(): 11 Match number 4 start(): 19 end(): 22 You can see that this example uses word boundaries to ensure that the letters "c" "a" "t" are not merely a substring in a longer word. It also gives some useful information about where in the input string the match has occurred. The start method returns the start index of the subsequence captured by the given group during the previous match operation, and end method returns the index of the last character matched, plus one. The matches and lookingAt Methods: The matches and lookingAt methods both attempt to match an input sequence against a pattern. The difference, however, is that matches requires the entire input sequence to be matched, while lookingAt does not. Both methods always start at the beginning of the input string. Here is the example explaining the functionality: import java.util.regex.Matcher;

import java.util.regex.Pattern; public class RegexMatches { private static final String REGEX = "foo"; private static final String INPUT = "fooooooooooooooooo"; private static Pattern pattern; private static Matcher matcher; public static void main( String args[] ){ pattern = Pattern.compile(REGEX); matcher = pattern.matcher(INPUT); System.out.println("Current REGEX is: "+REGEX); System.out.println("Current INPUT is: "+INPUT); System.out.println("lookingAt(): "+matcher.lookingAt()); System.out.println("matches(): "+matcher.matches()); } } This would produce following result: Current REGEX is: foo Current INPUT is: fooooooooooooooooo lookingAt(): true matches(): false The replaceFirst and replaceAll Methods: The replaceFirst and replaceAll methods replace text that matches a given regular expression. As their names indicate, replaceFirst replaces the first occurrence, and replaceAll replaces all occurences. Here is the example explaining the functionality: import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches {

private static String REGEX = "dog"; private static String INPUT = "The dog says meow. " + "All dogs say meow."; private static String REPLACE = "cat"; public static void main(String[] args) { Pattern p = Pattern.compile(REGEX); // get a matcher object Matcher m = p.matcher(INPUT); INPUT = m.replaceAll(REPLACE); System.out.println(INPUT); } } This would produce following result: The cat says meow. All cats say meow. The appendReplacement and appendTail Methods: The Matcher class also provides appendReplacement and appendTail methods for text replacement. Here is the example explaining the functionality: import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { private static String REGEX = "a*b"; private static String INPUT = "aabfooaabfooabfoob"; private static String REPLACE = "-"; public static void main(String[] args) { Pattern p = Pattern.compile(REGEX); // get a matcher object Matcher m = p.matcher(INPUT); StringBuffer sb = new StringBuffer(); while(m.find()){ m.appendReplacement(sb,REPLACE); }

m.appendTail(sb); System.out.println(sb.toString()); } } This would produce following result: -foo-foo-fooPatternSyntaxException Class Methods: A PatternSyntaxException is an unchecked exception that indicates a syntax error in a regular expression pattern. The PatternSyntaxException class provides the following methods to help you determine what went wrong: SN Methods with Description public String getDescription() Retrieves the description of the error. public int getIndex() Retrieves the error index. public String getPattern() Retrieves the erroneous regular expression pattern. public String getMessage() Returns a multi-line string containing the description of the syntax error and its index, the erroneous regular expression pattern, and a visual indication of the error index within the pattern.

Grouping and Backreference

You can group parts of your regular expression. In your pattern you group elements via round brackets, e.g. "()". This allows you to assign a repetition operator the a complete group. In addition these groups also create a backreference to the part of the regular expression. This captures the group. A backreference stores the part of the String which matched the group. This allows you to use this part in the replacement. Via the $ you can refer to a group. $1 is the first group, $2 the second, etc. Lets for example assume you want to replace all whitespace between a letter followed by a point or a comma. This would involve that the point or the comma is part of the pattern. Still it should be included in the result
// Removes whitespace between a word character and . or , String pattern = "(\\w)(\\s+)([\\.,])"; System.out.println(EXAMPLE_TEST.replaceAll(pattern, "</code>$3")); This example extracts the text between a title tag. // Extract the text between the two title elements pattern = "(?i)(<title.*?>)(.+?)(</title>)"; String updated = EXAMPLE_TEST.replaceAll(pattern, "$2");

Backslashes in Java

The backslash is an escape character in Java Strings. e.g. backslash has a predefined meaning in Java. You have to use "\\" to define a single backslash. If you want to define "\w" then you must be using "\\w" in your regex. If you want to use backslash you as a literal you have to type \\\\ as \ is also a escape character in regular expressions. Method s.matches("regex") s.split("regex") s.replace("regex"), "replacement" Description Evaluates if "regex" matches s. Returns only true if the WHOLE string can be matched Creates array with substrings of s divided at occurance of "regex". "regex" is not included in the result. Replaces "regex" with "replacement

package de.vogella.regex.test; public class RegexTestStrings { public static final String EXAMPLE_TEST = "This is my small example " + "string which I'm going to " + "use for pattern matching."; public static void main(String[] args) { System.out.println(EXAMPLE_TEST.matches("\\w.*")); String[] splitString = (EXAMPLE_TEST.split("\\s+")); System.out.println(splitString.length);// Should be 14 for (String string : splitString) { System.out.println(string); } // Replace all whitespace with tabs System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t")); } } package de.vogella.regex.string; public class StringMatcher { // Returns true if the string matches exactly "true" public boolean isTrue(String s){ return s.matches("true"); } // Returns true if the string matches exactly "true" or "True" public boolean isTrueVersion2(String s){ return s.matches("[tT]rue"); } // Returns true if the string matches exactly "true" or "True" // or "yes" or "Yes" public boolean isTrueOrYes(String s){ return s.matches("[tT]rue|[yY]es"); } // Returns true if the string contains exactly "true" public boolean containsTrue(String s){ return s.matches(".*true.*"); } // Returns true if the string contains of three letters public boolean isThreeLetters(String s){ return s.matches("[a-zA-Z]{3}"); // Simpler from for return s.matches("[a-Z][a-Z][a-Z]"); }

//

// Returns true if the string does not have a number at the beginning public boolean isNoNumberAtBeginning(String s){ return s.matches("^[^\\d].*"); } // Returns true if the string contains a arbitrary number of characters except b

public boolean isIntersection(String s){ return s.matches("([\\w&&[^b]])*"); } // Returns true if the string contains a number less then 300 public boolean isLessThenThreeHundret(String s){ return s.matches("[^0-9]*[12]?[0-9]{1,2}[^0-9]*"); } }

And a small JUnit Test to validates the examples.


package de.vogella.regex.string; import org.junit.Before; import org.junit.Test; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; public class StringMatcherTest { private StringMatcher m; @Before public void setup(){ m = new StringMatcher(); } @Test public void testIsTrue() { assertTrue(m.isTrue("true")); assertFalse(m.isTrue("true2")); assertFalse(m.isTrue("True")); } @Test public void testIsTrueVersion2() { assertTrue(m.isTrueVersion2("true")); assertFalse(m.isTrueVersion2("true2")); assertTrue(m.isTrueVersion2("True"));; } @Test public void testIsTrueOrYes() { assertTrue(m.isTrueOrYes("true")); assertTrue(m.isTrueOrYes("yes")); assertTrue(m.isTrueOrYes("Yes")); assertFalse(m.isTrueOrYes("no")); } @Test public void testContainsTrue() { assertTrue(m.containsTrue("thetruewithin")); }

@Test public void testIsThreeLetters() { assertTrue(m.isThreeLetters("abc")); assertFalse(m.isThreeLetters("abcd")); } @Test public void testisNoNumberAtBeginning() { assertTrue(m.isNoNumberAtBeginning("abc")); assertFalse(m.isNoNumberAtBeginning("1abcd")); assertTrue(m.isNoNumberAtBeginning("a1bcd")); assertTrue(m.isNoNumberAtBeginning("asdfdsf")); } @Test public void testisIntersection() { assertTrue(m.isIntersection("1")); assertFalse(m.isIntersection("abcksdfkdskfsdfdsf")); assertTrue(m.isIntersection("skdskfjsmcnxmvjwque484242")); } @Test public void testLessThenThreeHundret() { assertTrue(m.isLessThenThreeHundret("288")); assertFalse(m.isLessThenThreeHundret("3288")); assertFalse(m.isLessThenThreeHundret("328 8")); assertTrue(m.isLessThenThreeHundret("1")); assertTrue(m.isLessThenThreeHundret("99")); assertFalse(m.isLessThenThreeHundret("300")); } } package de.vogella.regex.test; import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexTestPatternMatcher { public static final String EXAMPLE_TEST = "This is my small example string which I'm going to use for pattern matching."; public static void main(String[] args) { Pattern pattern = Pattern.compile("\\w+"); // In case you would like to ignore case sensitivity you could use this // statement // Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(EXAMPLE_TEST); // Check all occurance while (matcher.find()) { System.out.print("Start index: " + matcher.start()); System.out.print(" End index: " + matcher.end() + " "); System.out.println(matcher.group());

} // Now create a new pattern and matcher to replace whitespace with tabs Pattern replace = Pattern.compile("\\s+"); Matcher matcher2 = replace.matcher(EXAMPLE_TEST); System.out.println(matcher2.replaceAll("\t")); } }

Write a regular expression which matches a text line if this text line contains either the word "Joe" or the word "Jim" or both. Create a project de.vogella.regex.eitheror and the following class.
package de.vogella.regex.eitheror; import org.junit.Test; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; public class EitherOrCheck { @Test public void testSimpleTrue() { String s = "humbapumpa jim"; assertTrue(s.matches(".*(jim|joe).*")); s = "humbapumpa jom"; assertFalse(s.matches(".*(jim|joe).*")); s = "humbaPumpa joe"; assertTrue(s.matches(".*(jim|joe).*")); s = "humbapumpa joe jim"; assertTrue(s.matches(".*(jim|joe).*")); } }

Write a regular expression which matches any phone number. A phone number in this example consists either out of 7 numbers in a row or out of 3 number a (white)space or a dash and then 4 numbers.
package de.vogella.regex.phonenumber; import org.junit.Test; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; public class CheckPhone {

@Test public void testSimpleTrue() { String pattern = "\\d\\d\\d([,\\s])?\\d\\d\\d\\d"; String s= "1233323322"; assertFalse(s.matches(pattern)); s = "1233323"; assertTrue(s.matches(pattern)); s = "123 3323"; assertTrue(s.matches(pattern)); }

Check for a certain number range

The following example will check if a text contains a number with 3 digits. Create the Java project "de.vogella.regex.numbermatch" and the following class.
package de.vogella.regex.numbermatch; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.junit.Test; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; public class CheckNumber { @Test public void testSimpleTrue() { String s= "1233"; assertTrue(test(s)); s= "0"; assertFalse(test(s)); s = "29 Kasdkf 2300 Kdsdf"; assertTrue(test(s)); s = "99900234"; assertTrue(test(s)); }

public static boolean test (String s){ Pattern pattern = Pattern.compile("\\d{3}"); Matcher matcher = pattern.matcher(s); if (matcher.find()){ return true; } return false; } }

Building a link checker

The following example allows you to extract all valid links from a webpage. It does not consider links with start with "javascript:" or "mailto:". Create the Java project de.vogella.regex.weblinks and the following class:
package de.vogella.regex.weblinks; import import import import import import import import import java.io.BufferedReader; java.io.IOException; java.io.InputStreamReader; java.net.MalformedURLException; java.net.URL; java.util.ArrayList; java.util.List; java.util.regex.Matcher; java.util.regex.Pattern;

public class LinkGetter { private Pattern htmltag; private Pattern link; private final String root; public LinkGetter(String root) { this.root = root; htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>"); link = Pattern.compile("href=\"[^>]*\">"); } public List<String> getLinks(String url) { List<String> links = new ArrayList<String>(); try { BufferedReader bufferedReader = new BufferedReader( new InputStreamReader(new URL(url).openStream())); String s; StringBuilder builder = new StringBuilder(); while ((s = bufferedReader.readLine()) != null) { builder.append(s); } Matcher tagmatch = htmltag.matcher(builder.toString()); while (tagmatch.find()) { Matcher matcher = link.matcher(tagmatch.group()); matcher.find(); String link = matcher.group().replaceFirst("href=\"", "") .replaceFirst("\">", ""); if (valid(link)) {

links.add(makeAbsolute(url, link)); } } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return links; } private boolean valid(String s) { if (s.matches("javascript:.*|mailto:.*")) { return false; } return true; } private String makeAbsolute(String url, String link) { if (link.matches("http://.*")) { return link; } if (link.matches("/.*") && url.matches(".*$[^/]")) { return url + "/" + link; } if (link.matches("[^/].*") && url.matches(".*[^/]")) { return url + "/" + link; } if (link.matches("/.*") && url.matches(".*[/]")) { return url + link; } if (link.matches("/.*") && url.matches(".*[^/]")) { return url + link; } throw new RuntimeException("Cannot make the link absolute. Url: " + url + " Link " + link); } }

Вам также может понравиться