Вы находитесь на странице: 1из 35

Demystifying The Regular

Expression That Checks If A


Number Is Prime
SEPTEMBER 08, 2016
 Reading time ~24 minutes

Introduction
A while back I was researching the most efficient way to check if a number is prime. This lead me to find the
following piece of code:

public static boolean isPrime(int n) {


return !new String(new char[n]).matches(".?|(..+?)\\1+");
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
}

I was intrigued. While this might not be the most efficient way, it’s certainly one of the less obvious ones, so
my curiosity kicked in. How on Earth could a match for the .?|(..+?)\1+ regular expression tell that a
number is not prime (once it’s converted to its unary representation)?

If you’re interested, read on, I’ll try to dissect this regular expression and explain what’s really going on. The
explanation will be programming language agnostic, I will, however, provide Python , JavaScript and
Perl versions of the Java code above and explain why they are slightly different.

I will explain how the regular expression ^.?$|^(..+?)\1+$ can filter out any prime numbers. Why this one
and not .?|(..+?)\1+ (the one used in Java code example above)? Well, this has to do with the way
String.matches() works, which I’ll explain later.

While there are some blog posts on this topic, I found them to not go deep enough and give just a high level
overview, not explaining some of the important details well enough. Here, I’ll try to lay it out with enough
detail so that anyone can follow and understand. The goal is to make it simple to understand for any one -
whether you are a regular expression guru or this is the first time you’ve heard about them, anyone should
be able to follow along.

1. Prime Numbers and Regular Expressions - The


open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Theory
Let’s start at a higher level. But wait, first, let’s get every one on the same page and begin with some
definitions. If you know know what a prime number is and are familiar with regular expression, feel free to
skip this section. I will try to explain how every bit of the regular expression works, so that even people who
are new or unfamiliar with them can follow along.

Prime Numbers

First, a prime number is any natural number greater than 1 that is only divisible by 1 and the number
itself, without leaving a remainder. Here’s a list of the fist 8 prime numbers:
2, 3, 5, 7, 11, 13, 17, 19 . For example, 5 is prime because you can only divide it by 1 and 5

without leaving a remainder. Sure we can divide it by 2 , but that would leave a remainder of 1 , since 5

= 2 * 2 + 1 . The number 4 , on the other hand, is not prime, since we can divide it by 1 , 2 and 4

without leaving a remainder.

Regular Expressions

Okay, now let’s get to the regular expression (A.K.A. regex) syntax. Now, there are quite a few regex flavors,
I’m not going to focus on any specific one, since that is not the point of this post. The concepts described
here work in a similar manner in all of the most common flavors, so don’t worry about it. If you want to learn
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
more about regular expressions, check out Regular-Expressions.info, it’s a great resource to learn regex and
later use it as a reference.

Here’s a cheatsheet with the concepts that will be needed for the explanation that follows:

^ - matches the position before the first character in the string

$ - matches the position right after the last character in the string

. - matches any character, except line break characters (for example, it does not match \n )

| - matches everything that’s either to the left or the right of it. You can think of it as an or operator.

( and ) delimit a capturing group. By placing a part of a regular expression between parentheses,
you’re grouping that part of the regular expression together. This allows you to apply quantifiers (like
+ ) to the entire group or restrict alternation (i.e. “or”: | ) to part of the regular expression. Besides
that, parentheses also create a numbered capturing group, which you can refer to later with
backreferences (more on that below)

\<number_here> - backreferences match the same text as previously matched by a capturing group.
The <number_here> is the group number (remember the discussion above? The one that says that
parentheses create a numbered capturing group? That’s where it comes in). I’ll give an example to
clarify things in a little bit, so if you’re confused, hang on!

+ - matches the preceding token (for example, it can be a character or a group of characters, if the
preceding token is a capturing group) one or more times

* - matches the preceding token zero or more times


open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
* - matches the preceding token

if ? is used after + or * quantifiers, it makes that quantifier non-greedy (more on that below)

Capturing Groups and Backreferences

As promised, let’s clarify how capturing groups and backreferences work together.

As I mentioned, parentheses create numbered capturing groups. What do I mean by that? Well, that means
that when you use parentheses, you create a group that matches some characters and you can refer to
those matched characters later on. The numbers are given to the groups in the order they appear in the
regular expression, beginning with 1 . For example, let’s say you have the following regular expression:
^aa(bb)cc(dd)$ . Note, that in this case, we have 2 groups. They are numbered as follows:

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
This means that we can refer to the characters matched by them later using backreferences. If we want to
refer to what is matched by (bb) , we use \1 (we use 1 because we’re referring to the capturing group
#1). To refer to the characters matched by (dd) we use \2 . Putting that together, the the regular
expression ^aa(bb)cc(dd)\1$ matches the string aabbccddbb . Note how we used \1 to refer to the last
bb . \1 refers to what was matched by the group (bb) , which in this case, was the sting bb .

Now note that I emphasize on what was matched. I really mean the characters that were matched and not
ones that can be matched. This means, that the regular expression ^aa(.+)cc(dd)\1$ does match the sting
aaHELLOccddHELLO , but does not match the sting aaHELLOccddGOODBYE , since it cannot find what was

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
matched by the group #1 (in this case it’s the character sequence HELLO ) after the character sequence dd

(it finds GOODBYE there).

Greedy and Non-Greedy Quantifiers

If you remember correctly, in the cheatseheet above, I mentioned that ? can be used to make the
preceding quantifier non-greedy. Well, okay, but what does that actually mean? + is greedy quantifier, this
means that it will try to repeat the preceding token as many times as possible, i.e. it will try to consume as
much input as it can. The same is true for the * quantifier.

For example, let’s say we have the string <p>The Documentary</p> (2005) and the regular expression <.+> .
Now, you might think that it will match <p> , but that’s not true. The matched string will actually be
<p>The Documentary</p> . Why is that? Well, that has to do with the fact mentioned above: the + will try
to consume as much input as it can, so that means that it will not stop at the first > , but rather at the last
one.

Now how do we go about making a quantifier non-greedy? Well, you might be already tired of hearing that
(since I’ve already mentioned it twice), but in order to make a greedy quantifier non-greedy, you put a
question mark (?) in front of it. It’s really as simple as that. In case you’re still confused, don’t worry, let’s see
an example.

Suppose we have the same string: <p>The Documentary</p> (2005) , but this time, we only want to match
what is between the first < and > . How would we go about that? Well, all we have to do is add ? in

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
front of the + . This will lead us to the <.+?> regex. “Uhhh, okay…”, you might wonder, “But what does
that actually do?”. Well, it will make the + quantifier non-greedy. This means that it will make the
quantifier consume as little input as possible. Well, in our case, the “as little as possible” is <p> , which is
exactly what we want! To be precise, it will match both of the p ’s: <p> and </p> , but we can easily get
what we want by asking for the fist match ( <p> ).

A Little Note On ^ and $

Since we’re on it, I’ll take a moment to quickly explain what the ^ and $ actually do. If you remember
correctly, ^ matches the position right before the first character in the string and $ matches the position
right after the last character in the string. Note how in both of the regular expressions above ( <.+> and
<.+?> ) we did not use them. What does that mean? Well, that means that the match does not have to
begin at the start of the string and end at the end of the string. Taking the second, non-greedy, regex
( <.+?> ) and the sting The Game - <p>The Documentary</p> (2005) , we would still obtain our expected
matches ( <p> and </p> ), since we’re not forcing it to begin at the beginning of the string and end at the
end of the string.

2. The Regular Expression That Tells If A Number Is


Prime
Phew, so we’re finally done with the theoretical introduction and now, since we’ve already have everything
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Phew, so we’re finally done with the theoretical introduction and now, since we’ve already have everything
we need under the belt, we’re ready to dive into the analysis of how the ^.?$|^(..+?)\1+$ regular
expression can match non-prime numbers (in their unary form).

You can ignore the ? in the regular expression, it’s there for performance reasons (explained below) - it
makes the + non-greedy. If it confuses you, just ignore it and consider that the regex is actually
^.?$|^(..+)\1+$ , it works as well, but it’s slower (with some exceptions, like when the number is prime,
where the ? makes no difference whatsoever). After explaining how this regular expression works, I’ll also
explain what that ? does there, you shouldn’t have any trouble understanding it after you understand the
inner workings of this regex.

All of the discussion below assumes that we have the number represented in its unary form (or base-1, if
you prefer). It doesn’t actually have to be represented as a sequence of 1 s, it can be a sequence of any
characters that are matched by . . This means that 5 does not have to be represented as 11111 , it might
as well be represented as fffff or BBBBB . As long as there are five characters, we’re good to go. Please
note, that the characters have to be the same, no mixtures of characters are allowed, this means that we
cannot represent 5 as ffffB , since here we have a mixture of two different characters.

High Level Overview

Let’s begin with a high level overview and then dive into the details. Our ^.?$|^(..+?)\1+$ regular
expression consists of two parts: ^.?$ and ^(..+?)\1+$ .

As a heads-up, I just want to say that I’m lying a little in the explanation in the paragraph about the
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
^(..+?)\1+$ regex. The lie has to do with the order in which the regex engine checks for multiples, it
actually starts with the highest number and goes to the lowest, and not how I explain it here. But feel free to
ignore that distinction here, since the regular expression still matches the same thing, it just does it in more
steps (so I’ll actually be explaining how ^.?$|^(..+?)\1+?$ works: notice the extra ? after the + .

I’m doing this because I believe this explanation is less verbose and easier to understand. And don’t worry, I
explain how I lied and reveal the shocking truth later on, so keep on reading. Well, maybe it’s not really that
shocking, but I wanna keep you engaged, so I’ll stick to that naming.

The regex engine will first try to match ^.?$ , then, if it fails, it will try to match ^(..+?)\1+$ . Note that the
number of characters matched corresponds to the matched number, i.e. if 3 characters are matched, that
means that number 3 was matched, if 26 characters are matched, that means that the number 26 was
matched.

^.?$ matches strings with zero or one characters (corresponds to the numbers 0 and 1 , respectively).

^(..+?)\1+$ first tries to match 2 characters (corresponds to the number 2), then 4 characters
(corresponds to the number 4), then 6 characters, then 8 characters and so on. Basically it will try to match
multiples of 2. If that fails, it will try to first match 3 characters (corresponds to the number 3), then 6
characters (corresponds to the number 6), then 9 characters, then 12 characters and so on. This means that
it will try to match multiples of 3. If that fails, it proceeds to try match multiples of 4, then if that fails it will
try to match multiples of 5 and so on, until the number whose multiple it tries to match is the length of the
string (failure case) or there is a successful match (success case).
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Diving Deeper

Note, that both of parts of the regular expression begin with a ^ symbol and end with a $ symbol, this
forces to what’s in between those symbols ( .? in the first case and (..+)\1+ in the second case) to start
at the beginning of the string and end at the end of the string. In our case that string is the unary
representation of the number. Both of the parts are separated separated by an alternation operator, this
means that either only one of them will be matched or neither will. If the number is prime, a match will not
occur. If the number is not prime a match will occur. To summarize, we concluded that:

either ^.?$ or ^(..+?)\1+$ will be matched

the match has to be on the whole string, i.e. start at the beginning of the string and end at the end of
the string

Okay, but what does each one those parts matches? Keep in mind that if a match occurs, it means that the
number is not prime.

How The ^.?$ Regular Expression Works

^.?$ will match 0 or 1 characters. This match will be successful if:

the string contains only 1 character - this means that we’re dealing with number 1 and, by
definition, 1 is not prime.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
the string contains 0 characters - this means that we’re dealing with number 0 , and 0 is certainly
not prime, since we can divide 0 by anything we want, except for 0 itself, of course.

If we’re given the sting 1 , ^.?$ will match it, since we have only one character in our string ( 1 ). The
match will also occur if we provide an empty string, since, as explained before, ^.?$ will match either an
empty string (0 characters) or a string with only 1 character.

Okay, so far so so good, we certainly want our regex to recognize 0 and 1 as non-primes. But that’s not
enough, since there are numbers other than 0 and 1 that are not prime. This is where the second part of
the regular expression comes in.

How The ^(..+?)\1+$ Regular Expression Works

^(..+?)\1+$ will first try to match multiples of 2, then multiples of 3, then multiples of 4, then multiples 5,
then multiples of 6 and so on, until the multiple of the number it tries to match is the length of the string or
there is a successful match.

But how does it actually work? Well, let’s dissect it!

Let’s focus on the parentheses now, here we have (..+?) (remember, ? just makes this expression non-
greedy). Notice that we have a + here, which means “one or more of the preceding token”. This regex will
first try to match (..) (2 characters), then (...) (3 characters), then (....) (4 characters), and so on,
until the length of the string we’re matching against is reached or there is a successful match.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
After matching for some number of characters (let’s call that number x , the regular expression will try to
see if the string’s length is multiple of x . How does it do that? Well, there’s a backreference. This takes us
to the second part of the regex: \1+ . Now, as explained before this will try to repeat the match in
capturing group #1 one or more times (actually it’s more “more or one times, I’m lying a little bit”) This
means that first, it will try to match x * 2 characters in the string, then x * 3 , then x * 4 , and so on. If
it succeeds in any of those matches, it returns it (and this means that the number is not prime). If it fails (it
will fail when x * <number> exceeds the length of the string we’re matching against), it will try the same
thing, but with x+1 characters, i.e, first (x+1) * 2 , then (x+1) * 3 , then (x+1) * 4 and so on (because
now the \1+ backreference refers to x+1 characters). If the number of characters matched by (..+?)

reaches the length of the string we’re matching against, the regex matching process will stop and return a
failure. If there is a successful match, it will be returned.

Example Time

Now, I’ll sketch some examples to make sure you got everything. I will provide one example where a regular
expression succeeds to match and one where it fails to match. Again, I’m lying in the order of sub-steps (the
nested ones, i.e the ones that have a . , like 2.1 , 3.2 , etc), just a little.

As an example of where a match succeeds, let’s consider the string 111111 . The length of the string we’re
matching against is 6 . Now, 6 is not a prime number, so we expect the regex to succeed with the match.
Let’s see a sketch of how it will work:

1. It will try to match ^.?$ . No luck. The left side of | returns a failure 2. It try to match ^(..+?)\1+$

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
(the right side of | ). It begins with (..+?) matching 11 :

2.1 The backreference \1+ will try to match 11 twice (i.e 1111 ). No luck.

2.2 The backreference \1+ will try to match 11 trice (i.e 111111 ). Success!. Right side of |

returns success

Woah, that was fast! Since the right side of | succeeded, our regular expression succeeds with the match,
which means our number is not prime.

As an example of where a match fails, let’s consider the string 11111 . The length of the string we’re
matching against is 5 . Now, 5 is a prime number, so we expect the regex to fail to match anything. Let’s see
a sketch of how it will work:

1. It will try to match ^.?$ . No luck. The left side of | returns a failure 2. It try to match ^(..+?)\1+$

(the right side of | ). It begins with (..+?) matching 11 :

2.1 The backreference \1+ will try to match 11 twice (i.e 1111 ). No luck.

2.2 The backreference \1+ will try to match 11 trice (i.e 111111 ). No luck. Length of string
exceeded (6 > 5). Backreference returns a failure.

3. (..+?) now matches 111 :

3.1 The backreference \1+ will try to match 111 twice (i.e 111111 ). No luck. Length of string
exceeded (6 > 5). Backreference returns a failure.
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
4. (..+?) now matches 1111 :

4.1 The backreference \1+ will try to match 1111 twice (i.e 11111111 ). No luck. Length of string
exceeded (8 > 5). Backreference returns a failure.

5. (..+?) now matches 11111 :

5.1 The backreference \1+ will try to match 11111 twice (i.e 1111111111 ). No luck. Length of string
exceeded (10 > 5). Backreference returns a failure.

5. (..+?) will try to match 1111111 . No luck. Length of string exceeded (6 > 5). (..+?) returns a failure.
The right side of | returns a failure

Now since both sides of | failed to match anything, the regular expression fails to match anything, which
means our number is prime.

What About The ?

Well, I mentioned that you can ignore the ? symbol in the regular expression, since it’s there only for
performance reasons, and that’s true, but there is no need to keep its purpose a mystery, so I’ll explain
what it actually does there.

As mentioned before, ? makes the preceding + non-greedy. What does it mean in practice? Let’s say our
string is 111111111111111 (corresponds to the number 15). Let’s call L the length of the string. In our case,
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
L=15 .

With the ? present there, + will try to match its preceding token (in this case . ) as few times as
possible. This means that first (..+?) will try to match .. , then ... , then .... and then ..... , after
which our whole regex ( ^.?$|^(..+?)\1+$ ) would succeed. So first, we’ll be testing the divisibility by 2,
then by 3, then by 4 and then by 5, after which we would have a match. Notice that the number of steps in
(..+?) was 4 (first it matches 2, then 3, then 4 and then 5).

If we omitted the ? , i.e if we had (..+) , then it would go the other way around: first it would try to match
............... (the number 15, which is our L ), then .............. (the number 14, i.e L-1 ), and so
on until ..... , after which the whole regex would succeed. Notice that even though the result was the
same as in (..+?) , in (..+) the number of steps was 11 instead of 4. By definition, any divisor of L must
be no greater than L/2, so that means that means that 8 steps were absolutely wasted computation, since
first we tested the divisibility by 15, then 14, then 13, and so on until 5 (we could only hope for a match from
number 7 and downwards, since L/2 = 15/2 = 7.5 and the first integer smaller than 7.5 is 7 ).

The Shocking Lie

As I mentioned before, I actually lied in the explanation of how the multiples of a number are matched. Let’s
say we have the string 111111111111111 (number 15).

The way I explained it before was that the regular expression would begin to test for divisibility by 2 . It
would do so by first trying to match 2*2 characters, then 2*3 , then 2*4 , then 2*5 , then 2*6 , then
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
2*7 , after which it would fail to match 2*8 , so it would try its luck with testing for divisibility by 3 , by
first trying to match for 3*2 characters, then for 3*3 characters, then for 3*4 and then for 3*5 , where
it would succeed. This is actually what would happen if the regular expression was ^.?$|^(..+?)\1+?$

(notice the ? at the end), i.e., if the + following the backreference was non-greedy.

What actually happens is the opposite. It would still try to test for the divisibility by 2 , first, but instead of
trying to match for 2*2 characters, it would begin with trying to match for 2*7 , then for 2*6 , then for
2*5 , then for 2*4 , then for 2*3 and then for 2*2 , after which it would fail and, once again, try its luck
with divisibility by 3 , by first trying to match for 3*5 characters, where it would succeed right away.

Notice, that in the second case, which is what happens in reality, less steps are required: 11 in the first case
vs 7 in the second (in reality, both of the cases would require more steps than presented here, the goal of
this explanation is not count them all, but to transmit the idea of what’s happening in both cases, it’s just a
sketch of what’s going on under the hood). While both versions are equivalent, the one explained in this
blog post, is more efficient.

3. The Java Case


Here’s the piece of Java code that started all of this:

public static boolean isPrime(int n) {


return !new String(new char[n]).matches(".?|(..+?)\\1+");

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
}

If you remember correctly, I said that due to the peculiarities of the way String.matches works in Java, the
regular expression that matches non-prime numbers is not the one in the code example above
( .?|(..+?)\1+ ), but it’s actually ^.?$|^(..+?)\1+$ . Why? Well, turns out String.matches() matches on
the whole string, not on any substring of the string. Basically, it “automatically inserts” all of the ^ and $

present in the regex I explained in this post.

If you’re looking for a way not to force the match on the whole string in Java, you can use Pattern, Matcher
and Matcher.find() method.

Other than that, it’s pretty much self explanatory: if the match succeeds, then the number is not prime. In
case of a successful match, String.matches() returns true (number is not prime), otherwise, it return
false (number is prime), so to obtain the desired functionality we negate what the method returns.

new String(new char[n]) returns a String of n null characters (the . in our regex matches them).

4. Code Examples
Now, as promised, it’s time for some code examples!

Java
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Although I already presented this code example twice in this post, I’ll do it here again, just to keep it
organized.

public static boolean isPrime(int n) {


return !new String(new char[n]).matches(".?|(..+?)\\1+");
}

Python

I’ve expressed my sympathy for Python before, so of course I have to include this one here.

def is_prime(n):
return not re.match(r'^.?$|^(..+?)\1+$', '1'*n)

JavaScript
Java is to JavaScript as car is to carpet.

That’s a joke I like. I didn’t come up with it and I don’t really know its first source, so I don’t know whom to
credit. Anyways, I’m actually going to give you two versions here, one which works in ES6 and one that
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
works in previous versions.

First, the ECMAScript 6 version:

function isPrime(n) {
var re = /^.?$|^(..+?)\1+$/;
return !re.test('1'.repeat(n));
}

The feature that’s only available in ECMAScript 6 is the String.prototype.repeat() method.

If you gotta use previous versions of ES, you can always fall back to Array.prototype.join(). Note, however,
that we’re passing n+1 to join() , since it actually places those characters in between array elements. So
if we have, let’s say, 10 array elements, there are only 9 “in-betweens”. Here’s the version that will work
in versions prior to ECMAScript 6:

function isPrime(n) {
var re = /^.?$|^(..+?)\1+$/;
return !re.test(Array(n+1).join('1'));
}

Perl
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Last, but not least, it’s time for Perl. I’m including this here because the regular expression we’ve been
exploring in this blog post has been popularized by Perl. I’m talking about the one-liner
perl -wle 'print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/' <number> (replace <number> with an
actual number).

Also, since I haven’t played around with Perl before, this seemed like a good opportunity to do so. So here
we go:

sub is_prime {
return !((1x$_[0]) =~ /^.?$|^(..+?)\1+$/);
}

Since Perl isn’t the most popular language right now, it might happen that you’re not familiar with its syntax.
Now, I’ve had about 15 mins with it, so I’m pretty much an expert, so I’ll take the liberty to briefly explain the
syntax above:

sub - defines a new subroutine (function)

$_[0] - we’re accessing the first parameter passed in to our subroutine

1x<number> - here we’re using the repetition operator x , this will basically repeat the number 1

<number> of times and return the result as a string. This is similar to what '1'*<number> would do in
Python or '1'.repeat(<number>) in JavaScript.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
=~ is the match test operator, it will return true if the regular expression (its right-hand side) has a
match on the string (its left-hand side).

! is the negation operator

I included this brief explanation, because, I myself, don’t like being left in mystery about what a certain
passage of code does and the explanation didn’t take up much space anyways.

Conclusion
That’s all folks! Hopefully, you’re now demystified about how a regular expression can check if a number is
prime. Keep in mind, that this is far from efficient, there are a lot more efficient algorithms for this task, but it
is, nonetheless, a fun and interesting thing.

I encourage you to go to a website like regex101 and play around, specially if you’re still not 100% clear
about how everything explained here works. One of the cool things about this website is that it includes an
explanation of the regular expression (column on the right), as well as the number of steps the regex engine
had to make (rectangle right above the modifiers box) - it’s a good way to see the performance differences
(through the number of steps taken) in the greedy and non-greedy cases.

If you have any questions or suggestions, feel free to post them in the comment section below or get in
touch with me via a different medium.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
EDIT:

Thanks to joshuamy for pointing out a typo in Perl code

Thanks to Keen for pointing out a typo in the post

Thanks to Russel for submitting a Swift 2 code example

I didn’t want to get into the topic of regular/non-regular languages and related, since it’s theory that
isn’t crucial for the topic of this post, but as lanzaa pointed out, there is a difference between “regex”
and “regular expression”. What was covered in this blog post wasn’t a regular expression, but rather a
regex. In the “real world”, however (outside of academia), those terms are used interchangeably

REGULAR EXPRESSIONS EXPLANATION

 LIKE  TWEET  +1

 Follow @iluxonchik

24 Comments iluxonchik 
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Sort by Be
 Recommend 3 ⤤ Share

Join the discussion…

Kivi Shapiro • 21 hours ago


I like to say Javascript is to Java as eggplant is to egg. They're both food, and one of them is kind of named after th
other based on external similarities, but after that they're really nothing alike.
3△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Kivi Shapiro • 8 hours ago


Haha.

"one of them is kind of named after the other based on external similarities"

- Actually, if I'm not wrong JavaScript was named after Java just because of Java's popularity, so that
JavaScript would get some fame off the "Java" in its name :)
3△ ▽ • Reply • Share ›

Tiger-222 • 21 minutes ago


Awesome post, thanks! A little typo into the first section "1. Prime Numbers and Regular Expressions - The Theor
"know know" ;)
△ ▽ • Reply • Share ›

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Hey,
thanks for explanation,
I have a question:
If 2*7 don't match why he test for 2*6 because we have ^$ ?

Sorry for my english


△ ▽ • Reply • Share ›

Xavier Noria • 4 hours ago


The original author of this regexp is Perl hacker Abigail. It has many years.

This regexp is totally brilliant. It solves this problem in an unexpected way, because you think numbers, but the
solution thinks strings. Not only that, the paradox, or its beauty, is that it goes to the root of the number abstraction
counting sticks.

I explored this technique some years later to solve a bunch of other problems, coprimality, prime factorization,
Euler's phi, continued fractions, etc. See https://github.com/fxn/math-wi....
△ ▽ • Reply • Share ›

joelonsql • 14 hours ago


Why is PostgreSQL so slow?

explain analyze select x.prime from (select * from generate_series(1,500) as prime) as x where not
repeat('1',x.prime) ~ '^.?$|^(..+?)\1+$';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series prime (cost=0.00..15.00 rows=848 width=4) (actual time=0.079..2341.630

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
rows=95 loops=1)
Filter: (repeat('1'::text, prime) !~ '^.?$|^(..+?)\1+$'::text)
Total runtime: 2341.671 ms
(3 rows)

The Perl version for 1..500 only takes 13 ms, compared to 2341 ms for the PostgreSQL version.
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > joelonsql • 8 hours ago


Well, one thing is that DBs are optimized for fast retrieval/storage of data, not code execution. Another one
that PostgreSQL has a lot more to it than Perl does. Now, I'm not familiar with the way PostgreSQL works
internally, so I'm speculating here a bit, but when you execute a query in PostgreSQL, it goes through man
steps such as transmiting the query to the DB backend, parsing the query, executing it, sending it back to
client.

Another thing is that generate_query(), most probably does more things than something like range() in Pyth
would do. Does it create a table?
△ ▽ • Reply • Share ›

Kadir • 15 hours ago


Nice article, thanks :)
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Kadir • 8 hours ago


You're welcome :)
△ ▽ • Reply • Share ›

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
viebel • 17 hours ago
Wonderful post.

Did you know that you could make the Javascript code snippets interactive in your article with the KLIPSE plugin?

https://github.com/viebel/klip...
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > viebel • 8 hours ago


Thank you and thanks for the link!
△ ▽ • Reply • Share ›

Russell • 20 hours ago


Nice post.

Here it is in Swift 2 (not so pretty, I tried to keep it short):

func isPrime(n: Int) -> Bool {


do {
return try NSRegularExpression(
pattern: "^.?$|^(..+?)\\1+$",
options: []).matchesInString(
String(count: n, repeatedValue: Character("1")),
options: [],
range: NSMakeRange(0, n)
).count == 0
} catch {
return false
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
}
}

or in one ugly line:

func isPrime(n:Int)->Bool{do{return try NSRegularExpression(pattern:"^.?


$|^(..+?)\\1+$",options:[]).matchesInString(String(count:n,repeatedValue:Character("1")),options:[],range:NSMakeR
false}}

I would be interested in any improvements.


△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Russell • 8 hours ago


Thank you!

I will include a link to your example in the end of the post.


1△ ▽ • Reply • Share ›

Keen • 20 hours ago


Thanks for the great explanation! I really enjoyed reading it.

There's a slight typo:


"Well, all we have to do is add ? in front of the +. This will lead us to the <.+?> regex."
The '?' is added *after* the '+'.
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Keen • 8 hours ago


You're welcome and thank you. Someone also pointed it out on HN, but this is actually exactly what I wante
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
to write. by "in front" I mean, "in front" in human terms. I guess you interpreted it in the FIFO queue context
△ ▽ • Reply • Share ›

Rajesh Rao • a day ago


This is amazing!! Could you also expand on where this method stands in terms of efficiency? is this better than
dividing each integer from 1 to sqrt(n)? it might not be more efficient than regular primality testing but is it better fo
checking larger primes?
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Rajesh Rao • 8 hours ago


I'm glad you liked it! Well, to give you a precise mathematical comparison, I'd have to take a look at how a
particular regex engine is implemented, but it's almost certainly not more efficient than diving each integer
from 1 to sqrt(n).

Note, that in regex case you're actually doing the same thing but with more work. On top of all of the implic
divisions you have the sting manipulation part and that is a lot slower than division. Computers are relative
fast at doing divisions, while regex would involve allocation of memory (for strings).

String allocation is just one part of the story. Notice that there is also some trial and error involved (in regex
while a division is just, well, it's just a division - it does not involve any trial and error.

Also note that in the "square root algorithm" you only check up to the square root of the number, while in
regex's case we go further than that. This plus the fact that regex involves strings will make a huge differen
once the numbers become large.

In fact, the larger the number gets, the slower the regex version is compared to the "square root algorithm
one".

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
I was actually thinking of doing some speed comparison in another blog post. This is probably a good idea
since you're not the first one to speculate on this topic :)
△ ▽ • Reply • Share ›

Feilong Ma • a day ago


Cool!
The first example has a tiny typo, there are actually 7 1's in the sentence:
"As an example of where a match succeeds, let’s consider the string 1111111."
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Feilong Ma • 8 hours ago


Woah, great catch! Thank you for pointing out, it'll be fixed :)
△ ▽ • Reply • Share ›

Java Developer • a day ago


Nice read. You did a really good job keeping this simple, however, I still struggled a lot on this. May need to read it
more times. Any recommendation on learning regex - I have used Regex in past but I am beginner and know very
little about how it works. Mostly used expressions suggested in stackoverflow site so it will be great if I can know h
I can create my own regex pattern.

I am far from understanding the performance aspect of it so I skipped through most of it.

Thanks.
△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Java Developer • a day ago


Thanks for the feedback! Here's a general learning tip (backed up by science): don't just re-read it, try
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
reflecting, quizzing yourself on the content. And don't worry, struggle is part of learning :)

Well, I'm not really sure from where I learned regex, there wasn't only one source. I've encountered it in so
books and in "the wild", so I went Googling around to understand it. I've also dealt with regex in some of my
courses (mainly Discrete Mathematics and Compilers).

I would probably suggest you reading a structured guide, that begins with the basics and then builds up on
them. There are many good resources out there, try looking up on YouTube, Google some books, really
anything to get you started. Even if you don't like reading books, just read a few chapters so that you get th
basics, then you can always Google for the rest of the stuff for yourself. There are also some interactive
tutorials out there.

This webisite definitely helped me a lot: https://regex101.com/


It allows you to visualize what's being matched by what.
I also used this webisite as a reference quite a few times: http://www.regular-expressions.... As you can se
it includes works as a tutorial, so read through it :)

Here's another idea: start working on some side-project that involves regex. Something that you like.

Oh and if you have any questions regarding the blog post (or regex, for that matter), feel free to ask. I'll be
glad to help ;)
△ ▽ • Reply • Share ›

Java Developer > ILUXONCHIK • a day ago


Thanks for your quick response. I am starting with some YouTube videos and then will try books.

Its mainly me being lazy about it. Most of the time my work is done with finding a pattern so I surviv
However, I must learn it sometime.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
Procrastination.

In fact I did blog about (in 2014) some tools for people like me here - http://bit.ly/2cHgiMW

Though none of the tools could teach me basics. I always ended up copy/pasting.

I just noticed that your blog url is difficult to remember, however I keep stumbling back to your articl
from variety of sources and comment thinking its first time. I just noticed my comment way back in
design patterns article.

Thanks again. Keep in touch.


△ ▽ • Reply • Share ›

ILUXONCHIK Mod > Java Developer • 9 hours ago


Oh wow, that's an extensive list :)

Oh, that's nice :) Well, I suppose the difficulty of the url comes from the "iluxonchik" part :P
That's like a play on my first name, like a nickname, it comes from Russian. In Russian, the
letter "X" sounds like "kh", I decided to keep the letter unchanged (technically it's a different
letter, but it's similar).

You're welcome, and sure thing, let's keep in touch :)


△ ▽ • Reply • Share ›

Java Developer > ILUXONCHIK • 5 hours ago


Nice to know about your blog name background.

Inspired by you and your tuts, I watched a YouTube video from Derek Banas and got
some basics now. I actually made a playlist too - for my learning -
open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com

However, I just noticed that the syntax convention is little different in each language.
see more

△ ▽ • Reply • Share ›

ALSO ON ILUXONCHIK

Why You Should Learn Python Design Patterns Notes - An Overview Of Design
57 comments • 18 days ago• Patterns
pvg — One of the things that makes python particularly 2 comments • 4 months ago•
useful for various scripting tasks is that it comes with Flux — This is perhaps the clearest & most practical
'batteries included'. The … synopsis of design patterns which I've come
to mention the excellent …

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
✉ Subscribe d Add Disqus to your site Privacy

Read More

Why You Should Learn Python


Why Python is awesome and you should at least give it a try. Continue reading

Design Patterns Notes - An Overview Of


Design Patterns
Published on September 03, 2015

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com
© 2016 iluxonchik. Powered by Jekyll using the HPSTR Theme.

open in browser PRO version Are you a developer? Try out the HTML to PDF API pdfcrowd.com

Вам также может понравиться