Вы находитесь на странице: 1из 14

Demystifying The

Regular Expression That


Checks If A Number Is
Prime
SEPTEMBER 08, 2016
Reading time ~24 minutes

Introduction
A while back I was researching the most ef cient way to check if a number is prime. This lead me to
nd the following piece of code:

publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}

I was intrigued. While this might not be the most ef cient way, its certainly one of the less obvious
ones, so my curiosity kicked in. How on Earth could a match for the

.?|(..+?)\1+

regular

expression tell that a number is not prime (once its converted to its unary representation)?
If youre interested, read on, Ill try to dissect this regular expression and explain whats really
going on. The explanation will be programming language agnostic, I will, however, provide
JavaScript

and

Perl

versions of the

Java

Python

code above and explain why they are slightly

different.
I will explain how the regular expression
this one and not

.?|(..+?)\1+

^.?$|^(..+?)\1+$

(the one used in

Java

can lter out any prime numbers. Why

code example above)? Well, this has to do

with the way String.matches() works, which Ill explain later.

While there are some blog posts on this topic, I found them to not go deep enough and give just a
high level overview, not explaining some of the important details well enough. Here, Ill try to lay it
out with enough detail so that anyone can follow and understand. The goal is to make it simple to
understand for any one - whether you are a regular expression guru or this is the rst time youve
heard about them, anyone should be able to follow along.

1. Prime Numbers and Regular Expressions The Theory


Lets start at a higher level. But wait, rst, lets get every one on the same page and begin with some
de nitions. If you know know what a prime number is and are familiar with regular expression, feel
free to skip this section. I will try to explain how every bit of the regular expression works, so that
even people who are new or unfamiliar with them can follow along.

Prime Numbers
First, a prime number is any natural number greater than

that is only divisible by 1 and the

number itself, without leaving a remainder. Heres a list of the st


2,3,5,7,11,13,17,19

. For example,

and

2
4

. The number

prime numbers:

is prime because you can only divide it by

without leaving a remainder. Sure we can divide it by


since

and

, but that would leave a remainder of

, on the other hand, is not prime, since we can divide it by

without leaving a remainder.

Regular Expressions
Okay, now lets get to the regular expression (A.K.A. regex) syntax. Now, there are quite a few
regex avors, Im not going to focus on any speci c one, since that is not the point of this post. The
concepts described here work in a similar manner in all of the most common avors, so dont worry
about it. If you want to learn more about regular expressions, check out Regular-Expressions.info,
its a great resource to learn regex and later use it as a reference.
Heres a cheatsheet with the concepts that will be needed for the explanation that follows:
^

- matches the position before the rst character in the string

- matches the position right after the last character in the string

- matches any character, except line break characters (for example, it does not match

\n

- matches everything thats either to the left or the right of it. You can think of it as an or

operator.
(

and

delimit a capturing group. By placing a part of a regular expression between

parentheses, youre grouping that part of the regular expression together. This allows you to
apply quanti ers (like

) to the entire group or restrict alternation (i.e. or:

) to part of

the regular expression. Besides that, parentheses also create a numbered capturing group,
which you can refer to later with backreferences (more on that below)
\<number_here>

group. The

- backreferences match the same text as previously matched by a capturing

<number_here>

is the group number (remember the discussion above? The one

that says that parentheses create a numbered capturing group? Thats where it comes in). Ill
give an example to clarify things in a little bit, so if youre confused, hang on!
+

- matches the preceding token (for example, it can be a character or a group of characters,

if the preceding token is a capturing group) one or more times


*

if

- matches the preceding token zero or more times


?

is used after

or

quanti ers, it makes that quanti er non-greedy (more on that

below)

Capturing Groups and Backreferences


As promised, lets clarify how capturing groups and backreferences work together.
As I mentioned, parentheses create numbered capturing groups. What do I mean by that? Well,
that means that when you use parentheses, you create a group that matches some characters and
you can refer to those matched characters later on. The numbers are given to the groups in the
order they appear in the regular expression, beginning with
following regular expression:
numbered as follows:

^aa(bb)cc(dd)$

. For example, lets say you have the

. Note, that in this case, we have

groups. They are

This means that we can refer to the characters matched by them later using backreferences. If we
want to refer to what is matched by

(bb)

, we use

\1

(we use

capturing group #1). To refer to the characters matched by


together, the the regular expression
we used

to refer to the last

\1

in this case, was the sting

bb

bb

^aa(bb)cc(dd)\1$

\1

(dd)

because were referring to the

we use

\2

matches the string

. Putting that

aabbccddbb

refers to what was matched by the group

. Note how
(bb)

, which

Now note that I emphasize on what was matched. I really mean the characters that were matched
and not ones that can be matched. This means, that the regular expression
match the sting

aaHELLOccddHELLO

, but does not match the sting

^aa(.+)cc(dd)\1$

aaHELLOccddGOODBYE

nd what was matched by the group #1 (in this case its the character sequence
character sequence

dd

(it nds

GOODBYE

does

, since it cannot

HELLO

) after the

there).

Greedy and Non-Greedy Quanti ers


If you remember correctly, in the cheatseheet above, I mentioned that

can be used to make the

preceding quanti er non-greedy. Well, okay, but what does that actually mean?

is greedy

quanti er, this means that it will try to repeat the preceding token as many times as possible, i.e. it
will try to consume as much input as it can. The same is true for the
For example, lets say we have the string
<.+>

<p>TheDocumentary</p>

above: the

rst

<p>TheDocumentary</p>(2005)

. Now, you might think that it will match

actually be
>

<p>

quanti er.
and the regular expression

, but thats not true. The matched string will

. Why is that? Well, that has to do with the fact mentioned

will try to consume as much input as it can, so that means that it will not stop at the

, but rather at the last one.

Now how do we go about making a quanti er non-greedy? Well, you might be already tired of
hearing that (since Ive already mentioned it twice), but in order to make a greedy quanti er non-

greedy, you put a question mark (?) in front of it. Its really as simple as that. In case youre still
confused, dont worry, lets see an example.
Suppose we have the same string:

<p>TheDocumentary</p>(2005)

match what is between the rst

and

add

in front of the

<

>

, but this time, we only want to

. How would we go about that? Well, all we have to do is

. This will lead us to the

<.+?>

But what does that actually do?. Well, it will make the

regex. Uhhh, okay, you might wonder,


+

quanti er non-greedy. This means that

it will make the quanti er consume as little input as possible. Well, in our case, the as little as
possible is
and

</p>

<p>

, which is exactly what we want! To be precise, it will match both of the

, but we can easily get what we want by asking for the st match (

<p>

s:

<p>

).

A Little Note On ^ and $


Since were on it, Ill take a moment to quickly explain what the
remember correctly,

and

actually do. If you

matches the position right before the rst character in the string and

matches the position right after the last character in the string. Note how in both of the regular
expressions above (

<.+>

and

<.+?>

) we did not use them. What does that mean? Well, that

means that the match does not have to begin at the start of the string and end at the end of the
string. Taking the second, non-greedy, regex (
TheGame<p>TheDocumentary</p>(2005)
</p>

<.+?>

) and the sting

, we would still obtain our expected matches (

<p>

and

), since were not forcing it to begin at the beginning of the string and end at the end of the

string.

2. The Regular Expression That Tells If A


Number Is Prime
Phew, so were nally done with the theoretical introduction and now, since weve already have
everything we need under the belt, were ready to dive into the analysis of how the
^.?$|^(..+?)\1+$

regular expression can match non-prime numbers (in their unary form).

You can ignore the

below) - it makes the


actually

in the regular expression, its there for performance reasons (explained


+

^.?$|^(..+)\1+$

non-greedy. If it confuses you, just ignore it and consider that the regex is
, it works as well, but its slower (with some exceptions, like when the

number is prime, where the

makes no difference whatsoever). After explaining how this regular

expression works, Ill also explain what that

does there, you shouldnt have any trouble

understanding it after you understand the inner workings of this regex.


All of the discussion below assumes that we have the number represented in its unary form (or
base-1, if you prefer). It doesnt actually have to be represented as a sequence of

s, it can be a

sequence of any characters that are matched by


represented as

. This means that

, it might as well be represented as

11111

fffff

or

does not have to be

BBBBB

. As long as there are

ve characters, were good to go. Please note, that the characters have to be the same, no
mixtures of characters are allowed, this means that we cannot represent

as

ffffB

, since here

we have a mixture of two different characters.

High Level Overview


Lets begin with a high level overview and then dive into the details. Our
expression consists of two parts:

^.?$

and

^(..+?)\1+$

^.?$|^(..+?)\1+$

regular

As a heads-up, I just want to say that Im lying a little in the explanation in the paragraph about the
^(..+?)\1+$

regex. The lie has to do with the order in which the regex engine checks for multiples,

it actually starts with the highest number and goes to the lowest, and not how I explain it here. But
feel free to ignore that distinction here, since the regular expression still matches the same thing, it
just does it in more steps (so Ill actually be explaining how
extra

after the

^.?$|^(..+?)\1+?$

works: notice the

Im doing this because I believe this explanation is less verbose and easier to understand. And dont
worry, I explain how I lied and reveal the shocking truth later on, so keep on reading. Well, maybe
its not really that shocking, but I wanna keep you engaged, so Ill stick to that naming.
The regex engine will rst try to match

^.?$

, then, if it fails, it will try to match

. Note

^(..+?)\1+$

that the number of characters matched corresponds to the matched number, i.e. if 3 characters
are matched, that means that number
that the number
^.?$

26

was matched, if 26 characters are matched, that means

was matched.

matches strings with zero or one characters (corresponds to the numbers

and

respectively).
^(..+?)\1+$

rst tries to match 2 characters (corresponds to the number 2), then 4 characters

(corresponds to the number 4), then 6 characters, then 8 characters and so on. Basically it will try
to match multiples of 2. If that fails, it will try to rst match 3 characters (corresponds to the
number 3), then 6 characters (corresponds to the number 6), then 9 characters, then 12 characters
and so on. This means that it will try to match multiples of 3. If that fails, it proceeds to try match
multiples of 4, then if that fails it will try to match multiples of 5 and so on, until the number whose
multiple it tries to match is the length of the string (failure case) or there is a successful match
(success case).

Diving Deeper

Note, that both of parts of the regular expression begin with a


symbol, this forces to whats in between those symbols (

.?

symbol and end with a

in the rst case and

in the

(..+)\1+

second case) to start at the beginning of the string and end at the end of the string. In our case
that string is the unary representation of the number. Both of the parts are separated separated by
an alternation operator, this means that either only one of them will be matched or neither will. If
the number is prime, a match will not occur. If the number is not prime a match will occur. To
summarize, we concluded that:
either

or

^.?$

^(..+?)\1+$

will be matched

the match has to be on the whole string, i.e. start at the beginning of the string and end at
the end of the string
Okay, but what does each one those parts matches? Keep in mind that if a match occurs, it means
that the number is not prime.

How The ^.?$ Regular Expression Works


^.?$

will match 0 or 1 characters. This match will be successful if:


the string contains only 1 character - this means that were dealing with number
de nition,

and, by

is not prime.

the string contains 0 characters - this means that were dealing with number
certainly not prime, since we can divide

by anything we want, except for

0
0

, and

is

itself, of

course.
If were given the sting

^.?$

will match it, since we have only one character in our string (

The match will also occur if we provide an empty string, since, as explained before,

^.?$

).

will

match either an empty string (0 characters) or a string with only 1 character.


Okay, so far so so good, we certainly want our regex to recognize

thats not enough, since there are numbers other than

that are not prime. This is where

and

and

as non-primes. But

the second part of the regular expression comes in.

How The ^(..+?)\1+$ Regular Expression Works


^(..+?)\1+$

will rst try to match multiples of 2, then multiples of 3, then multiples of 4, then

multiples 5, then multiples of 6 and so on, until the multiple of the number it tries to match is the
length of the string or there is a successful match.
But how does it actually work? Well, lets dissect it!
Lets focus on the parentheses now, here we have

(..+?)

expression non-greedy). Notice that we have a

here, which means one or more of the

(remember,

just makes this

preceding token. This regex will rst try to match


then

(2 characters), then

(..)

(...)

(3 characters),

(4 characters), and so on, until the length of the string were matching against is

(....)

reached or there is a successful match.


After matching for some number of characters (lets call that number
will try to see if the strings length is multiple of

, the regular expression

. How does it do that? Well, theres a

backreference. This takes us to the second part of the regex:

\1+

. Now, as explained before this

will try to repeat the match in capturing group #1 one or more times (actually its more more or
one times, Im lying a little bit) This means that rst, it will try to match
string, then

x*3

, then

x*2

, and so on. If it succeeds in any of those matches, it returns it (and

x*4

this means that the number is not prime). If it fails (it will fail when

x*<number>

length of the string were matching against), it will try the same thing, but with
rst

(x+1)*2

refers to

x+1

, then

characters in the

, then

(x+1)*3

(x+1)*4

and so on (because now the

characters). If the number of characters matched by

(..+?)

exceeds the
x+1

characters, i.e,

\1+

backreference

reaches the length of

the string were matching against, the regex matching process will stop and return a failure. If
there is a successful match, it will be returned.

Example Time
Now, Ill sketch some examples to make sure you got everything. I will provide one example where
a regular expression succeeds to match and one where it fails to match. Again, Im lying in the
order of sub-steps (the nested ones, i.e the ones that have a

, like

As an example of where a match succeeds, lets consider the string


string were matching against is

2.1

111111

3.2

, etc), just a little.

. The length of the

. Now, 6 is not a prime number, so we expect the regex to

succeed with the match. Lets see a sketch of how it will work:
1. It will try to match
^(..+?)\1+$

^.?$

. No luck. The left side of

(the right side of

). It begins with

returns a failure 2. It try to match

(..+?)

matching

11

2.1 The backreference

\1+

will try to match

11

twice (i.e

1111

2.2 The backreference

\1+

will try to match

11

trice (i.e

111111

:
). No luck.
). Success!. Right side of

returns success

Woah, that was fast! Since the right side of

succeeded, our regular expression succeeds with

the match, which means our number is not prime.


As an example of where a match fails, lets consider the string
were matching against is

11111

. The length of the string

. Now, 5 is a prime number, so we expect the regex to fail to match

anything. Lets see a sketch of how it will work:


1. It will try to match
^(..+?)\1+$

^.?$

. No luck. The left side of

(the right side of

). It begins with

returns a failure 2. It try to match

(..+?)

matching

11

2.1 The backreference

\1+

will try to match

11

twice (i.e

1111

2.2 The backreference

\1+

will try to match

11

trice (i.e

111111

). No luck.
). No luck. Length of string

exceeded (6 > 5). Backreference returns a failure.


3.

(..+?)

now matches

111

3.1 The backreference

\1+

will try to match

111

twice (i.e

111111

). No luck. Length of

string exceeded (6 > 5). Backreference returns a failure.


4.

(..+?)

now matches

1111

4.1 The backreference

\1+

will try to match

twice (i.e

1111

11111111

). No luck. Length of

string exceeded (8 > 5). Backreference returns a failure.


5.

(..+?)

now matches

11111

5.1 The backreference

\1+

will try to match

11111

twice (i.e

1111111111

). No luck. Length

of string exceeded (10 > 5). Backreference returns a failure.


5.

(..+?)

will try to match

1111111

. No luck. Length of string exceeded (6 > 5).

(..+?)

returns a

failure. The right side of

returns a failure

Now since both sides of

failed to match anything, the regular expression fails to match

anything, which means our number is prime.

What About The ?


Well, I mentioned that you can ignore the

symbol in the regular expression, since its there only

for performance reasons, and thats true, but there is no need to keep its purpose a mystery, so Ill
explain what it actually does there.
As mentioned before,
Lets say our string is

makes the preceding

L=15

present there,

non-greedy. What does it mean in practice?


L

the length of

will try to match its preceding token (in this case

) as few times

.
+

as possible. This means that rst


.....

(corresponds to the number 15). Lets call

111111111111111

the string. In our case,


With the

(..+?)

, after which our whole regex (

will try to match

^.?$|^(..+?)\1+$

..

, then

...

, then

....

and then

) would succeed. So rst, well be testing the

divisibility by 2, then by 3, then by 4 and then by 5, after which we would have a match. Notice that
the number of steps in
If we omitted the
to match
i.e

L1

(..+?)

was 4 ( rst it matches 2, then 3, then 4 and then 5).

, i.e if we had

...............

), and so on until

(..+)

, then it would go the other way around: rst it would try

(the number 15, which is our


.....

), then

..............

(the number 14,

, after which the whole regex would succeed. Notice that even

though the result was the same as in

(..+?)

, in

the number of steps was 11 instead of 4. By

(..+)

de nition, any divisor of L must be no greater than L/2, so that means that means that 8 steps
were absolutely wasted computation, since rst we tested the divisibility by 15, then 14, then 13,
and so on until 5 (we could only hope for a match from number 7 and downwards, since
and the rst integer smaller than

L/2=15/2=7.5

7.5

is

).

The Shocking Lie


As I mentioned before, I actually lied in the explanation of how the multiples of a number are
matched. Lets say we have the string

111111111111111

(number 15).

The way I explained it before was that the regular expression would begin to test for divisibility by
2

. It would do so by rst trying to match

2*6

, then

2*7

divisibility by
3*4

characters, then

, after which it would fail to match


, by rst trying to match for

and then for

expression was

2*2

3*5

3*2

2*8

2*3

, then

2*4

, then

2*5

, then

, so it would try its luck with testing for

characters, then for

3*3

characters, then for

, where it would succeed. This is actually what would happen if the regular

^.?$|^(..+?)\1+?$

(notice the

at the end), i.e., if the

following the

backreference was non-greedy.


What actually happens is the opposite. It would still try to test for the divisibility by
instead of trying to match for
for

2*6

, then for

2*5

characters, it would begin with trying to match for

2*2

, then for

2*4

, then for

and, once again, try its luck with divisibility by

2*3
3

and then for

2*2

, rst, but
2*7

, then

, after which it would fail

, by rst trying to match for

3*5

characters,

where it would succeed right away.


Notice, that in the second case, which is what happens in reality, less steps are required: 11 in the
rst case vs 7 in the second (in reality, both of the cases would require more steps than presented
here, the goal of this explanation is not count them all, but to transmit the idea of whats happening
in both cases, its just a sketch of whats going on under the hood). While both versions are
equivalent, the one explained in this blog post, is more ef cient.

3. The Java Case


Heres the piece of Java code that started all of this:

publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}

If you remember correctly, I said that due to the peculiarities of the way String.matches works in
Java, the regular expression that matches non-prime numbers is not the one in the code example
above (

.?|(..+?)\1+

), but its actually

^.?$|^(..+?)\1+$

. Why? Well, turns out

String.matches()

matches on the whole string, not on any substring of the string. Basically, it automatically inserts
all of the

and

present in the regex I explained in this post.

If youre looking for a way not to force the match on the whole string in Java, you can use Pattern,
Matcher and Matcher. nd() method.
Other than that, its pretty much self explanatory: if the match succeeds, then the number is not
prime. In case of a successful match,
otherwise, it return

false

String.matches()

returns

true

(number is not prime),

(number is prime), so to obtain the desired functionality we negate

what the method returns.


newString(newchar[n])

returns a

String

of

null characters (the

in our regex matches

them).

4. Code Examples
Now, as promised, its time for some code examples!

Java
Although I already presented this code example twice in this post, Ill do it here again, just to keep it
organized.

publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}

Python
Ive expressed my sympathy for Python before, so of course I have to include this one here.

defis_prime(n):
returnnotre.match(r'^.?$|^(..+?)\1+$','1'*n)

JavaScript

JavaistoJavaScriptascaristocarpet.
Thats a joke I like. I didnt come up with it and I dont really know its rst source, so I dont know
whom to credit. Anyways, Im actually going to give you two versions here, one which works in ES6
and one that works in previous versions.
First, the ECMAScript 6 version:

functionisPrime(n){
varre=/^.?$|^(..+?)\1+$/;
return!re.test('1'.repeat(n));
}

The feature thats only available in ECMAScript 6 is the String.prototype.repeat() method.


If you gotta use previous versions of ES, you can always fall back to Array.prototype.join(). Note,
however, that were passing

n+1

to

join()

array elements. So if we have, lets say,

10

, since it actually places those characters in between


array elements, there are only

in-betweens. Heres

the version that will work in versions prior to ECMAScript 6:

functionisPrime(n){
varre=/^.?$|^(..+?)\1+$/;
return!re.test(Array(n+1).join('1'));
}

Perl
Last, but not least, its time for Perl. Im including this here because the regular expression weve
been exploring in this blog post has been popularized by Perl. Im talking about the one-liner
perlwle'print"Prime"if(1xshift)!~/^1?$|^(11+?)\1+$/'<number>

(replace

<number>

with an

actual number).
Also, since I havent played around with Perl before, this seemed like a good opportunity to do so.
So here we go:

subis_prime{
return!((1x$_[0])=~/^.?$|^(..+?)\1+$/);
}

Since Perl isnt the most popular language right now, it might happen that youre not familiar with
its syntax. Now, Ive had about 15 mins with it, so Im pretty much an expert, so Ill take the liberty

to brie y explain the syntax above:


sub

- de nes a new subroutine (function)

$_[0]

- were accessing the rst parameter passed in to our subroutine

1x<number>

number

- here were using the repetition operator

'1'*<number>
=~

<number>

, this will basically repeat the

of times and return the result as a string. This is similar to what

would do in Python or

'1'.repeat(<number>)

in JavaScript.

is the match test operator, it will return true if the regular expression (its right-hand

side) has a match on the string (its left-hand side).


!

is the negation operator

I included this brief explanation, because, I myself, dont like being left in mystery about what a
certain passage of code does and the explanation didnt take up much space anyways.

Conclusion
Thats all folks! Hopefully, youre now demysti ed about how a regular expression can check if a
number is prime. Keep in mind, that this is far from ef cient, there are a lot more ef cient
algorithms for this task, but it is, nonetheless, a fun and interesting thing.
I encourage you to go to a website like regex101 and play around, specially if youre still not 100%
clear about how everything explained here works. One of the cool things about this website is that
it includes an explanation of the regular expression (column on the right), as well as the number of
steps the regex engine had to make (rectangle right above the modi ers box) - its a good way to
see the performance differences (through the number of steps taken) in the greedy and nongreedy cases.
If you have any questions or suggestions, feel free to post them in the comment section below or
get in touch with me via a different medium.
EDIT:
Thanks to joshuamy for pointing out a typo in Perl code
Thanks to Keen for pointing out a typo in the post
Thanks to Russel for submitting a Swift 2 code example
I didnt want to get into the topic of regular/non-regular languages and related, since its
theory that isnt crucial for the topic of this post, but as lanzaa pointed out, there is a
difference between regex and regular expression. What was covered in this blog post

wasnt a regular expression, but rather a regex. In the real world, however (outside of
academia), those terms are used interchangeably

REGULAR EXPRESSIONS

LIKE

EXPLANATION

TWEET

+1

Follow@iluxonchik

GhosteryblockedcommentspoweredbyDisqus.

Read More

Why You Should Learn Python


Why Python is awesome and you should at least give it a try. Continue reading

Design Patterns Notes - An Overview Of Design Patterns


Published on September 03, 2015

2016 iluxonchik. Powered by Jekyll using the HPSTR Theme.


comments powered by Disqus

Вам также может понравиться