Академический Документы
Профессиональный Документы
Культура Документы
Introduction
A while back I was researching the most ef cient way to check if a number is prime. This lead me to
nd the following piece of code:
publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}
I was intrigued. While this might not be the most ef cient way, its certainly one of the less obvious
ones, so my curiosity kicked in. How on Earth could a match for the
.?|(..+?)\1+
regular
expression tell that a number is not prime (once its converted to its unary representation)?
If youre interested, read on, Ill try to dissect this regular expression and explain whats really
going on. The explanation will be programming language agnostic, I will, however, provide
JavaScript
and
Perl
versions of the
Java
Python
different.
I will explain how the regular expression
this one and not
.?|(..+?)\1+
^.?$|^(..+?)\1+$
Java
While there are some blog posts on this topic, I found them to not go deep enough and give just a
high level overview, not explaining some of the important details well enough. Here, Ill try to lay it
out with enough detail so that anyone can follow and understand. The goal is to make it simple to
understand for any one - whether you are a regular expression guru or this is the rst time youve
heard about them, anyone should be able to follow along.
Prime Numbers
First, a prime number is any natural number greater than
. For example,
and
2
4
. The number
prime numbers:
and
Regular Expressions
Okay, now lets get to the regular expression (A.K.A. regex) syntax. Now, there are quite a few
regex avors, Im not going to focus on any speci c one, since that is not the point of this post. The
concepts described here work in a similar manner in all of the most common avors, so dont worry
about it. If you want to learn more about regular expressions, check out Regular-Expressions.info,
its a great resource to learn regex and later use it as a reference.
Heres a cheatsheet with the concepts that will be needed for the explanation that follows:
^
- matches the position right after the last character in the string
- matches any character, except line break characters (for example, it does not match
\n
- matches everything thats either to the left or the right of it. You can think of it as an or
operator.
(
and
parentheses, youre grouping that part of the regular expression together. This allows you to
apply quanti ers (like
) to part of
the regular expression. Besides that, parentheses also create a numbered capturing group,
which you can refer to later with backreferences (more on that below)
\<number_here>
group. The
<number_here>
that says that parentheses create a numbered capturing group? Thats where it comes in). Ill
give an example to clarify things in a little bit, so if youre confused, hang on!
+
- matches the preceding token (for example, it can be a character or a group of characters,
if
is used after
or
below)
^aa(bb)cc(dd)$
This means that we can refer to the characters matched by them later using backreferences. If we
want to refer to what is matched by
(bb)
, we use
\1
(we use
\1
bb
bb
^aa(bb)cc(dd)\1$
\1
(dd)
we use
\2
. Putting that
aabbccddbb
. Note how
(bb)
, which
Now note that I emphasize on what was matched. I really mean the characters that were matched
and not ones that can be matched. This means, that the regular expression
match the sting
aaHELLOccddHELLO
^aa(.+)cc(dd)\1$
aaHELLOccddGOODBYE
nd what was matched by the group #1 (in this case its the character sequence
character sequence
dd
(it nds
GOODBYE
does
, since it cannot
HELLO
) after the
there).
preceding quanti er non-greedy. Well, okay, but what does that actually mean?
is greedy
quanti er, this means that it will try to repeat the preceding token as many times as possible, i.e. it
will try to consume as much input as it can. The same is true for the
For example, lets say we have the string
<.+>
<p>TheDocumentary</p>
above: the
rst
<p>TheDocumentary</p>(2005)
actually be
>
<p>
quanti er.
and the regular expression
will try to consume as much input as it can, so that means that it will not stop at the
Now how do we go about making a quanti er non-greedy? Well, you might be already tired of
hearing that (since Ive already mentioned it twice), but in order to make a greedy quanti er non-
greedy, you put a question mark (?) in front of it. Its really as simple as that. In case youre still
confused, dont worry, lets see an example.
Suppose we have the same string:
<p>TheDocumentary</p>(2005)
and
add
in front of the
<
>
<.+?>
But what does that actually do?. Well, it will make the
it will make the quanti er consume as little input as possible. Well, in our case, the as little as
possible is
and
</p>
<p>
, but we can easily get what we want by asking for the st match (
<p>
s:
<p>
).
and
matches the position right before the rst character in the string and
matches the position right after the last character in the string. Note how in both of the regular
expressions above (
<.+>
and
<.+?>
) we did not use them. What does that mean? Well, that
means that the match does not have to begin at the start of the string and end at the end of the
string. Taking the second, non-greedy, regex (
TheGame<p>TheDocumentary</p>(2005)
</p>
<.+?>
<p>
and
), since were not forcing it to begin at the beginning of the string and end at the end of the
string.
regular expression can match non-prime numbers (in their unary form).
^.?$|^(..+)\1+$
non-greedy. If it confuses you, just ignore it and consider that the regex is
, it works as well, but its slower (with some exceptions, like when the
s, it can be a
11111
fffff
or
BBBBB
ve characters, were good to go. Please note, that the characters have to be the same, no
mixtures of characters are allowed, this means that we cannot represent
as
ffffB
, since here
^.?$
and
^(..+?)\1+$
^.?$|^(..+?)\1+$
regular
As a heads-up, I just want to say that Im lying a little in the explanation in the paragraph about the
^(..+?)\1+$
regex. The lie has to do with the order in which the regex engine checks for multiples,
it actually starts with the highest number and goes to the lowest, and not how I explain it here. But
feel free to ignore that distinction here, since the regular expression still matches the same thing, it
just does it in more steps (so Ill actually be explaining how
extra
after the
^.?$|^(..+?)\1+?$
Im doing this because I believe this explanation is less verbose and easier to understand. And dont
worry, I explain how I lied and reveal the shocking truth later on, so keep on reading. Well, maybe
its not really that shocking, but I wanna keep you engaged, so Ill stick to that naming.
The regex engine will rst try to match
^.?$
. Note
^(..+?)\1+$
that the number of characters matched corresponds to the matched number, i.e. if 3 characters
are matched, that means that number
that the number
^.?$
26
was matched.
and
respectively).
^(..+?)\1+$
rst tries to match 2 characters (corresponds to the number 2), then 4 characters
(corresponds to the number 4), then 6 characters, then 8 characters and so on. Basically it will try
to match multiples of 2. If that fails, it will try to rst match 3 characters (corresponds to the
number 3), then 6 characters (corresponds to the number 6), then 9 characters, then 12 characters
and so on. This means that it will try to match multiples of 3. If that fails, it proceeds to try match
multiples of 4, then if that fails it will try to match multiples of 5 and so on, until the number whose
multiple it tries to match is the length of the string (failure case) or there is a successful match
(success case).
Diving Deeper
.?
in the
(..+)\1+
second case) to start at the beginning of the string and end at the end of the string. In our case
that string is the unary representation of the number. Both of the parts are separated separated by
an alternation operator, this means that either only one of them will be matched or neither will. If
the number is prime, a match will not occur. If the number is not prime a match will occur. To
summarize, we concluded that:
either
or
^.?$
^(..+?)\1+$
will be matched
the match has to be on the whole string, i.e. start at the beginning of the string and end at
the end of the string
Okay, but what does each one those parts matches? Keep in mind that if a match occurs, it means
that the number is not prime.
and, by
is not prime.
the string contains 0 characters - this means that were dealing with number
certainly not prime, since we can divide
0
0
, and
is
itself, of
course.
If were given the sting
^.?$
will match it, since we have only one character in our string (
The match will also occur if we provide an empty string, since, as explained before,
^.?$
).
will
and
and
as non-primes. But
will rst try to match multiples of 2, then multiples of 3, then multiples of 4, then
multiples 5, then multiples of 6 and so on, until the multiple of the number it tries to match is the
length of the string or there is a successful match.
But how does it actually work? Well, lets dissect it!
Lets focus on the parentheses now, here we have
(..+?)
(remember,
(2 characters), then
(..)
(...)
(3 characters),
(4 characters), and so on, until the length of the string were matching against is
(....)
\1+
will try to repeat the match in capturing group #1 one or more times (actually its more more or
one times, Im lying a little bit) This means that rst, it will try to match
string, then
x*3
, then
x*2
x*4
this means that the number is not prime). If it fails (it will fail when
x*<number>
length of the string were matching against), it will try the same thing, but with
rst
(x+1)*2
refers to
x+1
, then
characters in the
, then
(x+1)*3
(x+1)*4
(..+?)
exceeds the
x+1
characters, i.e,
\1+
backreference
the string were matching against, the regex matching process will stop and return a failure. If
there is a successful match, it will be returned.
Example Time
Now, Ill sketch some examples to make sure you got everything. I will provide one example where
a regular expression succeeds to match and one where it fails to match. Again, Im lying in the
order of sub-steps (the nested ones, i.e the ones that have a
, like
2.1
111111
3.2
succeed with the match. Lets see a sketch of how it will work:
1. It will try to match
^(..+?)\1+$
^.?$
). It begins with
(..+?)
matching
11
\1+
11
twice (i.e
1111
\1+
11
trice (i.e
111111
:
). No luck.
). Success!. Right side of
returns success
11111
^.?$
). It begins with
(..+?)
matching
11
\1+
11
twice (i.e
1111
\1+
11
trice (i.e
111111
). No luck.
). No luck. Length of string
(..+?)
now matches
111
\1+
111
twice (i.e
111111
). No luck. Length of
(..+?)
now matches
1111
\1+
twice (i.e
1111
11111111
). No luck. Length of
(..+?)
now matches
11111
\1+
11111
twice (i.e
1111111111
). No luck. Length
(..+?)
1111111
(..+?)
returns a
returns a failure
for performance reasons, and thats true, but there is no need to keep its purpose a mystery, so Ill
explain what it actually does there.
As mentioned before,
Lets say our string is
L=15
present there,
the length of
) as few times
.
+
111111111111111
(..+?)
^.?$|^(..+?)\1+$
..
, then
...
, then
....
and then
divisibility by 2, then by 3, then by 4 and then by 5, after which we would have a match. Notice that
the number of steps in
If we omitted the
to match
i.e
L1
(..+?)
, i.e if we had
...............
), and so on until
(..+)
), then
..............
, after which the whole regex would succeed. Notice that even
(..+?)
, in
(..+)
de nition, any divisor of L must be no greater than L/2, so that means that means that 8 steps
were absolutely wasted computation, since rst we tested the divisibility by 15, then 14, then 13,
and so on until 5 (we could only hope for a match from number 7 and downwards, since
and the rst integer smaller than
L/2=15/2=7.5
7.5
is
).
111111111111111
(number 15).
The way I explained it before was that the regular expression would begin to test for divisibility by
2
2*6
, then
2*7
divisibility by
3*4
characters, then
expression was
2*2
3*5
3*2
2*8
2*3
, then
2*4
, then
2*5
, then
3*3
, where it would succeed. This is actually what would happen if the regular
^.?$|^(..+?)\1+?$
(notice the
following the
2*6
, then for
2*5
2*2
, then for
2*4
, then for
2*3
3
2*2
, rst, but
2*7
, then
3*5
characters,
publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}
If you remember correctly, I said that due to the peculiarities of the way String.matches works in
Java, the regular expression that matches non-prime numbers is not the one in the code example
above (
.?|(..+?)\1+
^.?$|^(..+?)\1+$
String.matches()
matches on the whole string, not on any substring of the string. Basically, it automatically inserts
all of the
and
If youre looking for a way not to force the match on the whole string in Java, you can use Pattern,
Matcher and Matcher. nd() method.
Other than that, its pretty much self explanatory: if the match succeeds, then the number is not
prime. In case of a successful match,
otherwise, it return
false
String.matches()
returns
true
returns a
String
of
them).
4. Code Examples
Now, as promised, its time for some code examples!
Java
Although I already presented this code example twice in this post, Ill do it here again, just to keep it
organized.
publicstaticbooleanisPrime(intn){
return!newString(newchar[n]).matches(".?|(..+?)\\1+");
}
Python
Ive expressed my sympathy for Python before, so of course I have to include this one here.
defis_prime(n):
returnnotre.match(r'^.?$|^(..+?)\1+$','1'*n)
JavaScript
JavaistoJavaScriptascaristocarpet.
Thats a joke I like. I didnt come up with it and I dont really know its rst source, so I dont know
whom to credit. Anyways, Im actually going to give you two versions here, one which works in ES6
and one that works in previous versions.
First, the ECMAScript 6 version:
functionisPrime(n){
varre=/^.?$|^(..+?)\1+$/;
return!re.test('1'.repeat(n));
}
n+1
to
join()
10
in-betweens. Heres
functionisPrime(n){
varre=/^.?$|^(..+?)\1+$/;
return!re.test(Array(n+1).join('1'));
}
Perl
Last, but not least, its time for Perl. Im including this here because the regular expression weve
been exploring in this blog post has been popularized by Perl. Im talking about the one-liner
perlwle'print"Prime"if(1xshift)!~/^1?$|^(11+?)\1+$/'<number>
(replace
<number>
with an
actual number).
Also, since I havent played around with Perl before, this seemed like a good opportunity to do so.
So here we go:
subis_prime{
return!((1x$_[0])=~/^.?$|^(..+?)\1+$/);
}
Since Perl isnt the most popular language right now, it might happen that youre not familiar with
its syntax. Now, Ive had about 15 mins with it, so Im pretty much an expert, so Ill take the liberty
$_[0]
1x<number>
number
'1'*<number>
=~
<number>
would do in Python or
'1'.repeat(<number>)
in JavaScript.
is the match test operator, it will return true if the regular expression (its right-hand
I included this brief explanation, because, I myself, dont like being left in mystery about what a
certain passage of code does and the explanation didnt take up much space anyways.
Conclusion
Thats all folks! Hopefully, youre now demysti ed about how a regular expression can check if a
number is prime. Keep in mind, that this is far from ef cient, there are a lot more ef cient
algorithms for this task, but it is, nonetheless, a fun and interesting thing.
I encourage you to go to a website like regex101 and play around, specially if youre still not 100%
clear about how everything explained here works. One of the cool things about this website is that
it includes an explanation of the regular expression (column on the right), as well as the number of
steps the regex engine had to make (rectangle right above the modi ers box) - its a good way to
see the performance differences (through the number of steps taken) in the greedy and nongreedy cases.
If you have any questions or suggestions, feel free to post them in the comment section below or
get in touch with me via a different medium.
EDIT:
Thanks to joshuamy for pointing out a typo in Perl code
Thanks to Keen for pointing out a typo in the post
Thanks to Russel for submitting a Swift 2 code example
I didnt want to get into the topic of regular/non-regular languages and related, since its
theory that isnt crucial for the topic of this post, but as lanzaa pointed out, there is a
difference between regex and regular expression. What was covered in this blog post
wasnt a regular expression, but rather a regex. In the real world, however (outside of
academia), those terms are used interchangeably
REGULAR EXPRESSIONS
LIKE
EXPLANATION
TWEET
+1
Follow@iluxonchik
GhosteryblockedcommentspoweredbyDisqus.
Read More