One of the most daunting tools at the beginning, but quite powerful if you master it, is regular expressions. They seem scary and hard to understand, but once you get used to them, you realize how powerful they are. So in this article, I will try to guide you through them, so they won’t seem complicated to you, especially if you are just starting out.
Table Of Contents
So what are Regular Expressions? Reg Exp is a sequence of chars which is used for finding or changing a text in a string or in a file. They use two types of symbols:
- special chars which have special meaning to it. For example,
*means “any char”
- literals such as a, b, 1, 2, etc.
- special chars which have special meaning to it. For example,
In Python, there is a special module called
re. Usually, you have to import it first in order to write any code like that
import re . There are many methods that are used in regular expressions. Most of them are used to find something in a given string, make substrings or change a specific part of a given string. Let’s take a look at most common methods:
We will get more into that very soon.
The very idea of Reg Exp is searching for some specific pattern in text. The best way to learn Reg Exp is actually using them! It’s pretty obvious. I suggest you to go along with me while reading the article and try doing using Reg Exp by yourself as well.
Let’s go ahead and open Python shell. The first thing we should do is to import the library
>>> import re >>> object = re.search(pattern, text)
The first arguement is
pattern which is a text which we want to search for, the second arguement is
text is whatever the text we have to process. What it returns is actually an
object which eventually tells us a lot about found patter in the text. Let’s run some examples:
>>> import re >>> object = re.search("ssion", "regular expression") >>> object <_sre.SRE_Match object; span=(13, 18), match='ssion'>
It tells us that
ssion is between 13th and 18th char in our string (if we start counting from zero). So what if we want to return the value of matching text? We can simply do:
>>> object.group() 'ssion'
It’s pretty self-explanatory, but just to leave no one confused: the
re.search function iterates the given text and finds the first pattern that fits to our
pattern. Well, this object was successfully found, so let’s make unsuccessful one.
>>> import re >>> object = re.search("python", "regular expressions") #it is none. Let's try calling group() >>> object.group() #it's quite common mistake Traceback (most recent call last): File "", line 1, in AttributeError: 'NoneType' object has no attribute 'group'
It doesn’t work because object points to nothing. I want to create a function, so we would be calling it while testing regular expressions.
def findMatch(pattern, text): object = re.search(pattern, text) if object: print(object.group()) else: print("Sorry, not found")
So for now, if we want to use any special characters like
., we can do the following: find a substring from the given string that has 4 chars and ends with t.
>>> findMatch("...t", "I want to study Python") want
The pretty obvious idea is that all of the pattern must match, so basically you can’t find something that matches a little bit, or even it almost matches. For example:
>>> findMatch("...ons", "I want to study Python") Sorry, not found
The thing is Reg Exp is pretty smart, so if we have a text that has 2 similar substrings, and the first one doesn’t match completely, it will return the second one as long as it matches 100%.
>>> findMatch("...e", "He codes well, but you can code better!") code
What we see here is
codes which is the first substring that looks similar to pattern. But the second substring matches completely the pattern even though it’s after the first one. That’s why Python shell returns the second substring – it matches the pattern completely! Maybe you are wondering right now, “What if we want to find substring that has
. character inside?”. This is a very good question! We can always put a backslash
>>> findMatch("c\.t", "This one is cat and other one is c.t") c.t
What it does is it inhibits the specialness of the dot character. So we are basically looking for the pattern that has
. but not “any character”. However, this is not a very convenient way to code. Python has raw string which can be put by typing
r. It says, “Don’t do any special processes with backslashes”. The idea is that backslashes can be interpreted differently in various languages. It is especially very powerful while writing Regular Expressions because we don’t have to worry about many backslashes and how they affect our code. So there is another cool feature
\w that stand for any word char. Let’s take a look at the example:
>>> findMatch(r"-\w\w\w\w\w\w", "What is it? -python -123456") -python
There is another feature for digits –
>>> findMatch(r"-\d\d\d\d\d\d", "What is it? -python -123456") -123456
What if we want to find retrieve some information that is between spaces. There is a feature
\s called whitespace. For example:
>>> findMatch(r"\w\s\w\s\d", "P y 1") P y 1
Another question that you may be having right now, “What if we have more spaces?”. That’s an excellent question. So Python has special chars
+ meaning 1 or more and
* meaning 0 or more. So let’s try to retrieve the infomration by having multiple spaces:
>>> findMatch(r"\w\s+\w\s+\d", "P y 1") P y 1
Probably by now, you get an idea why many novice programmers hate Regular Expressions. Isn’t it funny how this densed code
r"\w\s+\w\s\d" means so much? The order of each character is significant to output.
For instance, we have some useless text for us. However, there is a substring (we don’t know its length) that starts with
-. How do we find it?
>>> findMatch(r"-\w+","this text doesn't make -python any sense") -python
Note that the space isn’t a word character, and that’s why it stops after
There is another feature
\S which is non-whitespace character. Why would we use it? Well, let’s say we want to retrieve a substring from a text that contains not only word characters but digits and all junk, but we know that the substring is whole and doesn’t have white spaces. How do we do that?
>>> findMatch(r"-\S+","this text doesn't make [email protected]#=-=+ any sense") [email protected]#=-=+
That should be enough for basic examples.
Imagine we have a bunch of data, and we know for sure, that there is a an email address somewhere inside for sure. How do we get that info? Probably, the most obviouse Reg Exp we can think of is
\[email protected]\w, right? It’s not quite correct. Let’s check it:
>>> findMatch(r"\[email protected]\w+", "bunch of sentences [email protected]") [email protected]
This is not what we are expecting to get, right? We can surely change our Reg Exp to
\[email protected]\S, and it will surely fix our basic problem. However, I would like you to think harder on the previous example and stick to
We can use square brackets
 to show that we are expecting some dot
@ and some
>>> findMatch(r"[\w.][email protected][\w.]+", "bunch of sentences [email protected]") [email protected]
When we write
[\w.], we actually mean
. as a dot, but not “any character”. So what our Reg Exp states is that we iterate the text, and once we find the exact pattern we need. It actually works if we try simple email such as
[email protected]. It still works, it just ignores the dot, because it’s not within the substring in the pattern.
Now, let’s think what we should put if we want to have a check: the first symbol of the email must be a letter. What Reg Exp should we use? The easiest one would be
r"\w[\w.]*@[\w.]+". So it means that it must be at least one character which would be a word character, otherwise, email isn’t valid. Pretty logical and easy, right?
What happens if we want to retrieve information about username and server separetely? What I mean is we want to get
first.lastname separetely. How do we do that?
>>> object = re.search(r"([\w.]+)@([\w.]+)", "bunch of sentences [email protected]") >>> object <_sre.SRE_Match object; span=(19, 43), match='[email protected]'> >>> object object.group() '[email protected]' #that's not what we want to do, right? But what if... >>> object object.group(1) 'first.lastname' #here we go, that's username >>> object object.group(2) 'email.com' #here we go, that's host
So when we use brackets, it means we care about them and logically separate them. Note, the code above doesn’t change anything but gives an opportunity to nest parenthesis.
Now, there is an important thing to remember:
re.search iterates the object and stops once it finds the right pattern (if there is such). The function
re.findall iterates and stops once it finds all correct patterns. So the difference is that using the second function you may have a lot of outputs. It returns the list of all patterns we can possibly have. For example:
>>> object = re.findall(r"([\w.]+)@([\w.]+)", "bunch of sentences [email protected] and [email protected]") >>> object [('first.lastname', 'emal.com'), ('one', 'two.com')] # tuple of objects >>> object = re.findall(r"[\w.][email protected][\w.]+", "bunch of sentences [email protected] and [email protected]") >>> object ['[email protected]', '[email protected]'] #list of two objects
It’s quite handy, and you can do all sorts of stuff here. Every time you get data from a file or as a string, it’s quite easy to process it with Reg Exp.
Python has Reg Ex patterns, and I will show you briefly what each of them means.
$matches beginning and end of line respectively
[^...]matches any character in and not in brackets respectively
(?#...)goes for commenting
(?=re)specifies position using a pattern
\wmatches word characters
\Wmatches nonword characters
\Amatches beginning of string
\Zmatches end of string
\Gmatches position where last match finished
[aeiou]matches any one lowercase vowel
[0-9]matches any digit
[a-z]matches any lowercase ASCII letter
[A-Z]matches any uppercase ASCII letter
[a-zA-Z0-9]natches any ASCII letter and any digit
We have learnt today couple things about Regular Expressions:
- When we use them, it usually goes from left to right. In other words, it iterates the text and return the first pattern that fits the desired pattern completely (except for using
- There are a lot of them, but no need to memorize all of them. It’s better to understand the concept and learn patterns
- Reg Exp can be daunting at first (especially its syntax), but once you get used to it, it becomes a powerful tool
- Sometimes, when we write Reg Exp, we expect a different output. The reason for that is because we might be missing something. Don’t worry much about it, because code is so densed and each character there is significant, you just need to be very careful with it
- The best way to learn Reg Exp is actually writing some of your own
As I have already mentioned, the best way to learn Reg Exp is to actually write code using them. I put some exercises for you, and there are examples of how it can be done.
- Return the first and the last word from the given string
Example of how it can be done can be seen below:
>>> import re >>> first = re.findall(r"^\w+", "First word isn't last") ['First'] >>> last = re.findall(r"\w+$", "First word isn't last") ['last']
- Return the first 2 chars of each word in the given string
>>> import re >>> answer = re.findall(r"\b\w.", "This is my testing sentence") >>> print(answer) ['Th', 'is', 'my', 'te', 'se']
- Return the list of all email domains
>>> import re >>> answer = re.findall(r"@\w+.\w+", "[email protected], [email protected], [email protected], [email protected], [email protected]") >>> print(answer) ['@gmail.com', '@mail.com', '@coder.com', '@love.com', '@python.org']
- Retrieve information from HTML file
import urlib2 response = urlib2.urlopen('') html = response.read() result = re.findall(r"\w+\s(\w+)\s(\w+)", str) print(result)
Example of html file and data which we want to retrieve
You can download the full source code of this example here: regular-expressions.zip