Python

Python Regular Expression Tutorial

One of the most daunting tools at the beginning, but quite powerful if you master it, is regular expressions. They seem scary and hard to understand, but once you get used to them, you realize how powerful they are. So in this article, I will try to guide you through them, so they won’t seem complicated to you, especially if you are just starting out.
 
 
 
 
 
 
 

1. Introduction

So what are Regular Expressions? Reg Exp is a sequence of chars which is used for finding or changing a text in a string or in a file. They use two types of symbols:

    • special chars which have special meaning to it. For example, * means “any char”
    • literals such as a, b, 1, 2, etc.

In Python, there is a special module called re. Usually, you have to import it first in order to write any code like that import re . There are many methods that are used in regular expressions. Most of them are used to find something in a given string, make substrings or change a specific part of a given string. Let’s take a look at most common methods:

    • re.match()
    • re.search()
    • re.findall()
    • re.split()
    • re.sub()
    • re.compile()

We will get more into that very soon.

2. Examples

The very idea of Reg Exp is searching for some specific pattern in text. The best way to learn Reg Exp is actually using them! It’s pretty obvious. I suggest you to go along with me while reading the article and try doing using Reg Exp by yourself as well.

2.1. Basic examples

Let’s go ahead and open Python shell. The first thing we should do is to import the library re</code.

Python shell

>>> import re
>>> object = re.search(pattern, text)

The first arguement is pattern which is a text which we want to search for, the second arguement is text is whatever the text we have to process. What it returns is actually an object which eventually tells us a lot about found patter in the text. Let’s run some examples:

Python shell

>>> import re
>>> object = re.search("ssion", "regular expression")
>>> object
<_sre.SRE_Match object; span=(13, 18), match='ssion'>

It tells us that ssion is between 13th and 18th char in our string (if we start counting from zero). So what if we want to return the value of matching text? We can simply do:

Python shell

>>> object.group()
'ssion'

It’s pretty self-explanatory, but just to leave no one confused: the re.search function iterates the given text and finds the first pattern that fits to our pattern. Well, this object was successfully found, so let’s make unsuccessful one.

Python shell

>>> import re
>>> object = re.search("python", "regular expressions") #it is none. Let's try calling group()
>>> object.group() #it's quite common mistake
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'group'

It doesn’t work because object points to nothing. I want to create a function, so we would be calling it while testing regular expressions.

function.py

def findMatch(pattern, text):
   object = re.search(pattern, text)
   if object:
      print(object.group())
   else:
      print("Sorry, not found")

So for now, if we want to use any special characters like ., we can do the following: find a substring from the given string that has 4 chars and ends with t.

Python shell

>>> findMatch("...t", "I want to study Python")
want

The pretty obvious idea is that all of the pattern must match, so basically you can’t find something that matches a little bit, or even it almost matches. For example:

Python shell

>>> findMatch("...ons", "I want to study Python")
Sorry, not found

The thing is Reg Exp is pretty smart, so if we have a text that has 2 similar substrings, and the first one doesn’t match completely, it will return the second one as long as it matches 100%.

Python shell

>>> findMatch("...e", "He codes well, but you can code better!")
code

What we see here is codes which is the first substring that looks similar to pattern. But the second substring matches completely the pattern even though it’s after the first one. That’s why Python shell returns the second substring – it matches the pattern completely! Maybe you are wondering right now, “What if we want to find substring that has . character inside?”. This is a very good question! We can always put a backslash \:

Python shell

>>> findMatch("c\.t", "This one is cat and other one is c.t")
c.t

What it does is it inhibits the specialness of the dot character. So we are basically looking for the pattern that has . but not “any character”. However, this is not a very convenient way to code. Python has raw string which can be put by typing r. It says, “Don’t do any special processes with backslashes”. The idea is that backslashes can be interpreted differently in various languages. It is especially very powerful while writing Regular Expressions because we don’t have to worry about many backslashes and how they affect our code. So there is another cool feature \w that stand for any word char. Let’s take a look at the example:

Python shell

>>> findMatch(r"-\w\w\w\w\w\w", "What is it? -python -123456")
-python

There is another feature for digits – \d:

Python shell

>>> findMatch(r"-\d\d\d\d\d\d", "What is it? -python -123456")
-123456

What if we want to find retrieve some information that is between spaces. There is a feature \s called whitespace. For example:

Python shell

>>> findMatch(r"\w\s\w\s\d", "P y 1")
P y 1

Another question that you may be having right now, “What if we have more spaces?”. That’s an excellent question. So Python has special chars + meaning 1 or more and * meaning 0 or more. So let’s try to retrieve the infomration by having multiple spaces:

Python shell

>>> findMatch(r"\w\s+\w\s+\d", "P      y                           1")
P      y                           1

Probably by now, you get an idea why many novice programmers hate Regular Expressions. Isn’t it funny how this densed code r"\w\s+\w\s\d" means so much? The order of each character is significant to output.
For instance, we have some useless text for us. However, there is a substring (we don’t know its length) that starts with -. How do we find it?

Python shell

>>> findMatch(r"-\w+","this text doesn't make -python any sense")
-python

Note that the space isn’t a word character, and that’s why it stops after n.
There is another feature \S which is non-whitespace character. Why would we use it? Well, let’s say we want to retrieve a substring from a text that contains not only word characters but digits and all junk, but we know that the substring is whole and doesn’t have white spaces. How do we do that?

Python shell

>>> findMatch(r"-\S+","this text doesn't make -python123!@#=-=+ any sense")
-python123!@#=-=+

That should be enough for basic examples.

2.2. Advanced examples

Imagine we have a bunch of data, and we know for sure, that there is a an email address somewhere inside for sure. How do we get that info? Probably, the most obviouse Reg Exp we can think of is \w+@\w, right? It’s not quite correct. Let’s check it:

Python shell

>>> findMatch(r"\w+@\w+", "bunch of sentences first.lastname@email.com")
lastname@email

This is not what we are expecting to get, right? We can surely change our Reg Exp to \S+@\S, and it will surely fix our basic problem. However, I would like you to think harder on the previous example and stick to \w+@\w.
We can use square brackets [] to show that we are expecting some dot . before @ and some . after @.

Python shell

>>> findMatch(r"[\w.]+@[\w.]+", "bunch of sentences first.lastname@email.com")
first.lastname@email.com

When we write [\w.], we actually mean . as a dot, but not “any character”. So what our Reg Exp states is that we iterate the text, and once we find the exact pattern we need. It actually works if we try simple email such as email@email.com. It still works, it just ignores the dot, because it’s not within the substring in the pattern.
Now, let’s think what we should put if we want to have a check: the first symbol of the email must be a letter. What Reg Exp should we use? The easiest one would be r"\w[\w.]*@[\w.]+". So it means that it must be at least one character which would be a word character, otherwise, email isn’t valid. Pretty logical and easy, right?
What happens if we want to retrieve information about username and server separetely? What I mean is we want to get first.lastname separetely. How do we do that?

Python shell

>>> object = re.search(r"([\w.]+)@([\w.]+)", "bunch of sentences first.lastname@email.com")
>>> object
<_sre.SRE_Match object; span=(19, 43), match='first.lastname@email.com'>
>>> object object.group()
'first.lastname@email.com' #that's not what we want to do, right? But what if...
>>> object object.group(1)
'first.lastname' #here we go, that's username
>>> object object.group(2)
'email.com' #here we go, that's host

So when we use brackets, it means we care about them and logically separate them. Note, the code above doesn’t change anything but gives an opportunity to nest parenthesis.
Now, there is an important thing to remember: re.search iterates the object and stops once it finds the right pattern (if there is such). The function re.findall iterates and stops once it finds all correct patterns. So the difference is that using the second function you may have a lot of outputs. It returns the list of all patterns we can possibly have. For example:

Python shell

>>> object = re.findall(r"([\w.]+)@([\w.]+)", "bunch of sentences first.lastname@emal.com and one@two.com")
>>> object
[('first.lastname', 'emal.com'), ('one', 'two.com')] # tuple of objects
>>> object = re.findall(r"[\w.]+@[\w.]+", "bunch of sentences first.lastname@emal.com and one@two.com")
>>> object
['first.lastname@emal.com', 'one@two.com'] #list of two objects

It’s quite handy, and you can do all sorts of stuff here. Every time you get data from a file or as a string, it’s quite easy to process it with Reg Exp.

3. Patterns

Python has Reg Ex patterns, and I will show you briefly what each of them means.

  • ^ and $ matches beginning and end of line respectively
  • [...] and [^...] matches any character in and not in brackets respectively
  • (?#...)goes for commenting
  • (?=re)specifies position using a pattern
  • \wmatches word characters
  • \Wmatches nonword characters
  • \smatches whitespace
  • \Smatches nonwhitespace
  • \dmatches digits
  • \Dmatches nondigits
  • \Amatches beginning of string
  • \Zmatches end of string
  • \Gmatches position where last match finished
  • [aeiou]matches any one lowercase vowel
  • [0-9]matches any digit
  • [a-z]matches any lowercase ASCII letter
  • [A-Z]matches any uppercase ASCII letter
  • [a-zA-Z0-9]natches any ASCII letter and any digit

4. Summary

We have learnt today couple things about Regular Expressions:

  1. When we use them, it usually goes from left to right. In other words, it iterates the text and return the first pattern that fits the desired pattern completely (except for using re.findall)
  2. There are a lot of them, but no need to memorize all of them. It’s better to understand the concept and learn patterns
  3. Reg Exp can be daunting at first (especially its syntax), but once you get used to it, it becomes a powerful tool
  4. Sometimes, when we write Reg Exp, we expect a different output. The reason for that is because we might be missing something. Don’t worry much about it, because code is so densed and each character there is significant, you just need to be very careful with it
  5. The best way to learn Reg Exp is actually writing some of your own

5. Homework

As I have already mentioned, the best way to learn Reg Exp is to actually write code using them. I put some exercises for you, and there are examples of how it can be done.

    • Return the first and the last word from the given string

Example of how it can be done can be seen below:

homework.py

>>> import re
>>> first = re.findall(r"^\w+", "First word isn't last")
['First']
>>> last = re.findall(r"\w+$", "First word isn't last")
['last']
    • Return the first 2 chars of each word in the given string

homework.py

>>> import re
>>> answer = re.findall(r"\b\w.", "This is my testing sentence")
>>> print(answer)
['Th', 'is', 'my', 'te', 'se']
    • Return the list of all email domains

homework.py

>>> import re
>>> answer = re.findall(r"@\w+.\w+", "123@gmail.com, 234@mail.com, python@coder.com, coding@love.com, 111@python.org")
>>> print(answer)
['@gmail.com', '@mail.com', '@coder.com', '@love.com', '@python.org']
    • Retrieve information from HTML file

homework.py

import urlib2
response = urlib2.urlopen('')
html = response.read()
result = re.findall(r"\w+\s(\w+)\s(\w+)", str)
print(result)

Example of html file and data which we want to retrieve

1NoahEmma2LiamOlivia3MasonSophia4JacobIsabella5WilliamAva6EthanMia7MichaelEmily

6. Download the Source Code

Download
You can download the full source code of this example here: Python Regular Expression Tutorial

Aleksandr Krasnov

Aleksandr is passionate about teaching programming. His main interests are Neural Networks, Python and Web development. Hobbies are game development and translating. For the past year, he has been involved in different international projects as SEO and IT architect.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button