Python

How to Split a String by Whitespace in Python: Brute Force and split()

Once again, I’m back with another look at some ways to solve a common Python problem. This time, we’ll be looking at how to split a string by whitespace (and other separators) in Python.

If you’re in a rush, here’s the key takeaway. You could write your own whitespace splitting function, but they’re often slow and lack robustness. Instead, you should probably opt for Python’s builtin split() function. It works for any string as follows: "What a Wonderful World".split(). If done correctly, you’ll get a nice list of substrings without all that whitespace (e.g. ["What", "a", "Wonderful", "World"]).

In the remainder of this article, we’ll look at the solution described above in more detail. In addition, we’ll try writing our own solution. Then, we’ll compare them all by performance. At the end, I’ll ask you to tackle a little challenge.

Let’s get started!

Problem Description

When we talk about splitting a string, what we’re really talking about is the process of breaking a string up into parts. As it turns out, there are a lot of ways to split a string. For the purposes of this article, we’ll just be looking at splitting a string by whitespace.

Of course, what does it mean to split a string by whitespace? Well, let’s look at an example:

1
"How are you?"

Here, the only two whitespace characters are the two spaces. As a result, splitting this string by whitespace would result in a list of three strings:

1
["How", "are", "you?"]

Of course, there are a ton of different types of whitespace characters. Unfortunately, which characters are considered whitespace are totally dependent on the character set being used. As a result, we’ll simplify this problem by only concerning ourselves with Unicode characters (as of the publish date).

In the Unicode character set, there are 17 “separator, space” characters. In addition, there are another 8 whitespace characters which include things like line separators. As a result, the following string is a bit more interesting:

1
"Hi, Ben!\nHow are you?"

With the addition of the line break, we would expect that splitting by whitespace would result in the following list:

1
["Hi,", "Ben!", "How", "are", "you?"]

In this article, we’ll take a look at a few ways to actually write some code that will split a string by whitespace and store the result in a list.

Solutions

As always, there are a lot of different ways to split a string by whitespace. To kick things off, we’ll try to write our own solution. Then, we’ll look at a few more practical solutions.

Split a String by Whitespace Using Brute Force

If I were given the problem description above and asked to solve it without using any libraries, here’s what I would do:

01
02
03
04
05
06
07
08
09
10
11
items = []
my_string = "Hi, how are you?"
whitespace_chars = [" ", ..., "\n"]
start_index = 0
end_index = 0
for character in my_string:
  if character in whitespace_chars:
    items.append(my_string[start_index: end_index])
    start_index = end_index + 1
  items.append(my_string[start_index: end_index])
  end_index += 1

Here, I decided to build up a few variables. First, we need to track the end result which is items in this case. Then, we need some sort of string to work with (e.g. my_string).

To perform the splitting, we’ll need to track a couple indices: one for the front of each substring (e.g. start_index) and one for the back of the substring (e.g. end_index).

On top of all that, we need some way to verify that a character is in fact a whitespace. To do that, we created a list of whitespace characters called whitespace_chars. Rather than listing all of the whitespace characters, I cheated and showed two examples with a little ellipses. Make sure to remove the ellipsis before running this code. For some reason, Python gives those three dots meaning, so it won’t actually error out (although, it likely won’t cause any harm either).

Using these variables, we’re able to loop over our string and construct our substrings. We do that by checking if each character is a whitespace. If it is, we know we need to construct a substring and update start_index to begin tracking the next word. Then, when we’re done, we can grab the last word and store it.

Now, there’s a lot of messiness here. To make life a bit easier, I decided to move the code into a function which we could modify as we go along:

01
02
03
04
05
06
07
08
09
10
11
12
def split_string(my_string: str):
  items = []
  whitespace_chars = [" ", ..., "\n"]
  start_index = 0
  end_index = 0
  for character in my_string:
    if character in whitespace_chars:
      items.append(my_string[start_index: end_index])
      start_index = end_index + 1
    end_index += 1
  items.append(my_string[start_index: end_index])
  return items

Now, this solution is extremely error prone. To prove that, try running this function as follows:

1
split_string("Hello  World")  # returns ['Hello', '', 'World']

Notice how having two spaces in a row causes us to store empty strings? Yeah, that’s not ideal. In the next section, we’ll look at a way to improve this code.

Split a String by Whitespace Using State

Now, I borrowed this solution from a method that we ask students to write for a lab in one of the courses I teach. Basically, the method is called “nextWordOrSeparator” which is a method that looks like this:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
/**
  * Returns the first "word" (maximal length string of characters not in
  * {@code separators}) or "separator string" (maximal length string of
  * characters in {@code separators}) in the given {@code text} starting at
  * the given {@code position}.
  */
private static String nextWordOrSeparator(String text, int position,
            Set<Character> separators) {
        assert text != null : "Violation of: text is not null";
        assert separators != null : "Violation of: separators is not null";
        assert 0 <= position : "Violation of: 0 <= position";
        assert position < text.length() : "Violation of: position < |text|";
 
        // TODO - fill in body
 
        /*
         * This line added just to make the program compilable. Should be
         * replaced with appropriate return statement.
         */
        return "";
}

One way to implement this method is to check whether or not the first character is a separator. If it is, loop until it’s not. If it’s not, loop until it is.

Typically, this is done by writing two separate loops. One loop continually checks characters until a character is in the separator set. Meanwhile, the other loop does the opposite.

Of course, I think that’s a little redundant, so I wrote my solution using a single loop (this time in Python):

1
2
3
4
5
6
def next_word_or_separator(text: str, position: int, separators: list):
  end_index = position
  is_separator = text[position] in separators
  while end_index < len(text) and is_separator == (text[end_index] in separators):
    end_index += 1
  return text[position: end_index]

Here, we track a couple variables. First, we need an end_index, so we know where to split our string. In addition, we need to determine if we’re dealing with a word or separator. To do that, we check if the character at the current position in text is in separators. Then, we store the result in is_separator.

With is_separator, all there is left to do is loop over the string until we find a character that is different. To do that, we repeatedly run the same computation we ran for is_separator. To make that more obvious, I’ve stored that expression in a lambda function:

1
2
3
4
5
6
7
def next_word_or_separator(text: str, position: int, separators: list):
  test_separator = lambda x: text[x] in separators
  end_index = position
  is_separator = test_separator(position)
  while end_index < len(text) and is_separator == test_separator(end_index):
    end_index += 1
  return text[position: end_index]

At any rate, this loop will run until either we run out of string or our test_separator function gives us a value that differs from is_separator. For example, if is_separator is True then we won’t break until test_separator is False.

Now, we can use this function to make our first solution a bit more robust:

1
2
3
4
5
6
7
8
9
def split_string(my_string: str):
  items = []
  whitespace_chars = [" ", ..., "\n"]
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, whitespace_chars)
    items.append(sub)
    i += len(sub)
  return items

Unfortunately, this code is still wrong because we don’t bother to check if what is returned is a word or a separator. To do that, we’ll need to run a quick test:

01
02
03
04
05
06
07
08
09
10
def split_string(my_string: str):
  items = []
  whitespace_chars = [" ", ..., "\n"]
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, whitespace_chars)
    if sub[0] not in whitespace_chars:
      items.append(sub)
    i += len(sub)
  return items

Now, we have a solution that is slightly more robust! Also, it gets the job done for anything we consider separators; they don’t even have to be whitespace. Let’s go ahead and adapt this one last time to let the user enter any separators they like:

1
2
3
4
5
6
7
8
9
def split_string(my_string: str, seps: list):
  items = []
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, seps)
    if sub[0] not in seps:
      items.append(sub)
    i += len(sub)
  return items

Then, when we run this, we’ll see that we can split by whatever we like:

01
02
03
04
05
06
07
08
09
10
>>> split_string("Hello,    World", [" "])
['Hello,', 'World']
>>> split_string("Hello,    World", ["l"])
['He', 'o,    Wor', 'd']
>>> split_string("Hello,    World", ["l", "o"])
['He', ',    W', 'r', 'd']
>>> split_string("Hello,    World", ["l", "o", " "])
['He', ',', 'W', 'r', 'd']
>>> split_string("Hello,    World", [",", " "])
['Hello', 'World']

How cool is that?! In the next section, we’ll look at some builtin tools that do exactly this.

Split a String by Whitespace Using split()

While we spent all this time trying to write our own split method, Python had one built in all along. It’s called split(), and we can call it on strings directly:

1
2
my_string = "Hello, World!"
my_string.split()  # returns ["Hello,", "World!"]

In addition, we can provide our own separators to split the string:

1
2
my_string = "Hello, World!"
my_string.split(",")  # returns ['Hello', ' World!']

However, this method doesn’t work quite like the method we provided. If we input multiple separators, the method will only match the combined string:

1
2
my_string = "Hello, World!"
my_string.split("el")  # returns ['H', 'lo, World!']

In the documentation, this is described as a “different algorithm” from the default behavior. In other words, the whitespace algorithm will treat consecutive whitespace characters as a single entity. Meanwhile, if a separator is provided, the method splits at every occurrence of that separator:

1
2
my_string = "Hello, World!"
my_string.split("l")  # returns ['He', '', 'o, Wor', 'd!']

But, that’s not all! This method can also limit the number of splits using an additional parameter, maxsplit:

1
2
my_string = "Hello, World! Nice to meet you."
my_string.split(maxsplit=2)  # returns ['Hello,', 'World!', 'Nice to meet you.']

How cool is that? In the next section, we’ll see how this solution stacks up against the solutions we wrote ourselves.

Performance

To test performance, we’ll be using the timeit library. Essentially, it allows us to compute the runtime of our code snippets for comparison. If you’d like to learn more about this process, I’ve documented my approach in an article on performance testing in Python.

Otherwise, let’s go ahead and convert our solutions into strings:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
setup = """
zero_spaces = 'Jeremy'
one_space = 'Hello, World!'
many_spaces = 'I need to get many times stronger than everyone else!'
first_space = '    Well, what do we have here?'
last_space = 'Is this the Krusty Krab?    '
long_string = 'Spread love everywhere you go: first of all in your own house. Give love to your children, to your wife or husband, to a next door neighbor. Let no one ever come to you without leaving better and happier. Be the living expression of God’s kindness; kindness in your face, kindness in your eyes, kindness in your smile, kindness in your warm greeting.'
 
def split_string_bug(my_string: str):
  items = []
  whitespace_chars = [' ']
  start_index = 0
  end_index = 0
  for character in my_string:
    if character in whitespace_chars:
      items.append(my_string[start_index: end_index])
      start_index = end_index + 1
    end_index += 1
  items.append(my_string[start_index: end_index])
  return items
 
def next_word_or_separator(text: str, position: int, separators: list):
  test_separator = lambda x: text[x] in separators
  end_index = position
  is_separator = test_separator(position)
  while end_index < len(text) and is_separator == test_separator(end_index):
    end_index += 1
  return text[position: end_index]
 
def split_string(my_string: str, seps: list):
  items = []
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, seps)
    if sub[0] not in seps:
      items.append(sub)
    i += len(sub)
  return items
"""
 
split_string_bug = """
split_string_bug(zero_spaces)
"""
 
split_string = """
split_string(zero_spaces, [" "])
"""
 
split_python = """
zero_spaces.split()
"""

For this first set of tests, I decided to start with a string that has no spaces:

1
2
3
4
5
6
7
>>> import timeit
>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))
0.7218914000000041
>>> min(timeit.repeat(setup=setup, stmt=split_string))
2.867278899999974
>>> min(timeit.repeat(setup=setup, stmt=split_python))
0.0969244999998864

Looks like our next_word_or_separator() solution is very slow. Meanwhile, the builtin split() is extremely fast. Let’s see if that trends continues. Here are the results when we look at one space:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
>>> split_string_bug = """
split_string_bug(one_space)
"""
>>> split_string = """
split_string(one_space, [" "])
"""
>>> split_python = """
one_space.split()
"""
>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))
1.4134186999999656
>>> min(timeit.repeat(setup=setup, stmt=split_string))
6.758952300000146
>>> min(timeit.repeat(setup=setup, stmt=split_python))
0.1601205999998001

Again, Python’s split() method is pretty quick. Meanwhile, our robust method is terribly slow. I can’t imagine how much worse our performance is going to get with a larger string. Let’s try the many_spaces string next:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
>>> split_string_bug = """
split_string_bug(many_spaces)
"""
>>> split_string = """
split_string(many_spaces, [" "])
"""
>>> split_python = """
many_spaces.split()
"""
>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))
5.328358900000012
>>> min(timeit.repeat(setup=setup, stmt=split_string))
34.19867759999988
>>> min(timeit.repeat(setup=setup, stmt=split_python))
0.4214780000002065

This very quickly became painful to wait out. I’m a bit afraid to try the long_string test to be honest. At any rate, let’s check out the performance for the first_space string (and recall that the bugged solution doesn’t work as expected):

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
>>> split_string_bug = """
split_string_bug(first_space)
"""
>>> split_string = """
split_string(first_space, [" "])
"""
>>> split_python = """
first_space.split()
"""
>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))
3.8263317999999344
>>> min(timeit.repeat(setup=setup, stmt=split_string))
20.963715100000172
>>> min(timeit.repeat(setup=setup, stmt=split_python))
0.2931996000002073

At this point, I’m not seeing much difference in the results, so I figured I’d spare you the data dump and instead provide a table of the results:

Testsplit_string_bugsplit_stringsplit_python
no_spaces0.72189140000000412.8672788999999740.0969244999998864
one_space1.41341869999996566.7589523000001460.1601205999998001
many_spaces5.32835890000001234.198677599999880.4214780000002065
first_space3.826331799999934420.9637151000001720.2931996000002073
last_space3.56007150000004917.9764370999996570.2646626999999171
long_string35.38718729999982233.590293100000053.002933099999609
Performance timings using the
timeit library for three separate split solutions.

Clearly, the builtin method should be the goto method for splitting strings.

Challenge

At this point, we’ve covered just about everything I want to talk about today. As a result, I’ll leave you with this challenge.

We’ve written a function which can be used to split any string we like by any separator. How could we go about writing something similar for numbers? For example, what if I wanted to split a number every time the number 256 appears?

This could be a cool way to create a fun coding scheme where ASCII codes could be embedded in a large number:

1
secret_key = 72256101256108256108256111

We could then delineate each code by some separator code—in this case 256 because it’s outside of ASCII range. Using our method, we could split our coded string by the separator and then make sense of the result using chr():

1
2
arr = split_nums(secret_key, 256)  # [72, 101, 108, 108, 111]
print("".join([chr(x) for x in arr]))

If you read my article on obfuscation, you already know why this might be desirable. We could essentially write up an enormous number and use it to generate strings of text. Anyone trying to reverse engineer our solution would have to make sense of our coded string.

Also, I think something like this is fun thought experiment; I don’t expect it to be entirely useful. That said, feel free to share your solutions in the comments.

A Little Recap

And with that, we’re done! As always, here are all the solutions from this article in one convenient location:

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
my_string = "Hi, fam!"
 
# Split that only works when there are no consecutive separators
def split_string(my_string: str, seps: list):
  items = []
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, seps)
    if sub[0] not in seps:
      items.append(sub)
    i += len(sub)
  return items
 
split_string(my_string)  # ["Hi,", "fam!"]
 
# A more robust, albeit much slower, implementation of split
def next_word_or_separator(text: str, position: int, separators: list):
  test_separator = lambda x: text[x] in separators
  end_index = position
  is_separator = test_separator(position)
  while end_index < len(text) and is_separator == test_separator(end_index):
    end_index += 1
  return text[position: end_index]
 
def split_string(my_string: str, seps: list):
  items = []
  i = 0
  while i < len(my_string):
    sub = next_word_or_separator(my_string, i, seps)
    if sub[0] not in seps:
      items.append(sub)
    i += len(sub)
  return items
 
split_string(my_string)  # ["Hi,", "fam!"]
 
# The builtin split solution **preferred**
my_string.split()  # ["Hi,", "fam!"]

If you liked this article, and you’d like to read more like it, check out the following list of related articles:

If you’d like to go the extra mile, check out my article on ways you can help grow The Renegade Coder. This list includes ways to get involved like hopping on my mailing list or joining me on Patreon.

Once again, thanks for stopping by. Hopefully, you found value in this article and you’ll swing by again later! I’d appreciate it.

Published on Web Code Geeks with permission by Jeremy Grifski, partner at our WCG program. See the original article here: How to Split a String by Whitespace in Python: Brute Force and split()

Opinions expressed by Web Code Geeks contributors are their own.

Jeremy Grifski

Jeremy is the founder of The Renegade Coder, a software curriculum website launched in 2017. In addition, he is a PhD student with an interest in education and data visualization.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button