# How to Split a String by Whitespace in Python: Brute Force and split()

Once again, I’m back with another look at some ways to solve a common Python problem. This time, we’ll be looking at how to split a string by whitespace (and other separators) in Python.

If you’re in a rush, here’s the key takeaway. You could write your own whitespace splitting function, but they’re often slow and lack robustness. Instead, you should probably opt for Python’s builtin `split()` function. It works for any string as follows: `"What a Wonderful World".split()`. If done correctly, you’ll get a nice list of substrings without all that whitespace (e.g. `["What", "a", "Wonderful", "World"]`).

In the remainder of this article, we’ll look at the solution described above in more detail. In addition, we’ll try writing our own solution. Then, we’ll compare them all by performance. At the end, I’ll ask you to tackle a little challenge.

Let’s get started!

## Problem Description

When we talk about splitting a string, what we’re really talking about is the process of breaking a string up into parts. As it turns out, there are a lot of ways to split a string. For the purposes of this article, we’ll just be looking at splitting a string by whitespace.

Of course, what does it mean to split a string by whitespace? Well, let’s look at an example:

 1 `"How are you?"`

Here, the only two whitespace characters are the two spaces. As a result, splitting this string by whitespace would result in a list of three strings:

 1 `[``"How"``, ``"are"``, ``"you?"``]`

Of course, there are a ton of different types of whitespace characters. Unfortunately, which characters are considered whitespace are totally dependent on the character set being used. As a result, we’ll simplify this problem by only concerning ourselves with Unicode characters (as of the publish date).

In the Unicode character set, there are 17 “separator, space” characters. In addition, there are another 8 whitespace characters which include things like line separators. As a result, the following string is a bit more interesting:

 1 `"Hi, Ben!\nHow are you?"`

With the addition of the line break, we would expect that splitting by whitespace would result in the following list:

 1 `[``"Hi,"``, ``"Ben!"``, ``"How"``, ``"are"``, ``"you?"``]`

In this article, we’ll take a look at a few ways to actually write some code that will split a string by whitespace and store the result in a list.

## Solutions

As always, there are a lot of different ways to split a string by whitespace. To kick things off, we’ll try to write our own solution. Then, we’ll look at a few more practical solutions.

### Split a String by Whitespace Using Brute Force

If I were given the problem description above and asked to solve it without using any libraries, here’s what I would do:

 01 02 03 04 05 06 07 08 09 10 11 `items = []` `my_string = ``"Hi, how are you?"` `whitespace_chars = [``" "``, ..., ``"\n"``]` `start_index = ``0` `end_index = ``0` `for` `character in my_string:` `  ``if` `character in whitespace_chars:` `    ``items.append(my_string[start_index: end_index])` `    ``start_index = end_index + ``1` `  ``items.append(my_string[start_index: end_index])` `  ``end_index += ``1`

Here, I decided to build up a few variables. First, we need to track the end result which is `items` in this case. Then, we need some sort of string to work with (e.g. `my_string`).

To perform the splitting, we’ll need to track a couple indices: one for the front of each substring (e.g. `start_index`) and one for the back of the substring (e.g. `end_index`).

On top of all that, we need some way to verify that a character is in fact a whitespace. To do that, we created a list of whitespace characters called `whitespace_chars`. Rather than listing all of the whitespace characters, I cheated and showed two examples with a little ellipses. Make sure to remove the ellipsis before running this code. For some reason, Python gives those three dots meaning, so it won’t actually error out (although, it likely won’t cause any harm either).

Using these variables, we’re able to loop over our string and construct our substrings. We do that by checking if each character is a whitespace. If it is, we know we need to construct a substring and update `start_index` to begin tracking the next word. Then, when we’re done, we can grab the last word and store it.

Now, there’s a lot of messiness here. To make life a bit easier, I decided to move the code into a function which we could modify as we go along:

 01 02 03 04 05 06 07 08 09 10 11 12 `def split_string(my_string: str):` `  ``items = []` `  ``whitespace_chars = [``" "``, ..., ``"\n"``]` `  ``start_index = ``0` `  ``end_index = ``0` `  ``for` `character in my_string:` `    ``if` `character in whitespace_chars:` `      ``items.append(my_string[start_index: end_index])` `      ``start_index = end_index + ``1` `    ``end_index += ``1` `  ``items.append(my_string[start_index: end_index])` `  ``return` `items`

Now, this solution is extremely error prone. To prove that, try running this function as follows:

 1 `split_string(``"Hello  World"``)  # returns [``'Hello'``, ``''``, ``'World'``]`

Notice how having two spaces in a row causes us to store empty strings? Yeah, that’s not ideal. In the next section, we’ll look at a way to improve this code.

### Split a String by Whitespace Using State

Now, I borrowed this solution from a method that we ask students to write for a lab in one of the courses I teach. Basically, the method is called “nextWordOrSeparator” which is a method that looks like this:

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 `/**` `  ``* Returns the first "word" (maximal length string of characters not in` `  ``* {@code separators}) or "separator string" (maximal length string of` `  ``* characters in {@code separators}) in the given {@code text} starting at` `  ``* the given {@code position}.` `  ``*/` `private` `static` `String nextWordOrSeparator(String text, ``int` `position,` `            ``Set separators) {` `        ``assert` `text != ``null` `: ``"Violation of: text is not null"``;` `        ``assert` `separators != ``null` `: ``"Violation of: separators is not null"``;` `        ``assert` `0` `<= position : ``"Violation of: 0 <= position"``;` `        ``assert` `position < text.length() : ``"Violation of: position < |text|"``;`   `        ``// TODO - fill in body`   `        ``/*` `         ``* This line added just to make the program compilable. Should be` `         ``* replaced with appropriate return statement.` `         ``*/` `        ``return` `""``;` `}`

One way to implement this method is to check whether or not the first character is a separator. If it is, loop until it’s not. If it’s not, loop until it is.

Typically, this is done by writing two separate loops. One loop continually checks characters until a character is in the separator set. Meanwhile, the other loop does the opposite.

Of course, I think that’s a little redundant, so I wrote my solution using a single loop (this time in Python):

 1 2 3 4 5 6 `def next_word_or_separator(text: str, position: ``int``, separators: list):` `  ``end_index = position` `  ``is_separator = text[position] in separators` `  ``while` `end_index < len(text) and is_separator == (text[end_index] in separators):` `    ``end_index += ``1` `  ``return` `text[position: end_index]`

Here, we track a couple variables. First, we need an `end_index`, so we know where to split our string. In addition, we need to determine if we’re dealing with a word or separator. To do that, we check if the character at the current `position` in `text` is in `separators`. Then, we store the result in `is_separator`.

With `is_separator`, all there is left to do is loop over the string until we find a character that is different. To do that, we repeatedly run the same computation we ran for `is_separator`. To make that more obvious, I’ve stored that expression in a lambda function:

 1 2 3 4 5 6 7 `def next_word_or_separator(text: str, position: ``int``, separators: list):` `  ``test_separator = lambda x: text[x] in separators` `  ``end_index = position` `  ``is_separator = test_separator(position)` `  ``while` `end_index < len(text) and is_separator == test_separator(end_index):` `    ``end_index += ``1` `  ``return` `text[position: end_index]`

At any rate, this loop will run until either we run out of string or our `test_separator` function gives us a value that differs from `is_separator`. For example, if `is_separator` is `True` then we won’t break until `test_separator` is `False`.

Now, we can use this function to make our first solution a bit more robust:

 1 2 3 4 5 6 7 8 9 `def split_string(my_string: str):` `  ``items = []` `  ``whitespace_chars = [``" "``, ..., ``"\n"``]` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, whitespace_chars)` `    ``items.append(sub)` `    ``i += len(sub)` `  ``return` `items`

Unfortunately, this code is still wrong because we don’t bother to check if what is returned is a word or a separator. To do that, we’ll need to run a quick test:

 01 02 03 04 05 06 07 08 09 10 `def split_string(my_string: str):` `  ``items = []` `  ``whitespace_chars = [``" "``, ..., ``"\n"``]` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, whitespace_chars)` `    ``if` `sub[``0``] not in whitespace_chars:` `      ``items.append(sub) ` `    ``i += len(sub)` `  ``return` `items`

Now, we have a solution that is slightly more robust! Also, it gets the job done for anything we consider separators; they don’t even have to be whitespace. Let’s go ahead and adapt this one last time to let the user enter any separators they like:

 1 2 3 4 5 6 7 8 9 `def split_string(my_string: str, seps: list):` `  ``items = []` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, seps)` `    ``if` `sub[``0``] not in seps:` `      ``items.append(sub) ` `    ``i += len(sub)` `  ``return` `items`

Then, when we run this, we’ll see that we can split by whatever we like:

 01 02 03 04 05 06 07 08 09 10 `>>> split_string(``"Hello,    World"``, [``" "``])` `[``'Hello,'``, ``'World'``]` `>>> split_string(``"Hello,    World"``, [``"l"``])` `[``'He'``, ``'o,    Wor'``, ``'d'``]` `>>> split_string(``"Hello,    World"``, [``"l"``, ``"o"``])` `[``'He'``, ``',    W'``, ``'r'``, ``'d'``]` `>>> split_string(``"Hello,    World"``, [``"l"``, ``"o"``, ``" "``])` `[``'He'``, ``','``, ``'W'``, ``'r'``, ``'d'``]` `>>> split_string(``"Hello,    World"``, [``","``, ``" "``])` `[``'Hello'``, ``'World'``]`

How cool is that?! In the next section, we’ll look at some builtin tools that do exactly this.

### Split a String by Whitespace Using `split()`

While we spent all this time trying to write our own split method, Python had one built in all along. It’s called `split()`, and we can call it on strings directly:

 1 2 `my_string = ``"Hello, World!"` `my_string.split()  # returns [``"Hello,"``, ``"World!"``]`

In addition, we can provide our own separators to split the string:

 1 2 `my_string = ``"Hello, World!"` `my_string.split(``","``)  # returns [``'Hello'``, ``' World!'``]`

However, this method doesn’t work quite like the method we provided. If we input multiple separators, the method will only match the combined string:

 1 2 `my_string = ``"Hello, World!"` `my_string.split(``"el"``)  # returns [``'H'``, ``'lo, World!'``]`

In the documentation, this is described as a “different algorithm” from the default behavior. In other words, the whitespace algorithm will treat consecutive whitespace characters as a single entity. Meanwhile, if a separator is provided, the method splits at every occurrence of that separator:

 1 2 `my_string = ``"Hello, World!"` `my_string.split(``"l"``)  # returns [``'He'``, ``''``, ``'o, Wor'``, ``'d!'``]`

But, that’s not all! This method can also limit the number of splits using an additional parameter, `maxsplit`:

 1 2 `my_string = ``"Hello, World! Nice to meet you."` `my_string.split(maxsplit=``2``)  # returns [``'Hello,'``, ``'World!'``, ``'Nice to meet you.'``]`

How cool is that? In the next section, we’ll see how this solution stacks up against the solutions we wrote ourselves.

## Performance

To test performance, we’ll be using the `timeit` library. Essentially, it allows us to compute the runtime of our code snippets for comparison. If you’d like to learn more about this process, I’ve documented my approach in an article on performance testing in Python.

Otherwise, let’s go ahead and convert our solutions into strings:

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 `setup = ``""``"` `zero_spaces = ``'Jeremy'` `one_space = ``'Hello, World!'` `many_spaces = ``'I need to get many times stronger than everyone else!'` `first_space = ``'    Well, what do we have here?'` `last_space = ``'Is this the Krusty Krab?    '` `long_string = ``'Spread love everywhere you go: first of all in your own house. Give love to your children, to your wife or husband, to a next door neighbor. Let no one ever come to you without leaving better and happier. Be the living expression of God’s kindness; kindness in your face, kindness in your eyes, kindness in your smile, kindness in your warm greeting.'`   `def split_string_bug(my_string: str):` `  ``items = []` `  ``whitespace_chars = [``' '``]` `  ``start_index = ``0` `  ``end_index = ``0` `  ``for` `character in my_string:` `    ``if` `character in whitespace_chars:` `      ``items.append(my_string[start_index: end_index])` `      ``start_index = end_index + ``1` `    ``end_index += ``1` `  ``items.append(my_string[start_index: end_index])` `  ``return` `items`   `def next_word_or_separator(text: str, position: ``int``, separators: list):` `  ``test_separator = lambda x: text[x] in separators` `  ``end_index = position` `  ``is_separator = test_separator(position)` `  ``while` `end_index < len(text) and is_separator == test_separator(end_index):` `    ``end_index += ``1` `  ``return` `text[position: end_index]`   `def split_string(my_string: str, seps: list):` `  ``items = []` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, seps)` `    ``if` `sub[``0``] not in seps:` `      ``items.append(sub) ` `    ``i += len(sub)` `  ``return` `items` `""``"`   `split_string_bug = ``""``"` `split_string_bug(zero_spaces)` `""``"`   `split_string = ``""``"` `split_string(zero_spaces, [``" "``])` `""``"`   `split_python = ``""``"` `zero_spaces.split()` `""``"`

For this first set of tests, I decided to start with a string that has no spaces:

 1 2 3 4 5 6 7 `>>> ``import` `timeit` `>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))` `0.7218914000000041` `>>> min(timeit.repeat(setup=setup, stmt=split_string))` `2.867278899999974` `>>> min(timeit.repeat(setup=setup, stmt=split_python))` `0.0969244999998864`

Looks like our `next_word_or_separator()` solution is very slow. Meanwhile, the builtin `split()` is extremely fast. Let’s see if that trends continues. Here are the results when we look at one space:

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 `>>> split_string_bug = ``""``"` `split_string_bug(one_space)` `""``"` `>>> split_string = ``""``"` `split_string(one_space, [``" "``])` `""``"` `>>> split_python = ``""``"` `one_space.split()` `""``"` `>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))` `1.4134186999999656` `>>> min(timeit.repeat(setup=setup, stmt=split_string))` `6.758952300000146` `>>> min(timeit.repeat(setup=setup, stmt=split_python))` `0.1601205999998001`

Again, Python’s `split()` method is pretty quick. Meanwhile, our robust method is terribly slow. I can’t imagine how much worse our performance is going to get with a larger string. Let’s try the `many_spaces` string next:

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 `>>> split_string_bug = ``""``"` `split_string_bug(many_spaces)` `""``"` `>>> split_string = ``""``"` `split_string(many_spaces, [``" "``])` `""``"` `>>> split_python = ``""``"` `many_spaces.split()` `""``"` `>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))` `5.328358900000012` `>>> min(timeit.repeat(setup=setup, stmt=split_string))` `34.19867759999988` `>>> min(timeit.repeat(setup=setup, stmt=split_python))` `0.4214780000002065`

This very quickly became painful to wait out. I’m a bit afraid to try the `long_string` test to be honest. At any rate, let’s check out the performance for the `first_space` string (and recall that the bugged solution doesn’t work as expected):

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 `>>> split_string_bug = ``""``"` `split_string_bug(first_space)` `""``"` `>>> split_string = ``""``"` `split_string(first_space, [``" "``])` `""``"` `>>> split_python = ``""``"` `first_space.split()` `""``"` `>>> min(timeit.repeat(setup=setup, stmt=split_string_bug))` `3.8263317999999344` `>>> min(timeit.repeat(setup=setup, stmt=split_string))` `20.963715100000172` `>>> min(timeit.repeat(setup=setup, stmt=split_python))` `0.2931996000002073`

At this point, I’m not seeing much difference in the results, so I figured I’d spare you the data dump and instead provide a table of the results:

Clearly, the builtin method should be the goto method for splitting strings.

## Challenge

At this point, we’ve covered just about everything I want to talk about today. As a result, I’ll leave you with this challenge.

We’ve written a function which can be used to split any string we like by any separator. How could we go about writing something similar for numbers? For example, what if I wanted to split a number every time the number 256 appears?

This could be a cool way to create a fun coding scheme where ASCII codes could be embedded in a large number:

 1 `secret_key = ``72256101256108256108256111`

We could then delineate each code by some separator code—in this case 256 because it’s outside of ASCII range. Using our method, we could split our coded string by the separator and then make sense of the result using `chr()`:

 1 2 `arr = split_nums(secret_key, ``256``)  # [``72``, ``101``, ``108``, ``108``, ``111``]` `print(``""``.join([chr(x) ``for` `x in arr]))`

If you read my article on obfuscation, you already know why this might be desirable. We could essentially write up an enormous number and use it to generate strings of text. Anyone trying to reverse engineer our solution would have to make sense of our coded string.

Also, I think something like this is fun thought experiment; I don’t expect it to be entirely useful. That said, feel free to share your solutions in the comments.

## A Little Recap

And with that, we’re done! As always, here are all the solutions from this article in one convenient location:

 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 `my_string = ``"Hi, fam!"`   `# Split that only works when there are no consecutive separators` `def split_string(my_string: str, seps: list):` `  ``items = []` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, seps)` `    ``if` `sub[``0``] not in seps:` `      ``items.append(sub) ` `    ``i += len(sub)` `  ``return` `items`   `split_string(my_string)  # [``"Hi,"``, ``"fam!"``]`   `# A more robust, albeit much slower, implementation of split` `def next_word_or_separator(text: str, position: ``int``, separators: list):` `  ``test_separator = lambda x: text[x] in separators` `  ``end_index = position` `  ``is_separator = test_separator(position)` `  ``while` `end_index < len(text) and is_separator == test_separator(end_index):` `    ``end_index += ``1` `  ``return` `text[position: end_index]`   `def split_string(my_string: str, seps: list):` `  ``items = []` `  ``i = ``0` `  ``while` `i < len(my_string):` `    ``sub = next_word_or_separator(my_string, i, seps)` `    ``if` `sub[``0``] not in seps:` `      ``items.append(sub) ` `    ``i += len(sub)` `  ``return` `items`   `split_string(my_string)  # [``"Hi,"``, ``"fam!"``]`   `# The builtin split solution **preferred**` `my_string.split()  # [``"Hi,"``, ``"fam!"``]`

If you liked this article, and you’d like to read more like it, check out the following list of related articles:

If you’d like to go the extra mile, check out my article on ways you can help grow The Renegade Coder. This list includes ways to get involved like hopping on my mailing list or joining me on Patreon.

Once again, thanks for stopping by. Hopefully, you found value in this article and you’ll swing by again later! I’d appreciate it.

 Published on Web Code Geeks with permission by Jeremy Grifski, partner at our WCG program. See the original article here: How to Split a String by Whitespace in Python: Brute Force and split() Opinions expressed by Web Code Geeks contributors are their own.