Python: Extract Substrings using Index, Slice, or Regular Expression patterns


We will go over how to use the string class’s built-in methods to manipulate strings as well as some of the more popular methods for string manipulation such as index, slice, or Regular Expression patterns.

This can be useful when you’re working with data that contains a lot of text, such as website content or product descriptions. Keep reading to learn how to use Python to extract substrings!

There are many ways to extract substrings from strings in Python. We will go over three different ways to do this: using indexing, using slicing, Regular Expression patterns, and using the split() method.

Extract string by slicing

# [start:end] If 'start' is omitted, the range is from the beginning, and if 'end' is omitted, it is the range to the end. str = 'xyzabc123' print(str[1:3]) # yz print(str[:3]) # xyz print(str[1:]) # yzabc123 #As with indices, negative values ​​are allowed. print(str[-4:-2]) # c1 print(str[:-2]) # xyzabc1 print(str[-4:]) # c123 # Also, specifying a range that does not exist does not result in an error. Out of range is ignored. print(str[-100:100]) # xyzabc123
Code language: PHP (php)

Let’s look at an example string: ‘Extracting substrings is fun!’. If we wanted to extract the word ‘substrings’ from this string, we could do so using indexing like this:

str = 'Extracting substrings is fun!' result = str[11:19] print(result ) #'substrings'
Code language: Python (python)

As you can see, we were able to extract the word ‘substrings’ by specifying the indices of the starting and ending characters inclusive.

In addition, ‘start’ to start, ‘and’ ending positions stop, an increment ‘step’ can also be specified. write like [start:stop:step] You can extract jumpy characters ‘step’. If you specify a negative value for, it will be extracted in order from the back.

str = 'xyzabc123' print(str[1:4:2]) # ya print(str[::2]) # xzb13 print(str[::3]) # xa1 print(str[::-1]) # 321cbazyx print(str[::-2]) # 31bzx
Code language: PHP (php)

We could have extracted ‘substrings’ by using slicing like this:

str = 'Extracting substrings is fun!' result = str[11:-4] print(result ) #'substrings'
Code language: PHP (php)

As you can see, slicing is very similar to indexing except that it takes start and end indices as well as a step size (which is optional). The step size specifies how many characters to move over when extracting substrings. The default step size is 1, which is what we used in our previous example.

If we had wanted to extract every other character from our example string, we could have done so like this:

str[11:-4:2] # 'sbtig s'
Code language: PHP (php)

This extracted every other character from our substring starting at index 11 and ending at index -4 (the second last character). The last way we will discuss for extracting substrings is by using the split() method like this:

str.split(' ')[2] # 'is'
Code language: PHP (php)

The split() method splits a string into a list of substrings based on a specified delimiter (in our case, a space). We then index into that list to get our desired substring (‘is’).

Extract substring using regular expression: re.search(), re.findall()

re.search() : Using to extract strings that match a regular expression pattern

import re str = '333-777-555' result = re.search(r'\d+', str) print(result) # <re.Match object; span=(0, 3), match='333'> print(result.group()) # 333
Code language: Python (python)
\d is a number + One or more occurrences Thus \d+ matches one or more consecutive digits. \ Used to drop the special meaning of character following it (discussed below) [] Represent a character class ^ Matches the beginning $ Matches the end . Matches any character except newline ? Matches zero or one occurrence. | Means OR (Matches with any of the characters separated by it. * Any number of occurrences (including 0 occurrences) + One or more occurrences {} Indicate number of occurrences of a preceding RE to match. () Enclose a group of REs
Code language: JavaScript (javascript)

re.findall() : returns all matching parts as a list of strings.

Return a list of all the matches of pattern in string that don’t overlap. The string is read from left to right, and matches are given back in the order in which they were found.

import re str = '333-777-555' print(re.findall(r'\d+', str)) # ['333', '777', '555']
Code language: PHP (php)

Patterns that use regular expressions

Wildcard Pattern Matching

. is any single character except newline * is 0 or more repetitions of the previous pattern
print(re.findall('x.*y', 'x123abcy')) # ['x123abcy'] print(re.findall('x.*y', 'x---y')) # ['x---y'] print(re.findall('x.*y', 'xy')) # ['xy']
Code language: Python (python)
+ is one or more repetitions of the previous pattern. Example x.+y: xy does not match
Code language: Python (python)
print(re.findall('x.+y', 'xy')) # [] print(re.findall('x.+y', 'xay')) # ['xay'] print(re.findall('x.+y', 'xabcy')) # ['xabcy']
Code language: PHP (php)

? the previous pattern is either 0 or 1 time.

example x.?y : xy and matches only if there is exactly one character between and x.y

print(re.findall('x.?y', 'xy')) # ['xy'] print(re.findall('x.?y', 'xay')) # ['xay'] print(re.findall('x.?y', 'xaay')) # []
Code language: PHP (php)

Minimal match that matches as short a text as possible

str = 'xay-xaaaaaaay' print(re.findall('x.*y', str)) # ['xay-xaaaaaaay'] print(re.findall('x.*?y', str)) # ['xay', 'xaaaaaaay']
Code language: Python (python)

Extract Part of the Pattern with Parentheses

If you enclose part of the regular expression pattern string in parentheses (), you can extract that part of the string.

print(re.findall('x(.*)y', 'xabcdy')) # ['abcd']
Code language: PHP (php)

If you want to extract the portion of the target string enclosed in parentheses, enclose the parentheses with escaping in the pattern string with parentheses without escaping.

print(re.findall(r'\(.+\)', 'abc(xyz)ghi')) # ['(xyz)'] print(re.findall(r'\((.+)\)', 'abc(xyz)ghi')) # ['xyz']
Code language: PHP (php)

Matches any single character

[] Enclose a string in to match any one of the characters in it

[a-z] represents any single lowercase alphabetic character.

print(re.findall('[xyz]a', 'xa-ya-za')) # ['ax', 'bx', 'cx'] print(re.findall('[xyz]+', 'xyz-yzx-zxy')) ['xyz', 'yzx', 'zxy'] print(re.findall('[a-z]+', 'abc-xyz')) # ['abc', 'xyz']
Code language: PHP (php)

Extract by character type

str = 'xyz-012-vfv' print(re.findall('[0-9]+', str)) # ['012'] print(re.findall('[a-z]+', str)) # ['xyz', 'vfv']
Code language: PHP (php)

Extract from beginning/end

If you only want to extract strings that start at the beginning or end at the end, use the meta characters

^ (match at the beginning),

$ (match at the end).

s = 'abc-xyz-vfv' print(re.findall('[a-z]+', s)) # ['abc', 'xyz', 'vfv'] print(re.findall('^[a-z]+', s)) # ['abc'] print(re.findall('[a-z]+$', s)) # ['vfv']
Code language: PHP (php)

Extract with multiple patterns

Use | if you want to extract parts that match any of the multiple patterns.

Write regular expression pattern A, pattern B like this A | B.

str = 'xabcy-777' print(re.findall('x.*y', str)) # ['xabcy'] print(re.findall(r'\d+', str)) # ['777'] print(re.findall(r'a.*b|\d+', str)) # ['xabcy', '777']
Code language: PHP (php)

Extract case-insensitively

The re module is case-sensitive by default. Arguments flags are re.IGNORECASE case-insensitive.

str = 'xyz-Xyz-XYZ' print(re.findall('[a-z]+', str)) # ['xyz', 'yz'] print(re.findall('[A-Z]+', str)) # ['X', 'XYZ'] print(re.findall('[a-z]+', str, flags=re.IGNORECASE)) # ['xyz', 'Xyz', 'XYZ']
Code language: PHP (php)

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts