REGEXEXTRACT Google Sheets function is part of the suite of REGEX functions available to many users.
The function is extremely handy when you are trying to dig out a certain string that is part of a larger one.
The only prerequisite to using this function is a firm grasp of regular expressions.
In this tutorial, we will discuss the REGEXEXTRACT function in detail, along with some examples of how you can apply them in some common spreadsheet tasks.
This Article Covers:
What does the REGEXEXTRACT Function Do?
The REGEXEXTRACT function mainly uses regular expressions to extract matching substrings from a string. It takes in a string and a regular expression and returns the part of the string that matches the pattern in the regular expression.
Syntax for the REGEXEXTRACT Function
The syntax for the REGEXEXTRACT function is as follows:
REGEXEXTRACT(text, reg_exp)
Here,
- text is the text or string from which you want to extract a substring
- reg_exp is a regular expression. The expression matches the part of the text that you want to extract. The regular expression parameter should be provided in double-quotes.
Note: The function always returns the first part of the text that matches the pattern in reg_exp.
Applications of the REGEXEXTRACT Function
The REGEXEXTRACT function can be quite helpful when you want to extract valuable information from a set of strings that are not exactly ‘homogenous’ or consistent in format.
Here are some useful applications of the REGEXEXTRACT function. You can use it to:
- Extract the first or last few characters from a string
- Extract numbers from a string
- Extract whole words based on a partial match
- Extract one of a list of words
- Extract contents between certain characters
- Extract different parts of a URL
- Extract different parts of email addresses
Let us see how REGEXREPLACE can be used in each of the above applications.
Using the REGEXEXTRACT Function to Extract the First or Last few Characters from a String
Let’s first see how you can use REGEXREPLACE to extract the first or last few characters or words from a string.
Let us say you have the following list of book titles in column A:
If you want to extract just the first, say 3 characters from each cell, you can use the single dot symbol (.). A single dot in a regular expression is used to represent a single character. So if you want to extract 3 characters from a string, you need to provide 3 dots in the regular expression parameter, as follows:
=REGEXEXTRACT(A2,"...")
Here’s the result you will get:
Similarly, if you want to extract the last 3 characters from each cell, you can use 3 dots followed by a $ metacharacter, since the dollar symbol represents the end of a string.
So your formula would be:
=REGEXEXTRACT(A2,"...$")
Here’s the result you will get:
The dot symbol represents any character, including space or any other symbol. So using “.+” in the regular expression will simply extract the entire text in the cell.
If you want to make sure you only extract alphanumeric characters, then instead of the dot symbol, you could use the \w metacharacter, which represents a single alphanumeric character (a digit, a letter, or an underscore).
So, if you want to extract the whole first word, you would need to use the combination “\w+”. This will ensure that any characters before the first space get extracted, as follows:
=REGEXEXTRACT(A2,"\w+")
Similarly, to extract the last word, the formula would be:
=REGEXEXTRACT(A2,"\w+$")
Here’s the result you will get:
Using the REGEXEXTRACT Function to Extract Numbers from a String
The \d metacharacter represents a numeric digit. As such, if you want to extract the first number from a string, you can use the expression “\d+” as follows:
=REGEXEXTRACT(A2,"\d+")
For the following list of strings, here are the results you would get:
Using the REGEXEXTRACT Function to Extract Whole words Based on Partial Match
Let’s say you have the following list of strings and want to extract all the license plate numbers that start with the symbols ‘L-’:
You could then use the required string pattern, followed by “\w+” as follows:
=REGEXEXTRACT(A2,"L-\w+")
This will give you the following result:
This kind of regular expression can also be used if you want to extract words that follow a particular pattern, for example, let’s say you have the following list of sentences:
If you want to extract the first word in each string that starts with ‘bo’ and ends with a ‘d’ then you can use the REGEXEXTRACT function as follows:
=REGEXEXTRACT(A2,"bo\w+d")
This will give you the following result:
Notice in the last example, the function extracted only the first word in the string that followed the pattern ‘bo…d’. If instead you want it to extract the last word, then you would need to add the dollar metacharacter ($) in the end:
=REGEXEXTRACT(A2,"bo\w+d$")
This will give you the following result:
Using the REGEXEXTRACT Function to Extract One of a List of Words
The metacharacter ‘|’ represents an Or operation. So, if you want to extract one word from a list of words or characters, then you can use this character in the REGEXMATCH function.
For example, let us say you have the following list of strings:
If you want to extract the first occurrence of any of the words red, blue, green or yellow in the cell A2, then you can use the REGEXEXTRACT function as follows:
=REGEXEXTRACT(A2,"red|blue|green|yellow")
This will give you the following result:
Using the REGEXEXTRACT Function to Extract Contents between Certain Characters
A common use of the REGEXREPLACE function is to extract contents between certain characters. For example, say you have copied some markup text from a website and need to extract only the text part of it, removing the HTML tags:
Now you might think that simply using the regular expression ‘>.+<’ would be enough to extract all the contents in between the ‘>’ and ‘<’ symbols:
=REGEXEXTRACT(A2,">.+<")
However, this would also extract the symbols along with the text in between, as shown below:
To extract only the text in between and exclude the demarcating symbols, you would need to enclose the “.+” metacharacter in group brackets ‘()’. This will ensure that only the contents inside the group brackets get extracted:
=REGEXEXTRACT(A2,">(.+)<")
This will give you the following result:
Using the REGEXEXTRACT Function to Extract Different Parts of a URL
If you want to extract the domain name of a URL, you can use REGEXEXTRACT as follows:
=REGEXEXTRACT(A2,"http.+\ / \ /(.+) \ /")
This will extract all the contents in between the pattern HTTP:// (or HTTPS://) and the ‘/’ symbol.
The above formula will give you the following result:
If you further want to remove everything and only extract the main domain name (without any of the subdomains or extensions, then your REGEXEXTRACT function can be refined as follows:
=REGEXEXTRACT(A27,"http.+\ / \ / \ w+\.(.+)\.[org|com]")
Here we made sure that any words before the dot and after the dot are removed. Since the word following the dot can be any one of the words ‘org’ or ‘com’, we specified these within square brackets.
This will give you the following result:
Using the REGEXEXTRACT Function to Extract Different Parts of an Email Address
Similar to the previous example, we can also use REGEXEXTRACT to extract parts of an email address. For example, say you have the following list of email addresses:
If you want to extract only the username part of the email addresses, then you can use the REGEXEXTRACT function to extract everything that comes before the ‘@’ symbol as follows:
=REGEXEXTRACT(A33,"(.+)@")
This will give you the following result:
We used the dot metacharacter instead of \w because we want the expression to consider any symbol in the username, including dots, hyphens, or underscores (as in the third example).
If instead of the username you are more interested in extracting the domain name part of the email address, then you can use the REGEXEXTRACT function as follows:
=REGEXEXTRACT(A33,"@(.+)")
This will give you the following result:
Using the REGEXEXTRACT Function to Extract a Particular Pattern of Characters
Let us say you have the following list of strings and want to extract the phone numbers from each cell:
Since the US phone numbers all follow the same pattern, you can use the REGEXEXTRACT function as follows:
=REGEXEXTRACT(A40,"\(...\)...-....")
Here, each dot represents one character. However, instead of putting so many dots, you could shorten the regular expression by following the dot with the number of characters enclosed in curly brackets.
So instead of “…” you can use “.{3}” in your expression. This means the above formula can also be written as:
=REGEXEXTRACT(A40,"\(.{3}\).{3}-.{4}")
You have 3 numbers between rounded brackets, followed by three more numbers, followed by a hyphen and 4 more numbers.
This will give you the following result:
Now here’s a cool trick. If you further want to separate parts of the phone number into separate columns for area code, exchange code and subscriber number, you can enclose in rounded brackets each part that you want in a single column.
So to separate the result of the above formula into three different columns, your REGEXEXTRACT function can be written as:
=REGEXEXTRACT(A40,"\((.{3})\)(.{3})-(.{4})")
This will give you the following result:
These were a few simple examples of how the REGEXEXTRACT function can effectively help you extract exactly what you need from a string.
REGEXEXTRACT Google Sheets Function Tip:
Here are a few important tips that you need to remember when using the REGEXEXTRACT function.
- This function only works with text input. It does not work with numbers
- If you want to use numbers as input (for example telephone numbers), you need to first convert it to text, using the TEXT function.
- The REGEXEXTRACT function is case-sensitive. Therefore, you will need to specify the correct case inside the regular expression or convert the entire input string to upper or lower case using the UPPER or LOWER functions.
The REGEXEXTRACT function can have numerous applications, once you learn how to use it effectively. Strong knowledge of regular expressions helps, and the best way to get a good hold over it is to practice.
Play around with different regular expressions and see what results you get. You will be surprised by how useful the REGEXEXTRACT function can be once you start using it on your day-to-day spreadsheet data.