How Do You Perform SQL LIKE Queries for Pattern Matching?

How Do You Perform SQL LIKE Queries for Pattern Matching


The LIKE SQL query allows you to perform a pattern-matching search of your text data. Learn how to use it and add wildcards for increased flexibility.

In today's article, we will tackle the question of using the SQL operator LIKE to match patterns in text data.

We will talk about why pattern matching is important and how it’s done in SQL. You will learn the LIKE operator’s syntax and different variations in patterns. After that, we’ll show you how to perform simple pattern matching, but also some complex pattern matching examples.

Importance of Pattern Matching in SQL Queries

Being able to match patterns gives you flexibility when querying databases and makes it more customizable.

Importance of Pattern Matching in SQL Like Queries

1. Increased Efficiency

Without pattern matching, you would have to know how exactly data in each column is stored to output precisely that data. Imagine being limited to retrieving only exact matches. If you want to output all the customers whose surname starts with the letter ‘B’, you would have to go through the whole database (thousands, tens of thousands, millions of rows?) and include each and every exact last name that starts with ‘B’. Talking about efficiency, ha? Pattern matching allows you to write only this one criterion – find the last names beginning with B – in your query, and it’ll output what you want.

2. Data Exploration

When you first explore data, you’ll be unfamiliar with it. Of course, that means you might not know search terms in advance. Pattern matching is one of the methods that allows you to get to know your data efficiently, as it allows ‘approximate’ querying. Once you know your data that way, you’ll be able to query it using exact matches.

3. Handling Incomplete Data

Even if you’re familiar with data, you might have to query data without the complete information. For instance, you’re looking for a customer, but you only have the last four digits of their phone number. If you look for a phone number that ends with, say, ‘8947’, you could find the customer or, at least, narrow down the search to several customers.

4. Data Cleaning

Data is often full of inconsistencies, especially when it comes from various data sources. Pattern matching is essential for data cleaning, as it allows you to search for data that doesn’t match a specific pattern, e.g., the initial of a middle name not followed by a dot, and correct these inconsistencies or plain mistakes.

5. Data Retrieval Customization

Customizing data searches has a wide application in extracting information from large text datasets, which are used in text mining, natural language processing, and log and transaction records analysis. Such customization requires the use of advanced pattern-matching techniques, which often involve the use of regular expressions (regex).

What is SQL LIKE Operator?

The LIKE operator is one of the numerous logical operators in SQL.

LIKE is commonly used in the WHERE clause to return data that matches a pattern specified in the LIKE operator. However, it can also be used in the HAVING clause or CASE WHEN statements.

It is commonly used with another logical operator, namely NOT. This does exactly the opposite of LIKE, as NOT LIKE returns all the data that doesn’t match a specified pattern.

Syntax of SQL LIKE Operator

As LIKE is most often used in the WHERE clause, it can be a part of SELECT, UPDATE, or DELETE statements.

The syntax of the LIKE operator itself is always the same. Here it is in the context of SELECT.

SELECTFROM table_name
WHERE column_name LIKE 'pattern';


The pattern in LIKE can include SQL wildcards, which is our next topic.

Differences Between %, _, and [ ] in Patterns

All three characters mentioned in the subheading above are SQL wildcards. They are special characters that substitute one or more characters in the pattern definition to be matched by a query.

The Percent Sign (%)

The percent sign (%) wildcard represents zero, one, or more characters, i.e., any character sequence. Where you put it in a pattern determines the positions of that character sequence.

For instance, the query below looks for the employees whose last name starts with ‘B’.

SELECT first_name,
       last_name
FROM employees
WHERE last_name LIKE 'B%';

To look for the employees whose last name ends with ‘b’, you would write this.

SELECT first_name,
       last_name
FROM employees
WHERE last_name LIKE '%B';

If you want to look for a pattern anywhere in a string, put a percent sign on either side of a pattern. The example below looks for employees whose last name contains ‘han’ anywhere in their last name. It would, for instance, return the Johanson last name.

SELECT first_name,
       last_name
FROM employees
WHERE last_name LIKE '%han%';

The Underscore Sign (_)

This wildcard replaces only one sign at a specific position in a string.

For example, the code below will return all customers named Shawn or Shaun.

SELECT first_name,
       last_name
FROM employees
WHERE first_name LIKE ‘Sha_n%’;

The Square Brackets ([ ])

This wildcard is used when looking for data that matches any single character from a data range or data set specified in the square brackets.

In this example, the query looks for values in the grade column, which contains single-character grades. It will output all the students with grades A, B, or C.

SELECT first_name,
       last_name 
FROM students 
WHERE grade LIKE '[ABC]';

Case Sensitivity and Collation

Collation in the database refers to the rules for sorting and comparing data. This is important for pattern matching, as collation can be case-insensitive or case-sensitive, depending on the database you use.

If your database performs a case-sensitive data search, then the pattern looking for Shaun or Shawn has to be written like below, i.e., with the capital ‘S’.

SELECT first_name,
last_name
FROM employees
WHERE first_name LIKE 'Sha_n%';

If the search is case-insensitive, then you could even write the query with the lowercase ‘s’; it would nevertheless return Shaun or Shawn.

SELECT first_name,
       last_name
FROM employees
WHERE first_name LIKE 'sha_n%';

Here’s an overview of the most popular databases and their default search setup.

SQL Like Queries for Database Case Sensitivity

In most examples, we will use the PostgreSQL syntax, so it’s important to say it has the operator ILIKE. It’s the same as LIKE, with the same syntax, only case-insensitive. We’ll use both, depending on the suitability.

So, remember: LIKE for case-sensitive searches, ILIKE for case-insensitive searches.

Basic Usage of SQL LIKE

Now that we have the fundamentals in place let's learn how to use LIKE in practice.

Simple Pattern Matching

How to perform a simple pattern matching using LIKE is demonstrated in this Shopify interview question.


Tables: shopify_orders, shopify_carriers

Link to the question: https://platform.stratascratch.com/coding/2116-shipped-by-speedy-express

The question asks you to find the number of orders shipped by Speedy Express.

There are two tables to work with. The first table is shopify_orders.

Table: shopify_orders
order_idshop_iduser_idorder_amounttotal_itemspayment_methodcreated_atresp_employee_idcarrier_id
1537462242cash2017-03-13 12:36:56313485
292925901cash2017-03-03 17:38:52315483
3448611441cash2017-03-14 04:23:56312485
4189351561credit_card2017-03-26 12:43:37313
5188831561credit_card2017-03-01 04:35:11314483

The second is shopify_carriers.

Table: shopify_carriers
idname
482Jet Delivery
483Road Runner
484Speedy Express
485Express It
486Global Pack Ship

The solution is simple. It joins the tables and uses COUNT(*) to count the orders. To do that only for the Speedy Express carrier, we need to look in the column name for this pattern defined in LIKE.

Also, since we’re using the case-insensitive search (LIKE), we need to spell the carrier name correctly with a capital ‘S’ and a capital ‘E’.

SELECT COUNT(*) AS n_shipped
FROM shopify_orders
JOIN shopify_carriers ON carrier_id = id
WHERE name LIKE 'Speedy Express';

The output shows there are seven orders shipped by Speedy Express.

All required columns and the first 5 rows of the solution are shown

n_shipped
7


Of course, using LIKE in that manner actually looks for an exact string, so you could get the same result using the equals operator (=) instead.

SELECT COUNT(*) AS n_shipped
FROM shopify_orders
JOIN shopify_carriers ON carrier_id = id
WHERE name = 'Speedy Express';

Pattern Matching Using the % Wildcard

I’ll show you two examples of using the percentage wildcard. The first example uses the single % wildcard, while the second uses two.

Single % Wildcard

Here’s an interesting example by Google.


Tables: google_gmail_emails, google_gmail_labels

Link to the question: https://platform.stratascratch.com/coding/10120-number-of-custom-email-labels

The question requires us to find the number of occurrences of custom email labels for each user receiving an email.

We need to query two tables, the first one being google_gmail_emails.

Table: google_gmail_emails
idfrom_userto_userday
06edf0be4b2267df1fa75d295377a46f8323610
16edf0be4b2267df1fa32ded68d89443e8086
26edf0be4b2267df1fa55e60cfcc9dc49c17e10
36edf0be4b2267df1fae0e0defbb9ec47f6f76
46edf0be4b2267df1fa47be2887786891367e1

The second table is named google_gmail_labels.

Table: google_gmail_labels
email_idlabel
0Shopping
1Custom_3
2Social
3Promotion
4Social

The solution selects the required columns and uses COUNT(*) to find the number of email label occurrences.

However, we only need to output emails with custom labels. Scrolling through the table google_gmail_labels reveals multiple custom labels, e.g., ‘Custom_3’ or ‘Custom_2’.

So, we basically need to look for all the labels that begin with ‘custom’. That’s exactly what we do by specifying putting the % wildcard at the end of the ‘custom%’ pattern in ILIKE. It will case-insensitively look for the labels that begin with ‘custom’, regardless of what follows it, if anything.

SELECT to_user AS user_id,
       label,
       COUNT(*) AS n_occurences
FROM google_gmail_emails e 
INNER JOIN google_gmail_labels l
ON e.id = l.email_id 
WHERE l.label ILIKE 'custom%'
GROUP BY to_user, label;

The output shows the user ID, the email labels, and the number of occurrences.

All required columns and the first 5 rows of the solution are shown

user_idlabeln_occurences
8bba390b53976da0cdCustom_32
5dc768b2f067c56f77Custom_24
114bafadff2d882864Custom_31
32ded68d89443e808Custom_24
a84065b7933ad01019Custom_11

Double % Wildcard

As we already mentioned, using LIKE is not limited to WHERE. You can use it in the conditional expression CASE WHEN like this Airbnb interview question requires.


Table: airbnb_search_details

Link to the question: https://platform.stratascratch.com/coding/9638-total-searches-for-rooms

You need to find the total number of searches for each room type using the table airbnb_search_details.

Table: airbnb_search_details
idpriceproperty_typeroom_typeamenitiesaccommodatesbathroomsbed_typecancellation_policycleaning_feecityhost_identity_verifiedhost_response_ratehost_sinceneighbourhoodnumber_of_reviewsreview_scores_ratingzipcodebedroomsbeds
12513361555.68ApartmentEntire home/apt{TV,"Wireless Internet","Air conditioning","Smoke detector","Carbon monoxide detector",Essentials,"Lock on bedroom door",Hangers,Iron}21Real BedflexibleFALSENYCt89%2015-11-18East Harlem3871002901
7196412366.36CabinPrivate room{"Wireless Internet",Kitchen,Washer,Dryer,"Smoke detector","First aid kit","Fire extinguisher",Essentials,"Hair dryer","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}23Real BedmoderateFALSELAf100%2016-09-10Valley Glen14919160611
16333776482.83HousePrivate room{TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free parking on premises","Pets live on this property",Dog(s),"Indoor fireplace","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Safety card","Fire extinguisher",Essentials,Shampoo,"24-hour check-in",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50","Self Check-In",Lockbox}21Real BedstrictTRUESFt100%2013-12-26Richmond District117969411811
1786412448.86ApartmentPrivate room{"Wireless Internet","Air conditioning",Kitchen,Heating,"Suitable for events","Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials,Shampoo,"Lock on bedroom door",Hangers,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}21Real BedstrictTRUENYCt93%2010-05-11Williamsburg8861121111
14575777506.89VillaPrivate room{TV,Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Essentials,Shampoo,"translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}62Real BedstrictTRUELAt70%2015-10-2221009070333

The room types are apartment, private, and shared. We select the city and then count the searches for each room type separately using COUNT(), CASE WHEN, and LIKE.

The LIKE operator is used to look for the keywords ‘apt’, ‘private’, and ‘shared’ anywhere in the string, which we achieve by putting the % wildcard on both sides of the string. In other words, the room type string can, for example, start with ‘apt’, end with it, or contain it anywhere in the name.

The CASE WHEN statement says that whenever the data matches the pattern stated in LIKE, it will output the column id, and COUNT() will count it as one search. If the data doesn’t match the pattern in LIKE, CASE WHEN will output NULL, which can’t be counted by COUNT().

SELECT city,
       COUNT(CASE
                 WHEN room_type ILIKE '%apt%' THEN id
                 ELSE NULL
             END) apt_count,
       COUNT(CASE
                 WHEN room_type ILIKE '%private%' THEN id
                 ELSE NULL
             END) private_count,
       COUNT(CASE
                 WHEN room_type ILIKE '%shared%' THEN id
                 ELSE NULL
             END) shared_count
FROM airbnb_search_details
GROUP BY 1;

Here’s the required output.

All required columns and the first 5 rows of the solution are shown

cityapt_countprivate_countshared_count
DC010
Chicago310
LA34254
NYC43353
SF340

Advanced Pattern Matching Techniques

Now, we will show several examples of advanced pattern matching.

Using the Underscore (_) Wildcard for Single Character Matching

We’ll use the question by Google and Netflix and change its requirements so it requires using the underscore wildcard.


Table: fraud_score

Link to the question: https://platform.stratascratch.com/coding/10303-top-percentile-fraud

There’s the fraud_score dataset.

Table: fraud_score
policy_numstateclaim_costfraud_score
ABCD1001CA41130.613
ABCD1002CA39460.156
ABCD1003CA43350.014
ABCD1004CA39670.142
ABCD1005CA15990.889

We will use it to output the policy, claim cost, and fraud score and rank them by the fraud score ascendingly. We will do that only for the policies whose numerical part of the name is above 1149 but below 1160.

The data in the column policy_num is text type, not numerical, so we can’t use the numerical comparison operator. How do we then solve this problem? We can, of course, use LIKE with the underscore wildcard.

We list the relevant columns and use DENSE_RANK() to rank policies by the fraud score. To do that only for the policies satisfying the above criteria, we simply write the policy name as ‘ABCD115_’. This will return all the policies from ABCD1150 to  ABCD1159.

SELECT policy_num,
       claim_cost,
       fraud_score,
       DENSE_RANK() OVER (ORDER BY fraud_score) AS rank_fraud_score
FROM fraud_score
WHERE policy_num ILIKE 'ABCD115_';

Here’s the list of these ten policies.

All required columns and the first 5 rows of the solution are shown

policy_numclaim_costfraud_scorerank_fraud_score
ABCD115947940.0161
ABCD115019970.0312
ABCD115612380.253
ABCD115829580.2764
ABCD115521390.4455
ABCD115429560.546
ABCD115214940.6437
ABCD115740170.6458
ABCD115313850.7939
ABCD115128740.95910


Using the Square Bracket ([ ]) Wildcard for Character Range Matching

I’ll use this example from the City of Los Angeles, the City of San Francisco, and Tripadvisor.


Table: los_angeles_restaurant_health_inspections

Link to the question: https://platform.stratascratch.com/coding/10180-find-the-lowest-score-for-each-facility-in-hollywood-boulevard

We’ll work with the table los_angeles_restaurant_health_inspections.

Table: los_angeles_restaurant_health_inspections
serial_numberactivity_datefacility_namescoregradeservice_codeservice_descriptionemployee_idfacility_addressfacility_cityfacility_idfacility_statefacility_zipowner_idowner_namepe_descriptionprogram_element_peprogram_nameprogram_statusrecord_id
DAQHRSETQ2017-06-08MARGARITAS CAFE93A1ROUTINE INSPECTIONEE00000065026 S CRENSHAW BLVDLOS ANGELESFA0023656CA90043OW0004133BAZAN, ASCENCIONRESTAURANT (61-150) SEATS HIGH RISK1638MARGARITAS CAFEACTIVEPR0011718
DA2GQRJOS2017-03-07LAS MOLENDERAS97A1ROUTINE INSPECTIONEE00009972635 WHITTIER BLVDLOS ANGELESFA0160416CA90023OW0125379MARISOL FEREGRINORESTAURANT (0-30) SEATS HIGH RISK1632LAS MOLENDERASINACTIVEPR0148504
DAMQTA46T2016-03-22SANDRA'S TAMALES93A1ROUTINE INSPECTIONEE00010495390 WHITTIER BLVDLOS ANGELESFA0171769CA90022-4032OW0178828SANDRA'S TAMALES INC.RESTAURANT (0-30) SEATS MODERATE RISK1631SANDRA'S TAMALESACTIVEPR0164225
DAXMBTIRZ2018-02-12CAFE GRATITUDE97A1ROUTINE INSPECTIONEE0000828639 N LARCHMONT BLVD STE #102LOS ANGELESFA0058921CA90004OW0005704CAFE GRATITUDE LARCHMONT LLCRESTAURANT (61-150) SEATS HIGH RISK1638CAFE GRATITUDEACTIVEPR0019854
DAK8TBMS02015-09-10THE WAFFLE90A1ROUTINE INSPECTIONEE00007096255 W SUNSET BLVD STE #105LOS ANGELESFA0051830CA90028OW0035796THE WAFFLE, LLCRESTAURANT (61-150) SEATS HIGH RISK1638THE WAFFLEACTIVEPR0010922

We’ll ignore the question requirements and make our own. Let’s write a query that will show the grades and the highest and the lowest scores by grade.

We heard that the grades go from A to F, but we want only the three best grades, i.e., A, B, and C.

We use the MIN() and MAX() aggregate functions in the query to find the highest and lowest scores. To output only the desired grades, we can use the square brackets wildcard to specify the value range. In this case, it’s A-C, i.e., it will include grade A, grade C, and anything in between, which is only grade B, in this case.

SELECT grade,
       MIN(score) AS min_score,
       MAX(score) AS max_score
FROM los_angeles_restaurant_health_inspections
WHERE grade LIKE '[A-C]'
GROUP BY grade
ORDER BY grade;

Unfortunately, this code will work only in SQL Server, but not in PostgreSQL. Why? Because it doesn’t support character ranges within square brackets. One of the solutions for this problem is to use the tilde operator (~) instead of LIKE. It is the operator for regular expression matching.

So, the code rewritten for PostgreSQL looks like this.

SELECT grade,
       MIN(score) AS min_score,
       MAX(score) AS max_score
FROM    
    los_angeles_restaurant_health_inspections
WHERE grade ~ '[A-C]'
GROUP BY grade
ORDER BY grade;

In both cases, the output is as shown below.

All required columns and the first 5 rows of the solution are shown

grademin_scoremax_score
A90100
B8088
C7079

Combining %, _, and [ ] for Complex Patterns

Now that we learned how to use each wildcard separately, let’s combine them in one LIKE operator and search for complex patterns.

Here’s the interview question by Google.


Table: google_gmail_emails

Link to the question: https://platform.stratascratch.com/coding/10351-activity-rank

We’ll not solve it, we’ll only use its dataset, which consists of one table named google_gmail_emails.

Table: google_gmail_emails
idfrom_userto_userday
06edf0be4b2267df1fa75d295377a46f8323610
16edf0be4b2267df1fa32ded68d89443e8086
26edf0be4b2267df1fa55e60cfcc9dc49c17e10
36edf0be4b2267df1fae0e0defbb9ec47f6f76
46edf0be4b2267df1fa47be2887786891367e1

We want to output the list of all email senders whose code starts with the letters from C to E and the third character is ‘e’.

We will have to use the [ ] wildcard, so let’s write the code in SQL Server, since PostgreSQL doesn’t allow that wildcard with LIKE.

The query selects a distinct combination of the email sender, receiver and the sending day.

To output the list only for the specific senders, we need to use LIKE in WHERE. The pattern in LIKE first specifies the range of first letters, i.e., C-E. We want the sender code to start with one of those letters, so this range is followed by % wildcard. We also want that the third character is ‘e’, and the second character can be anything; we don’t care about it. That’s why we add the underscore wildcard to replace the second character, which should be then followed by the letter ‘e’.

SELECT DISTINCT from_user,
       to_user,
       day
FROM google_gmail_emails
WHERE from_user LIKE '[C-E]_e%';

Look at the output, and you’ll see only senders matching the pattern in LIKE. Actually, there’s only one such email sender.

All required columns and the first 5 rows of the solution are shown

from_userto_userday
e0e0defbb9ec47f6f7157e3e9278e32aba3e7
e0e0defbb9ec47f6f72813e59cf6c1ff698e4
e0e0defbb9ec47f6f732ded68d89443e8086
e0e0defbb9ec47f6f747be2887786891367e8
e0e0defbb9ec47f6f755e60cfcc9dc49c17e5

Performance Considerations

While using pattern matching with LIKE is necessary to achieve querying flexibility, it can have a detrimental influence on query performance.

This primarily applies to using LIKE with a leading wildcard – e.g., ‘%son’ – which hinders the use of indexes. They depend on starting characters for quickly locating data, which is negated by using a leading wildcard, i.e., you want the data to start with any character.

Mitigating Performance Issues

One approach is to use the full-text search. Here’s an example for PostgreSQL that looks in the column name of the table employees to return all the employees whose last name contains ‘son’.

SELECT * 
FROM employees 
WHERE to_tsvector(last_name) @@ to_tsquery('son');

The full-text indexes are optimized for searching within text data. The to_tsvector function converts the column last_name into a text search vector. The to_tsquery function converts the search term into a text search query.

The @@ operator checks if the text search vector on the left matches the text search query on the right.

Another approach is reverse indexing. This requires creating an additional column with the reversed data from the column you want to search and indexing it.

ALTER TABLE employees ADD COLUMN reversed_name TEXT;
UPDATE employees SET reversed_last_name = reverse(last_name);
CREATE INDEX idx_reversed_last_name ON employees(reversed_last_name);

Then you can show all the last names ending with ‘son’ like this.

SELECT * 
FROM employees
WHERE reversed_last_name LIKE reverse('son%');

The third approach is using trigram indexes. They are useful for approximate string matching and can improve the performance of wildcard searches.

First, you need to enable the pg_trgm extension, which provides functions and operators for determining the similarity of text based on trigram matching. Next, you create a trigram index.

CREATE EXTENSION pg_trgm;
CREATE INDEX idx_last_name_trgm ON employees USING gin (last_name gin_trgm_ops);

Then you can search for the pattern like this.

SELECT * 
FROM employees 
WHERE last_name LIKE '%son%';

Conclusion

The LIKE operator allows you to look for patterns in text data. Whether it’s case-sensitive or case-insensitive search, depends on the database you use. In PostgreSQL, which is case-sensitive, applying the case-insensitive search is easy: simply use the ILIKE operator instead of LIKE, with the syntax remaining the same.

Using wildcards adds to the search flexibility, increasing your search and data exploration efficiency and allowing you to handle incomplete data, clean it, and customize data retrieval.

Learning how to use LIKE and wildcards effectively requires a lot of practice in actual code writing. You’ll find plenty of examples in our coding questions section, where you can apply LIKE and wildcards in solving real-world problems.

You can find many useful articles in the blog section, such as this one, explaining different SQL query interview questions.

How Do You Perform SQL LIKE Queries for Pattern Matching


Become a data expert. Subscribe to our newsletter.