Practicing String Manipulation in SQL

Last Updated: February 17, 2023

Categories:

Written by:
Nathan Rosidi

A detailed walkthrough of the solution for a Google interview question to practice SQL String Manipulation.

With the wealth of data being captured by companies, not all of them will be structured and numerical. So today, our focus is to hone your skill in manipulating strings in SQL by introducing several advanced functions.

Interview Question Example to Practice SQL String Manipulation

Let’s dive into an example question from an interview at Google to practice SQL string manipulation. The question is entitled ‘File Contents Shuffle’. It asks:

File Contents Shuffle

Last Updated: February 2019

HardID 9818

Sort the words alphabetically in 'final.txt' and make a new file named 'wacky.txt'. Output the file contents in one column and the filename 'wacky.txt' in another column. Lowercase all the words. To simplify the question, there is no need to remove the punctuation marks.

If coding in python, the file contents should be contained in a list.

Go to the Question

Link to the question: https://platform.stratascratch.com/coding/9818-file-contents-shuffle

Video Solution:

To understand the question a bit better, let’s have a look at the dataset we’re working with.

1. Exploring the Dataset

Table: google_file_store

filename	contents
draft1.txt	The stock exchange predicts a bull market which would make many investors happy.
draft2.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market.
final.txt	The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices.

Loading Dataset

The table google_file_store provides a list of text files with the filename as one column and its contents on the other. Both columns are string data.

The question asks us specifically to look at the record where the filename is final.txt’. Notice that there are punctuation marks and duplication of some words like ‘the’, ‘and’, and ‘a’.

When dealing with strings, always remember that data may not be ‘clean’. Watch out for punctuation marks, numbers, a mix of upper and lower cases, double spaces, and duplication of words. State how you’d like to deal with these scenarios or clarify this with your interviewer. For today, we will ignore them first.

The contents of the ‘final.txt’ need to be sorted alphabetically, returned in lowercase with a new filename called ‘wacky.txt’.

2. Writing Out the Approach

Once you’ve fully understood the requirements of the question, formulate a plan of how you’ll build the solution. Oftentimes, you already have an idea of what this is but I strongly suggest writing this out step-by-step. This forces you to identify any gaps in your thinking or errors that you may have missed otherwise.

From the instructions alone, you could easily write out these steps:

Filter the table where the filename is ‘final.txt’
Sort its contents alphabetically
Convert the words into lowercase
Return the contents with ‘wacky.txt’ as the filename column

While this sounds simple at the start, there are several important steps missed. To avoid this, I would also encourage you to think about the input and output at each step.

For example, the output of Step 1 is:

SELECT * FROM google_file_store
WHERE filename = 'final.txt'

filename	contents
wacky.txt	a a all always an analysts and and and are as awaiting bear best bull but exchange fact follow future game happy, in instincts investors investors is make many market market market. much of of optimism possibility practices. predicting predicts should stock that the the their too uncertain warn we which would

The contents are in string format and are encoded in one row only so we cannot immediately sort out the words alphabetically. If, instead, each word has its own row, we can do the usual sort through the ORDER BY() clause.

So we need to prepare the data first so we can manipulate them more easily later on. Let’s call this the data preparation step with the aim of having a string convert to the column of words. This is how we will do it:

Data preparation:

Convert the string into an array by splitting the text using a space to identify the individual words
Explode the array column-wise so that each element in the array becomes its own row

This will allow us to proceed to Step 2 where we can sort the new column alphabetically and turn it into lower case.

Then, we would like to return the result as a string like this:

filename	contents
wacky.txt	a a all always an analysts and and and are as awaiting bear best bull but exchange fact follow future game happy, in instincts investors investors is make many market market market. much of of optimism possibility practices. predicting predicts should stock that the the their too uncertain warn we which would

And we cannot do that directly with the current format so another data transformation is required. This time, it has to be the reverse of Step 2 where the aim is to summarize the contents of a column into an array and stitch these elements together in a string format.

Data reformatting:

Aggregate column into an array
Combine elements of the array using a space, returning this as a string

Therefore, our full approach follows:

Filter the table where the filename is ‘final.txt’
Data preparation:
1. Convert the string into an array using space as a delimiter
2. Explode the array column-wise
Sort its contents alphabetically
Convert the words into lowercase
Data reformatting:
1. Aggregate column into an array
2. Combine elements of the array using a space, returning this as a string
Return the contents with ‘wacky.txt’ as the filename column

Don’t you feel more confident about tackling the question now that you have the steps written out? This will also provide you a good reference point if you ever feel stuck in the interview.

3. Coding the solution

Let’s code up the query.

1). Filter the table

First, let’s only look at the file ‘final.txt’. We can do this through using an equality condition in the WHERE() clause since we know the exact filename we are looking for.

SELECT * FROM google_file_store
WHERE filename = 'final.txt'

However, if we only knew it started with ‘final’, we can use the LIKE() or ILIKE() function. These two functions are used to match strings based on a given pattern. The only difference is that LIKE() is case sensitive and ILIKE() is not.

Here, we can use the ILIKE()function with the wildcard operator, %, representing zero or more characters. This allows us to retrieve the records where the filename starts with ‘final’.

SELECT * FROM google_file_store
WHERE filename ILIKE 'final%'

Missing or invalid data

Tables: google_file_store

2). Data preparation

Next, let’s prepare the data for manipulation. We will use the STRING_TO_ARRAY() function which takes in a string and converts this to an array (or a list). The elements in this array are based on the delimiter we specify. So if we use a space as a delimiter, it creates an individual element whenever it sees a space. Essentially, it will break up our text into words like this:

SELECT STRING_TO_ARRAY(contents, ' ') AS word
FROM google_file_store
WHERE filename ILIKE 'final%'

word
The
stock
exchange
predicts
a

As you can see, arrays provide a lot of information at one go but we cannot access or analyze its contents easily so a common manipulation done on arrays is ‘exploding’ them. We can do this with the UNNEST() function, which will take an array as an input and output a column where each array element becomes accessible as a separate row. Imagine this as a row-to-column transformation.

SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'

word
The
stock
exchange
predicts
a

3). Sort the contents alphabetically

Having transformed our data earlier makes the sorting straightforward using the ORDER BY() function.

SELECT UNNEST (STRING_TO_ARRAY(CONTENTS, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
ORDER BY word

word
a
a
all
always
an

4). Convert the words into lowercase using LOWER()

SELECT LOWER(word) AS CONTENTS
FROM
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
      FROM google_file_store
      WHERE filename ILIKE 'final%' 
   ORDER BY word) base

contents
a
a
all
always
an

5). Data reformatting

Finally, to return the contents in a string format, we’ll do the reverse of the steps earlier.

First, we will aggregate the rows of the contents column into an array using the ARRAY_AGG()function. ARRAY_AGG() is an aggregate function so like your SUM() and AVG(), it will take a column and output a single row summarizing the set of values. But here, instead of performing a calculation, it will return an array listing all the values of the column.

SELECT ARRAY_AGG(LOWER(word)) AS contents
FROM
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
      FROM google_file_store
      WHERE filename ILIKE 'final%' 
   ORDER BY word) base

contents
["a","a","all","always","an","analysts","and","and","and","are","as","awaiting","bear","best","bull","but","exchange","fact","follow","future","game","happy,","in","instincts","investors","investors","is","make","many","market","market","market.","much","of","of","optimism","possibility","practices.","predicting","predicts","should","stock","that","the","the","their","too","uncertain","warn","we","which","would"]

["a","a","all","always","an","analysts","and","and","and","are","as","awaiting","bear","best","bull","but","exchange","fact","follow","future","game","happy,","in","instincts","investors","investors","is","make","many","market","market","market.","much","of","of","optimism","possibility","practices.","predicting","predicts","should","stock","that","the","the","their","too","uncertain","warn","we","which","would"]

Then, we can return this as a text by combining these individual words. The ARRAY_TO_STRING() takes in an array, combines the individual elements using a specified delimiter like a space, and returns the output as a string.

In the same query, we’ll hardcode the filename as ‘wacky.txt’ so our final solution looks like:

SELECT 'wacky.txt' AS filename,
       ARRAY_TO_STRING(ARRAY_AGG(LOWER(word)), ' ') AS contents
FROM
  (SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
      FROM google_file_store
      WHERE filename ILIKE 'final%' 
   ORDER BY word) base

Go to the question on the platformTables: google_file_store

Bonus

For more advanced users of SQL, you may be familiar with the REGEX_SPLIT_TO_TABLE() function which gives the same output as the UNNEST(STRING_TO_ARRAY()) combination we used earlier.

REGEX_SPLIT_TO_TABLE() will take in a string, separate these by a delimiter and return a table with each element in a separate row.

This is helpful for more complex manipulations where the use of regex is required. In this example, however, the delimiter is simply a space so the code is:

SELECT regexp_split_to_table(contents, ' ') AS word 
FROM google_file_store 
WHERE filename ILIKE 'final%'

word
The
stock
exchange
predicts
a

And this gives us the same result as we had in Step 2!

Conclusion

This was an interesting example to level up your SQL string manipulation skills and I hope you learned something new today.

If you ever find yourself stuck doing SQL string manipulation, remember you can transform the data in another format first if that makes the next steps easier. Converting strings to arrays is now one of the tricks up your sleeve to impress your interviewer.

Practice more SQL interview questions and test your new skills on our coding platform where you can look specifically for string-related questions.