Practicing String Manipulation in SQL
Categories
A detailed walkthrough of the solution for a Google interview question to practice SQL String Manipulation.
With the wealth of data being captured by companies, not all of them will be structured and numerical. So today, our focus is to hone your skill in manipulating strings in SQL by introducing several advanced functions.
Interview Question Example to Practice SQL String Manipulation
Let’s dive into an example question from an interview at Google to practice SQL string manipulation. The question is entitled ‘File Contents Shuffle’. It asks:
Interview Question Date: February 2019
Sort the words alphabetically in 'final.txt' and make a new file named 'wacky.txt'. Output the file contents in one column and the filename 'wacky.txt' in another column. Lowercase all the words. To simplify the question, there is no need to remove the punctuation marks.
- If coding in python, the file contents should be contained in a list.
Link to the question: https://platform.stratascratch.com/coding/9818-file-contents-shuffle
Video Solution:
To understand the question a bit better, let’s have a look at the dataset we’re working with.
1. Exploring the Dataset
filename | contents |
---|---|
draft1.txt | The stock exchange predicts a bull market which would make many investors happy. |
draft2.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. |
final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
The table google_file_store provides a list of text files with the filename as one column and its contents on the other. Both columns are string data.
The question asks us specifically to look at the record where the filename is final.txt’. Notice that there are punctuation marks and duplication of some words like ‘the’, ‘and’, and ‘a’.
When dealing with strings, always remember that data may not be ‘clean’. Watch out for punctuation marks, numbers, a mix of upper and lower cases, double spaces, and duplication of words. State how you’d like to deal with these scenarios or clarify this with your interviewer. For today, we will ignore them first.
The contents of the ‘final.txt’ need to be sorted alphabetically, returned in lowercase with a new filename called ‘wacky.txt’.
2. Writing Out the Approach
Once you’ve fully understood the requirements of the question, formulate a plan of how you’ll build the solution. Oftentimes, you already have an idea of what this is but I strongly suggest writing this out step-by-step. This forces you to identify any gaps in your thinking or errors that you may have missed otherwise.
From the instructions alone, you could easily write out these steps:
- Filter the table where the filename is ‘final.txt’
- Sort its contents alphabetically
- Convert the words into lowercase
- Return the contents with ‘wacky.txt’ as the filename column
While this sounds simple at the start, there are several important steps missed. To avoid this, I would also encourage you to think about the input and output at each step.
For example, the output of Step 1 is:
SELECT * FROM google_file_store
WHERE filename = 'final.txt'
filename | contents |
---|---|
wacky.txt | a a all always an analysts and and and are as awaiting bear best bull but exchange fact follow future game happy, in instincts investors investors is make many market market market. much of of optimism possibility practices. predicting predicts should stock that the the their too uncertain warn we which would |
The contents are in string format and are encoded in one row only so we cannot immediately sort out the words alphabetically. If, instead, each word has its own row, we can do the usual sort through the ORDER BY() clause.
So we need to prepare the data first so we can manipulate them more easily later on. Let’s call this the data preparation step with the aim of having a string convert to the column of words. This is how we will do it:
Data preparation:
- Convert the string into an array by splitting the text using a space to identify the individual words
- Explode the array column-wise so that each element in the array becomes its own row
This will allow us to proceed to Step 2 where we can sort the new column alphabetically and turn it into lower case.
Then, we would like to return the result as a string like this:
filename | contents |
---|---|
wacky.txt | a a all always an analysts and and and are as awaiting bear best bull but exchange fact follow future game happy, in instincts investors investors is make many market market market. much of of optimism possibility practices. predicting predicts should stock that the the their too uncertain warn we which would |
And we cannot do that directly with the current format so another data transformation is required. This time, it has to be the reverse of Step 2 where the aim is to summarize the contents of a column into an array and stitch these elements together in a string format.
Data reformatting:
- Aggregate column into an array
- Combine elements of the array using a space, returning this as a string
Therefore, our full approach follows:
- Filter the table where the filename is ‘final.txt’
- Data preparation:
- Convert the string into an array using space as a delimiter
- Explode the array column-wise
- Sort its contents alphabetically
- Convert the words into lowercase
- Data reformatting:
- Aggregate column into an array
- Combine elements of the array using a space, returning this as a string
- Return the contents with ‘wacky.txt’ as the filename column
Don’t you feel more confident about tackling the question now that you have the steps written out? This will also provide you a good reference point if you ever feel stuck in the interview.
3. Coding the solution
Let’s code up the query.
1). Filter the table
First, let’s only look at the file ‘final.txt’. We can do this through using an equality condition in the WHERE() clause since we know the exact filename we are looking for.
SELECT * FROM google_file_store
WHERE filename = 'final.txt'
However, if we only knew it started with ‘final’, we can use the LIKE() or ILIKE() function. These two functions are used to match strings based on a given pattern. The only difference is that LIKE() is case sensitive and ILIKE() is not.
Here, we can use the ILIKE()function with the wildcard operator, %, representing zero or more characters. This allows us to retrieve the records where the filename starts with ‘final’.
SELECT * FROM google_file_store
WHERE filename ILIKE 'final%'
filename | contents |
---|---|
final.txt | The stock exchange predicts a bull market which would make many investors happy, but analysts warn of possibility of too much optimism and that in fact we are awaiting a bear market. As always predicting the future market is an uncertain game and all investors should follow their instincts and best practices. |
2). Data preparation
Next, let’s prepare the data for manipulation. We will use the STRING_TO_ARRAY() function which takes in a string and converts this to an array (or a list). The elements in this array are based on the delimiter we specify. So if we use a space as a delimiter, it creates an individual element whenever it sees a space. Essentially, it will break up our text into words like this:
SELECT STRING_TO_ARRAY(contents, ' ') AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
word |
---|
["The","stock","exchange","predicts","a","bull","market","which","would","make","many","investors","happy,","but","analysts","warn","of","possibility","of","too","much","optimism","and","that","in","fact","we","are","awaiting","a","bear","market.","As","always","predicting","the","future","market","is","an","uncertain","game","and","all","investors","should","follow","their","instincts","and","best","practices."] |
As you can see, arrays provide a lot of information at one go but we cannot access or analyze its contents easily so a common manipulation done on arrays is ‘exploding’ them. We can do this with the UNNEST() function, which will take an array as an input and output a column where each array element becomes accessible as a separate row. Imagine this as a row-to-column transformation.
SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
word |
---|
The |
stock |
exchange |
predicts |
a |
3). Sort the contents alphabetically
Having transformed our data earlier makes the sorting straightforward using the ORDER BY() function.
SELECT UNNEST (STRING_TO_ARRAY(CONTENTS, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
ORDER BY word
word |
---|
a |
a |
all |
always |
an |
4). Convert the words into lowercase using LOWER()
SELECT LOWER(word) AS CONTENTS
FROM
(SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
ORDER BY word) base
contents |
---|
a |
a |
all |
always |
an |
5). Data reformatting
Finally, to return the contents in a string format, we’ll do the reverse of the steps earlier.
First, we will aggregate the rows of the contents column into an array using the ARRAY_AGG()function. ARRAY_AGG() is an aggregate function so like your SUM() and AVG(), it will take a column and output a single row summarizing the set of values. But here, instead of performing a calculation, it will return an array listing all the values of the column.
SELECT ARRAY_AGG(LOWER(word)) AS contents
FROM
(SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
ORDER BY word) base
contents |
---|
["a","a","all","always","an","analysts","and","and","and","are","as","awaiting","bear","best","bull","but","exchange","fact","follow","future","game","happy,","in","instincts","investors","investors","is","make","many","market","market","market.","much","of","of","optimism","possibility","practices.","predicting","predicts","should","stock","that","the","the","their","too","uncertain","warn","we","which","would"] |
Then, we can return this as a text by combining these individual words. The ARRAY_TO_STRING() takes in an array, combines the individual elements using a specified delimiter like a space, and returns the output as a string.
In the same query, we’ll hardcode the filename as ‘wacky.txt’ so our final solution looks like:
SELECT 'wacky.txt' AS filename,
ARRAY_TO_STRING(ARRAY_AGG(LOWER(word)), ' ') AS contents
FROM
(SELECT UNNEST (STRING_TO_ARRAY(contents, ' ')) AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
ORDER BY word) base
Bonus
For more advanced users of SQL, you may be familiar with the REGEX_SPLIT_TO_TABLE() function which gives the same output as the UNNEST(STRING_TO_ARRAY()) combination we used earlier.
REGEX_SPLIT_TO_TABLE() will take in a string, separate these by a delimiter and return a table with each element in a separate row.
This is helpful for more complex manipulations where the use of regex is required. In this example, however, the delimiter is simply a space so the code is:
SELECT regexp_split_to_table(contents, ' ') AS word
FROM google_file_store
WHERE filename ILIKE 'final%'
word |
---|
The |
stock |
exchange |
predicts |
a |
And this gives us the same result as we had in Step 2!
Conclusion
This was an interesting example to level up your SQL string manipulation skills and I hope you learned something new today.
If you ever find yourself stuck doing SQL string manipulation, remember you can transform the data in another format first if that makes the next steps easier. Converting strings to arrays is now one of the tricks up your sleeve to impress your interviewer.
Practice more SQL interview questions and test your new skills on our coding platform where you can look specifically for string-related questions.