Why SQL is THE Language to Learn for Data Science
Categories
Outdated, non-cool, and basic – is that what you think of it? SQL is none of that! Here’s why you can’t avoid it in data science, and it’s here to stay.
Which one will it be – R, Python, Rust, Julia? Are you always on the hunt for the latest, better, sexiest programming language for data science?
Along the way, you’re overlooking an old, trusted, and humble friend – SQL.
Why Do We Need SQL?
If you’re in data science, and you are, you are working with relational databases. And if you’re working with relational databases, you’re working with SQL.
SQL is the only way to query and manage relational databases well. While all other languages have their place in data science, SQL is a foundational language that no data scientist should ignore.
Despite all the fancy new programming languages, SQL is still among the two most popular languages for data scientists.
But aren’t the relational databases outdated?
Sure, there is a NoSQL movement. (Or there was.) NoSQL database management systems have their purpose in storing semi-structured and unstructured data. I’m talking here about tools like Cassandra, MongoDB, Cloudant, redis, AmazonDynamo DB, or neo4j, to name a few.
This is exactly the problem – too many tools with no standard NoSQL query language.
Surely Python can do more things than Python? Yes, you can use it to do models, which SQL can’t. But you can use SAS or R to do the same thing. So what? SQL’s purpose is not that.
But you can also query and clean data with Python, right? You can if you must, but SQL is better for that. The reason is simple – SQL is specifically designed for communicating with and querying databases.
Data Cleaning With SQL
Data cleaning is certainly a big part of a data scientist's job. SQL allows you to filter, sort, and aggregate data efficiently and on the database itself. This means one change to the database will impact everyone, which is usually a good thing.
Yes, you can clean and process data with other languages. Python is also popular for data cleaning and processing tasks. Some data scientists use it on databases in an attempt to avoid SQL.
But that’s only making your life complicated. Why?
SQL offers unique advantages for certain aspects of data cleaning and processing. Particularly, it's efficient in terms of code needed to clean data.
For example, take a look at this Python code that uses the for loop to manipulate data. Compared to similar data manipulation tasks in SQL, using Python might require writing more lines of code, dealing with loops and conditions, and installing external libraries.
After you manipulated data, you need to make the change to the database itself. To do that with Python, you'll need to connect to the database using Python and still write some SQL code to make that change.
So, why not just exclusively use SQL?
SQL Integration With Other Data Science Languages
Knowing only SQL won’t get you far in data science. But SQL integrates perfectly well with other famous and popular data science languages like R, Python, Julia, Rust, or SAS.
That way, you get other languages’ benefits for analysis, data visualization, and machine learning while still retaining SQL's strength for data manipulation.
This possibility is especially powerful when you think about the whole of the data science process.
You can use SQL to pre-process and clean data directly within databases. Then, you can move on to Python, R, Julia, or Rust to perform more advanced data transformations or feature engineering to build machine learning models, like in this DoorDash project.
You can even use Tableau to connect to your database and create visualizations.
SQL works with anything and everything!
SQL in Data Science Interviews
There’s a reason a big portion of coding questions at data science interviews relates to SQL questions.
Almost every data science job requires at least a basic knowledge of SQL. Just a quick look at Glassdoor shows that it’s easier to count data science jobs that don’t require SQL.
Most of the job ads have requirements similar to this one.
Conclusion
The really cool thing about SQL is that it’s the ultimate transferable tool. One job may prefer Python. The other job or a startup might require Rust or R due to personal preference or legacy infrastructure.
But no matter where you go or what you do as a data scientist, it's SQL or bust. It’s not surprising! After all, data science revolves around grabbing and analyzing data stored in databases. And as long as we rely on databases, SQL will remain the most essential language for data science.