Have you ever been sent a file and asked to find important information buried within it? Your coworkers would be very impressed if you could query the files in a quick and efficient manner. But… how exactly are you going to achieve such a feat?
As you probably know, SQL allows you to modify database data quickly and easily. When trying to work with data files, developers usually load data into a database and manage it via SQL. In a perfect world, you could just query the database to get the information your company needs. But in real life, there’s a catch: data loading is often not straightforward.
It would be extremely convenient if SQL queries could be run directly on files, skipping the database setup step. It turns out, other people have noticed this fact and have built tools to query your files directly with SQL. In this article, we'll look at a few tools and list some different SQL file query scenarios that they can handle. We'll also compare the tools and share some general considerations. Let’s begin.
TextQL
TextQL might be a suitable fit for you if you're looking for a simple tool to query .csv or .tsv files. TextqQL allows you to execute SQL on structured text with ease. It also allows you to use quote-escaped delimiters. When running a query, TextQL can automatically detect numeric and datetime data in appropriate formats, which helps you make your work easier. TextQL lets you list as many files and folders as you want. You can load the files or directories you’re working with by listing them at the end of the command.
Installation
Let's look at a few ways to install TextQL. One of the simplest methods you can use to get started is to use Homebrew. Just run the command below:
brew install textql
If you're running Docker, you can also install TextQL with a single short command:
docker build -t textql .
Lastly, if you're using Linux, you may be familiar with the AUR tool. AUR installation is also straightforward:
yaourt textql-git
Now that we've gotten TextQL set up and working, let's see what it can do.
Example
We're going to examine an example to understand how TextQL works. Assume we have the following data.
cat sample_data.csv
Id,name,value,timestamp
1, Jack,5,1643004723
1, John,11,1643114723
1, James,-3,1645596723
Now, excluding the header, let's count the number of data items we have.
textql -header -sql "select count() from sample_data" sample_data.csv
#output
3
You should see an output of "3," which is exactly what we'd expect. We can also use TextQL to find the maximum value of this data. We can do that by running the command below:
textql -header -sql "select max(value) from sample_data" sample_data
#output
11
Again, we see the expected value of 11. Very nice!
q
You might consider using q to query your files if you are working with .csv or .tsv files and need something that works faster than TextQL. q aims to bring SQL’s expressive capability to the Linux command line by offering direct access to multi-file SQLite3 databases and simple access to text as actual data.
q uses the SQLite engine. It allows you to run SQL-like statements directly on tabular text data, with the data being auto-cached to speed up subsequent queries on the same file. Using q, SQL statements can be run directly on multi-file SQLite3 databases without requiring them to be merged or loaded into memory.
Installation
Like TextQL, it's easy to install q via homebrew. Simply run the command below:
brew install harelba/q/q
q also has a standalone executable, which you can download from this link. If you want to use the Windows installer, you can run the executable and follow the prompts displayed on the screen.
You can also install q as an .rpm package. You can download the package from this link, then install it via the following commands:
rpm -ivh <package-filename>
Or
rpm -U <package-filename>
Example
To get an idea of how q works, let's run a query on a file where columns are named a1, a2, ... aN.
q "select a1,a5 from file.csv"
Now, we'll COUNT DISTINCT values in a specific field. In this case, let's try to see how many UUIDs we have with the following command:
q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./file.csv"
The output of this command will be a numeric value, equal to the number of UUIDs in the file.
OctoSQL
OctoSQL is primarily a command-line application that allows you to query a variety of databases and filetypes using SQL in a single interface, as well as perform JOINS between them. OctoSQL is a fully expandable, fully-featured dataflow engine that can be used to provide a SQL interface for your applications. It validates and optimizes queries based on database types. It may process massive volumes of data and return partial results before completing the complete query. Out of the box, OctoSQL only supports .csv and JSON files. You'll need to install a plugin to query other file formats, like Excel or Parquet files.
Installation
You can run the following command to install OctoSQL with Homebrew.
brew install cube2222/octosql/octosql
Installation with Go is also supported, as shown below:
go install -u github.com/cube2222/octosql
As mentioned, OctoSQL has a robust plugin capability. You can install plugins using commands like the one shown below:
octosql plugin install postgres
Example
Let's have a look at a simple example of invoice data. Suppose we have .csv of invoice data in the format shown below:
octosql "SELECT * FROM ./invoice2.csv"
cust_id,cust_name,value
121, Jack,599.00
122, John,1100.00
123, James,400.50
Suppose we want to calculate a sum of the invoice values. In that case, we could use the following command to find the answer:
#Sum
octosql "SELECT id, SUM(value) as value_sum FROM ./invoices2.csv GROUP BY id ORDER BY value sum DESC”
Similarly, we can find a count of all the entries using this command:
#Count
octosql "SELECT COUNT(*) FROM mydb.customer”
We can also perform more advanced operations like joins. The command below will get us what we want:
octosql "SELECT inv_id, value, email FROM ./invoices.csv JOIN mydb.customers ON customer_id = customer_id”
OctoSQL may be the ideal fit for you if you require a tool that can work with a variety of file formats. One drawback that few people consider is that OctoSQL uses a custom engine instead of SQLite, which indicates that it may be missing some features.
DSQ
DSQ is a command-line tool that lets you execute SQL queries. It supports a wide range of file formats, including .csv, JSON, .tsv, Excel, Parquet, and .ods.
Installation
To install DSQ on a Mac or Linux operating system, use the following command:
$ curl -LO "https://github.com/multiprocessio/dsq/releases/download/$VERSION/dsq-$(uname -s | awk '{ print tolower($0) }')-x64-$VERSION.zip"
If you're using Windows, you can install DSQ using a different workflow. Just download the latest Windows release, unzip it, and add DSQ to your $PATH
You can also install DSQ in Go using the following command:
$ go install github.com/multiprocessio/dsq@latest
Example
DSQ allows you to either stream data or provide a file name to work with. The example below shows how you can use either a .json or .ndjson format, depending on your requirements.
$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"
#or
$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"
You can also use DSQ to work with data from different origin types. The example below connects a .csv dataset and a .json dataset.
$ dsq testdata/join/users.csv testdata/join/ages.json \
"select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"
If you need a tool that can handle a wide range of file formats and uses SQLite, this is the tool for you.
Cases for SQL file queries
To see these tools in action, let's look at a few SQL file query examples. We'll be using .txt or .csv files for these examples, since almost all of the tools we discussed above support this file format.
Filtering via SQL
Filtering allows you to view only the information you wish to see. Filters are useful for displaying only the desired records from a form or report. Using a filter, you may limit the data in a view without affecting the architecture of the underlying object.
For example, let's start by selecting records from a text file that contains the information below.
Class | Name | English | History | Math |
---|---|---|---|---|
1 | James Kim | 78 | 65 | 67 |
1 | John White | 87 | 61 | 79 |
2 | Paige Davis | 77 | 82 | 94 |
3 | Edwin Henderson | 65 | 78 | 94 |
Suppose we want to choose the students in class 1 in the above file. The file's first row provides column names, whereas the rest of the rows have comprehensive information that we can use to query the records we want. So, we could run the following command to get the information we want:
$select * from E:/txt/Students_file.txt where CLASS = 1
In SQL, you may sort data in ascending or descending order by one or more columns. To try this out, let's sort the table of student scores by class in ascending order and total score in descending order. We can do that with the following command:
$select * from E:/txt/Students_file.txt order by CLASS,ENGLISH+HISTORY+MATH desc
You can combine results using aggregation by grouping records depending on their value. Grouping can also be used to calculate the sum of many values in a group. The grouped aggregates give a summary of a set of rows. For example, you can use the command below to find the lowest English score, the highest History score, and the overall Math score.
$select CLASS,min(English),max(History),sum(Math) from E:/txt/students_file.txt group by CLASS
As you can see, the ability to perform SQL queries directly on files can help us answer important questions from our data.
Conclusion
In conclusion, let's use the chart below to take a quick look at all of the tools we mentioned before.
Name | Speed | Supported Files | Engine |
---|---|---|---|
q | Fast | .csv, .tsv | SQLite |
TextQL | Ok | .csv, .tsv | SQLite |
OctoQL | Slow | .csv, .json, Excel, Parquet | Custom |
DSQ | Ok | .csv, .tsv, .json, Parquet, Excel, logs | SQLite |
Without having to go through a database, you can use the tools above to execute SQL on your files for instant insights. Each of these tools has different pros and cons, so be sure to choose the option that best suits your needs. Happy querying.
Top comments (1)
Great article. You should also consider SPyQL (github.com/dcmoura/spyql) that allows to execute python code in the SQL queries and does not load the full dataset into an in-memory database. (disclaimer: I am the author of SPyQL)