Cover Photo by Christina Morillo from Pexels.
Databases are the containers/repository of data. They hold the data in the form of a spreadsheet(not the same but for the sake of imagination). There are many databases viz. Relational and Non-Relational. Let us not dwell on the latter part but stick with the relational databases. Watch this video to know more about databases.
This post will focus primarily on writing SQL queries. SQL queries are common across all the databases but there are some flavors and they may vary according to different databases. There are various vendors like IBM, MySQL, PostgreSQL, Oracle, etc. Every vendor has its own style of SQL to communicate with its database. The one I would suggest is PostgreSQL which is open source and has got wider adoption.
The primary way to communicate with a relational database is through SQL. Structured Query Language (SQL) is a language to communicate with the database. Without further ado, let us jump into learning SQL.
Image Source: The Data Labs
A table looks like the above spreadsheet with rows and columns. Columns are the headers for the category of the values that we input. A column that is defined as a name will have only names in it and a column defined as a currency would certainly have numbers in the entire column. So, a column defines what kind of values it should comprise. Rows are the list of entries made in different columns. Rows define the number of entries in a spreadsheet.
Now, to get all the columns from the customer table, we use the keyword SELECT
and specify the column names that we are interested in. We can use the *
symbol to fetch all the columns from the table. To connect to a table, the keyword FROM
is used which comes after the SELECT
keyword.
SELECT * FROM customer;
To get specific columns from the customer table.
SELECT first_name, last_name FROM customer;
In spreadsheets, we would have used filters similar to that in SQL there are filters to query. For instance, if you want to look only for emails that have the first name as 'Sai', the keyword that should be used is WHERE
. WHERE
should be used after the FROM
command.
SELECT email FROM customer WHERE first_name='Sai';
To get all the emails of the customer whose first name is 'Sai' and the last name is 'Kumar', we use a keyword called AND
which is a logical operator similar to OR
, and NOT
. You can learn more about logical operators in SQL.
SELECT email FROM customer WHERE first_name='Sai' AND last_name='Kumar';
To get all the emails of the customer whose first name is 'Sai' or the last name is 'Kumar'.
SELECT email FROM customer WHERE first_name='Sai' OR last_name='Kumar';
To arrange the data in ascending or descending order use the ORDER BY
keyword followed by the column name based on which the entire table would be ordered. To arrange is ascending use ASC
and for descending DESC
. By default if you don't specify the order, it will sort the column in ascending order. ORDER BY
always comes at the tail end of the query.
To get the list of customer ids and arrange them in ascending order of their transaction date.
SELECT customer_id FROM customer ORDER BY transaction_date ASC;
To get the list of customer ids and arrange them in descending order of their transaction date.
SELECT customer_id FROM customer ORDER BY transaction_date DESC;
We can also limit the number of records displayed by using LIMIT
with a positive integer number to specify the limit.
To get the list of the first 10 customer ids.
SELECT customer_id FROM customer ORDER BY transaction_date ASC LIMIT 10;
COUNT
is a type of function that is used in columns and it should come after the SELECT
. COUNT
returns the number of data in the specified column. In general, passing *
as an argument to the COUNT
will also return the same result as the column name.
To get the count of some customers whose age is not more than 50.
SELECT COUNT(*) FROM customer WHERE age<=50;
The use of regular expressions is ubiquitous in all programming languages similarly we use regex in SQL to match the strings. LIKE
is used in conjunction with WHERE
and followed by the regex.
-
%String
- grabs all the words that end withString
-
String%
- grabs all the words that start withString
-
%String%
- grabs all the words that comprise theString
in between them Note:LIKE
is case sensitive and looks only for the string that is mentioned.ILIKE
is an alternative that is case insensitive and can match across various cases of the string.
To get all the names of the customer whose first letter in their first name starts with 'P'
SELECT first_name FROM customer WHERE first_name LIKE 'P%';
To get all the customers whose last name has 'siva' in between and should be case insensitive.
SELECT * FROM customer WHERE last_name ILIKE '%siva%';
At times the entries in the table might have duplicate data. To filter out unique data, DISTINCT
is used. DISTINCT
applies only to the column and comes next to SELECT
and even as an argument to the COUNT
function.
To get the count of distinct areas where the customers are from.
SELECT COUNT(DISTINCT area) FROM customer;
To get distinct areas where the customers are from.
SELECT DISTINCT area FROM customer;
IN
is used to specify the list of items against which a column is compared. For example, we can check if a column consists of (Orange, Mango, Banana) and the function filters and display only the entries with orange, mango, and banana.
To get the count of customers who are located in the area with Pincode '82' between the age group 60 to 70.
SELECT COUNT(*) FROM customer WHERE pincode IN(82) and age BETWEEN 60 AND 70;
AGGREGATION
Aggregation means a cluster of things that are brought together. Likewise aggregation in SQL clusters the data into a homogeneous category and groups it. For example,
To get the total number of transactions made by all customers.
SELECT customer_id, COUNT(transaction_date) FROM customer GROUP BY customer_id;
Here COUNT
is the aggregation function that has been performed on the transaction_date column concerning the customer_id. So a group is formed for every customer_id where you will get a matching count of the number of transactions. A normal COUNT
would have resulted in the total number of transactions whereas in aggregation with customer_id by using GROUP BY
, it produces a table with individual customer ids and their corresponding count of transactions.
GROUP BY
should always have the column names that are mentioned in the SELECT
criteria except for the aggregation function.
To get an average amount that a customer has spent on purchases.
SELECT customer_id, AVG(amount) from customer GROUP BY customer_id;
To get a rounded value of the average calculated.
SELECT customer_id, round(AVG(amount),2) from customer GROUP BY customer_id;
To arrange the SUM(amount) in descending order and display only 10 rows.
SELECT customer_id, SUM(amount) FROM customer GROUP BY customer_id ORDER BY SUM(amount) DESC LIMIT 10;
It is not possible to use the WHERE
condition on the aggregate function and that is where we use HAVING
after the GROUP BY
and specify the condition.
To display the customers who have spent more than 100 INR.
SELECT customer_id, COUNT(amount) FROM customer GROUP BY customer_id HAVING COUNT(amount)>=100;
To display the customers who were attended by the staff with id '2' and their total spending is more than 1000 INR.
SELECT customer_id, staff_id, SUM(amount) FROM customer WHERE staff_id=2 GROUP BY customer_id, staff_id HAVING SUM(amount)>1000;
JOINS
JOINS
play an important role while analyzing data by combining multiple tables. There are four types of JOINS
- INNER JOIN
- LEFT OUTER JOIN
- RIGHT OUTER JOIN
- FULL OUTER JOIN
INNER JOIN
- Compares between two tables on the specified columns and returns the data that matches in both the columns.
LEFT OUTER JOIN
- Compares between two tables on the specified columns and returns the data that matches in the left table. If no value is present in the right table matching the left, then it returns the data as null
.
RIGHT OUTER JOIN
- Compares between two tables on the specified columns and returns the data that matches in the right table. If no value is present in the left table matching the right, then it returns the data as null
.
FULL OUTER JOIN
- Compares between two tables on the specified columns and returns the data that matches in both the columns along with mismatches with a value as null
.
You can read more about JOINS here
INNER JOIN
to merge two tables according to matching values of customer_id.
SELECT payment_id, customer.customer_id, first_name, last_name FROM customer INNER JOIN payment ON customer.customer_id = payment.customer_id;
FULL OUTER JOIN
to merge two tables according to matching values of customer_id.
SELECT payment_id, customer.customer_id, first_name, last_name FROM customer FULL OUTER JOIN payment ON customer.customer_id = payment.customer_id WHERE payment.customer_id IS null OR customer.customer_id IS null;
LEFT OUTER JOIN
to merge two tables according to matching values of customer_id based on the left table.
SELECT payment_id, customer.customer_id FROM payment LEFT OUTER JOIN customer on payment.customer_id = customer.customer_id WHERE payment.customer_id IS NULL;
RIGHT OUTER JOIN
to merge two tables according to matching values of customer_id based on the right table.
SELECT payment.customer_id,customer_id FROM customer RIGHT OUTER JOIN payment on customer.customer_id = payment.customer_id WHERE payment.customer_id IS NULL;
You can also use nested joins to query from multiple tables.
SELECT customer_id, first_name, last_name FROM customer INNER JOIN payment ON customer.customer_id = payment.customer_id INNER JOIN billing ON payment.customer_id = billing.customer_id WHERE first_name='Sai' AND last_name='Kumar';
Other SQL Commands
To display the timezone of your current location.
SHOW TIMEZONE;
To get today's timestamp.
SELECT NOW();
To get the current time of the day.
SELECT TIMEOFDAY();
To extract month, day, year from the timestamp use the function EXTRACT
which will return a numeric month, date, or year.
SELECT EXTRACT(MONTH FROM transaction_date) FROM customer;
To convert a timestamp into a specific string format like 'DD/MM/YYYY', the TO_CHAR function is used where you specify the timestamp column name as the first argument and followed by the format to be returned.
SELECT DISTINCT (TO_CHAR(payment_date,'DD/MM/YYYY')) FROM payment;
A sub query returns either a single data or a list of data which can be asserted with its parent query.
-- SUB QUERY
SELECT first_name || ' ' || last_name AS Full_Name FROM customer WHERE address_id IN (SELECT address_id FROM address WHERE district = 'Chennai');
Similar to conditionals in programming languages, SQL has a CASE
statement that runs a check against the selected column and performs specified operations on it.
SELECT
SUM(CASE citizen
WHEN 'INDIAN' THEN 1
ELSE 0
END)as IN,
SUM(CASE rating
WHEN 'NRI' THEN 1
ELSE 0
END)as NRI
FROM customer;
That will be all and there are few more concepts like views, procedures etc. which is much more advanced and I will cover those is my future posts.
Top comments (2)
can you expand on joins? the sample and explanation sounds confusing. consider the snippet from your example,
customer RIGHT OUTER JOIN payment on customer.customer_id = payment.customer_id WHERE payment.customer_id IS NULL;
why such addition in
where
clause? we are joining using customer_id, then why should we haveIS NULL
In
RIGHT OUTER JOIN
, all rows from the right table are included, unmatched rows from the left are replaced withNULL
values. So to get those unmatched data explicitly,WHERE
keyword with a conditionIS NULL
is used. As you said you can simply eliminate the WHERE condition and see for yourself for a better understanding.