DEV Community

Cover image for SQL JOIN explained
Leandro Proença
Leandro Proença

Posted on • Updated on

SQL JOIN explained

In this post we're going to see how the SQL JOIN works, guided by a practical example while covering SQL and Set theory basics.

Even though you're not familiar with SQL, you can follow along as this post covers the very basics.


A little disclaimer

As you may already know, every time I want to explain something, it's my modus operandi to take a step back and deep dive into the fundamentals.

But if you are comfortable with set theory and SQL basics such as CREATE TABLE, INSERT INTO, generate_series, SELECT and FROM, you may want to jump to the "JOIN to the world" section.

So fasten your seat belts and let's go into the journey of understanding SQL JOIN.


First things first

The examples described here were tested using PostgreSQL 15.

Despite SQL being a standard DSL that should work on any RDBMS, please review the examples before running on other RDBMS but PostgreSQL.

Without further ado, let's start the journey to understanding SQL JOINs.


🗝️ Back to the basics

In SQL, tables are related to set theory in mathematics. Usually, a set is a collection of different things, for instance:

a set of different things

Expressing the set above as a tuple:

(Ana, Star Wars, Star Trek, Kate, John)
Enter fullscreen mode Exit fullscreen mode

However, we can also represent users and movies as two distinct sets:

two distinct sets

Which can be expressed as tuples as well:

users = (Ana, Kate, John)
movies = (Star Wars, Star Trek)
Enter fullscreen mode Exit fullscreen mode

Using sets, we can perform some mathematical operations such as:

  • UNION: users ∪ movies
  • INTERSECTION: users ∩ movies
  • DIFFERENCE: users - movies
  • CARTESIAN PRODUCT: users × movies

set operations

We'll take a closer look at set operations in the upcoming sections as we relate them to SQL.


👉 SQL basics

In SQL, tables are relations that are physically persisted. An SQL table can be seen as a set in set theory.

But how do structure a set in SQL without persisting physical tables?

🔵 SELECT

With the SELECT keyword, we can structure a set as follows:

postgres=#  SELECT 'John';

?column?
----------
 John
(1 row)
Enter fullscreen mode Exit fullscreen mode

💡 Note
In PostgreSQL, we can omit the FROM clause from the SELECT keyword

But that's a one-element set. We should add more items to the collection.

PostgreSQL provides a set-returning function called generate_series:

postgres=# SELECT generate_series(1, 3);

 generate_series
-----------------
               1
               2
               3
(3 rows)
Enter fullscreen mode Exit fullscreen mode

We could add more and more information to the SELECT:

SELECT 
    generate_series(1, 3),
    generate_series(1, 3);

 generate_series | generate_series
-----------------+-----------------
               1 |               1
               2 |               2
               3 |               3

(3 rows)
Enter fullscreen mode Exit fullscreen mode

However we're not yet satisfied. The SELECT should be capable of "projecting fields" coming from a structure that is already a set.

Meet SQL FROM.

🔵 FROM

SQL FROM is a keyword used to specify a set of elements and, combined with the SELECT, we could define our first set:

SELECT 
    *
FROM 
    generate_series(1, 3) AS id;

 id
----
  1
  2
  3
(3 rows)
Enter fullscreen mode Exit fullscreen mode

Using a single FROM we can project multiple columns with SELECT:

SELECT 
    id,
    'User-' || id AS name
FROM 
    generate_series(1, 3) AS id;
Enter fullscreen mode Exit fullscreen mode

Which should print the output:

 id | name
----+-------
  1 | User-1
  2 | User-2
  3 | User-3
(3 rows)
Enter fullscreen mode Exit fullscreen mode

Take for instance that we have two sets:

  • the first set containing (User-1, User-2, User-3)
  • and a second set containing (User-2, User-3, User-4, User-5)

Now, time to explore some operations using the two sets.

🔵 UNION and UNION ALL

The SQL UNION operator combines the result of two or more sets, i.e SELECT statements:

SELECT id, 'User-' || id AS name
FROM generate_series(1, 3) AS id

UNION

SELECT id, 'User-' || id AS name
FROM generate_series(2, 5) AS id;
Enter fullscreen mode Exit fullscreen mode

Which outputs:

 id | name
----+-------
  2 | User-2
  3 | User-3
  1 | User-1
  5 | User-5
  4 | User-4
(5 rows)
Enter fullscreen mode Exit fullscreen mode

The UNION operator removes all duplicates in the result. But in case we want to also return duplicates, we can use UNION ALL:

SELECT id, 'User-' || id AS name
FROM generate_series(1, 3) AS id

UNION ALL

SELECT id, 'User-' || id AS name
FROM generate_series(2, 5) AS id;
Enter fullscreen mode Exit fullscreen mode

Output:

 id | name
----+-------
  1 | User-1
  2 | User-2
  3 | User-3
  2 | User-2
  3 | User-3
  4 | User-4
  5 | User-5
(7 rows)
Enter fullscreen mode Exit fullscreen mode

The result outputs 7 rows: 3 from the first SELECT and 4 from the second SELECT, including the duplicates User-2 and User-3.

Also, keep in mind some rules about SQL UNION:

  • every set must have the same number of columns
  • the columns in both sets must be in the same order

An illustrated example:

union

🔵 INTERSECT

The SQL INTERSECT operator also combines the result of two or more sets, but only return rows from the first set (first SELECT) that are identical to a row in the second set (second SELECT).

SELECT id, 'User-' || id AS name
FROM generate_series(1, 3) AS id

INTERSECT

SELECT id, 'User-' || id AS name
FROM generate_series(2, 5) AS id;
Enter fullscreen mode Exit fullscreen mode

Output:

 id |  name
----+--------
  3 | User-3
  2 | User-2
(2 rows)
Enter fullscreen mode Exit fullscreen mode

Here, the query returned only the identical rows found in both sets. That's exactly the "intersection" between the two sets.

An illustrated example:

intersection

🔵 EXCEPT

Another important operator is the SQL EXCEPT, which in the set theory is the DIFFERENCE (A - B).

This operator also combines the result of two or more sets, but only return rows of the first set (first SELECT) that are not present in the second set (second SELECT).

SELECT id, 'User-' || id AS name
FROM generate_series(1, 3) AS id

EXCEPT

SELECT id, 'User-' || id AS name
FROM generate_series(2, 5) AS id;
Enter fullscreen mode Exit fullscreen mode

Output:

 id |  name
----+--------
  1 | User-1
(1 row)
Enter fullscreen mode Exit fullscreen mode

Just 1 row. The only one that is present in the first set (left) but not in the second (right). We can also invert the sets and get the differences in reverse:

SELECT id, 'User-' || id AS name
FROM generate_series(2, 5) AS id

EXCEPT

SELECT id, 'User-' || id AS name
FROM generate_series(1, 3) AS id;
Enter fullscreen mode Exit fullscreen mode

Which are the User-5 and User-4 respectively:

 id |  name
----+--------
  5 | User-5
  4 | User-4
(2 rows)
Enter fullscreen mode Exit fullscreen mode

An illustrated example:

except

Once we understand some SQL basics and how those relate to set theory in mathematics, it's time to go beyond and explore a more sophisticated example.


💡 A more sophisticated example

So far, we've been experimenting SQL operators with sets on the fly by using the generate_series PostgreSQL function.

Did you notice that we haven't create any table? That's a good alternative to experiment and learn ad hoc.

But for a more sophisticated example we'll start persisting data into pyshical tables.

🔵 CREATE TABLE

Using CREATE TABLE, we can define a structure for a particular set of elements.

Such table structure is represented by a matrix of rows and columns, where rows are essentially a collection of values whereas columns define the type of values.

Let's create a table called users:

CREATE TABLE users (id INTEGER, name VARCHAR(250));
Enter fullscreen mode Exit fullscreen mode
  • id is the name of the first column. It only allows values of type integer
  • name is the name of the second column. It only allows values of type "character varying up to 250 bytes" (VARCHAR(250))

Now, we can query the table using FROM and SELECT:

SELECT
    id,
    name
FROM 
    users;
Enter fullscreen mode Exit fullscreen mode

Output:

 id | name
----+------
(0 rows)
Enter fullscreen mode Exit fullscreen mode

Of course, there's no data inserted into the table. Let's insert some data.

🔵 INSERT INTO

The command INSERT INTO allows to insert a collection of values into an existing table.

INSERT INTO users (id, name) 
VALUES (1, 'John'), (2, 'Ana'), (3, 'Kate');
Enter fullscreen mode Exit fullscreen mode

Note that the VALUES clause accepts a collection of tuples. Each tuple matches the columns (id, name) respectively.

Now, if we execute the query again, we get the output:

 id | name
----+------
  1 | John
  2 | Ana
  3 | Kate
Enter fullscreen mode Exit fullscreen mode

Yay! Our first table was created!

Now let's create another table called movies, which will contain the same number of columns as users, id and name respectively.

Afterwards, we follow inserting some data into the movies table.

CREATE TABLE movies (id INTEGER, name VARCHAR(250));

INSERT INTO movies (id, name) 
VALUES (1, 'Star Wars'), (2, 'Star Trek');
Enter fullscreen mode Exit fullscreen mode

💡 Tip
If we want to SELECT all the columns from a table, we can use the symbol *

SELECT * FROM movies;

 id |   name
----+-----------
  1 | Star Wars
  2 | Star Trek
(2 rows)
Enter fullscreen mode Exit fullscreen mode

Until now, we learned some basic operations that are used to combine data between two simple sets:

  • UNION
  • INTERSECT
  • EXCEPT

Next, it's time to explore some other operations between users and movies.


👉 A Cartesian Product

In set theory, a cartesian product is a combination of every element from one set to every element of another set.

As for our example, it's like combining every pair user-movie:

users = (John, Ana, Kate)
movies = (Star Wars, Star Trek)

### The cartesian product
(John, Star Wars)
(John, Star Trek)
(Ana, Star Wars)
(Ana, Star Trek)
(Kate, Star Wars)
(Kate, Star Trek)
Enter fullscreen mode Exit fullscreen mode

In SQL, it's the operator CROSS JOIN.

🔵 CROSS JOIN

The SQL CROSS JOIN keyword is used to build a cartesian product, a.k.a the cartesian join.

SELECT * FROM users
CROSS JOIN movies;
Enter fullscreen mode Exit fullscreen mode

It outputs every combination between the two tables:

 id | name | id |   name
----+------+----+-----------
  1 | John |  1 | Star Wars
  1 | John |  2 | Star Trek
  2 | Ana  |  1 | Star Wars
  2 | Ana  |  2 | Star Trek
  3 | Kate |  1 | Star Wars
  3 | Kate |  2 | Star Trek
(6 rows)
Enter fullscreen mode Exit fullscreen mode

cartesian product

🥁 It's all good, man! 🥁


💡 An even more sophisticated example

Suppose that users can give a rating (not mandatory) for every movie they watch.

Let's create another table called movies_ratings that associates:

  • a user
  • a movie
  • a rating

In SQL, it's a good practice using relationships, thus avoiding redundancy and stale data.

Such relationships are defined using ID's as "foreign keys". So, in the movies_rating table, a user is represented by its ID and a movie is represented by its ID.

CREATE TABLE movies_ratings 
(user_id INTEGER, movie_id INTEGER, rating INTEGER);
Enter fullscreen mode Exit fullscreen mode

And insert some data into it:

INSERT INTO
    movies_ratings (user_id, movie_id, rating)
VALUES
    (1, 1, 7),
    (1, 2, 8),
    (2, 1, 6),
    (2, 2, 10);
Enter fullscreen mode Exit fullscreen mode

When we query this table, we get the following output:

 user_id | movie_id | rating
---------+----------+--------
       1 |        1 |      7
       1 |        2 |      8
       2 |        1 |      6
       2 |        2 |     10
(4 rows)
Enter fullscreen mode Exit fullscreen mode

The schema should look like the following:

schema

For reporting, we'd like to combine the data of movies_ratings with users and movies so we can display their names, something like that:

 name |   movie    | rating
------+-----------+--------
 Ana  | Star Trek |     10
 John | Star Trek |      8
 John | Star Wars |      7
 Ana  | Star Wars |      6
(4 rows)
Enter fullscreen mode Exit fullscreen mode

Once we got a third table into the game - the movies_ratings table -, and because its structure differs form the other two tables, we can no longer use basic set operators such as UNION, INTERSECT or EXCEPT to combine data between different tables.

We need something like a cartesian product, but not doing every combination, only under certain conditions, based on a related column.

That's what we're going to explore in the next section: a world of JOINs.


🥁 JOIN to the world

With SQL JOIN, we can combine rows across different tables but applying a specific condition.

The clause may have different types of JOIN depending on the situation.

In case we want to combine all data from the first table (left) with the second table (right), but discarding the rows that do not appear in both tables, we use INNER JOIN.

🔵 INNER JOIN

Should we combine all the values from users with movies_ratings, discarding rows that don't appear in both tables:

SELECT * FROM users
INNER JOIN movies_ratings ON movies_ratings.user_id = users.id;
Enter fullscreen mode Exit fullscreen mode

Output:

 id | name | user_id | movie_id | rating
----+------+---------+----------+--------
  1 | John |       1 |        1 |      7
  1 | John |       1 |        2 |      8
  2 | Ana  |       2 |        1 |      6
  2 | Ana  |       2 |        2 |     10
(4 rows)
Enter fullscreen mode Exit fullscreen mode

Note that:

  • John voted two times
  • Ana too
  • Kate haven't registered any vote, so Kate was discarded in the INNER JOIN

inner join

💡 Note
In PostgreSQL, we can use INNER JOIN or simply JOIN. By default, without specifying the JOIN type, PostgreSQL will perform an INNER JOIN

But what if we wanted to display even Kate in the JOIN but with null values? In other words, how to keep every row in both tables even if the condition wasn't satisfied?

🔵 FULL OUTER JOIN

The SQL FULL OUTER JOIN will keep every row from both tables even if the condition wasn't satisfied.

SELECT * FROM users
FULL OUTER JOIN movies_ratings ON movies_ratings.user_id = users.id;
Enter fullscreen mode Exit fullscreen mode

Which produces the output:

 id | name | user_id | movie_id | rating
----+------+---------+----------+--------
  1 | John |       1 |        1 |      7
  1 | John |       1 |        2 |      8
  2 | Ana  |       2 |        1 |      6
  2 | Ana  |       2 |        2 |     10
  3 | Kate |         |          |
(5 rows)
Enter fullscreen mode Exit fullscreen mode

full join

💡 Note
In PostgreSQL, we can use FULL OUTER JOIN or simply FULL JOIN

Also, SQL provides another type of JOIN, which combines two tables (left and right) but keeps all rows from the left table only, even if they do not appear in the right table.

Yes, we are talking about the LEFT OUTER JOIN.

🔵 LEFT OUTER JOIN

The SQL LEFT OUTER JOIN will keep every row from the left table even if they do not appear in the right table.

SELECT * FROM users
LEFT OUTER JOIN movies_ratings ON movies_ratings.user_id = users.id;
Enter fullscreen mode Exit fullscreen mode

Which produces the output:

 id | name | user_id | movie_id | rating
----+------+---------+----------+--------
  1 | John |       1 |        1 |      7
  1 | John |       1 |        2 |      8
  2 | Ana  |       2 |        1 |      6
  2 | Ana  |       2 |        2 |     10
  3 | Kate |         |          |
(5 rows)
Enter fullscreen mode Exit fullscreen mode

left join

💡 Note
In PostgreSQL, we can use LEFT OUTER JOIN or simply LEFT JOIN

On the other hand, there's also the opposite direction: RIGHT OUTER JOIN.

🔵 RIGHT OUTER JOIN

The SQL RIGHT OUTER JOIN will keep every row from the right table even if they do not appear in the left table.

Please note that we'll invert the tables, movies_ratings will be the left table whereas users will be the right table:

SELECT * FROM movies_ratings
RIGHT OUTER JOIN users ON movies_ratings.user_id = users.id;
Enter fullscreen mode Exit fullscreen mode

Which produces the output:

 user_id | movie_id | rating | id | name
---------+----------+--------+----+------
       1 |        1 |      7 |  1 | John
       1 |        2 |      8 |  1 | John
       2 |        1 |      6 |  2 | Ana
       2 |        2 |     10 |  2 | Ana
         |          |        |  3 | Kate
(5 rows)
Enter fullscreen mode Exit fullscreen mode

With the tables inverted, in case we applied LEFT JOIN, Kate would be discarded.

But as we applied RIGHT JOIN (users are in the right-side), Kate was kept in the JOIN even not appearing in the left table movies_ratings.

💡 Note
In PostgreSQL, we can use RIGHT OUTER JOIN or simply RIGHT JOIN


Putting all together

Now that we understand about different JOIN strategies in SQL, let's build a query where we display:

  • the user name
  • the movie title
  • the rating

...keeping users who didn't vote.

Let's write the query in baby steps.

First, we project the fields in SELECT:

SELECT
    users.name,
    movies.name,
    movies_ratings.rating
FROM users
Enter fullscreen mode Exit fullscreen mode

As such, we should JOIN movies and movies_ratings tables.

Next, we'll perform a LEFT JOIN with the movies_ratings, because we want to keep all users even those who didn’t vote.

SELECT
    users.name,
    movies.name,
    movies_ratings.rating
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
Enter fullscreen mode Exit fullscreen mode

The following step consists of joining the movies table as well.

But keep in mind that at this moment, the left table is the result of the first join and the right table will be the movies table.

So if we apply JOIN, Kate will be discarded as she doesn't appear in the movies table. We then should use LEFT JOIN as well:

SELECT
    users.name,
    movies.name,
    movies_ratings.rating
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
LEFT JOIN movies ON movies.id = movies_ratings.movie_id
Enter fullscreen mode Exit fullscreen mode

Lastly, let's order the results by rating in descending order.

Also, in the ORDER clause, we'll perform the NULLS LAST, so Kate will appear at the bottom of the results, and not at the top:

SELECT
    users.name AS user,
    movies.name AS movie,
    movies_ratings.rating
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
LEFT JOIN movies ON movies.id = movies_ratings.movie_id
ORDER BY movies_ratings.rating DESC NULLS LAST;
Enter fullscreen mode Exit fullscreen mode

Which outputs:

 user |   movie   | rating
------+-----------+--------
 Ana  | Star Trek |     10
 John | Star Trek |      8
 John | Star Wars |      7
 Ana  | Star Wars |      6
 Kate |           |
(5 rows)
Enter fullscreen mode Exit fullscreen mode

Such a big Yay!

☢️ Wait!
What if I wanted to discard users who didn't vote in the results?

No problem, you just need to perform an INNER JOIN, remember?

...
JOIN movies_ratings ON users.id = movies_ratings.user_id
JOIN movies ON movies.id = movies_ratings.movie_id
...
Enter fullscreen mode Exit fullscreen mode

➕ More

Moreover, we want to display a query which only brings the number of votes for each user. Then it should display as follows:

 user | votes
------+-------
 Ana  |     2
 John |     2
 Kate |     0
(3 rows)
Enter fullscreen mode Exit fullscreen mode

First, we return all users JOINed with movies_ratings, projecting the users.name field:

SELECT
    users.name
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id

 name
------
 John
 John
 Ana
 Ana
 Kate
(5 rows)
Enter fullscreen mode Exit fullscreen mode

Now, we want to group users by their names. In SQL, we can use the clause GROUP BY.

SELECT
    users.name
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
GROUP BY users.name

 name
------
 Kate
 Ana
 John
(3 rows)
Enter fullscreen mode Exit fullscreen mode

Next, we want to project another field in the SELECT, which are the count of votes of each user.

PostgreSQL provides a function called COUNT, which does the job: you can count on an arbitrary field that belongs to a selected projection.

SELECT
    users.name,
    COUNT(movies_ratings.rating) AS votes
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
GROUP BY users.name
Enter fullscreen mode Exit fullscreen mode

Last but not least, let's order by the counter of votes, so Kate with 0 votes will appear at the bottom:

SELECT
    users.name,
    COUNT(movies_ratings.rating) AS votes
FROM users
LEFT JOIN movies_ratings ON users.id = movies_ratings.user_id
GROUP BY users.name
ORDER BY COUNT(movies_ratings.rating) DESC;
Enter fullscreen mode Exit fullscreen mode

And, voilà:

 name | votes
------+-------
 Ana  |     2
 John |     2
 Kate |     0
(3 rows)
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

This post was a try to explain the basics of SQL and set theory, along with a practical example towards an explanation of different JOIN strategies.

I hope you could learn a bit more about SQL JOIN and SQL in general.

Cheers!

Top comments (4)

Collapse
 
linuxguist profile image
Nathan S.R.

Thanks for this very useful post. I just wanted to add that, there is a very easy way now, to test all the SQLs described here, using the free & portable tools, mentioned in my latest post here : dev.to/linuxguist/learn-sql-quickl...

Collapse
 
sachajw profile image
Sacha Wharton

This is brilliantly put together, thank you!

Collapse
 
mahadev_k profile image
Mahadev K

Good write up !!

Collapse
 
mishu36 profile image
Mukibul Hasan

Verry nice one.