Mark Dsouza

Posted on Jul 10, 2020

Oracle SQL Analytic Functions - Part 1

#database #oracle #sql #tutorial

Do you have a bunch of raw data stored in your Oracle Database and do you need to derive something meaningful out of the raw data?

Depending on your DB structure, you might only need basic Joins and Filters. For more complex data sets and reporting requirements, you will probably need to use Analytic functions to convert your data into something more meaningful.
There are a huge bunch of these functions and using them can become a little bit tricky if it's the first time you are working on some sort of DB data manipulation.

In this series of posts, I'm going to try and slowly expose you to the basic analytic functions(at least the commonly used ones), describe them and break down how you can really configure them to suit your exact needs.
By the end of this, I hope to empower you with the knowledge to convert Data set A(your source data) into Data set B(your target data)

Part 1 will look at the 3 easier and commonly used functions: rank(), row_number() and dense_rank()

Let's take this sample Data set to explain these functions

We have a Table EMPLOYEE with 7 rows of data.
Let's run a basic select statement to see what data is present in this table

select EMP.* from EMPLOYEE EMP

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY
100	MICHAEL	9000
101	PAM	6000
102	KEVIN	6000
103	DWIGHT	7500
104	ANGELA	6000
105	JIM	7500
106	ANDY	4000

Here EMPLOYEE_ID is the primary key of the table. We have 7 employes with their EMPLOYEE_NAME and SALARY stored.
All 3 functions will assign a number to each row based on the criteria we give it, but how it gets assigned for edge cases is what is different about each. Let's take a look at them one by one.

rank()

The syntax for applying this would be

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK 
from EMPLOYEE EMP

Now let's take a look at what we just wrote - we called the rank() function but we also mentioned some additional text over (order by SALARY).
So why do we need this and what are we doing here?
We basically need to specify how to do the ranking so Oracle knows how rank() needs to be applied.
over is basically telling Oracle - Over what dataset and how to apply the rank. This is going to be used for all analaytic functions to add configuration.
order by SALARY tells Oracle to sort the data in ascending order. By default all sorts are ascending.
The 3 ways of writing this are
rank() over (order by SALARY)
rank() over (order by SALARY ASC)
rank() over (order by SALARY DESC)
Since the default sort is Ascending, Line 1 and Line 2 would give you the same result.

Now that we have gone over the syntax. Let's look at what the target dataset looks like after calling this and what really happens.

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK 
from EMPLOYEE EMP

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_RANK
100	MICHAEL	9000	1
101	PAM	6000	4
102	KEVIN	6000	4
103	DWIGHT	7500	2
104	ANGELA	6000	4
105	JIM	7500	2
106	ANDY	4000	7

Let me order the dataset as per the Rank just to understand the data better.

select * from 
    (select EMP.*, 
    rank() over(order by SALARY desc) as SAL_RANK 
    from EMPLOYEE EMP)
order by SAL_RANK

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_RANK
100	MICHAEL	9000	1
105	JIM	7500	2
103	DWIGHT	7500	2
102	KEVIN	6000	4
104	ANGELA	6000	4
101	PAM	6000	4
106	ANDY	4000	7

As mentioned, this function assigns a number to each and every row.
We also see that if there are rows in our table with the same value for Salary, they get assigned the same rank!
Lets quickly look at our data. We have 1 employee MICHAEL who gets paid 9000 and is assigned Rank 1. We have 2 Employees JIM and DWIGHT who are both assigned Rank 2 because they are both earning 7500. Now, since we have both of them assigned to 2, the rank function skips assigning a value 3 and next assigns 4 to whoever is next in the data after sorting. We see there are 3 people with salary 6000. All 3 are assigned the Rank 4. Finally Rank 7 is assigned to the person with 4000.
Say in this case ANDY was not a part of your dataset, the maximum rank would be 4. Hence it is not necessary that the lowest rank equals the total number of rows in the dataset.

dense_rank()

The syntax for applying dense_rank() would be similar to rank()

select EMP.*, 
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK
from EMPLOYEE EMP

Dense Rank is very similar to Rank. The only difference is that it does not skip any numbers when assigning a value to a row. Hence, for our example, even though we have 7 employees, our dense_rank value will be 1,2,3 and 4.

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_RANK	SAL_DENSE_RANK
100	MICHAEL	9000	1	1
105	JIM	7500	2	2
103	DWIGHT	7500	2	2
102	KEVIN	6000	4	3
104	ANGELA	6000	4	3
101	PAM	6000	4	3
106	ANDY	4000	7	4

I personally have never used dense_rank but it is a very common basic Oracle interview question. So it surely is good to keep in mind.

So where can dense_rank() be used?
Say we want to find the 3rd highest salary being paid (NOT the person with the 3rd highest salary. Just the 3rd Highest salary as number)

select * from (
    select EMP.*, 
    dense_rank() over (order by SALARY desc) as SAL_DENSE_RANK 
    from EMPLOYEE EMP
) where SAL_DENSE_RANK = 3

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY SAL_DENSE_RANK
102	KEVIN	6000
104	ANGELA	6000
101	PAM	6000

We would get all 3 rows. The 3rd highest salary is 6000 (After 9k to MICHAEL and 7.5k to JIM and DWIGHT)

In case we want to get the employee who is paid the 3rd Highest amount in the employee table, we could write the query

select * from 
    (select EMP.*, 
    rank() over(order by SALARY desc) as SAL_RANK 
    from EMPLOYEE EMP) 
WHERE SAL_RANK = 3;

Now, this will return 0 rows.
Why? You guessed it. It's because no row is actually assigned the value 3 for its rank even though we have 7 employees.

If we try this with dense_rank,

select * from 
    (select EMP.*,
    dense_rank() over(order by SALARY desc) as DENSE_SAL_RANK
    from EMPLOYEE EMP)
WHERE DENSE_SAL_RANK = 3

We would find a value 3, but we would get 2 rows but these are employees with salary 6000. And by looking at the dataset we know 3 is either Jim or Dwight.
This is the perfect use case of the last function we are looking at.

row_number()

Let's see it in action and how it assigns data differently for the exact same dataset

select EMP.*,
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc) as SAL_ROW_NUMBER
from EMPLOYEE EMP

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_RANK	SAL_DENSE_RANK	SAL_ROW_NUMBER
100	MICHAEL	9000	1	1	1
105	JIM	7500	2	2	2
103	DWIGHT	7500	2	2	3
102	KEVIN	6000	4	3	5
104	ANGELA	6000	4	3	4
101	PAM	6000	4	3	6
106	ANDY	4000	7	4	7

So what happens when 2 rows have the same value? row_number randomly assigns a value to each of these conflicting rows. This means the next time you run your query, you could be seeing JIM with 3 and DWIGHT with 2.

So for the use case we were talking about - finding the person with the 3rd Highest Salary in the company, the below query would work perfectly.

select * from 
    (select EMP.*,
    row_number() over(order by SALARY desc) as SAL_ROW_NUMBER 
    from EMPLOYEE EMP) 
WHERE SAL_ROW_NUMBER = 3;

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_ROW_NUMBER
103	DWIGHT	7500	3

But(yes there is a but), this can lead to some issues when it comes to reports.
Just for the above case, sometimes you get JIM, sometimes DWIGHT. And that is surely not ideal.

Using Multiple Sort Conditions

Since we do not get a consistent match, we could set an additional sort condition. Let's assume EMPLOYEE_ID is assigned when an employee joins the organization. So a lesser employee_Id means they have been the firm longer. We could add an additional SORT based this so in case the Salary the same, seniority is the next sort.
Note : It is not ideal to use the EMPLOYEE_ID in the sort condition. If there was a column of joining date, that would be more appropriate to be sorted on.

select * from 
    (select EMP.*, 
    row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER 
    from EMPLOYEE EMP) 
WHERE SAL_ROW_NUMBER = 3;

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_ROW_NUMBER
105	JIM	7500	2

Note: it is always safe to sort your data such that the dataset never changes each time the query runs. This eliminates any randomness in behavior which can eventually lead to issues later on.
By Doing so, something interesting happens. We see that rank() = row_number() = dense_rank()

Running the same query with an additional sort condition

select EMP.*, 
rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_RANK,
dense_rank() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc, EMPLOYEE_ID asc) as SAL_ROW_NUMBER  
from EMPLOYEE EMP

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	SAL_RANK	SAL_DENSE_RANK	SAL_ROW_NUMBER
100	MICHAEL	9000	1	1	1
103	DWIGHT	7500	2	2	2
105	JIM	7500	3	3	3
101	PAM	6000	4	4	4
102	KEVIN	6000	5	5	5
104	ANGELA	6000	6	6	6
106	ANDY	4000	7	7	7

Well will you look at that ! Just by adding additional sort conditions, we are able to uniquely assign a value to each and every row. And because of this we do not see any difference in the value assigned by rank(), row_number() and dense_rank() since there are no conflicts.
The real difference between these functions come into play ONLY when 2 or more rows have the same value for the row that is being sorted on.

Using partition by

Phew. Now that that's out of the way, let's keep it up a notch shall we?
So far all we have done is rank the entire Table. But Oracle allows us to assign ranks for smaller subsections as well.
Let's add another column to the base dataset and change some numbers that we had to see the use of partitioning.

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	TEAM
100	MICHAEL	9000	LEADERSHIP
101	PAM	6000	SALES
102	KEVIN	5000	ACCOUNTING
103	DWIGHT	7500	SALES
104	ANGELA	6000	ACCOUNTING
105	JIM	7500	SALES
106	ANDY	4000	SALES

We have 3 Teams in the Table - Leadership, Sales and Accounting.
Now, what Oracle lets us do is basically apply all the above 3 functions only for each team.
Let's look at how rank() would work with this dataset

select * from (
    select EMP.*, 
    rank() over (order by SALARY desc) as SAL_RANK,  
    rank() over (partition by TEAM order by SALARY desc) as TEAM_SAL_RANK
    from EMPLOYEE EMP
) order by TEAM

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	TEAM	SAL_RANK	TEAM_SAL_RANK
104	ANGELA	6000	ACCOUNTING	4	1
102	KEVIN	5000	ACCOUNTING	6	2
100	MICHAEL	9000	LEADERSHIP	1	1
103	DWIGHT	7500	SALES	2	1
105	JIM	7500	SALES	2	1
101	PAM	6000	SALES	4	3
106	ANDY	4000	SALES	7	4

I have ordered the dataset by the TEAM name. If you see, for each team I have my own set of rankS. Accounting has a rank 1 and 2. LEADERSHIP is a single employee so there is only 1 rank. There are bunch of people in the SALES team, and there are 2 with the same highest salary, so both get assigned 1.

Seeing what row_number would do in the same situation

select * from (
    select EMP.*, 
    row_number() over (order by SALARY desc) as SAL_ROW_NUMBER,  
    row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER
    from EMPLOYEE EMP
) order by TEAM

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	TEAM	SAL_ROW_NUMBER	TEAM_SAL_ROW_NUMBER
104	ANGELA	6000	ACCOUNTING	4	1
102	KEVIN	5000	ACCOUNTING	6	2
100	MICHAEL	9000	LEADERSHIP	1	1
103	DWIGHT	7500	SALES	2	1
105	JIM	7500	SALES	3	2
101	PAM	6000	SALES	5	3
106	ANDY	4000	SALES	7	4

So what if we wanted the details of the highest salary per Team? If you didn't have analytic functions you would probably need to know what are the various teams and then write a bunch of the below statements.

select EMP.* 
from EMPLOYEE 
WHERE TEAM = "SALES" 
AND SALARY = 
    (select MAX(SALARY) 
    from EMPLOYEE 
    WHERE TEAM = "SALES");

Doing this without analytic functions gets exponentially harder when you want say the 2nd or 3rd highest Salary.

Using analytic functions, irrespective of the number of teams, I can do the below

select * from (
    select EMP.*, 
    row_number() over (partition by TEAM order by SALARY desc) as TEAM_SAL_ROW_NUMBER
    from EMPLOYEE EMP
) WHERE  TEAM_SAL_ROW_NUMBER = 1;

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	TEAM	TEAM_SAL_ROW_NUMBER
104	ANGELA	6000	ACCOUNTING	1
100	MICHAEL	9000	LEADERSHIP	1
103	DWIGHT	7500	SALES	1

Handling NULL

Another common edge case that can lead to random or unexpected behaviour is when we are trying to sort on a column, and some rows have the value NULL. In such cases, it is advisable to add an NVL condition to the sort statement so as to correctly order the data. In the above example of salary, in case due to an error in the system 1 row has a value NULL, and you sorting the salarys decending, you could use the below work around.

select EMP.*, 
rank() over(order by nvl(SALARY,0) desc) as SAL_RANK 
from EMPLOYEE EMP

If any value is NULL, it will substitue the value with 0 only for the sorting. This will handle the situation where any row does not have salary populated.

Questions

Below are a few examples for you to try out to see if you have got the hang of it.

Source Data set

EMPLOYEE_ID	EMPLOYEE_NAME	SALARY	TEAM
100	MICHAEL	9000	LEADERSHIP
101	PAM	6000	SALES
102	KEVIN	5000	ACCOUNTING
103	DWIGHT	7500	SALES
104	ANGELA	6000	ACCOUNTING
105	JIM	7500	SALES
106	ANDY	4000	SALES

Question 1 :

Write a query using an Analytic Function to get the details of the employee with the 2nd lowest Salary

Question 2 :

Write a query using an Analytic Function to get the details of the employee with the highest salary from the group of employees who earn less than or equal to 7000

select *, 
rank() over(order by SALARY desc) as SAL_RANK,
dense_rank() over(order by SALARY desc) as SAL_DENSE_RANK,
row_number() over(order by SALARY desc) as SAL_ROW_NUMBER
from EMPLOYEE
WHERE SAL_RANK= 1

Question 3

Is the above syntax correct?

I would love comments, feedback on this article and suggestions for future ones. I hope this helps someone somewhere!

DEV Community

Oracle SQL Analytic Functions - Part 1

rank()

dense_rank()

row_number()

Using Multiple Sort Conditions

Using partition by

Handling NULL

Questions

Question 1 :

Question 2 :

Question 3

Top comments (0)

Read next

Machine Learning Basics: Building Your First Predictive Model in R

LIBRARY MANAGEMENT SYSTEM USING JAVA AND SQL

GO:lack of synchronization

Unlock the Power of Custom Formatting in Go: A Deep Dive into the Formatter Interface