Everything about Analytic Functions in Advanced SQL queries
The Analytic Functions have been implemented to leverage SQL queries and make them more useful and powerful. These functions are calculated over sets of rows and then return the result in the current row. These functions could be an aggregate function like COUNT, MIN, MAX, AVG, SUM, or a pure one like DENSE_RANK.
OVER
The inseparable part of the analytic function is OVER. This is a mandatory part of using these function if they get () (open close parenthesis) in front of OVER means the analytic function indicates the widest possible scope.
select
department_id,
sum(salary) over() "Total"
from hr.employees
where department_id in (10,20,30)
The total SUM of departments 10,20,30 is returned next to each department.
Note the SUM function is used here without the GROUP BY clause. Also, it is possible to use KEEP clause to fetch out more columns.
select
department_id,
salary,
min(last_name) keep (dense_rank first order by salary) over(),
min(salary) over() "Total"
from hr.employees
where department_id in (10,20,30)
PARTITION BY
This part of the Analytic function acts as same as GROUP BY clause to group rows.
select
last_name,
salary,
department_id,
salary,
max(salary) over (partition by department_id),
JOB_ID,
max(salary) over (partition by job_id)
from hr.employees
where department_id in (10,20,30)
Raphaely is the PU_MAN of department 30 and he has the best paid in this department.
LEAD, LAG, and ROW_NUMBER
These analytic functions are the member of the ranking family and in order to use them note, do not forget using ORDER BY:
select
last_name,
hire_date,
row_number() over(order by hire_date),
lag(last_name,1,'-') over(order by hire_date) ,
lead(last_name,1,'-') over(order by hire_date)
from hr.employees
where department_id in (10,20,30,40,50)
ROW_NUMBER returns a row number in the specific order. LAG returns the previous nth row based on what number you passed as the second input argument, and LEAD returns next nth rows based on what ever you passed as the second input argument. here for both case we passed 1 as the second argument, then in the first row LAG returns null but if you determined the third argument that what you need instead of null and it will return the third argument, here '-' will return, and LEAD returns next value for LAST_NAME in the next record in the current record and so on and so forth.
select
last_name,
salary,
lag(last_name,1,'-') over(order by salary) ,
lag(last_name,2,'-') over(order by salary) ,
lag(last_name,3,'-') over(order by salary)
from hr.employees
where department_id in (10,20,30,40)
RANK and DENSE_RANK
select
last_name,
department_id,
row_number() over(order by department_id) "Row_Number" ,
rank() over(order by department_id) "Rank" ,
dense_rank() over(order by department_id) "Dense_Rank"
from hr.employees
where department_id in (10,20,30,40)
The function ROW_NUMBER always returns unique numbers based on OVER() clause. Both RANK and DENSE_RANK return duplicate numbers for employees with the same DEPARTMENT_ID. the difference between them is DENSE_RANK that it does not skip any numbers.
Note that you can not use Analytic Functions in where clause, only in the SELECT and ORDER BY clause.
select
last_name,
department_id,
row_number() over(partition by department_id order by department_id) "Row_Number" ,
rank() over(partition by department_id order by department_id) "Rank" ,
dense_rank() over(partition by department_id order by department_id) "Dense_Rank"
from hr.employees
where department_id in (10,20,30,40)
order by department_id,"Dense_Rank"
WINDOW
When you use ROWS BETWEEN clause in an analytic function, indeed you make a window on the query, it is one of the most powerful features in the SQL queries:
select
last_name,
hire_date,
salary,
max(salary) over(order by hire_date,last_name rows between
unbounded preceding and 1 preceding) max_before,
max(salary) over(order by hire_date,last_name rows between 1
following and unbounded following) max_after
from hr.employees
where department_id in (10,20,30,40)
order by hire_date,last_name
On each row, the maximum salary before the current row and the maximum salary after the current row are demonstrated.
When SUM is used with ROWS BETWEEN we can produce cumulative totals.
select
last_name,
salary,
sum(salary) over(order by last_name rows between unbounded preceding
and current row) Cum_sum
from hr.employees
where department_id in (10,20,30,40)
The lowest possible bound is UNBOUNDED PRECEDING, the current row is CURRENT ROW and the highest possible row is UNBOUNDED FOLLOWING.
Rows preceding and following the current row are retrieved with "n" PRECEDING and "n" FOLLOWING where "n" is the relative position of the current row.
Note when not specifying BETWEEN, the window implicitly ends at the CURRENT ROW.
select
last_name,
salary,
sum(salary) over(order by Salary rows 1 preceding) Cum_sum
from hr.employees
where department_id in (10,20,30,40)
The SUM is calculated for rows between the previous and the current row. The window start at the position immediately preceding the current row and the current row.
RANGE is similar to ROWS but the intervals are not a number of rows. They are either numeric or DATE values.
select
last_name,
salary,
salary*.9 LOW,
salary*1.1 HIGH,
count(*) over(order by Salary range between salary*.1 preceding
and salary*.1 following) cnt
from hr.employees
where department_id in (10,20,30,40)
On the first row; SALARY is 2500, low bound is 2250 and upper bound is 2750 then COUNT should calculate how many records exist between this range so also value 2600 in the next row could be in this range then the result is 2, for next row SALARY is 2600 and low bound is 2340 and upper bound is 2860, how many records are exist in this range; there are values 2500,2600,2800 so the COUNT returns 3, so on and so forth.
It is very important to understand that if two rows get the same value, they are both either included or excluded from the window:
select
ename,
sal,
sum(sal) over(order by sal rows unbounded preceding) sum_rows,
sum(sal) over(order by sal range unbounded preceding) sum_ranges
from scott.emp
Until the third row, everything is going equally in both column SUM_ROWS and SUM_RANGES but in the fourth row, SALARY is 1250 and also in fifth-row SALARY is 1250 then RANGES include both of them in the window and calculate 5350 but ROWS calculate 4100.
ROWS CURRENT ROW points to exactly one row; RANGE CURRENT ROW points to all row where the sort key is equal to the current row.
With dates and TIMESTAMPs, the interval could be a number of days, a day-to-seconds interval or a year-to-month interval.
select
last_name,
hire_date,
salary,
avg(salary) over(order by trunc(hire_date,'MM')
range between interval '1' month preceding
and
interval '1' month preceding) "previous",
avg(salary) over(order by trunc(hire_date,'MM')
range current row) "current",
avg(salary) over(order by trunc(hire_date,'MM')
range between interval '1' month following
and
interval '1' month following) "next",
avg(salary) over(order by trunc(hire_date,'MM')
range between interval '1' month preceding
and
interval '1' month following) "3months"
from hr.employees
order by hire_date
The sort key is the month of hire date, The previous column evaluates the average salary for the employees hired in the month before the current employee was hired, the current column includes the average of employees hired in the same month of the current row and the next column relates to the employees hired in the month after the hire date of the current employee.
FIRST_VALUE and LAST_VALUE
The discrete bounds of the current window are returned by the FIRST_VALUE and LAST_VALUE:
select
last_name,
hire_date,
first_value(last_name || '(' || hire_date || ')')
over (order by hire_date
range between 30 preceding and 30 following) "first",
last_value(last_name || '(' || hire_date || ')')
over (order by hire_date
range between 30 preceding and 30 following) "last"
from hr.employees
order by hire_date
The window starts 30 days before the current hire date and ends 30 days after the current hire date, and FIRST_VALUE returns the first value in this window and LAST_VALUE returns the last value.
in 10g and later the IGNORE NULLS clause returns the first and last non-null values.
select
ename,
sal,
comm,
first_value(comm ignore nulls)
over (order by sal
rows between 1 following and unbounded following) "next",
last_value(comm ignore nulls)
over (order by sal
rows between unbounded preceding and 1 preceding) "previous"
from scott.emp
Using Analytic Functions with GROUP BY clause
Analytics can be used with aggregation, When combining analytics with aggregation, all expression in the ORDER BY clause and column expression have to be either part of the GROUP BY expressions or aggregate functions.
select
row_number() over(order by deptno) "Row_Number",
deptno,
sum(sal),
to_char(100*ratio_to_report(sum(sal)) over(),'990.00L','NLS_CURRENCY=%') pct
from scott.emp
group by deptno
RATIO_TO_REPORT returns the percentage of the total salary of the department to the overall total.