Gaps & Islands: Number of Consecutive Days in SQL
In my latest post about Gaps & Islands problem, I promised to provide a real-world example that involves finding the number of consecutive login days in a dataset.
Problem Description:
Given a table CUSTOMER_LOGINS with the following columns:
The task is to identify and group consecutive login dates for each customer, and for each such group, provide:
Step I: Create a Test Table
First, let's create a test table CUSTOMER_LOGINS.
-- Create the CUSTOMER_LOGINS table
CREATE TABLE CUSTOMER_LOGINS (
customer_id INT,
login_date DATE
);
-- Insert some sample data
INSERT INTO CUSTOMER_LOGINS (customer_id, login_date)
VALUES
(1, '2022-01-09'),
(1, '2022-01-10'),
(1, '2022-01-11'),
(1, '2022-01-18'),
(1, '2022-01-19'),
(1, '2022-01-20'),
(1, '2022-01-21');
SELECT * FROM CUSTOMER_LOGINS;
+-------------+------------+
| customer_id | login_date |
+-------------+------------+
| 1 | 2022-01-09 |
| 1 | 2022-01-10 |
| 1 | 2022-01-11 |
| 1 | 2022-01-18 |
| 1 | 2022-01-19 |
| 1 | 2022-01-20 |
| 1 | 2022-01-21 |
+-------------+------------+
Step II: Rank Dates
At first sight, it's not easy to guess how should we approach this problem. But if we think about the intermediary output, we probably need a column to group consecutive dates. Let's rank the date column and see what it gives us.
Recommended by LinkedIn
SELECT
customer_id
, login_date
, RANK() OVER (
PARTITION BY customer_id
ORDER BY login_date
) AS rnk
FROM CUSTOMER_LOGINS
-- Intermediary Result
+-------------+------------+-------+
| customer_id | login_date | rnk |
+-------------+------------+-------+
| 1 | 2022-01-09 | 1 |
| 1 | 2022-01-10 | 2 |
| 1 | 2022-01-11 | 3 |
| 1 | 2022-01-18 | 4 |
| 1 | 2022-01-19 | 5 |
| 1 | 2022-01-20 | 6 |
| 1 | 2022-01-21 | 7 |
+-------------+------------+-------+
Step III: Subtract to Get Groups
If you think for a minute, you will notice that by subtracting rnk from login_date we get a unique group for consecutive dates.
WITH ranked_logins AS (
SELECT
customer_id
, login_date
, DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY login_date
) AS rnk
FROM
CUSTOMER_LOGINS
)
SELECT
customer_id
, login_date
-- Subract `rnk` from `login_date`
, login_date - INTERVAL '1' DAY * rnk AS grp_date
FROM
ranked_logins;
-- Intermediary Result
+-------------+------------+-------+------------+
| customer_id | login_date | rnk | grp_date |
+-------------+------------+-------+------------+
| 1 | 2022-01-09 | 1 | 2022-01-08 |
| 1 | 2022-01-10 | 2 | 2022-01-08 |
| 1 | 2022-01-11 | 3 | 2022-01-08 |
| 1 | 2022-01-18 | 4 | 2022-01-14 |
| 1 | 2022-01-19 | 5 | 2022-01-14 |
| 1 | 2022-01-20 | 6 | 2022-01-14 |
| 1 | 2022-01-21 | 7 | 2022-01-14 |
+-------------+------------+-------+------------+
Step IV: MIN and MAX
The next step is to find the minimum and maximum dates for each grp_date, it will give us the interval_start_date and interval_end_date.
WITH ranked_logins AS (
SELECT
customer_id
, login_date
, DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY login_date
) AS rnk
FROM
CUSTOMER_LOGINS
),
date_groups AS (
SELECT
customer_id
, login_date
, login_date - INTERVAL '1' DAY * rnk AS grp_date
FROM
ranked_logins
)
SELECT
customer_id
, MIN(login_date) AS interval_start_date
, MAX(login_date) AS interval_end_date
FROM date_groups
GROUP BY customer_id, grp_date
ORDER BY customer_id, interval_start_date
-- Intermediary Result
+-------------+---------------------+-------------------+
| customer_id | interval_start_date | interval_end_date |
+-------------+---------------------+-------------------+
| 1 | 2022-01-09 | 2022-01-11 |
| 1 | 2022-01-18 | 2022-01-21 |
+-------------+---------------------+-------------------+
Step (Final) V: Number of Consecutive Days
It is now obvious, we just need to count the days between interval_start_date and interval_end_date.
WITH ranked_logins AS (
SELECT
customer_id
, login_date
, DENSE_RANK() OVER (
PARTITION BY customer_id
ORDER BY login_date
) AS rnk
FROM
CUSTOMER_LOGINS
),
date_groups AS (
SELECT
customer_id
, login_date
, login_date - INTERVAL '1' DAY * rnk AS grp_date
FROM
ranked_logins
)
SELECT
customer_id
, MIN(login_date) AS interval_start_date
, MAX(login_date) AS interval_end_date
, 1 + DATEDIFF('day',
MIN(login_date), MAX(login_date)
) AS consecutive_days
FROM date_groups
GROUP BY customer_id, grp_date
ORDER BY customer_id, interval_start_date
-- Final Result
+---------+---------------------+-------------------+------------------+
| cust_id | interval_start_date | interval_end_date | consecutive_days |
+---------+---------------------+-------------------+------------------+
| 1 | 2022-01-09 | 2022-01-11 | 3 |
| 1 | 2022-01-18 | 2022-01-21 | 4 |
+---------+---------------------+-------------------+------------------+
It is not always obvious how should you approach the problem. 💡 Advice? Learn the basic structures of SQL and be creative!
🚀 Feel free to ask any questions and remember to 𝗞𝗲𝗲𝗽 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗙𝗹𝗼𝘄𝗶𝗻𝗴! 🔔 Engage, Share, & Follow | Lasha Dolenjashvili
Business Intelligence Engineer | Analytics Engineer | Tableau, SQL, Python, Automation, Predictive Analytics | Building End-to-End Solutions
1moSame here, I love how you broke it down into five steps, made it super simple to grasp! Thank you
Quality Assurance | Software Testing | Incident/Problem Management | Data Governance
9moThis is very helpful! I was doing online exercises, and encountered something very similar to this, and I was stuck. Your post got me to the answer