需求描述:登录表 tb_user_logs(user_id int, login_time datetime),求3天内连续登录的用户ID。
user_id | dt |
1001 | 2023-06-10 |
1001 | 2023-06-11 |
1001 | 2023-06-15 |
1001 | 2023-06-16 |
1001 | 2023-06-17 |
步骤1: 给数据进行排序
user_id | dt | NO |
1001 | 2023-06-10 | 1 |
1001 | 2023-06-11 | 2 |
1001 | 2023-06-15 | 3 |
1001 | 2023-06-16 | 4 |
1001 | 2023-06-17 | 5 |
步骤2: 用 “dt-NO”
user_id | dt | NO | td-NO |
1001 | 2023-06-10 | 1 | 2023-06-09 |
1001 | 2023-06-11 | 2 | 2023-06-09 |
1001 | 2023-06-15 | 3 | 2023-06-12 |
1001 | 2023-06-16 | 4 | 2023-06-12 |
1001 | 2023-06-17 | 5 | 2023-06-12 |
步骤3:对 “dt-NO” 进行分组,同一组的日期一定是连续的。
整理思路:对每个用户,按照日期从小到大排序,对数据进行连续编号,计算日期与编号的差值,按照差值进行分组操作,同一组内的日期一定是连续的,只要组内数据个数N>=3即可。
编号:窗口函数row_number()支持对行进行编号,编号可以是全局的,也可以分组编号,但一定要提供排序的规则。
语法:
row_number() over (partition by user_id order by dt)
解释:对分组进行编号,artition by user_id:对 user_id 进行分组,order by dt:对 dt 进行排序
实战:
步骤1:ID去重(一个用户1天只统计1次)
select
distinct
user_id,date(login_time) from
tb_user_logs;
步骤2:分组编号
select
*,row_number
() over
(partition
by user_id order
by dt)
form (select
distinct
user_id,date(login_time) from
tb_user_logs) a;
步骤3:计算差值
select
*, date_sub
(dt, INTERVAL
xrank DAY) as
delta
form (
select
*,row_number
() over
(partition
by user_id order
by dt)
form (
select
distinct
user_id,date(login_time) from
tb_user_logs
) a
)
步骤4:找出>=3的组
select
user_id,min
(dt) as start_date,count
(*) as days
form
(
select
*, date_sub
(dt, INTERVAL
xrank DAY) as
delta
form
(
select
*,row_number
() over
(partition
by user_id order
by dt)
form
(
select
distinct
user_id,date(login_time) from
tb_user_logs
) a
) b
) c
group
by
user_id, delta
having
count
(*) >=3;