访问咨询主题看板

学习目标

了解访问咨询主题看板需求

能够提取出需求中的指标和维度

了解访问客户量指标的分层结构

了解常见的数据格式和压缩格式

理解全量和增量

掌握Hive静态分区和动态分区的用法

了解如何配置HDFS副本数

了解Yarn的基础优化配置

了解Hive的基础优化配置

了解MR与Hive的压缩配置

能够使用Sqoop抽取全量数据

了解Hive常用的时间函数

了解Hive常用的字符串截取函数

能够编写访问量指标的清洗转换SQL

能够编写访问量指标的统计分析SQL

能够使用Sqoop导出全量数据到Mysql

掌握Shell的date命令

掌握Shell的变量替换、命令替换与数学运算语法

掌握Shell的串行与并行

能够编写sqoop导入的Shell脚本

能够使用oozie定时调度sqoop任务

理解增量清洗转换和全量的区别

理解增量统计分析和全量的区别

理解增量Sqoop导出和全量的区别

能够使用shell脚本删除mysql数据

主题需求

客户访问和咨询主题,顾名思义,分析的数据主要是客户的访问数据和咨询数据。但是经过需求调研,这里的访问数据,实际指的是访问的客户量,而不是客户访问量。原始数据来源于咨询系统的mysql业务数据库。

用户关注的核心指标有:1、总访问客户量、2、地区独立访客热力图、3、访客咨询率趋势、4、客户访问量和访客咨询率双轴趋势、5、时间段访问客户量趋势、6、来源渠道访问量占比、7、搜索来源访问量占比、8、活跃页面排行榜。

image-20230130195956110

总访问客户量

说明:统计指定时间段内,访问客户的总数量。能够下钻到小时数据。

展现:线状图

指标:访问客户量

维度:年、季度、月

粒度:天

条件:年、季度、月

数据来源:咨询系统的web_chat_ems_2019_12等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
8
SELECT  
count( DISTINCT ( wce.sid ) ) '总数', DATE_FORMAT( wce.create_time, '%Y-%m-%d' ) '时间'
FROM
web_chat_ems_2019_12 wce
GROUP BY
DATE_FORMAT( wce.create_time, '%Y-%m-%d' )
ORDER BY
DATE_FORMAT( wce.create_time, '%Y-%m-%d' ) ASC

地区独立访客热力图

说明:统计指定时间段内,访问客户中各区域人数热力图。能够下钻到小时数据。

展现:地图热力图

指标:按照地区聚合访问的客户数量

维度:年、季度、月

粒度:天

条件:年、季度、月

数据来源:咨询系统的web_chat_ems_2019_12等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT  
wce.area '区域',
count(DISTINCT(wce.sid)) '总数',
wce.country '国家',
wce.province '省份',
wce.city '城市',
DATE_FORMAT(wce.create_time,'%Y-%m-%d') '时间'
FROM
web_chat_ems_2019_12 wce
GROUP BY
DATE_FORMAT(wce.create_time,'%Y-%m-%d'),
wce.area
ORDER BY
DATE_FORMAT(wce.create_time,'%Y-%m-%d') ASC, count(DISTINCT(wce.sid)) DESC;

访客咨询率趋势

说明:统计指定时间段内,不同地区(省、市)访问的客户中发起咨询的人数占比;

咨询率=发起咨询的人数/访问客户量;客户与网咨有说一句话的称为有效咨询。

展现:线状图

指标:访客咨询率

维度:年、月、城市

粒度:天

条件:年、季度、月、省、市

数据来源:咨询系统的web_chat_ems_2019_12等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SELECT  
CONCAT(msgNumber.num / totalNumber.num * 100, '%')
FROM
(
SELECT
count( DISTINCT ( sid ) ) num
FROM
web_chat_ems_2019_12
WHERE
msg_count >= 1
) msgNumber,
(
SELECT
count( DISTINCT ( sid ) ) num
FROM
web_chat_ems_2019_12
) totalNumber

客户访问量和访客咨询率双轴趋势

说明:统计指定时间段内,每日客户访问量/咨询率双轴趋势图。能够下钻到小时数据。

每日客户访问量可以复用指标2数据;

咨询率可以复用指标3的数据。

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

时间段访问客户量趋势

说明:统计指定时间段内,1-24h之间,每个时间段的访问客户量。

横轴:1-24h,间隔为一小时,纵轴:指定时间段内同一小时内的总访问客户量。

展现:线状图、柱状图、饼状图

指标:某小时的总访问客户量

维度:天

粒度:区间内小时段

条件:天

数据来源:咨询系统的web_chat_ems_2019_12等月表

SQL:

1
2
3
4
5
6
7
8
9
SELECT  
DATE_FORMAT(wce.create_time,'%H') '时间',
count(DISTINCT(wce.sid)) '总数' ,
FROM
web_chat_ems_2019_12 wce
GROUP BY
DATE_FORMAT(wce.create_time,'%H')
ORDER BY
DATE_FORMAT(wce.create_time,'%H');

来源渠道访问量占比

说明:统计指定时间段内,不同来源渠道的访问客户量占比。能够下钻到小时数据。

展现:饼状图

指标:比值

维度:年、季度、月

粒度:天

条件:年、季度、月

数据来源:咨询系统的web_chat_ems_2019_12等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
SELECT  
count( DISTINCT ( wce.sid ) ) '总数',
wce.origin_channel
FROM
web_chat_ems_2019_12 wce
GROUP BY
wce.origin_channel;

搜索来源访问量占比

说明:统计指定时间段内,不同搜索来源的访问客户量占比。能够下钻到小时数据。

展现:饼状图

指标:比值

维度:年、季度、月

粒度:天

条件:年、季度、月

数据来源:咨询系统的web_chat_ems_2019_12等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
SELECT  
count( DISTINCT ( wce.sid ) ) '总数',
wce.seo_source
FROM
web_chat_ems_2019_12 wce
GROUP BY
wce.seo_source;

活跃页面排行榜

说明:统计指定时间段内,产生访问客户量最多的页面排行榜TOPN。能够下钻到小时数据。

展现:柱状图

指标:访问客户量

维度:页面、年、季度、月

粒度:天

条件:年、季度、月、Top数量

数据来源:咨询系统的 web_chat_text_ems_2019_11等月表

按年:显示指定年范围内每天的客户访问量

按季度:显示指定季度范围内每天的客户访问量

按月:显示指定月份范围内每天的客户访问量

SQL:

1
2
3
4
5
6
7
8
SELECT  
count( 1 ),
wcte.from_url
FROM
web_chat_text_ems_2019_11 wcte
GROUP BY
wcte.from_url
LIMIT 20

原始数据结构(数据源)

访问客户量的数据来源于咨询系统的访问会话月表web_chat_ems,表名的格式为web_chat_ems_年_月,年份为4位数字,月份为二位数字,如果为单数时,前面会用0来补全,比如web_chat_ems_2019_07。

web_chat_text_ems表是访问附属月表,表名的格式和web_chat_ems相同。web_chat_ems和web_chat_text_ems是一一对应的,通过主键id进行关联。

image-20230130200337575

建库:

1
create database nev default character set utf8mb4 collate utf8mb4_unicode_ci;

web_chat_ems表结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
create table web_chat_ems_2019_07
(
id int auto_increment comment '主键' primary key,
create_date_time timestamp null comment '数据创建时间',
session_id varchar(48) default '' not null comment '会话系统sessionId',
sid varchar(48) collate utf8_bin default '' not null comment '访客id',
create_time datetime null comment '会话创建时间',
seo_source varchar(255) collate utf8_bin default '' null comment '搜索来源',
seo_keywords varchar(512) collate utf8_bin default '' null comment '关键字',
ip varchar(48) collate utf8_bin default '' null comment 'IP地址',
area varchar(255) collate utf8_bin default '' null comment '地域',
country varchar(16) collate utf8_bin default '' null comment '所在国家',
province varchar(16) collate utf8_bin default '' null comment '省',
city varchar(255) collate utf8_bin default '' null comment '城市',
origin_channel varchar(32) collate utf8_bin default '' null comment '来源渠道(广告)',
user varchar(255) collate utf8_bin default '' null comment '所属坐席',
manual_time datetime null comment '人工开始时间',
begin_time datetime null comment '坐席领取时间 ',
end_time datetime null comment '会话结束时间',
last_customer_msg_time_stamp datetime null comment '客户最后一条消息的时间',
last_agent_msg_time_stamp datetime null comment '坐席最后一下回复的时间',
reply_msg_count int(12) default 0 null comment '客服回复消息数',
msg_count int(12) default 0 null comment '客户发送消息数',
browser_name varchar(255) collate utf8_bin default '' null comment '浏览器名称',
os_info varchar(255) collate utf8_bin default '' null comment '系统名称'
);

web_chat_text_ems表结构:

1
2
3
4
5
6
7
8
9
10
11
create table web_chat_text_ems_2019_07
(
id int not null comment '主键' primary key,
referrer text collate utf8_bin null comment '上级来源页面',
from_url text collate utf8_bin null comment '会话来源页面',
landing_page_url text collate utf8_bin null comment '访客着陆页面',
url_title text collate utf8_bin null comment '咨询页面title',
platform_description text collate utf8_bin null comment '客户平台信息',
other_params text collate utf8_bin null comment '扩展字段中数据',
history text collate utf8_bin null comment '历史访问记录'
);

测试数据

Mysql测试数据可以通过导入已准备好的sql文件进行创建:【Home\讲义\完整原始数据\nev.sql】。可以通过mysql脚本导入:

可以上传到linux后执行,也可以在windows执行(需要有mysql环境变量C:\Program Files\MySQL\MySQL Server 5.7\bin)。

1
2
mysql -h 192.168.52.150 -P 3306 -uroot -p
source G:\知行教育大数据平台\讲义\完整原始数据\nev.sql

image-20230130200431752

建模分析

提取指标维度

根据主题的需求,我们可以看出,包含的指标有一些是可以提取合并的:

  1. 地区独立访客热力图、总访问客户量、时间段访问客户量趋势、来源渠道访问量占比、搜索来源访问量占比、活跃页面排行榜的指标都可以合并为一个:访问客户量。

  2. 合并后的访问客户量指标,维度不同,而且数据来源也不同。

  3. 访客咨询率趋势、客户访问量和访客咨询率双轴趋势,都包含了访客咨询率指标。

  4. 访客咨询率=发起咨询的人数/访问客户量,分母访问客户量,可以复用前面的指标。因此只需要计算出分子:咨询客户量。

由此我们可以推断出,指标有两个:访问客户量和咨询客户量。

访问客户量

虽然访问客户量的时间维度只有年、季度、月,但是展示粒度要具体到天,所以统计时也要包含日维度。同时要求能够下钻到小时数据,所以维度中也要包含小时。

来源渠道访客量占比,虽然最终要的是占比的比值,但是这个比值是可以通过具体的访客量计算出来,所以我们只需要提供不同来源渠道的访客量数据,柱状图前端就能够自动的计算比值。因此这里的指标也归类于访问客户量,维度为来源渠道。

活跃页面排行榜,字面看是和页面相关的指标,实际统计的却是每个页面的访问客户量,然后再进行排序后得出的排行榜。所以这里的指标也归类于访问客户量,统计的维度是具体的页面。但是要注意这里的数据来源有变化。

维度包括:年、季度、月、天、小时(天区间内小时段)、地区、来源渠道、搜索来源、会话来源页面、总访问量。

咨询客户量

访客咨询率趋势统计中,访客咨询率=发起咨询的人数/访问客户量,分母访问客户量我们可以复制上面的指标数据,而分子咨询客户量则是我们要统计的新指标。

维度包括:年、季度、月、日、地区、来源渠道。

分层设计

image-20230130200446041

访问客户量

抽取咨询原始数据到ODS层以后,如何分析出对应维度的数据呢?

我们可以采取结果导向的方式来进行倒推:

1.最终的数据维度:年、季度、月、日、小时、天区间内小时、地区、来源渠道、搜索来源、会话来源页面、总访问量;

2.首先要有DWD层对ODS原始数据进行清洗和转换,作为我们的明细数据;

3.维度我们可以分为两类:时间维度(年、季度、月、日、小时)和产品属性维度(地区、来源渠道、搜索来源、会话来源页面、总访问量);

4.我们可以将产品属性维度和最小的时间粒度(小时)来统计,作为共享表,放置在DWM层;

5.在DWM层小时数据的基础上,进行上卷sum统计(年、季度、月、日、小时),即可得到DWS层的数据集市;

6.注意:最终的数据要求在统计之前,要先根据客户进行去重,这也对我们的中间层进行了限制,不能简单的先按天去重count,然后再按月和年sum,因为不同天的客户可能存在重复,直接sum会导致结果不正确;

7.所以DWM层不能进行count,那么如果将DWD客户去重后的数据,直接保存为中间表呢?也会存在问题,因为如果DWM层把全表的用户去重后,在DWS层数据会存在丢失减少的情况,比如小时数据和天数据。

8.DWS层直接根据DWD的数据进行统计,得出数据集市;

9.将宽表数据导出到mysql,由FineBI灵活选择APP数据字段进行展示。

10.ODS——》DWD——》DWS。

咨询客户量

1.最终的数据维度:年、季度、月、日、地区、来源渠道;

2.统计的数据和访问客户量指标相似,唯一的不同点是,多了一个条件:和客服有聊天信息;

3.因为咨询客户量的数据来源和访问客户量相同,所以ODS层可以复用;

4.DWD层对ODS原始数据进行清洗和转换,可以复用;

5.DWM层先去重,再在DWS中sum的结果是不正确的;所以跳过DWM层;

6.DWS层直接在DWD层的基础上,加上聊天信息的条件后,按照维度进行统计。

访问客户量实现

数据格式和压缩格式

数据格式

列式存储和行式存储

image-20230130200500893

行存储的特点: 查询满足条件的一整行(所有列)数据的时候,列存储则需要去每个聚集的字段找到对应的每个列的值,行存储只需要找到其中一个值,其余的值都在相邻地方,所以此时行存储查询的速度更快。

列存储的特点: 因为每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量;每个字段的数据类型一定是相同的,列式存储可以针对性的设计更好的设计压缩算法。

TEXTFILE

默认格式,行式存储。可结合Gzip、Bzip2使用(系统自动检查,执行查询时自动解压),但使用这种方式,hive不会对数据进行切分,从而无法对数据进行并行操作。并且反序列化过程中,必须逐个字符判断是不是分隔符和行结束符,性能较差。

ORCFILE

使用ORC文件格式可以提高hive读、写和处理数据的能力。ORCFile是RCFile的升级版。

在ORC格式的hive表中,数据按行分块,每块按列存储。结合了行存储和列存储的优点。记录首先会被横向的切分为多个stripes,然后在每一个stripe内数据以列为单位进行存储,所有列的内容都保存在同一个文件中。

每个stripe的默认大小为256MB,相对于RCFile每个4MB的stripe而言,更大的stripe使ORC可以支持索引,数据读取更加高效。

image-20231004074058173

存储和压缩结合

zlib压缩

优点:压缩率比较高;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样。

缺点:压缩性能一般。

snappy压缩

优点:高速压缩速度和合理的压缩率。

缺点:压缩率比zlib要低;hadoop本身不支持,需要安装(CDH版本已自动支持,可忽略)。

系统采用的格式

因为ORCFILE的压缩快、存取快,而且拥有特有的查询优化机制,所以系统采用ORCFILE存储格式(RCFILE升级版),压缩算法采用orc支持的ZLIB和SNAPPY。

在ODS数据源层,因为数据量较大,可以采用orcfile+ZLIB的方式,以节省磁盘空间;

而在计算的过程中(DWD、DWM、DWS、APP),为了不影响执行的速度,可以浪费一点磁盘空间,采用orcfile+SNAPPY的方式,提升hive的执行速度。

存储空间足够的情况下,推荐采用SNAPPY压缩。

全量和增量

开发步骤共包含两大过程:全量过程和增量过程。

全量过程

全量过程是在首次建库时,需要对OLTP应用中的全量数据进行采集、清洗和统计计算。历史数据量可能会非常大,远远超出了增量过程。在执行时需要进行针对性的优化配置并采用分批执行。

增量过程

增量过程是在全量过程之后进行的,大多采用的是T+1模式。

全量执行完毕后,对OLTP每天的新增数据和更新数据要进行同步,如果还是对全量数据进行分析,效率会非常低下。增量数据只有一天的量,采集、清洗和统计的效率相对于全量过程会有很大提升。

什么是T+1?

这种说法来源于股票交易:

T+0,是国际上普遍使用的一种证劵度(或期货)交易制度。凡在证劵(或期货)成交日当天办理好证劵(或期货)和价款清算交割手续的交易制度,就称为T+0交易。通俗说,就是当天买入的证道劵(或期货)在当天就可以卖出。

T+1是一种股票交易制度,即当日买进的股票,要到下一个交易日才能卖出。“T”指交易登记日,“T+1”指登记日的次日。

Hive的分区

我们知道传统的OLTP数据库一般都具有索引和表分区的功能,通过表分区能够在特定的区域检索数据,减少扫描成本,在一定程度上提高查询效率,我们还可以通过建立索引进一步提升查询效率。在Hive数仓中也有索引和分区的概念。

为了对表进行合理的管理以及提高查询效率,Hive可以将表组织成“分区”。

分区是表的部分列的集合,可以为频繁使用的数据建立分区,这样查找分区中的数据时就不需要扫描全表,这对于提高查找效率很有帮助。

分区是一种根据“分区列”(partition column)的值对表进行粗略划分的机制。Hive中每个分区对应着表很多的子目录,将所有的数据按照分区列放入到不同的子目录中去。

为什么要分区

庞大的数据集可能需要耗费大量的时间去处理。在许多场景下,可以通过分区的方法减少每一次扫描总数据量,这种做法可以显著地改善性能。

数据会依照单个或多个列进行分区,通常按照时间、地域或者是商业维度进行分区。

比如电影表,分区的依据可以是电影的种类和评级,另外,按照拍摄时间划分可能会得到均匀的结果。

为了达到性能表现的一致性,对不同列的划分应该让数据尽可能均匀分布。最好的情况下,分区的划分条件总是能够对应where语句的部分查询条件,这样才能充分利用分区带来的性能优势。

image-20231004074103715

Hive的分区使用HDFS的子目录功能实现。每一个子目录包含了分区对应的列名和每一列的值。但是由于HDFS并不支持大量的子目录,这也给分区的使用带来了限制。我们有必要对表中的分区数量进行预估,从而避免因为分区数量过大带来一系列问题。

Hive查询通常使用分区的列作为查询条件。这样的做法可以指定MapReduce任务在HDFS中指定的子目录下完成扫描的工作。HDFS的文件目录结构可以像索引一样高效利用。

Hive(Inceptor)分区包括静态分区和动态分区。

静态分区

​ 根据插入时是否需要手动指定分区可以分为:静态分区:导入数据时需要手动指定分区。动态分区:导入数据时,系统可以动态判断目标分区。

1. 创建静态分区

直接在 PARTITIONED BY 后面跟上分区键、类型即可。(分区键不能和任何列重名)

语法:

1
2
3
4
5
6
7
8
CREATE [EXTERNAL] TABLE <table_name>
(<col_name> <data_type> [, <col_name> <data_type> ...])
-- 指定分区键和数据类型
PARTITIONED BY (<partition_key> <data_type>, ...)
[ROW FORMAT <row_format>]
[STORED AS TEXTFILE|ORC|CSVFILE]
[LOCATION '<file_path>']
[TBLPROPERTIES ('<property_name>'='<property_value>', ...)];

栗子:

1
2
3
4
5
6
7
--分区字段主要是时间,按年分区
CREATE TABLE device_open (
deviceid varchar(50),
...
)
PARTITIONED BY (year varchar(50))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

1. 写入数据

语法:

1
2
3
4
5
6
7
8
9
-- 覆盖写入
INSERT OVERWRITE TABLE <table_name>
PARTITION (<partition_key>=<partition_value>[, <partition_key>=<partition_value>, ...])
SELECT <select_statement>;

-- 追加写入
INSERT INTO TABLE <table_name>
PARTITION (<partition_key>=<partition_value>[, <partition_key>=<partition_value>, ...])
SELECT <select_statement>;

栗子:

1
2
3
4
5
6
7
insert overwrite table device_open partition(year=2020’)
select
...,
original_device_open.month as month,
original_device_open.day as day,
original_device_open.hour as hour
FROM original_device_open
动态分区

1. 创建

创建方式与静态分区表完全一样。

语法:

1
2
3
4
5
6
7
--分区字段主要是时间,分为年,月,日,时
CREATE TABLE device_open (
deviceid varchar(50),
...
)
PARTITIONED BY (year varchar(50), month varchar(50), day varchar(50), hour varchar(50))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

1. 写入

动态分区只需要给出分区键名称。

语法:

1
2
3
4
5
6
7
8
9
10
11
12
-- 开启动态分区支持,并开启非严格模式
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table device_open partition(year,month,day,hour)
select
...,
original_device_open.year as year,
original_device_open.month as month,
original_device_open.day as day,
original_device_open.hour as hour
FROM original_device_open

set hive.exec.dynamic.partition=true; 是开启动态分区

set hive.exec.dynamic.partition.mode=nonstrict; 这个属性默认值是strict,就是要求分区字段必须有一个是静态的分区值。全部动态分区插入,需要设置为nonstrict非严格模式。

代码中标红的部分,partition(year,month,day,hour) 就是要动态插入的分区。对于大批量数据的插入分区,动态分区相当方便。

静态分区和动态分区混用

一张表可同时被静态和动态分区键分区,只是动态分区键需要放在静态分区键的后面(因为HDFS上的动态分区目录下不能包含静态分区的子目录)。

静态分区键要用 = 指定分区值;动态分区只需要给出分区键名称

spk 即静态分区static partition key, dpk 即动态分区dynamic partition key。

比如:

1
2
3
4
5
6
7
insert overwrite table device_open partition(year='2017',month='05',day,hour)
select
...,
original_device_open.day as day,
original_device_open.hour as hour
FROM original_device_open
where original_device_open.year='2017' and original_device_open.month='05'

partition(year=’2017’, month=’05’, day, hour),year和month是静态分区字段,day和hour是动态分区字段,这里指将2017年5月份的数据插入分区表,对应底层的物理操作就是将2017年5月份的数据load到hdfs上对应2017年5月份下的所有day和hour目录中去。

注意混用的情况下,静态分区的上层必须也是静态分区,如果partition(year, month, day=’05’, hour=’08’),则会报错:FAILED: SemanticException [Error 10094]: Line 1:50 Dynamic partition cannot be the parent of a static partition ‘’day’’。

有序动态分区

注意,如果个人电脑性能不好,出现因为动态分区而导致的内存溢出问题,可以设置hive.optimize.sort.dynamic.partition进行避免:

image-20231004074114069

设置为true后,当启用动态分区时,reducer仅随时保持一个记录写入程序,从而降低对 reducer产生的内存压力。但同时也会使查询性能变慢。

动态分区其他相关属性设置:

image-20231004074109695

建模

指标和维度

指标:访问客户量是单位时间内访问网站的去重后客户数量,以天为单位显示访问客户。

维度:

  • 时间维度:年、季度、月、天、小时

  • 业务属性维度:地区、来源渠道、搜索来源、会话来源页面、总访问量。

事实表和维度表

事实表的数据就是指标数据,访问客户量指标的事实表就是我们的客户访问表。而维度数据都包含在事实表中,没有需要额外关联的维度表。

分层

数据库命名统一加上前缀itcast,在实际场景中,将此前缀替换为系统的简称即可。比如:edu_ods、edu_dwd、edu_dws等。

ODS层是原始数据,一般不允许修改,所以使用外部表保证数据的安全性,避免误删除;DW和APP层是统计数据,为了使覆盖插入等操作更方便,满足业务需求的同时,提高开发和测试效率,推荐使用内部表。

image-20231004074118916

ODS层

从咨询系统OLTP数据库的web_chat_ems_20XX_XX等月表中抽取的原始数据;

离线数仓大多数的场景都是T+1,为了便于后续的DW层清洗数据时,快速获取昨天的数据,ODS模型要在原始mysql表的基础之上增加starts_time抽取日期字段,并且可以使用starts_time字段分区以提升查询的性能。

建库
1
CREATE DATABASE IF NOT EXISTS `itcast_ods`;
建表web_chat_ems

建表时,要注意字段名不要采用关键字,比如原始mysql表中有一个user字段,我们需要将它修改为user_match。

注意,设置ORC压缩格式前一定要先设置hive.exec.orc.compression.strategy,否则压缩不生效:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_ems (
id INT comment '主键',
create_date_time STRING comment '数据创建时间',
session_id STRING comment '七陌sessionId',
sid STRING comment '访客id',
create_time STRING comment '会话创建时间',
seo_source STRING comment '搜索来源',
seo_keywords STRING comment '关键字',
ip STRING comment 'IP地址',
area STRING comment '地域',
country STRING comment '所在国家',
province STRING comment '省',
city STRING comment '城市',
origin_channel STRING comment '投放渠道',
user_match STRING comment '所属坐席',
manual_time STRING comment '人工开始时间',
begin_time STRING comment '坐席领取时间 ',
end_time STRING comment '会话结束时间',
last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
reply_msg_count INT comment '客服回复消息数',
msg_count INT comment '客户发送消息数',
browser_name STRING comment '浏览器名称',
os_info STRING comment '系统名称')
comment '访问会话信息表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
建表web_chat_text_ems
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_text_ems (
id INT COMMENT '主键来自MySQL',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录'
) comment 'EMS-PV测试表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_text_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
DWD层

维度:

  • 时间维度:年、季度、月、天、小时

  • 业务属性维度:地区、来源渠道、搜索来源、会话来源页面、总访问量。

建库:

1
2
CREATE DATABASE IF NOT EXISTS `itcast_dwd`
WITH DBPROPERTIES ( 'creator' = 'jiale', 'create_date' = '2020-05-05');

将ODS层数据,进行清洗转换,并且将web_chat_ems主表和web_chat_text_ems附表的内容根据id合并在一起。数据粒度保持不变。

数据清洗:空数据、不满足业务需求的数据处理。

数据转换:数据格式和数据形式的转换,比如时间类型可以转换为同样的展现形式“yyyy-MM-dd HH:mm:ss”或者时间戳类型,金钱类型的数据可以统一转换为以元为单位或以分为单位的数值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
create table if not exists itcast_dwd.visit_consult_dwd(
session_id STRING comment '七陌sessionId',
sid STRING comment '访客id',
create_time bigint comment '会话创建时间',
seo_source STRING comment '搜索来源',
ip STRING comment 'IP地址',
area STRING comment '地域',
msg_count int comment '客户发送消息数',
origin_channel STRING COMMENT '来源渠道',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录',
hourinfo string comment '小时',
quarterinfo string comment '季度'
)
comment '访问咨询DWD表'
partitioned by(yearinfo String, monthinfo String, dayinfo string)
row format delimited fields terminated by '\t'
stored as orc
location '/user/hive/warehouse/itcast_dwd.db/visit_consult_dwd'
tblproperties ('orc.compress'='SNAPPY');
DWS层

在DWD层的基础上,按照业务的要求进行统计分析;时间和业务属性三个维度分类,可以在模型中增加对应的属性标识:

  • 时间维度:1.年、2.季度、3.月、4.天、5.小时

  • 业务属性维度:1.地区、2.来源渠道、3.搜索来源、4.会话来源页面、5.总访问量

建库:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE DATABASE IF NOT EXISTS `itcast_dws`
WITH DBPROPERTIES ( 'creator' = 'kongshuai', 'create_date' = '2020-05-05');
CREATE TABLE IF NOT EXISTS itcast_dws.visit_dws (
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
seo_source STRING COMMENT '搜索来源',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
quarterinfo STRING COMMENT '季度',
time_str STRING COMMENT '时间明细',
from_url STRING comment '会话来源页面',
groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;')
comment 'EMS访客日志dws表'
PARTITIONED BY(yearinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_dws.db/visit_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
APP层

如果用户需要具体的报表展示,可以针对不同的报表页面设计APP层结构,然后导出至OLAP系统的mysql中。此系统使用FineBI,需要通过宽表来进行灵活的展现。因此APP层不再进行细化。直接将DWS层导出至mysql即可。

Hive参数优化(基础)

此课程中关于Hive的优化,皆是基于Hive2.x的版本,对于Hive1.x旧版本的优化机制不再复述(新版本已改善或变更)。另外新版本中默认为开启状态的优化配置项,在工作中无需修改,也不再复述。

HDFS副本数

dfs.replication(HDFS)

文件副本数,通常设为3,不推荐修改。

如果测试环境只有二台虚拟机(2个datanode节点),此值要修改为2。

image-20231004074124542

Yarn基础配置

NodeManager配置**
CPU配置

配置项:yarn.nodemanager.resource.cpu-vcores

表示该节点服务器上yarn可以使用的虚拟CPU个数,默认值是8,推荐将值配置与物理CPU线程数相同,如果节点CPU核心不足8个,要调小这个值,yarn不会智能的去检测物理核心数。

image-20231004074128958

image-20231004074132822

查看 CPU 线程数:

grep ‘processor’ /proc/cpuinfo | sort -u | wc -l

image-20231004074137013

内存配置

配置项:yarn.nodemanager.resource.memory-mb

设置该nodemanager节点上可以为容器分配的总内存,默认为8G,如果节点内存资源不足8G,要减少这个值,yarn不会智能的去检测内存资源,一般按照服务器剩余可用内存资源进行配置。生产上根据经验一般要预留15-20%的内存,那么可用内存就是实际内存*0.8,比如实际内存是64G,那么64*0.8=51.2G,我们设置成50G就可以了(固定经验值)。

image-20231004074140972

image-20231004074143791

通过CM所有主机查看剩余内存:

可以看到第一台剩余内存为31.3-4.1=27.2G。

image-20231004074147412

注意,要同时设置yarn.scheduler.maximum-allocation-mb为一样的值,yarn.app.mapreduce.am.command-opts(JVM内存)的值要同步修改为略小的值(-Xmx1024m)。

本地目录

yarn.nodemanager.local-dirs(Yarn)

NodeManager 存储中间数据文件的本地文件系统中的目录列表。

如果单台服务器上有多个磁盘挂载,则配置的值应当是分布在各个磁盘上目录,这样可以充分利用节点的IO读写能力。

image-20231004074153594

MapReduce内存配置

当MR内存溢出时,可以根据服务器配置进行调整。

mapreduce.map.memory.mb

为作业的每个 Map 任务分配的物理内存量(MiB),默认为0,自动判断大小。

mapreduce.reduce.memory.mb

为作业的每个 Reduce 任务分配的物理内存量(MiB),默认为0,自动判断大小。

mapreduce.map.java.opts、mapreduce.reduce.java.opts

Map和Reduce的JVM配置选项。

注意:

mapreduce.map.java.opts一定要小于mapreduce.map.memory.mb;

mapreduce.reduce.java.opts一定要小于mapreduce.reduce.memory.mb,格式-Xmx4096m。

注意:

此部分所有配置均不能大于Yarn的NodeManager内存配置。

image-20231004074157202

image-20231004074202222

image-20231004074205468

Hive基础配置

HiveServer2 的 Java 堆栈

Hiveserver2异常退出,导致连接失败的问题。

image-20231004074209627

image-20231004074212692

解决方法:修改HiveServer2 的 Java 堆栈大小。

image-20231004074216432

动态生成分区的线程数

hive.load.dynamic.partitions.thread

用于加载动态生成的分区的线程数。加载需要将文件重命名为它的最终位置,并更新关于新分区的一些元数据。默认值为 15 。

当有大量动态生成的分区时,增加这个值可以提高性能。根据服务器配置修改。

image-20231004074220375

监听输入文件线程数

hive.exec.input.listing.max.threads

Hive用来监听输入文件的最大线程数。默认值:15。

当需要读取大量分区时,增加这个值可以提高性能。根据服务器配置进行调整。

image-20231004074223377

压缩配置

Map输出压缩

除了创建表时指定保存数据时压缩,在查询分析过程中,Map的输出也可以进行压缩。由于map任务的输出需要写到磁盘并通过网络传输到reducer节点,所以通过使用LZO、LZ4或者Snappy这样的快速压缩方式,是可以获得性能提升的,因为需要传输的数据减少了。

MapReduce配置项:

l mapreduce.map.output.compress

设置是否启动map输出压缩,默认为false。在需要减少网络传输的时候,可以设置为true。

l mapreduce.map.output.compress.codec

设置map输出压缩编码解码器,默认为org.apache.hadoop.io.compress.DefaultCodec,推荐使用SnappyCodec:org.apache.hadoop.io.compress.SnappyCodec。

image-20231004074228729

Reduce结果压缩

是否对任务输出结果压缩,默认值false。对传输数据进行压缩,既可以减少文件的存储空间,又可以加快数据在网络不同节点之间的传输速度。

配置项:

\1. mapreduce.output.fileoutputformat.compress

是否启用 MapReduce 作业输出压缩。

\2. mapreduce.output.fileoutputformat.compress.codec

指定要使用的压缩编码解码器,推荐SnappyCodec。

\3. mapreduce.output.fileoutputformat.compress.type

指定MapReduce作业输出的压缩方式,默认值RECORD,可配置值有:NONE、RECORD、BLOCK。推荐使用BLOCK,即针对一组记录进行批量压缩,压缩效率更高。

image-20231004074233063

1.1.4.3 Hive执行过程通用压缩设置

主要包括压缩/解码器设置和压缩方式设置:

l mapreduce.output.fileoutputformat.compress.codec(Yarn)

n map输出所用的压缩编码解码器,默认为org.apache.hadoop.io.compress.DefaultCodec;

​ 推荐使用SnappyCodec:org.apache.hadoop.io.compress.SnappyCodec。

l mapreduce.output.fileoutputformat.compress.type

n 输出产生任务数据的压缩方式,默认值RECORD,可配置值有:NONE、RECORD、BLOCK。推荐使用BLOCK,即针对一组记录进行批量压缩,压缩效率更高。

image-20231004074349414

image-20231004074353537

1.1.4.4 Hive多个Map-Reduce中间数据压缩

控制 Hive 在多个map-reduce作业之间生成的中间文件是否被压缩。压缩编解码器和其他选项由上面Hive通用压缩mapreduce.output.fileoutputformat.compress.*确定。

set hive.exec.compress.intermediate=true;

1.1.4.5 Hive最终结果压缩

控制是否压缩查询的最终输出(到 local/hdfs 文件或 Hive table)。压缩编解码器和其他选项由 上面Hive通用压缩mapreduce.output.fileoutputformat.compress.*确定。

set hive.exec.compress.output=true;

其他

JVM重用(不再支持)

随着Hadoop版本的升级,已自动优化了JVM重用选项,MRv2开始不再支持JVM重用。(旧版本配置项:mapred.job.reuse.jvm.num.tasks、mapreduce.job.jvm.numtasks)

Hive执行引擎(了解)

CDH支持的引擎包括MapReduce和Spark两种,可自由选择,Spark不一定比MR快,Hive2.x和Hadoop3.x经过多次优化,Hive-MR引擎的性能已经大幅提升。

配置项:hive.execution.engine

image-20231004074358389

CDH默认不支持Tez引擎:https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_unsupported_features.html

image-20231004074404574

全量流程

OLTP原始数据(mysql)——》数据采集(ODS)——》清洗转换(DWD)——》统计分析(DWS)——》导出至OLAP(Mysql),如图:

image-20231004074408738

数据采集

web_chat_ems表
SQL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
select id,
create_date_time,
session_id,
sid,
create_time,
seo_source,
seo_keywords,
ip,
area,
country,
province,
city,
origin_channel,
user as user_match,
manual_time,
begin_time,
end_time,
last_customer_msg_time_stamp,
last_agent_msg_time_stamp,
reply_msg_count,
msg_count,
browser_name,
os_info,
"2019-07-01" as starts_time
from web_chat_ems_2019_07;
Sqoop:
1
2
3
4
5
6
7
8
9
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id, create_date_time, session_id, sid, create_time, seo_source, seo_keywords, ip, area, country, province, city, origin_channel, user as user_match, manual_time, begin_time, end_time, last_customer_msg_time_stamp, last_agent_msg_time_stamp, reply_msg_count, msg_count, browser_name, os_info, "2019-07-01" as starts_time from web_chat_ems_2019_07 where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 100 \
--split-by id

-m 100,指的是使用100个MapReduce任务并行处理;

而split-by参数,是指以哪个字段为基础进行分割。

web_chat_text_ems表
SQL
1
2
3
4
5
6
7
8
9
10
select id,
referrer,
from_url,
landing_page_url,
url_title,
platform_description,
other_params,
history,
"2019-07-01" as start_time
from web_chat_text_ems_2019_07;
Sqoop
1
2
3
4
5
6
7
8
9
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id,referrer,from_url,landing_page_url,url_title,platform_description,other_params,history, "2019-07-01" as start_time from web_chat_text_ems_2019_07 where $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 100 \
--split-by id

数据清洗转换

时间函数
unix_timestamp()日期转为时间戳

1.unix_timestamp() 获取当前时间戳

例如:

1
select unix_timestamp()   --1565858389

2.unix_timestamp(string timestame) 输入的时间戳格式必须为’yyyy-MM-dd HH:mm:ss’,如不符合则返回null

例如:

1
2
select unix_timestamp('2019-08-15 16:40:00')   --1565858400
select unix_timestamp('2019-08-15') --null

3.unix_timestamp(string date,string pattern) 将指定时间字符串格式字符串转化成unix时间戳,如不符合则返回null

例如:

1
2
3
select unix_timestamp('2019-08-15','yyyy-MM-dd')   --1565798400
select unix_timestamp('2019-08-15 16:40:00','yyyy-MM-dd HH:mm:ss') --1565858400
select unix_timestamp('2019-08-15','yyyy-MM-dd HH:mm:ss') --null
from_unixtime()时间戳转为日期
  1. from_unixtime(bigint unixtime,string format) 将时间戳秒数转化为UTC时间,并用字符串表示,可通过format规定的时间格式,指定输出的时间格式,其中unixtime 是10位的时间戳值,而13位的所谓毫秒的是不可以的。

例如:

1
2
select from_unixtime(1565858389,'yyyy-MM-dd HH:mm:ss')  --2019-08-15 16:39:49
select from_unixtime(1565858389,'yyyy-MM-dd') --2019-08-15
  1. 如果unixtime为13位的,需要先转成10位
1
2
select from_unixtime(cast(1553184000488/1000 as int),'yyyy-MM-dd HH:mm:ss')   --2019-03-22 00:00:00
select from_unixtime(cast(substr(1553184000488,1,10) as int),'yyyy-MM-dd HH:mm:ss') --2019-03-22 00:00:00
当前时间
1
select from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss')   -- 2019-08-15 17:18:55
获取时间所在的季度
1
quarter('2015-04-08')    --2

2015-04-08所在的季度为当年的第二个季度。

字符串截取函数

大多数数据库中都有substr和substring两种字符串截取函数。但与其他的关系型数据库不同,在hive中,substr与substring函数的使用方式是完全一致的,属于同一个函数。

两个参数

语法:substr(string A, int start),substring(string A, int start)

返回值: string

说明:返回字符串A从start位置到结尾的字符串

栗子:

1
select substr('2020-06-06', 6), substring('2020-06-06', 6);     --06-06  06-06
三个参数

语法: substr(string A, int start, int len),substring(string A, intstart, int len)

返回值: string

说明:返回字符串A从start位置开始,长度为len的字符串。

栗子:

1
select substr('2020-06-06', 6,2), substring('2020-06-06', 6,2);     --06  06
分析

从ODS层到DWD层,数据粒度是一致的,并且要保证数据的质量。主要做两件事:

  1. 数据清洗:空数据、不满足业务需求的数据处理

对于访问客户量指标,已知的原始数据是经过咨询业务系统严格清洗过的数据,所以此处可以省略清洗过程。

  1. 数据转换:数据格式和数据形式的转换,比如时间类型可以转换为同样的展现形式“yyyy-MM-dd HH:mm:ss”或者时间戳类型,金钱类型的数据可以统一转换为以元为单位或以分为单位的数值。
原始字段 清洗转换 目标字段
msg_count(STRING) 类型由String转为Int;如果值为空,则转换为0; msg_count(INT)
create_time(STRING) 转为时间戳Int类型 create_time(BIGINT)
切割年份 yearinfo(STRING)
切割月份 monthinfo(STRING)
切割日 dayinfo(STRING)
切割小时 hourinfo(STRING)
获取季度 quarterinfo(STRING)
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;


insert into table itcast_dwd.visit_consult_dwd partition (yearinfo, monthinfo, dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time, 'yyyy-MM-dd HH:mm:ss.SSS') as create_time,
wce.seo_source,
wce.ip,
wce.area,
cast(if(wce.msg_count is null, 0, wce.msg_count) as int) as msg_count,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time, 12, 2) as hourinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time, 1, 4) as yearinfo,
substr(wce.create_time, 6, 2) as monthinfo,
substr(wce.create_time, 9, 2) as dayinfo
from itcast_ods.web_chat_ems wce inner join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id;
问题

过多的动态分区会导致如下错误

1
2
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.
metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100

解决:

1
2
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;

Hive动态分区创建文件数过多错误:

1
[Fatal Error] total number of created files now is 100385, which exceeds 100000. Killing the job.

解决:

1
set hive.exec.max.created.files=150000;

问题2:

image-20231004074416007

1
Execution failed with exit status: 137

出现原因:

可能是由于服务器内存不足引起的, 在进行mapJoin时候, 内存不足以放下小表中的数据

解决方案: 不让其进行mapjoin优化操作

1
2
-- 关闭自动装载(尤其是对于内存比较小的机器)
set hive.auto.convert.join= false;

统计分析

分析

DWD层之后是DWM中间层和DWS业务层。回顾建模分析阶段,我们已经得到了指标相关的维度:年、季度、月、天、小时、地区、来源渠道、页面。分两大类:

l时间维度:年、季度、月、天、小时

l业务属性维度:地区、来源渠道、页面、总访问量。

在DWS层按照不同维度使用count+distinct来统计指标,形成宽表。

空值处理

事实表中的维度关联键不能存在空值,关联的维度信息必须用代理键(-1)而不是空值表示未知的条件。

代码

我们的维度一共有两大类:时间维度和产品属性维度,在DWS层我们可以产出一个宽表,将所有维度的数据都生成出来,供APP层和OLAP应用来使用。

地区分组

统计地区维度时,需要设置产品属性类型groupType为1(地区),同时将其他产品属性设置为-1(搜索来源、来源渠道、会话来源页面),便于团队理解,减少自己和团队出错率的同时也降低了沟通成本。

在insertsql中,尽量为查询出的字段加上别名,特别是字段多的表,便于识别。

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by area, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by area, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by area, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by area, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'1' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
group by wce.area,wce.yearinfo;
搜索来源分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by seo_source, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by seo_source, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'2' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
group by wce.seo_source,wce.yearinfo;
来源渠道分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by origin_channel, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by origin_channel, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'3' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
group by wce.origin_channel,wce.yearinfo;
会话来源页面分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
from_url,
'4' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
from_url,
'4' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
from_url,
'4' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by from_url, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
from_url,
'4' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by from_url, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
from_url,
'4' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
group by wce.from_url,wce.yearinfo;
总访问量

小时(小时段区间的基础数据)

因为小时段数据可以直接sum求和,因此OLAP应用可以在小时数据基础上,进行简单的sum操作以获取到区间小时段数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo, quarterinfo, monthinfo, dayinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo, quarterinfo, monthinfo;

季度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo, quarterinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'5' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
group by wce.yearinfo;

导出数据

创建mysql表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;

CREATE TABLE `itcast_visit` (
sid_total int(11) COMMENT '根据sid去重求count',
sessionid_total int(11) COMMENT '根据sessionid去重求count',
ip_total int(11) COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
seo_source varchar(32) COMMENT '搜索来源',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '小时信息',
quarterinfo varchar(32) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
from_url varchar(32) comment '会话来源页面',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年信息',
monthinfo varchar(32) COMMENT '月信息',
dayinfo varchar(32) COMMENT '日信息'
);
执行sqoop导出脚本
1
2
3
4
5
6
7
8
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password '123456' \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 100
MR错误日志

执行错误:

image-20231004074522166

在hue作业中找到application_1591389362937_0085:

image-20231004074526118

查看具体错误信息:

image-20231004074529482

image-20231004074534338

原因是from_url字段长度不够,修改后再次执行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
drop table itcast_visit;

CREATE TABLE `itcast_visit` (
sid_total int(11) COMMENT '根据sid去重求count',
sessionid_total int(11) COMMENT '根据sessionid去重求count',
ip_total int(11) COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
seo_source varchar(32) COMMENT '搜索来源',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '小时信息',
quarterinfo varchar(32) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
from_url varchar(2083) comment '会话来源页面',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年信息',
monthinfo varchar(32) COMMENT '月信息',
dayinfo varchar(32) COMMENT '日信息'
)ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

再次执行sqoop脚本,执行成功。

增量流程

数据采集

增量更新和全量更新区别是,增量更新采用的是T+1模式,分析的数据只有一天的量。

  1. Sql中需要增加where条件,只查询昨天一天的数据,而不是所有表数据。

  2. Sqoop指定分区:增量更新每次只更新一天的数据,也可以使用静态分区方式导入。通过指定hive-partition-key和hive-partition-value两个参数实现。hive-partition-key用来指定分区字段名,hive-partition-value用来指定分区的值,也就是昨天的日期(今天0点采集的数据属于昨天)。

造数据

在mysql中执行:

1
2
3
4
5
6
7
8
9
# 备份表
create table web_chat_ems_2019_07_back
select *
from web_chat_ems_2019_07;

# 将2019-07-01变更为2019-07-25,模拟新数据
update web_chat_ems_2019_07
set create_time = concat('2019-07-25 ', substr(create_time, 12))
where create_time between '2019-07-01 00:00:00' and '2019-07-02 00:00:00';
SQL脚本
web_chat_ems表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
select id,
create_date_time,
session_id,
sid,
create_time,
seo_source,
seo_keywords,
ip,
area,
country,
province,
city,
origin_channel,
user as user_match,
manual_time,
begin_time,
end_time,
last_customer_msg_time_stamp,
last_agent_msg_time_stamp,
reply_msg_count,
msg_count,
browser_name,
os_info,
"2019-07-25" as starts_time
from web_chat_ems_2019_07
where create_time between "2019-07-25 00:00:00" and "2019-07-25 23:59:59" and $CONDITIONS;
web_chat_text_ems表

副表没有创建日期字段,通过关联主表,使用主表的日期字段来进行判断。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
select id,
referrer,
from_url,
landing_page_url,
url_title,
platform_description,
other_params,
history,
"2019-07-25" as start_time
from web_chat_text_ems_2019_07 wcte,
(select id as wce_id, create_time
from web_chat_ems_2019_07
where create_time between "2019-07-25 00:00:00" and "2019-07-25 23:59:59") wce
where wcte.id = wce.wce_id
and $CONDITIONS;
sqoop脚本
web_chat_ems表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query 'select id,
create_date_time,
session_id,
sid,
create_time,
seo_source,
seo_keywords,
ip,
area,
country,
province,
city,
origin_channel,
user as user_match,
manual_time,
begin_time,
end_time,
last_customer_msg_time_stamp,
last_agent_msg_time_stamp,
reply_msg_count,
msg_count,
browser_name,
os_info,
"2019-07-25" as starts_time
from web_chat_ems_2019_07
where create_time between "2019-07-25 00:00:00" and "2019-07-25 23:59:59" and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
--hive-partition-key starts_time \
--hive-partition-value 2019-07-25 \
-m 100 \
--split-by id
web_chat_text_ems表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query '
select id,
referrer,
from_url,
landing_page_url,
url_title,
platform_description,
other_params,
history,
"2019-07-25" as start_time
from web_chat_text_ems_2019_07 wcte,
(select id as wce_id, create_time
from web_chat_ems_2019_07
where create_time between "2019-07-25 00:00:00" and "2019-07-25 23:59:59") wce
where wcte.id = wce.wce_id
and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
--hive-partition-key start_time \
--hive-partition-value 2019-07-25 \
-m 100 \
--split-by id
shell脚本
date命令

1.获取今天的日期

1
2
date
date +%Y%m%d

1.1.1.1.1.1 指定日期获取内容 -d或–date=

1
2
3
4
#获取指定日期的年月日格式输出
date -d "2014-11-12" +%Y%m%d
#获取指定日期的星期(周几)格式输出
date --date="2014-11-23" +%w

1.1.1.1.1.1 日期加减 -d或–date==

1
2
3
4
5
#获取上周日期(day,month,year,hour)
date -d "-1 week" +%Y%m%d
#获取昨天日期
date -d '-1 day' "+%Y-%m-%d"
date --date="-24 hour" +%Y%m%d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
日期格式化规则:
%a 当前域的星期缩写 (Sun..Sat)
%A 当前域的星期全写 (Sunday..Saturday)
%b 当前域的月份缩写(Jan..Dec)
%B 当前域的月份全称 (January..December)
%d 两位的天 (01..31)
%D 短时间格式 (mm/dd/yy)
%e 短格式天 ( 1..31)
%F 文件时间格式 same as %Y-%m-%d
%h same as %b
%H 24小时制的小时 (00..23)
%I 12小时制的小时 (01..12)
%j 一年中的第几天 (001..366)
%k 短格式24小时制的小时 ( 0..23)
%m 双位月份 (01..12)
%M 双位分钟 (00..59)
%r 12小时制的时间表示(时:分:秒,双位) time, 12-hour (hh:mm:ss [AP]M)
%R 24小时制的时间表示 (时:分,双位)time, 24-hour (hh:mm)
%s 自基础时间 1970-01-01 00:00:00 到当前时刻的秒数(a GNU extension)
%T 24小时制时间表示(hh:mm:ss)
%u 数字表示的星期(从星期一开始 1-7)
%x 本地日期格式 (mm/dd/yy)
%X 本地时间格式 (%H:%M:%S)
%y 两位的年(00..99)
%Y 年 (1970…)
${} $()、`` (())、$(())

1.变量替换

在 bash shell 中,${}是用来作变量替换的。

一般情况下,$var${var}是没有区别的,但是用${ }会比较精确的界定变量名称的范围。

1
2
3
A=Linux
echo $AB #表示变量AB
echo ${A}B #表示变量A后连接着B

1.命令替换

在 bash shell 中,$( )与 (反引号)都是用来作命令替换的。

命令替换与变量替换差不多,区别是一个为执行变量,一个为执行命令,先完成引号里的命令行,然后将其结果替换出来,再重组成新的命令行。

1
2
echo $(date "+%Y-%m-%d")    #2020-07-01
echo today is `date "+%Y-%m-%d"` #today is 2020-07-01

1.数学运算

1.1.(())

双小括号命令是用来执行数学表达式的,可以在其中进行各种逻辑运算、数学运算,也支持更多的运算符(如++、–等)。

1
echo $(((5 * 2)))        #10

在(( )) 中的变量名称,可于其前面加 $ 符号来执行替换,也可以不用。

1
2
3
i=5
echo $(((i=$i*2))) #10
echo $(((i=i*2))) #20

1.1.$(())

$((( )))的缩写。

1
2
echo $(((i*2)))          #40
echo $((i*2)) #40
串行与并行

shell脚本默认是按顺序串行执行的,使用&可以将一个命令放在后台运行,从而使shell脚本能够继续往后执行:

1
2
sleep 5 &
echo "done"

上面的脚本执行后会立即打印出”done”,sleep命令被扔给后台执行,不会阻塞脚本执行。

如果想要在进入下个循环前,必须等待上个后台命令执行完毕,可以使用wait命令:

1
2
3
sleep 5 &
wait
echo "done"

这样,需要等待5s后才能在屏幕上看到”done”。

增量采集Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#! /bin/bash
#SQOOP_HOME=/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/bin/sqoop
SQOOP_HOME=/usr/bin/sqoop
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

V_YEAR=$(date --date="${TD_DATE}" +%Y)
V_MONTH=$(date --date="${TD_DATE}" +%m)
V_TABLE_web_chat_ems="web_chat_ems_${V_YEAR}_${V_MONTH}"
V_TABLE_web_chat_text_ems="web_chat_text_ems_${V_YEAR}_${V_MONTH}"

${SQOOP_HOME} import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query "select id,
create_date_time,
session_id,
sid,
create_time,
seo_source,
seo_keywords,
ip,
area,
country,
province,
city,
origin_channel,
user as user_match,
manual_time,
begin_time,
end_time,
last_customer_msg_time_stamp,
last_agent_msg_time_stamp,
reply_msg_count,
msg_count,
browser_name,
os_info,
'${TD_DATE}' as starts_time
from ${V_TABLE_web_chat_ems}
where create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59' and \$CONDITIONS" \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
--hive-partition-key starts_time \
--hive-partition-value ${TD_DATE} \
-m 100 \
--split-by id

#如果想要确保等待上个后台命令执行完毕再执行后续任务,可以使用wait命令:
wait

${SQOOP_HOME} import \
--connect jdbc:mysql://192.168.52.150:3306/nev \
--username root \
--password 123456 \
--query "
select id,
referrer,
from_url,
landing_page_url,
url_title,
platform_description,
other_params,
history,
'${TD_DATE}' as start_time
from ${V_TABLE_web_chat_text_ems} wcte,
(select id as wce_id, create_time
from ${V_TABLE_web_chat_ems}
where create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') wce
where wcte.id = wce.wce_id
and \$CONDITIONS" \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
--hive-partition-key start_time \
--hive-partition-value ${TD_DATE} \
-m 100 \
--split-by id
Oozie调度

image-20231004074543184

时区设置

\1. ClouderaManager中,hue 配置修改time_zone的值为:Asia/Shanghai

image-20231004074546978

\2. ClouderaManager中,oozie配置修改oozie-site.xml,添加配置项:

name: oozie.processing.timezoneValue: GMT+0800Description: oozie时区设置为东八区区时

image-20231004074550849

创建workflow

上传shell脚本到HDFS:

image-20231004074555041

1.1.1.1.2.2 设置workflow

image-20231004074557946

测试workflow

image-20231004074601281

创建调度计划

创建计划,注意时区要设置正确。

image-20231004074605372

保存,并提交

image-20231004074609293

查看提交列表

image-20231004074613462

查看触发记录

image-20231004074617616

image-20231004074621204

执行成功

image-20231004074625038

数据清洗转换

增量清洗转换时,如果同一天的分区已有旧数据,需要覆盖掉,则可以使用Insert overwrite覆盖写入,或者通过sql语句删除。否则会出现数据重复,导致后续计算出错。(注意内部表不能直接删除HDFS分区)

需要增加where条件,指向ODS层的采集日期start_time字段,只清洗转换昨天的数据,旧数据已经计算过,不需要重复计算。

SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition (yearinfo, monthinfo, dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time, 'yyyy-MM-dd HH:mm:ss.SSS') as create_time,
wce.seo_source,
wce.ip,
wce.area,
cast(if(wce.msg_count is null, 0, wce.msg_count) as int) as msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time, 12, 2) as hourinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time, 1, 4) as yearinfo,
substr(wce.create_time, 6, 2) as monthinfo,
substr(wce.create_time, 9, 2) as dayinfo
from itcast_ods.web_chat_ems wce inner join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id
where wce.starts_time='2019-07-25';
Shell脚本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#! /bin/bash
HIVE_HOME=/usr/bin/hive
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

${HIVE_HOME} -S -e "
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

insert into table itcast_dwd.visit_consult_dwd partition (yearinfo, monthinfo, dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time, 'yyyy-MM-dd HH:mm:ss.SSS') as create_time,
wce.seo_source,
wce.ip,
wce.area,
cast(if(wce.msg_count is null, 0, wce.msg_count) as int) as msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time, 12, 2) as hourinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time, 1, 4) as yearinfo,
substr(wce.create_time, 6, 2) as monthinfo,
substr(wce.create_time, 9, 2) as dayinfo
from itcast_ods.web_chat_ems wce inner join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id
where wce.starts_time='${TD_DATE}';
"

运行后报错:Failing Oozie Launcher, output.properties data exceeds its limit [2048]

此错误是oozie输出内容的大小受限,并不影响hive执行结果的正确性。

image-20231004074631870

解决:

修改oozie配置oozie-site.xml,添加子配置项:

Name:oozie.action.max.output.dataValue:2048000000

image-20231004074634991

重新运行oozie workflow。

统计分析

增量统计时,同时增加where条件,只统计昨天所在的区间数据(年、月、日、时)。

注意DWS层维度较多, 小时数据和天数据(2019-07-01/2019-07-01 10:00:00),季度数据和年数据(2019-Q3/2019),不同维度存在分区目录相同的情况。如果使用的是Insert overwrite覆盖写入,天和小时数据存在于同一个分区中,会出现新数据互相覆盖的问题。

使用Insert Into比较安全。

分区内有旧数据的话,则要先通过sql删除旧数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- 精确删除数据
--删除20190901的05点数据
delete from itcast_dws.visit_dws where yearinfo='2019' and monthinfo='07' and dayinfo='01' and hourinfo='05' and time_type=1;
--删除20190901天数据
delete from itcast_dws.visit_dws where yearinfo='2019' and monthinfo='09' and dayinfo='01' and time_type=2;
--删除201909月数据
delete from itcast_dws.visit_dws where yearinfo='2019' and monthinfo='09' and dayinfo='-1' and time_type=3;
--删除2019年3季度数据
delete from itcast_dws.visit_dws where yearinfo='2019' and quarterinfo='3' and time_type=4;
--删除2019年数据
delete from itcast_dws.visit_dws where yearinfo='2019' and time_type=4;

--删除分区下所有数据
--删除20190901天和小时数据(会把日期里面的小时数据一起删除掉,月和年尽量使用delete语句)
alter table itcast_dws.visit_dws drop partition (yearinfo='2019', monthinfo='09', dayinfo='01');
地区分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

INSERT INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
wce.hourinfo as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo,' ',wce.hourinfo) as time_str,
'-1' as from_url,
'1' as groupType,
'1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.area,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo) as time_str,
'-1' as from_url,
'1' as groupType,
'2' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.area,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo) as time_str,
'-1' as from_url,
'1' as groupType,
'3' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws(' - ', wce.yearinfo, wce.monthinfo) = ' ${V_Month}'
group by wce.area,wce.yearinfo,wce.quarterinfo,wce.monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo) as time_str,
'-1' as from_url,
'1' as groupType,
'4' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.area,wce.yearinfo,wce.quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'1' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where wce.yearinfo='${V_Year}'
group by wce.area,wce.yearinfo;
搜索来源分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo, monthinfo, dayinfo)
select COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'' - 1'' as area,
wce.seo_source as seo_source,
''-1'' as origin_channel,
wce.hourinfo as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,''-'',wce.monthinfo,''-'',wce.dayinfo,'' '',wce.hourinfo) as time_str,
''-1'' as from_url,
''2'' as groupType,
''1'' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.seo_source, wce.yearinfo, wce.quarterinfo, wce.monthinfo, wce.dayinfo, wce.hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
wce.seo_source as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo) as time_str,
'-1' as from_url,
'2' as groupType,
'2' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.seo_source,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
wce.seo_source as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo) as time_str,
'-1' as from_url,
'2' as groupType,
'3' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo)='${V_Month}'
group by wce.seo_source,wce.yearinfo,wce.quarterinfo,wce.monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
wce.seo_source as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo) as time_str,
'-1' as from_url,
'2' as groupType,
'4' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.seo_source,wce.yearinfo,wce.quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
wce.seo_source as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'2' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where wce.yearinfo='${V_Year}'
group by wce.seo_source,wce.yearinfo;
来源渠道分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
wce.origin_channel as origin_channel,
wce.hourinfo as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo,' ',wce.hourinfo) as time_str,
'-1' as from_url,
'3' as groupType,
'1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.origin_channel,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
wce.origin_channel as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo) as time_str,
'-1' as from_url,
'3' as groupType,
'2' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.origin_channel,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
wce.origin_channel as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo) as time_str,
'-1' as from_url,
'3' as groupType,
'3' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo)='${V_Month}'
group by wce.origin_channel,wce.yearinfo,wce.quarterinfo,wce.monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
wce.origin_channel as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo) as time_str,
'-1' as from_url,
'3' as groupType,
'4' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.origin_channel,wce.yearinfo,wce.quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
wce.origin_channel as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'3' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where wce.yearinfo='${V_Year}'
group by wce.origin_channel,wce.yearinfo;
会话来源页面分组

小时维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
wce.hourinfo as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo,' ',wce.hourinfo) as time_str,
wce.from_url as from_url,
'4' as groupType,
'1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.from_url,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.hourinfo;

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo) as time_str,
wce.from_url as from_url,
'4' as groupType,
'2' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd tablesample(bucket 1 out of 10 on sid) wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.from_url,wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo) as time_str,
wce.from_url as from_url,
'4' as groupType,
'3' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd tablesample(bucket 1 out of 10 on sid) wce
where concat_ws('-',wce.yearinfo,wce.monthinfo)='${V_Month}'
group by wce.from_url,wce.yearinfo,wce.quarterinfo,wce.monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo) as time_str,
wce.from_url as from_url,
'4' as groupType,
'4' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd tablesample(bucket 1 out of 10 on sid) wce
where concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.from_url,wce.yearinfo,wce.quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.`visit_dws` PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
wce.from_url as from_url,
'4' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd tablesample(bucket 1 out of 10 on sid) wce
where wce.yearinfo='${V_Year}'
group by wce.from_url,wce.yearinfo;
总访问量

小时(小时段基础数据)

因为小时段数据可以直接sum求和,因此OLAP应用可以在小时数据基础上,进行简单的sum操作以获取到区间小时段数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
wce.hourinfo as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo,'-',wce.hourinfo) as time_str,
'-1' as from_url,
'5' as groupType,
'-1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.hourinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo) as time_str,
'-1' as from_url,
'5' as groupType,
'-1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
wce.dayinfo as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo) as time_str,
'-1' as from_url,
'5' as groupType,
'-1' as time_type,
wce.yearinfo as yearinfo,
wce.monthinfo as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.monthinfo)='${V_Month}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo;

季度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
wce.quarterinfo as quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo) as time_str,
'-1' as from_url,
'5' as groupType,
'-1' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.yearinfo,wce.quarterinfo;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
INSERT  INTO TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.sessionid) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'5' as groupType,
'-1' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where wce.yearinfo='${V_Year}'
group by wce.yearinfo;
OOzie Shell

季度值可以通过公式获取,QUARTER=$((($MONTH-1)/3+1))

date +%-m,和+%m的区别是,-m会将前缀的0去掉,比如7,而不是07。结果能够作为int类型参与运算。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
#! /bin/bash
HIVE_HOME=/usr/bin/hive
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

TD_YEAR=$(date -d "$TD_DATE" +%Y)
TD_MONTH=$(date -d "$TD_DATE" +%m)
TD_DAY=$(date -d "$TD_DATE" +%d)
month_for_quarter=`date --date="$TD_DATE" +%-m`
TD_QUARTER=$((($month_for_quarter-1)/3+1))

${HIVE_HOME} -S -e "
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

---------地区分组
--小时
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by area, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

--天
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by area, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by area, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'1' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by area, yearinfo, quarterinfo;
--年
INSERT overwrite TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
wce.area as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'1' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where yearinfo='$TD_YEAR'
group by wce.area,wce.yearinfo;


---------搜索来源分组
--小时
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

--天
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by seo_source, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by seo_source, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'2' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by seo_source, yearinfo, quarterinfo;
--年
INSERT overwrite TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'2' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where yearinfo='$TD_YEAR'
group by wce.seo_source,wce.yearinfo;

---------来源渠道分组
--小时
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

--天
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by origin_channel, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'3' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by origin_channel, yearinfo, quarterinfo;
--年
INSERT overwrite TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'3' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where yearinfo='$TD_YEAR'
group by wce.origin_channel,wce.yearinfo;


---------会话来源页面分组
--小时
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
from_url,
'4' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

--天
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
from_url,
'4' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by from_url, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
from_url,
'4' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by from_url, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
from_url,
'4' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by from_url, yearinfo, quarterinfo;
--年
INSERT overwrite TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
from_url,
'4' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where yearinfo='$TD_YEAR'
group by wce.from_url,wce.yearinfo;

---------总访问量分组
--小时
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'1' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;

--天
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'2' as time_type,
yearinfo, monthinfo, dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo='$TD_DAY'
group by yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'3' as time_type,
yearinfo, monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.visit_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as session_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat(yearinfo,'-Q',quarterinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'4' as time_type,
yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by yearinfo, quarterinfo;
--年
INSERT overwrite TABLE itcast_dws.visit_dws PARTITION (yearinfo,monthinfo,dayinfo)
select
COUNT(DISTINCT wce.sid) as sid_total,
COUNT(DISTINCT wce.session_id) as sessionid_total,
COUNT(DISTINCT wce.ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
wce.yearinfo as time_str,
'-1' as from_url,
'5' as groupType,
'5' as time_type,
wce.yearinfo as yearinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd wce
where yearinfo='$TD_YEAR'
group by wce.yearinfo;
"

导出数据

增量数据导出mysql,需要做成sh脚本,用oozie定时触发,分析流程完毕后执行导出。

因为每次统计最大的时间范围是年,所以每次导出都需要覆盖导出一年的数据。此时需要先将mysql中的旧数据删除,然后再将新数据导入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#! /bin/bash
SQOOP_HOME=/usr/bin/sqoop
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

TD_YEAR=$(date -d "$TD_DATE" +%Y)

mysql -uroot -p123456 -h192.168.52.150 -P3306 -e "delete from scrm_bi.itcast_visit where yearinfo='$TD_YEAR';"

wait

$SQOOP_HOME export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password '123456' \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values $TD_YEAR \
-m 100

咨询客户量实现

建模

指标和维度

指标:咨询客户量是单位时间内有效咨询客服的去重后客户数量,以天为单位显示咨询客户。客户与网咨有说一句话的称为有效咨询。

维度:

l 时间维度:年、季度、月、天

l 业务属性维度:地区、来源渠道

原始数据结构

咨询客户量的数据来源和访问客户量一致,都是咨询系统的访问会话信息月表web_chat_ems,表名的格式为web_chat_ems_年_月,年份为4位数字,月份为二位数字,如果为单数时,前面会用0来补全,比如web_chat_ems_2019_07。

事实表和维度表

事实表就是我们的客户访问表,而维度数据都包含在事实表中,没有需要额外关联的维度表。

分层

ODS层是原始数据,一般不允许修改,所以使用外部表保证数据的安全性,避免误删除;DW和APP层是统计数据,为了使覆盖插入等操作更方便,满足业务需求的同时,提高开发和测试效率,推荐使用内部表。

image-20231004074649499

ODS层

由于咨询客户量的原始数据和访问客户量一致,所以ODS层可以直接复用itcast_ods.web_chat_ems表,内容是咨询系统OLTP数据库的web_chat_ems_20XX_XX等月表中抽取的原始数据;

DWD层

可以复用访问客户量指标的DWD层:itcast_dwd.visit_consult_dwd。

DWS层

客户访问量指标中,DWS层我们增加了两个标识字段:时间和业务属性。在咨询客户量指标中,我们采用同样的方式来进行统计,最后在APP层或OLAP应用中再做进一步的分组取值。

写入时压缩生效

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
set hive.exec.orc.compression.strategy=COMPRESSION;

CREATE TABLE IF NOT EXISTS itcast_dws.consult_dws
(
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
quarterinfo STRING COMMENT '季度',
time_str STRING COMMENT '时间明细',
groupType STRING COMMENT '产品属性类型:1.地区;2.来源渠道',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;'
)
COMMENT '咨询量DWS宽表'
PARTITIONED BY (yearinfo string, monthinfo STRING, dayinfo string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC
LOCATION '/user/hive/warehouse/itcast_dws.db/consult_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
APP层

如果客户需要具体的报表展示,可以针对不同的报表页面设计APP层结构,然后导出至OLAP系统的mysql中。此系统使用FineReport,需要通过宽表来进行灵活的展现。因此APP层不再进行细化。直接将DWS层导出至mysql即可。

全量流程

数据采集

同访问客户量指标。

数据清洗转换

同访问客户量指标。

统计分析

地区分组

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

-------地区分组
--天
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo),
'1',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by area, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo),
'1',
'3',
yearinfo,monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by area, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-Q',yearinfo,quarterinfo),
'1',
'4',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by area, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
yearinfo,
'1',
'5',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by area, yearinfo;
来源渠道分组

天维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo),
'2',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo;

月维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo),
'2',
'3',
yearinfo,monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by origin_channel, yearinfo, quarterinfo, monthinfo;

季度维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-Q',yearinfo,quarterinfo),
'2',
'4',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by origin_channel, yearinfo, quarterinfo;

年维度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
yearinfo,
'2',
'5',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1
group by origin_channel, yearinfo;

导出数据

创建mysql表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE `itcast_consult` (
sid_total int(11) COMMENT '去重并聚合sid',
sessionid_total int(11) COMMENT '去重并聚合sessionid',
ip_total int(11) COMMENT '去重并聚合ip',
area varchar(32) COMMENT '区域信息',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(5) COMMENT '创建时间,统计至小时',
quarterinfo varchar(5) COMMENT '季度',
time_str varchar(32) COMMENT '时间明细',
groupType varchar(5) COMMENT '产品属性类型:1.地区;2.来源渠道',
time_type varchar(5) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(5) COMMENT '创建时间,统计至年',
monthinfo varchar(5) COMMENT '创建时间,统计至月',
dayinfo varchar(5) COMMENT '创建时间,统计至天'
) COMMENT='客户咨询统计数据';
执行sqoop导出脚本
1
2
3
4
5
6
7
8
sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table itcast_consult \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 100

增量流程

数据采集

同访问客户量指标。

数据清洗转换

同访问客户量指标。

统计分析

SQL
地区分组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

-- 地区分组
-- 天维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
'-1',
wce.area,
wce.quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo),
'2',
'1',
wce.yearinfo,
wce.monthinfo,
wce.dayinfo
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.area;
-- 月维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
'-1',
wce.area,
wce.quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo),
'3',
'1',
wce.yearinfo,
wce.monthinfo,
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws(' - ', wce.yearinfo, wce.monthinfo) = ' ${V_Month}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.area;
-- 季度维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
'-1',
wce.area,
wce.quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo),
'4',
'1',
wce.yearinfo,
'-1',
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.yearinfo,wce.quarterinfo,wce.area;
-- 年维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
'-1',
wce.area,
'-1',
wce.yearinfo,
'5',
'1',
wce.yearinfo,
'-1',
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and wce.yearinfo='${V_Year}'
group by wce.yearinfo,wce.area;
来源渠道分组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--来源渠道分组
--天维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
wce.origin_channel,
'-1',
wce.quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo,'-',wce.dayinfo),
'2',
'2',
wce.yearinfo,
wce.monthinfo,
wce.dayinfo
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws('-',wce.yearinfo,wce.monthinfo,wce.dayinfo)='${TD_DATE}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.dayinfo,wce.origin_channel;
--月维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
wce.origin_channel,
'-1',
wce.quarterinfo,
CONCAT(wce.yearinfo,'-',wce.monthinfo),
'3',
'2',
wce.yearinfo,
wce.monthinfo,
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws(' - ', wce.yearinfo, wce.monthinfo) = ' ${V_Month}'
group by wce.yearinfo,wce.quarterinfo,wce.monthinfo,wce.origin_channel;
--季度维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
wce.origin_channel,
'-1',
wce.quarterinfo,
CONCAT(wce.yearinfo,'-Q',wce.quarterinfo),
'4',
'2',
wce.yearinfo,
'-1',
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and concat_ws('-',wce.yearinfo,wce.quarterinfo)='${V_QUARTER}'
group by wce.yearinfo,wce.quarterinfo,wce.origin_channel;
--年维度:
INSERT INTO TABLE itcast_dws.consult_dws PARTITION (yearinfo,monthinfo,dayinfo)
SELECT
count(distinct wce.sid),
'-1',
wce.origin_channel,
'-1',
'-1',
wce.yearinfo,
'5',
'2',
wce.yearinfo,
'-1',
'-1'
FROM itcast_dwd.visit_consult_dwd wce
where wce.msg_count >= 1
and wce.yearinfo='${V_Year}'
group by wce.yearinfo,wce.origin_channel;
OOzie Shell脚本示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
#! /bin/bash
HIVE_HOME=/usr/bin/hive
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

TD_YEAR=$(date -d "$TD_DATE" +%Y)
TD_MONTH=$(date -d "$TD_DATE" +%m)
TD_DAY=$(date -d "$TD_DATE" +%d)
TD_QUARTER=$((($TD_MONTH-1)/3+1))

$HIVE_HOME -S -e "
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=2000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;

-------地区分组
--天
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo),
'1',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo = '$TD_DAY'
group by area, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo),
'1',
'3',
yearinfo,monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by area, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-Q',yearinfo,quarterinfo),
'1',
'4',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by area, yearinfo, quarterinfo;

--年
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
area,
'-1' as origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
yearinfo,
'1',
'5',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR'
group by area, yearinfo;


------来源渠道分组
--天
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo,dayinfo),
'2',
'2',
yearinfo,monthinfo,dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH' and dayinfo = '$TD_DAY'
group by origin_channel, yearinfo, quarterinfo, monthinfo, dayinfo;

--月
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-',yearinfo,monthinfo),
'2',
'3',
yearinfo,monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and monthinfo='$TD_MONTH'
group by origin_channel, yearinfo, quarterinfo, monthinfo;

--季度
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
quarterinfo,
concat_ws('-Q',yearinfo,quarterinfo),
'2',
'4',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR' and quarterinfo='$TD_QUARTER'
group by origin_channel, yearinfo, quarterinfo;

--年
insert into itcast_dws.consult_dws partition (yearinfo, monthinfo, dayinfo)
select
count(distinct sid),
count(distinct session_id),
count(distinct ip),
'-1' as area,
origin_channel,
'-1' as hourinfo,
'-1' as quarterinfo,
yearinfo,
'2',
'5',
yearinfo,'-1' as monthinfo,'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
where msg_count >= 1 and yearinfo='$TD_YEAR'
group by origin_channel, yearinfo;
"

导出数据

导出到mysql

增量数据导出mysql,需要做成sh脚本,用oozie定时触发,在整个流程分析完毕后执行。

因为每次统计最大的时间范围是年,所以每次导出都需要覆盖导出一年的数据。此时需要先将mysql中的旧数据删除,然后再将新数据导入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#! /bin/bash
SQOOP_HOME=/usr/bin/sqoop
if [[ $1 == "" ]];then
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
TD_DATE=$1
fi

TD_YEAR=$(date -d "$TD_DATE" +%Y)

mysql -uroot -p123456 -h192.168.52.150 -P3306 -e "
delete from scrm_bi.itcast_consult where yearinfo='$TD_YEAR';
"

sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 123456 \
--table itcast_consult \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values $TD_YEAR \
-m 100
oozie调度

略。