说明:同样数据的两个orc表,分别模拟一个月120亿wifi数据,平均每日4亿数据。
字段:id 唯一标记
mac mac地址
point 点位
x x坐标
y y坐标
hour 数据时间-小时
day 数据时间-日期
区别:
表1:wifi_orc 以day作为一级分区,hour作为二级分区
表2:wifi_orc_ex 只以day作为一级分区
测试sql: 选取6个不同时间段同范围的点位作为条件,进行6表之间的数据碰撞,之后对频次进行排序
SELECT * FROM
(SELECT tmp.mac, count(tmp.mac) c FROM (SELECT t1.mac FROM wifi_orc t1, wifi_orc t2, wifi_orc t3, wifi_orc t4, wifi_orc t5, wifi_orc t6 WHERE t1.day = '20170111' AND t1.hour >= '07' AND t1.hour <= '09' AND t1.point IN(1686,1199,1356,999,2500) AND t2.day = '20170112' AND t2.hour >= '11' AND t2.hour <= '13' AND t2.point IN(1686,1199,1356,999,2500) AND t3.day = '20170113' AND t3.hour >= '16' AND t3.hour <= '19' AND t2.point IN(1686,1199,1356,999,2500) AND t4.day = '20170114' AND t4.hour >= '20' AND t4.hour <= '22' AND t4.point IN(1686,1199,1356,999,2500) AND t5.day = '20170112' AND t5.hour >= '09' AND t5.hour <= '11' AND t5.point IN(1686,1199,1356,999,2500) AND t6.day = '20170115' AND t6.hour >= '00' AND t6.hour <= '05' AND t6.point IN(1686,1199,1356,999,2500) AND t1.mac = t2.mac AND t2.mac = t3.mac AND t3.mac = t4.mac AND t4.mac = t5.mac AND t5.mac = t6.mac ) tmpGROUP BY tmp.mac) retORDER BY ret.c DESC LIMIT 10
结果:
带二级分区的wifi_orc 执行时间: 2分42秒
只有一级分区的wifi_orc_ex 执行时间:4分42秒
上述执行引擎为mr方式执行,等上线spark引擎,可以进行比较.