Hadoop/Spark 太重,esProc SPL 很轻
共 8002字,需浏览 17分钟
· 2023-02-25
随着大数据时代的来临,数据量不断增长,传统小机上跑数据库的模式扩容困难且成本高昂,难以支撑业务发展。很多用户开始转向分布式计算路线,用多台廉价的 PC 服务器组成集群来完成大数据计算任务。
对于数据量更大的情况,SPL 实现了轻量级集群计算功能。这一功能的设计目标是几台到十几台节点的集群,采用了与 Hadoop 完全不同的实现方法。
with e1 as (
select gid,1 as step1,min(etime) as t1
from T
where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime<to_date('2021-01-25', 'yyyy-MM-dd')
and eventtype='eventtype1' and …
group by 1
),
with e2 as (
select gid,1 as step2,min(e1.t1) as t1,min(e2.etime) as t2
from T as e2
inner join e1 on e2.gid = e1.gid where e2.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime<to_date('2021-01-25', 'yyyy-MM-dd')
and e2.etime > t1
and e2.etime < t1 + 7
and eventtype='eventtype2' and …
group by 1
),
with e3 as (
select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
from T as e3
inner join e2 on e3.gid = e2.gid
where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime<to_date('2021-01-25', 'yyyy-MM-dd')
and e3.etime > t2
and e3.etime < t1 + 7
and eventtype='eventtype3' and …
group by 1
)
select
sum(step1) as step1,
sum(step2) as step2,
sum(step3) as step3
from
e1
left join e2 on e1.gid = e2.gid
left join e3 on e2.gid = e3.gid
SPL 集群计算的代码也非常简单,比如前面提到的订单分析计算,具体要求是:大订单表分段存储在 4 个节点上,小产品表则加载到每个节点的内存中,两表关联之后要按照产品供应商分组汇总订单金额。用 SPL 写出来大致是下面这样:
A | B | |
1 | ["192.168.0.101:8281","192.168.0.102:8281",…, "192.168.0.104:8281"] | |
2 | fork to(4);A1 | =file("product.ctx").open().import() |
3 | >env(PRODUCT,B2) | |
4 | =memory(A1,PRODUCT) | |
5 | =file("orders.ctx":to(4),A1).open().cursor(p_id,quantity) | |
6 | =A5.switch(p_id,A4) | |
7 | =A7.groups(p_id.vendor;sum(p_id.price*quantity)) |
这段代码执行时,任务管理(内存加载、任务拆分、合并等)所需要的计算资源,远远小于关联和分组汇总计算的消耗。如此轻便的任务管理功能,可以在任意节点、甚至是集成开发环境 IDE 上执行。