ES 不香吗,为啥被大厂摒弃而迁移到ClickHouse?
架构和设计的对比

- 
    
Client Node,负责API和数据的访问的节点,不存储/处理数据  - 
    
Data Node,负责数据的存储和索引  - 
    
Master Node, 管理节点,负责Cluster中的节点的协调,不存储数据。  


为了支持搜索,Clickhouse同样支持布隆过滤器。
查询对比实战

架构主要有四个部分组成:
- 
    
ES stack ES stack有一个单节点的Elastic的容器和一个Kibana容器组成,Elastic是被测目标之一,Kibana作为验证和辅助工具。 
部署代码如下:  
version: '3.7'services:elasticsearch:image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0container_name: elasticsearchenvironment:- xpack.security.enabled=false- discovery.type=single-nodeulimits:memlock:soft: -1hard: -1nofile:soft: 65536hard: 65536cap_add:- IPC_LOCKvolumes:- elasticsearch-data:/usr/share/elasticsearch/dataports:- 9200:9200- 9300:9300deploy:resources:limits:cpus: '4'memory: 4096Mreservations:memory: 4096Mkibana:container_name: kibanaimage: docker.elastic.co/kibana/kibana:7.4.0environment:- ELASTICSEARCH_HOSTS=http://elasticsearch:9200ports:- 5601:5601depends_on:- elasticsearchvolumes:elasticsearch-data:driver: local
- 
    
Clickhouse stack Clickhouse stack有一个单节点的Clickhouse服务容器和一个TabixUI作为Clickhouse的客户端。 
部署代码如下:  
version: "3.7"services:clickhouse:container_name: clickhouseimage: yandex/clickhouse-servervolumes:./data/config:/var/lib/clickhouseports:"8123:8123""9000:9000""9009:9009""9004:9004"ulimits:nproc: 65535nofile:soft: 262144hard: 262144healthcheck:test: ["CMD", "wget", "--spider", "-q", "localhost:8123/ping"]interval: 30stimeout: 5sretries: 3deploy:resources:limits:cpus: '4'memory: 4096Mreservations:memory: 4096Mtabixui:container_name: tabixuiimage: spoonest/clickhouse-tabix-web-clientenvironment:CH_NAME=devCH_HOST=127.0.0.1:8123CH_LOGIN=defaultports:"18080:80"depends_on:clickhousedeploy:resources:limits:cpus: '0.1'memory: 128Mreservations:memory: 128M
数据导入 stack 数据导入部分使用了Vector.dev开发的vector,该工具和fluentd类似,都可以实现数据管道式的灵活的数据导入。
测试控制 stack 测试控制我使用了Jupyter,使用了ES和Clickhouse的Python SDK来进行查询的测试。
CREATE TABLE default.syslog(application String,hostname String,message String,mid String,pid String,priority Int16,raw String,timestamp DateTime('UTC'),version Int16) ENGINE = MergeTree()PARTITION BY toYYYYMMDD(timestamp)ORDER BY timestampTTL timestamp + toIntervalMonth(1);
[sources.in]type = "generator"format = "syslog"interval = 0.01count = 100000[transforms.clone_message]type = "add_fields"inputs = ["in"]fields.raw = "{{ message }}"[transforms.parser]# Generaltype = "regex_parser"inputs = ["clone_message"]field = "message" # optional, defaultpatterns = ['^<(?P<priority>\d*)>(?P<version>\d) (?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z) (?P<hostname>\w+\.\w+) (?P<application>\w+) (?P<pid>\d+) (?P<mid>ID\d+) - (?P<message>.*)$'][transforms.coercer]type = "coercer"inputs = ["parser"]types.timestamp = "timestamp"types.version = "int"types.priority = "int"[sinks.out_console]# Generaltype = "console"inputs = ["coercer"]target = "stdout"# Encodingencoding.codec = "json"[sinks.out_clickhouse]host = "http://host.docker.internal:8123"inputs = ["coercer"]table = "syslog"type = "clickhouse"encoding.only_fields = ["application", "hostname", "message", "mid", "pid", "priority", "raw", "timestamp", "version"]encoding.timestamp_format = "unix"[sinks.out_es]# Generaltype = "elasticsearch"inputs = ["coercer"]compression = "none"endpoint = "http://host.docker.internal:9200"index = "syslog-%F"# Encoding# Healthcheckhealthcheck.enabled = true
这里简单介绍一下这个流水线:
- 
    
http://source.in 生成syslog的模拟数据,生成10w条,生成间隔和0.01秒  - 
    
transforms.clone_message 把原始消息复制一份,这样抽取的信息同时可以保留原始消息  - 
    
transforms.parser 使用正则表达式,按照syslog的定义,抽取出application,hostname,message ,mid ,pid ,priority ,timestamp ,version 这几个字段  - 
    
transforms.coercer 数据类型转化  - 
    
sinks.out_console 把生成的数据打印到控制台,供开发调试  - 
    
sinks.out_clickhouse 把生成的数据发送到Clickhouse  - 
    
sinks.out_es 把生成的数据发送到ES  
运行Docker命令,执行该流水线:
docker run \-v $(mkfile_path)/vector.toml:/etc/vector/vector.toml:ro \-p 18383:8383 \timberio/vector:nightly-alpine
返回所有的记录
# ES{"query":{"match_all":{}}}# Clickhouse"SELECT * FROM syslog"
匹配单个字段
# ES{"query":{"match":{"hostname":"for.org"}}}# Clickhouse"SELECT * FROM syslog WHERE hostname='for.org'"
匹配多个字段
# ES{"query":{"multi_match":{"query":"up.com ahmadajmi","fields":["hostname","application"]}}}# Clickhouse、"SELECT * FROM syslog WHERE hostname='for.org' OR application='ahmadajmi'"
单词查找,查找包含特定单词的字段
# ES{"query":{"term":{"message":"pretty"}}}# Clickhouse"SELECT * FROM syslog WHERE lowerUTF8(raw) LIKE '%pretty%'"
范围查询, 查找版本大于2的记录
# ES{"query":{"range":{"version":{"gte":2}}}}# Clickhouse"SELECT * FROM syslog WHERE version >= 2"
查找到存在某字段的记录
# ES{"query":{"exists":{"field":"application"}}}# Clickhouse"SELECT * FROM syslog WHERE application is not NULL"
正则表达式查询,查询匹配某个正则表达式的数据
# ES{"query":{"regexp":{"hostname":{"value":"up.*","flags":"ALL","max_determinized_states":10000,"rewrite":"constant_score"}}}}# Clickhouse"SELECT * FROM syslog WHERE match(hostname, 'up.*')"
聚合计数,统计某个字段出现的次数
# ES{"aggs":{"version_count":{"value_count":{"field":"version"}}}}# Clickhouse"SELECT count(version) FROM syslog"
聚合不重复的值,查找所有不重复的字段的个数
# ES{"aggs":{"my-agg-name":{"cardinality":{"field":"priority"}}}}# Clickhouse"SELECT count(distinct(priority)) FROM syslog "
我用 Python 的 SDK,对上述的查询在两个Stack上各跑10次,然后统计查询的性能结果。
我们画出出所有的查询的响应时间的分布:

总查询时间的对比如下:

通过测试数据我们可以看出 Clickhouse 在大部分的查询的性能上都明显要优于 Elastic。在正则查询(Regex query)和单词查询(Term query)等搜索常见的场景下,也并不逊色。
总结
本文通过对于一些基本查询的测试,对比了 Clickhouse 和 Elasticsearch 的功能和性能,测试结果表明,Clickhouse在这些基本场景表现非常优秀,性能优于ES,这也解释了为什么用很多的公司应从 ES 切换到 Clickhouse 之上。
(版权归原作者所有,侵删)
评论

