DB engineering: ClickHouse: open-source, column oriented DB engine.
Disclaimer: Some technical details are outdated in this podcast, but the discussion is still very relevant.
- created by Yandex for equivalent of Google Analytics: OLAP use cases.
- columnar: compress data along columns. To add data: decompress, merge sort, compress. This for all columns.
Consequences of columnar design:
- does not like deletes: requires to decompress chunks, sometimes to copy big amount of data.
- inserting row by row is will have bad performance
- accessing single row is not effective
- batch load
- HTTP API, but tCP with typed column is better
- CLI (great)
- read from Kafka
- read from datalake
- read other DBs (MySQL, PostGre)
Fact table in ClickHouse OLAP, dimension table in SQL for easy update (OLTP).
- array management (nested columns)
- aggregation in real-time
- approximate unique (sketches, hyperloglog like)
- approximate joins
- SQL based ttl: allows to change storage type based on time
Easy thanks to schema evolution.
- partition key
- primary sort order
- load everything as string
- update types
- codex: compression
- low cardinality strings (lookup table for strings with few different values)
- store data in arrays
- materialize columns
- ETL to ELT paradigm
Easy to install, easy to manage (for a distributed OLAP DB).
- focus on infrastructure integration before performance
- bare metal » virtualized for performance
- works quite well on Kubernetes (requires some tweaking)
Security and access control and was not very mature in 2019.
Vision of the future:
OLAP and AI are two separate worlds: this will change. ML will be embedded into the DBs.
- keeps compute close to the data –> cost + performance
- same language for data gathering and processing –> less errors