Podcasting Pearl #4: Scale your analytics on the Clickhouse Data Warehouse.

CATHEU.TECH

Cyril de Catheu

Data Engineer

26 Apr 2021

ClickHouse: open-source, column oriented DB engine.

Disclaimer: Some technical details are outdated in this podcast, but the discussion is still very relevant.

Principles:

created by Yandex for equivalent of Google Analytics: OLAP use cases.
columnar: compress data along columns. To add data: decompress, merge sort, compress. This for all columns.

Consequences of columnar design:

does not like deletes: requires to decompress chunks, sometimes to copy big amount of data.
inserting row by row is will have bad performance
accessing single row is not effective

Inserting capabilities:

Fact table in ClickHouse OLAP, dimension table in SQL for easy update (OLTP).

Nice features:

Modelling:
Easy thanks to schema evolution.

- partition key
- primary sort order
- load everything as string
- update types
- codex: compression
- low cardinality strings (lookup table for strings with few different values)
- store data in arrays
- materialize columns
- ETL to ELT paradigm

Installation:
Easy to install, easy to manage (for a distributed OLAP DB).

Security and access control and was not very mature in 2019.

Vision of the future:
OLAP and AI are two separate worlds: this will change. ML will be embedded into the DBs.