Cyril de Catheu

Cyril de Catheu

Data Engineer

Newsletter

Podcasting Pearl #4: Scale your analytics on the Clickhouse Data Warehouse.

ClickHouse: open-source, column oriented DB engine.

Disclaimer: Some technical details are outdated in this podcast, but the discussion is still very relevant.

Principles:

  • created by Yandex for equivalent of Google Analytics: OLAP use cases.
  • columnar: compress data along columns. To add data: decompress, merge sort, compress. This for all columns.

Consequences of columnar design:

  • does not like deletes: requires to decompress chunks, sometimes to copy big amount of data.
  • inserting row by row is will have bad performance
  • accessing single row is not effective

Inserting capabilities:

  • batch load
  • HTTP API, but tCP with typed column is better
  • CLI (great)
  • read from Kafka
  • read from datalake
  • read other DBs (MySQL, PostGre)

Fact table in ClickHouse OLAP, dimension table in SQL for easy update (OLTP).

Nice features:

  • array management (nested columns)
  • aggregation in real-time
  • approximate unique (sketches, hyperloglog like)
  • approximate joins
  • SQL based ttl: allows to change storage type based on time

Modelling:
Easy thanks to schema evolution.

    • partition key
    • primary sort order
    • load everything as string
    • update types
    • codex: compression
    • low cardinality strings (lookup table for strings with few different values)
    • store data in arrays
    • materialize columns
    • ETL to ELT paradigm

Installation:
Easy to install, easy to manage (for a distributed OLAP DB).

  • focus on infrastructure integration before performance
  • bare metal » virtualized for performance
  • works quite well on Kubernetes (requires some tweaking)

Security and access control and was not very mature in 2019.

Vision of the future:
OLAP and AI are two separate worlds: this will change. ML will be embedded into the DBs.

  • keeps compute close to the data –> cost + performance
  • same language for data gathering and processing –> less errors
Subscribe to the newsletter