Real-time Analytics with ClickHouse: Level 2

Course

Start

About

Real-Time Analytics with ClickHouse – Level 2 is where you move from understanding how ClickHouse works to understanding how to use it well. Building on the architectural foundations of Level 1 — columnar storage, MergeTree, data parts, granules, and the sparse primary index — Level 2 focuses on four practical disciplines: data modeling, analytical query writing, joining data efficiently, and handling updates and deletes in an OLAP-first environment.

Across four modules (4–7), you will gain the skills to design production-quality schemas, write expressive sub-second analytical queries, choose the right join strategy, and manage mutable data in a system built for immutability. Passing the quiz at the end of each module earns you the ClickHouse Database Professional credential.

Module 4: Modeling Data

Every schema decision in ClickHouse is a performance decision that compounds across billions and trillions of rows. Module 4 gives you the vocabulary and judgment to model data the way ClickHouse is designed to store it.

The guiding principle is to always use the smallest, most precise type that accurately represents your data. ClickHouse exposes 66 native C++ data types, and the right choice between them has a direct and measurable impact on storage, compression, and query speed at scale.

What You'll Learn:

Use the smallest, most precise native data type for every column — this compounds significantly at scale
Avoid Nullable unless the null/default distinction genuinely affects business logic
Apply LowCardinality to string columns with fewer than ~10,000 unique values for dictionary encoding and storage savings
Use Enum when values are known and stable; use LowCardinality when values are dynamic or unknown
Prefer Decimal over Float for precision-sensitive numeric data
Use the native JSON type for dynamic, schema-free document storage without sacrificing columnar performance
Use MATERIALIZED columns to shift recurring computation from query time to insert time
Understand that partitioning is for data lifecycle management, not query speed

Module 5: Analyzing Data

ClickHouse is standard SQL-compatible, so queries you've written for PostgreSQL will generally work without modification. But standard SQL is just the starting point. ClickHouse ships with over 1,700 built-in functions — growing at roughly 200 per year — and learning to use them is what will set you apart as a ClickHouse practitioner.

Functions fall into two categories: regular functions applied to every row, and aggregate functions that return one value per group. The date and time functions alone are worth mastering — bucketing and grouping by quarter, week, five-minute interval, or any other granularity is a single function call with no date arithmetic required.

Beyond standard aggregates like sum() and avg(), ClickHouse offers powerful analytical functions like topK() for most-frequent values, approximate quantile() for fast percentile calculations, and argMax() / argMin() — which return the value of one column corresponding to the max or min of another, replacing correlated subqueries with a single parallelizable pass.

One of ClickHouse's most powerful and least-known features is aggregate function combinators. Suffixes like If, Array, and Map that can be appended to any of the 200+ aggregate functions to add conditional filtering or handle complex column types. These combinations aren't individually documented; you learn the system and compose them freely.

What You'll Learn:

Write standard SQL queries in ClickHouse with confidence, and understand CTE behavior (CTEs re-execute for each reference — they are not memoized)
Use date/time bucketing functions to simplify time-series analysis at any granularity
Leverage argMax() / argMin() as fast, scalable alternatives to correlated subqueries
Apply aggregate function combinators (If, Array, Map) to extend all 200+ aggregate functions with conditional and type-aware behavior
Know when to use approximate vs. exact functions based on performance and precision requirements
Create user-defined functions in SQL or external languages for custom analytical logic

Module 6: Joining Data

ClickHouse supports all standard SQL join types with identical syntax to other SQL databases. But at OLAP scale, the join algorithm matters far more than the syntax — it determines whether a query completes in milliseconds or runs out of memory entirely.

Join algorithms are controlled by the join_algorithm session setting and involve a trade-off between speed and memory. The default, parallel hash, is the fastest option for most joins but is memory-bound. Grace hash provides the same approach with disk spillover for a higher memory ceiling at some speed cost. Full sort merge eliminates hashing entirely and can be extremely fast when both tables are physically sorted by the join key. Partial merge is the most memory-conservative fallback for very large joins.

For the fastest joins of all, dictionaries are ClickHouse's most powerful optimization. A dictionary is an in-memory key-value store that loads from a configured source. A ClickHouse table, PostgreSQL, MongoDB, an HTTP endpoint, and many others. This replicates across every node in the cluster, and refreshes automatically on a configurable interval. Lookups via dictGet() are essentially instantaneous, consistently outperforming hash-based joins. Dictionaries can also be used at insert time via MATERIALIZED columns to permanently embed enrichment values into stored rows, eliminating join cost at query time entirely.

What You'll Learn

Understand that join algorithm selection — not join syntax — is the critical performance decision in ClickHouse
Choose between parallel hash, grace hash, and sort merge based on data size and memory constraints
Build dictionaries to replace dimension-table joins with sub-millisecond key-value lookups using dictGet()
Use dictionaries for both query-time and insert-time data enrichment via materialized columns
Apply the Join table engine for static in-memory reference data that doesn't need periodic refresh
Know when to rewrite joins as IN subqueries for better query planner optimization

Module 7: Deleting and Updating Data

One of the most important realities of working with ClickHouse is that nothing happens immediately when you update or delete data. Because data parts are immutable, every update or delete is fundamentally a rewrite operation. Module 7 covers all available mechanisms and when to use each.

Mutations (ALTER TABLE ... UPDATE / DELETE) are the traditional approach. This is an asynchronous background operation that rewrites every part containing matching rows. They are expensive, can take minutes to hours, and should be reserved for bulk operations aligned with table partitions rather than frequent row-level changes.

Lightweight deletes mark rows as deleted in a hidden bitmap, making them immediately invisible to queries without rewriting parts. Physical removal happens during future merges. This works well for low-volume corrections, but query overhead grows as the percentage of deleted rows increases.

Lightweight updates (patch parts) are ClickHouse's most significant recent advancement for mutability — issued with standard UPDATE ... SET syntax, they create small patch parts containing only the changed columns and rows. Changes are immediately visible in queries and merged into the main data in the background. This is the go-to mechanism for frequent, targeted row corrections.

For workloads requiring continuous upserts — particularly ClickPipes CDC ingestion from PostgreSQL, MySQL, or MongoDB via PeerDB — ReplacingMergeTree is the standard engine. It deduplicates rows with the same sort key during background merges, keeping the latest version. Because deduplication is eventual, queries use the FINAL keyword or argMax() aggregation patterns to ensure correct results. CollapsingMergeTree offers an alternative sign-based approach (+1 / -1 rows) for workloads that need to update sort key values, at the cost of additional application-side complexity.

What You'll Learn

Understand that all update and delete operations in ClickHouse are asynchronous — nothing physically changes data immediately
Use mutations for bulk operations aligned with table partitions; avoid them for frequent row-level changes
Applying lightweight deletes for low-volume row removal where immediate query-visibility matters
Using lightweight updates (patch parts) for frequent, targeted row corrections
Build ReplacingMergeTree tables for CDC and upsert workloads; use FINAL or argMax() queries for correct, deduplicated results
Understand CollapsingMergeTree sign-based row cancellation for state-tracking workloads that require sort key mutability
Recognize that ClickPipes CDC connectors use ReplacingMergeTree as their destination engine

Passing the quizzes for all four Level 2 modules earns you the ClickHouse Database Professional credential.

We no longer support Internet Explorer

Real-time Analytics with ClickHouse: Level 2

About

Module 4: Modeling Data

Module 5: Analyzing Data

What You'll Learn:

Module 6: Joining Data

What You'll Learn

Module 7: Deleting and Updating Data

What You'll Learn