Skip to content
Learning Path: Real-time Analytics with ClickHouse

Real-time Analytics with ClickHouse: Level 3


Course

About

Real-Time Analytics with ClickHouse – Level 3 is where you move from writing fast queries to building systems that stay fast at any scale. Having mastered data modeling, analytical queries, joins, and updates in Levels 1 and 2, Level 3 focuses on the techniques and architectural patterns that power production ClickHouse deployments: pre-computing query results so dashboards respond in milliseconds regardless of data volume, scaling ClickHouse horizontally across clusters, and managing data lifecycle through compression and expiration policies.

 

Across three modules (8–10), you will gain the skills to build real-time dashboards backed by materialized views and projections, understand how ClickHouse scales across shards and replicas, and configure compression codecs and TTL rules to keep storage lean and queries fast over time. 

 

Completing the quizzes for all three modules makes you eligible to sit the ClickHouse Certified Developer exam.

Module 8: Query Acceleration Techniques

ClickHouse offers two fundamentally different types of materialized views. You will learn all about them in this module. 

 

Refreshable materialized views run a query against the entire source table on a configurable schedule and atomically replace the target table with the result — simple and effective when a small amount of staleness is acceptable. Incremental materialized views are insert triggers that execute only on newly inserted rows, making them the foundation of true real-time analytics. They never scan historical data and scale to trillions of rows without any increase in per-insert overhead. When combining incremental views with aggregations, the target table must use AggregatingMergeTree or SummingMergeTree to correctly merge partial states across inserts.

 

Projections offer an alternative approach: rather than a separate table, a projection is stored inside each data part and automatically selected by the query analyzer at runtime — the query writer doesn't need to know it exists. They are ideal for re-sorting data by a different key or pre-aggregating results for a single table, but don't support joins or FINAL queries. Skipping indexes extend granule-skipping to non-primary-key columns using auxiliary structures like bloom filters and min/max ranges. The benefit depends entirely on selectivity — a skipping index is only worth its overhead if it can eliminate a meaningful percentage of granules for your most common query patterns.

What You'll Learn

  • Understand the difference between refreshable and incremental materialized views and when to use each
  • Build incremental materialized views as real-time insert triggers that scale to any data volume
  • Use AggregatingMergeTree and SummingMergeTree as targets for aggregation-based incremental views
  • Apply the State and Merge function combinators when writing to and reading from AggregatingMergeTree tables
  • Define projections to pre-sort or pre-aggregate data at the part level for transparent query acceleration
  • Add skipping indexes (min/max, bloom filter, set) to enable granule skipping on non-primary-key columns
  • Understand when to choose projections vs. materialized views based on query patterns and data update behavior

Module 9: Sharding and Replication

Sharding splits a table's data across multiple nodes when it exceeds the capacity of a single machine. Each shard operates independently, and a Distributed table acts as a proxy that routes queries and inserts across all shards. Replication provides redundancy and additional query throughput. Every MergeTree engine has a replicated variant, coordinated by ClickHouse Keeper, ClickHouse's own C++ implementation of the ZooKeeper coordination protocol. In self-managed deployments, each replicated table requires unique shard and replica names configured via XML macros, and the ON CLUSTER syntax propagates DDL across every node automatically.

 

In ClickHouse Cloud, both sharding and data-level replication are replaced by SharedMergeTree — a cloud-native engine that separates compute from storage entirely. All data lives in object storage (S3, GCS, or Azure Blob) as the single source of truth. Compute replicas are stateless nodes that cache parts locally and can be scaled up or down instantly without any data movement. This means no sharding complexity, no storage multiplication, and fully independent scaling of read and write capacity.

What You'll Learn

  • Understand the difference between shards (horizontal data partitioning) and replicas (redundancy and additional compute)
  • Know when sharding is necessary and how the Distributed table engine routes queries and inserts across shards
  • Use ReplicatedMergeTree and its variants for self-managed high-availability deployments
  • Understand how ClickHouse Keeper coordinates replication metadata and ensures eventual consistency
  • Configure shard and replica names via XML macros and issue cluster-wide DDL with ON CLUSTER
  • Recognize that ClickHouse Cloud's SharedMergeTree eliminates sharding and data-level replication through separation of compute and storage
  • Understand how SharedMergeTree enables independent scaling of compute and storage without data duplication

Module 10: Managing Data

ClickHouse uses ZSTD as its default compression algorithm, and you can go further by applying column-level codecs that exploit the statistical properties of specific data. Delta stores differences between successive values rather than the values themselves — ideal for monotonically increasing integers and timestamps. DoubleDelta extends this for sequences with predictable step sizes, and Gorilla targets floating-point time series. Codecs can be stacked (e.g., Delta, ZSTD(1)) for compounding gains.

 

Time-to-Live (TTL) automates data lifecycle by defining what happens when a timestamp-based expression expires: rows are deleted, columns are dropped, or data is migrated to a different storage volume. TTL rules run on a background merge schedule and are not applied immediately. In self-managed deployments, TTL also powers hot/warm/cold storage tiering by automatically migrating aging parts to slower, cheaper volumes. In ClickHouse Cloud, tiering is not needed — all data is object-storage-backed from day one.

What You'll Learn

  • Choose between ZSTD and LZ4 based on the compression-vs-speed trade-off for your workload
  • Apply column-level compression codecs (Delta, DoubleDelta, Gorilla) to exploit data structure for better compression
  • Understand that codec selection always warrants benchmarking — compression gains can impact query decompression speed
  • Define table-level TTL rules to automatically expire and delete data based on a timestamp expression
  • Use ttl_only_drop_parts = 1 to ensure TTL deletion drops entire parts rather than triggering expensive mutations
  • Apply column-level TTL to drop specific columns as data ages, reducing storage for rarely queried historical fields
  • Configure storage tiering with TTL TO VOLUME rules to automatically migrate aging data to cheaper storage in self-managed deployments

 

After finishing all three modules, you will have covered every topic in the ClickHouse Certified Developer exam objectives — from data ingestion and schema design through query optimization, joins, updates, cluster architecture, and data lifecycle management. The certification exam is a hands-on, performance-based assessment delivered via HackerRank, where you complete practical tasks in a real ClickHouse environment. 

 

A discount code for the exam fee is available — reach out to the ClickHouse training team to request one.