Skip to content
Learning Path: Data Warehousing

Data Warehousing with ClickHouse: Level 2


Course

About

Level Overview

Level 2 builds on your foundational ClickHouse knowledge and takes you deep into the production data engineering workflows You'll learn how to move data efficiently into ClickHouse Cloud — whether through batch loads, asynchronous inserts, or continuous streaming pipelines — and how to keep that data accurate, deduplicated, and trustworthy over time. This level teaches you how to architect, maintain, and govern an enterprise-grade OLAP data platform built for demanding workloads.

 

Across four modules, you'll learn various strategies for data ingestion and management, how to dedupe and handle updates and deletes, how to structure your data using medallion architecture, and how to ensure data quality and governance all with ClickHouse.

 

Prerequisites: We recommend that you take the Real-time Analytics with ClickHouse: Level 1 training 

 

 

Module 4: Data Ingestion and Processing

ClickHouse is engineered for high-throughput data ingestion, and understanding how to move data into it efficiently is essential to unlocking sub-second query performance. This module covers the full spectrum of ingestion patterns available in ClickHouse — from ad hoc table functions and integration proxy table engines, to structured batch loading strategies and event-driven real-time pipelines. You'll learn when and why to move data from external sources into native ClickHouse storage, and how choosing the right ingestion method directly impacts query performance, storage efficiency, and operational overhead across sources like Amazon S3, Apache Kafka, MongoDB, MySQL, Azure Blob Storage, and Iceberg open table formats.

 

You'll also explore the power of materialized views as a core data transformation primitive in ClickHouse. Incremental materialized views act as insert-time triggers, continuously processing new data as it arrives for real-time analytics workloads. Refreshable materialized views enable full periodic recomputation — ideal for complex aggregations, joins, or datasets with frequent updates. Together with ClickPipes, ClickHouse's managed integration platform designed for the most demanding ingest workloads, these tools give you everything you need to build scalable, low-latency data pipelines directly in ClickHouse Cloud without external ETL tools.

 

What You'll Learn

  • The difference between table functions (ad hoc query-time access) and integration table engines (persistent proxy tables) How to perform full batch reloads and incremental loads, and when each strategy is most appropriate
  • How asynchronous inserts work and when to use them,How to configure ClickPipes for continuous, managed data ingestion from Apache Kafka, Amazon S3, Redpanda, Postgres, MySQL, and more
  • The difference between regular views (query-time execution) and materialized views (insert-time execution), and the right use case for each
  • How to define, populate, and backfill incremental materialized views How to configure refreshable materialized views with scheduled refresh intervals and monitor themWhen to use incremental vs. refreshable materialized views based on update frequency, join complexity, and real-time requirements

 

 

Module 5: Updating and Deleting Data

Because ClickHouse is a columnar database purpose-built for fast inserts, updates and deletes work differently than in row-oriented OLTP systems. This module explains why mutations in ClickHouse are asynchronous by design — rooted in the immutable, part-based storage model that makes ClickHouse so fast at insertion and compression — and how to choose between heavy ALTER TABLE mutations and lightweight operations based on your use case. Next, you’ll learn how to deduplicate your data in ClickHouse. You'll explore the three specialized MergeTree table engines that power this pattern: ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree

 

What You'll Learn

  • Why mutations in ClickHouse are asynchronous and how to monitor them via the mutation queue
  • The difference between heavy ALTER TABLE mutations (ideal for large batch changes) and lightweight updates and deletes (preferred for most use cases)
  • How lightweight deletes use hidden row-existence masks and how lightweight updates use patch parts to make changes immediately visible at query time
  • When to use ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree for deduplication based on update frequency, delete volume, and concurrency requirements
  • Query strategies — including aggregation and max() alternatives to FINAL

 

Module 6: Medallion Architecture 

This module introduces medallion architecture as the recommended data modeling pattern for ClickHouse Cloud, explaining why denormalized, flat tables with intentional field duplication outperform the join-heavy approaches of anchor modeling or data vault. You'll learn how to structure bronze, silver, and gold layers and why ClickHouse's schema-on-write model makes upfront structural choices more consequential than in schema-on-read warehouses

You'll also develop a working understanding of joins in ClickHouse — how they differ fundamentally from joins in row-oriented databases, and how to select the right join algorithm for your workload. ClickHouse supports several strategies, including hash joins , direct joins, and sort-merge joins. 

 

You'll then see how to chain materialized views across layers to build fully automated, end-to-end data pipelines — from raw ingestion in the bronze layer, through cleaning and deduplication in the silver layer, all the way to business-ready, pre-aggregated OLAP tables in the gold layer powering BI tools, dashboards, and AI-driven analytics.

 

What You'll Learn

  • Why medallion architecture is the recommended data warehouse structure for ClickHouse Cloud
  • Why ClickHouse favors denormalized, wide flat tables over normalized, join-heavy schemas How to design bronze layer tables optimized for fast bulk inserts, historical archiving, and downstream deduplication
  • How to populate silver layer tables using incremental materialized views for real-time cleaning, schema normalization, and deduplication
  • How to build gold layer tables using refreshable materialized views for complex joins, pre-aggregated metrics, and BI-ready datasets
  • The difference between hash, parallel hash, grace hash, direct, and sort-merge join algorithms — and when to use each
  • How to specify join algorithms How materialized view chaining drives fully automated, multi-layer data pipelines in ClickHouse

 

 

Module 7: Data Quality Assurance and Governance

Reliable real-time analytics begins with trustworthy data, and this module equips you with ClickHouse-native tools to enforce data quality and control access at every layer of your pipeline. You'll learn how to implement schema validation through materialized views — catching and filtering invalid data at write time rather than query time. You'll also learn how to perform zero-downtime schema migrations and structural changes to production tables using the EXCHANGE TABLES command, enabling blue-green deployment patterns that protect downstream consumers and eliminate service interruptions.

 

On the governance side, you'll implement role-based access control (RBAC) a using ClickHouse's four core security entities: user accounts, roles, settings profiles, and quotas. From restricting raw data access in the bronze layer to applying column-level and row-level security policies on gold layer marts, you'll learn how to ensure the right data reaches the right people. The module also covers data masking and obfuscation techniques — including ClickHouse's built-in obfuscation tool and masking views — for protecting personally identifiable information (PII) and meeting compliance requirements in regulated industries.

 

What You'll Learn

  • How to enforce data quality at insert time using materialized views with validation filters and various  constraints
  • How to perform zero-downtime structural changes using the EXCHANGE TABLES command with blue-green deployment patterns
  • How to create and manage users, roles, settings profiles, and quotas to govern resource usage across your ClickHouse Cloud service
  • How to grant and restrict privileges at the database and table level across bronze, silver, and gold layers using RBAC
  • How to  enforce column-level security by excluding sensitive columns from role-specific GRANT SELECT statements
  • How to  enforce row-level security using ClickHouse row policies to restrict which rows different roles can read
  • How to implement data masking with masking views and role-conditional encryption, and how to use ClickHouse's built-in obfuscation tool to anonymize data while preserving statistical distributions for testing and compliance