HSBC UK Data Engineer Interview: Enterprise Data Warehouse Architecture Design

汇丰银行英国DE面试：企业级数据仓库架构设计

22 December 2024

4 min read

Anonymous Candidate

2025 HSBC UK Data Engineer Interviewee

摘要 Summary

A system design interview experience from HSBC Data Engineering, featuring a modern cloud-native Lakehouse architecture to replace legacy Teradata systems.

汇丰银行数据工程系统设计面试实录，详解云原生湖仓一体架构替换传统Teradata系统。

Case Background| 案例背景

The system design question for HSBC Data Engineering final interview was designing a modern Data Warehouse solution to support bank-wide data analytics and reporting needs. The interviewer's background: the existing data warehouse is based on a traditional Teradata system—outdated technology stack, poor scalability, and high costs—no longer meeting growing business demands.

汇丰数据工程（DE）终面的系统设计题，是设计一个现代化的数据仓库解决方案，以支持全行级别的数据分析和报表需求。面试官给出的背景是：现有的数据仓库是基于传统的Teradata系统，技术栈老旧、扩展性差、成本高昂，已经无法满足日益增长的业务需求。

My solution was a cloud-native Lakehouse architecture.

我的方案是基于云原生的、湖仓一体的架构。

Core Architecture Components| 架构核心组件

1. Data Lake| 1. 数据湖

I would choose a cloud object storage service like AWS S3 or Azure Data Lake Storage (ADLS) as our unified data lake. All raw data from source systems (transaction systems, CRM, core banking) would be stored in the data lake in their original format—whether structured (CSV, Parquet), semi-structured (JSON, XML), or unstructured (PDF, audio). This ensures our 'Single Source of Truth.'

我会选择一个云上的对象存储服务，比如AWS S3或Azure Data Lake Storage (ADLS)，作为我们统一的数据湖。所有来自源系统（交易系统、CRM系统、核心银行系统）的原始数据，无论是结构化的、半结构化的、还是非结构化的，都会以其最原始的格式存储在数据湖中。这保证了我们数据的「单一事实来源」。

2. Data Processing & Transformation (ETL/ELT)| 2. 数据处理与转换

I would use Apache Spark as our primary data processing engine. I'd adopt an ELT (Extract, Load, Transform) pattern instead of traditional ETL—first load raw data unchanged into the data lake, then perform large-scale parallel data transformations on the data lake.

我会使用Apache Spark作为我们主要的数据处理引擎。我会采用ELT模式而不是传统的ETL。也就是说，先把原始数据原封不动地加载到数据湖中，然后再在数据湖上进行大规模的并行化数据转换。

Data Ingestion: Use tools like Apache NiFi or Airbyte to pull data from various source systems in real-time or batch mode.
数据摄取：使用像Apache NiFi或Airbyte这样的工具，来从各种源系统中实时或批量地拉取数据。
Data Transformation: Use Spark SQL or PySpark for data cleaning, transformation, and aggregation. I'd use dbt (Data Build Tool) to manage and schedule these transformation tasks, ensuring version control and repeatability.
数据转换：使用Spark SQL或PySpark来对数据进行清洗、转换和聚合。我会用dbt（Data Build Tool）来管理和调度这些转换任务，以保证数据处理逻辑的版本控制和可重复性。

3. Data Warehouse| 3. 数据仓库

I would choose a cloud-based data warehouse supporting Lakehouse architecture, like Databricks or Snowflake. These modern data warehouses can directly query data stored in the data lake (S3, ADLS), achieving compute-storage separation with high elasticity and cost-effectiveness.

我会选择一个云上的、支持湖仓一体架构的数据仓库，比如Databricks或Snowflake。这些现代化的数据仓库可以直接查询存储在数据湖中的数据，实现了计算和存储的分离，具有极高的弹性和性价比。

Data Layering (Bronze-Silver-Gold)| 数据分层（铜银金架构）

Within the data lake, I would organize data into three layers based on processing depth:

在数据湖中，我会把数据按照处理的深度分成三层：

Bronze Layer: Stores the rawest, unprocessed data. This is our data 'landing zone.'
铜层：存储最原始的、未经任何处理的数据。这是我们数据的「着陆区」。
Silver Layer: Stores cleaned, deduplicated, and standardized data. For example, unifying all time fields to UTC format; standardizing all customer IDs to a single format.
银层：存储经过清洗、去重、和标准化的数据。比如，把所有的时间字段都统一成UTC格式；把所有的客户ID都统一成一个标准的格式。
Gold Layer: Stores highly aggregated and business-modeled data. This is the final data we provide to data analysts, data scientists, and business reports—typically Star Schema or Snowflake Schema dimensional and fact tables.
金层：存储经过高度聚合和业务建模的数据。这是我们最终提供给数据分析师、数据科学家和业务报表使用的数据，通常是星型模型或雪花模型的维度表和事实表。

Data Governance| 数据治理

To ensure data quality, security, and compliance, I would also introduce a data governance framework:

为了保证数据的质量、安全和合规性，我还会引入一套数据治理的方案：

Data Catalog: Use tools like Apache Atlas or Alation to automatically scan and index all our data assets, providing a searchable data map for the entire company.
数据目录：使用像Apache Atlas或Alation这样的工具，来自动地扫描和索引我们所有的数据资产，并提供一个可供全公司查询的数据地图。
Data Lineage: Track the complete transformation path from source systems to final reports. When report data has issues, we can quickly trace back to where the problem occurred.
数据血缘：追踪数据从源系统到最终报表的完整转换路径。当一个报表上的数据出现问题时，我们可以快速地追溯到是哪个环节出了问题。
Access Control: Role-Based Access Control (RBAC) ensures only authorized users can access sensitive data.
访问控制：基于角色的访问控制（RBAC），确保只有被授权的用户才能访问敏感数据。

Key Takeaways| 面试心得

Throughout the interview, I felt HSBC's DE requires a very broad architectural vision. You need to be able to design a robust, scalable, and secure data platform to support the entire bank's 'data-driven' transformation. You need to think like a 'city planner' designing the bank's entire 'data highway.'
整个面试下来，感觉汇丰的DE需要有非常宏大的架构视野。你需要能够设计一个稳健、可扩展、且安全的数据平台，来支撑整个银行的「数据驱动」转型。你需要像一个「城市规划师」一样，去设计整个银行的「数据高速公路」。