HSBC UK Data Scientist Interview: Real-Time Fraud Detection System Design

汇丰银行英国DS面试：实时欺诈检测系统设计

26 December 2024

3 min read

Anonymous Candidate

2025 HSBC UK Data Scientist Interviewee

摘要 Summary

A comprehensive Data Science interview experience from HSBC UK, featuring a hybrid machine learning approach for credit card fraud detection.

汇丰银行英国数据科学家面试实录，详解混合机器学习方法设计信用卡欺诈检测系统。

Case Background| 案例背景

The case for HSBC DS final interview was designing a Real-time Credit Card Fraud Detection System. The interviewer emphasized that this system needs to process millions of transactions daily, with extremely high requirements for both Latency and Accuracy, and must be able to adapt to constantly changing fraud patterns.

汇丰DS终面的Case，是设计一个实时的信用卡交易欺诈侦测系统。面试官强调，这个系统每天需要处理数百万笔交易，对延迟和准确率的要求都极高，并且需要能够应对不断变化的欺诈模式。

My approach was a Hybrid Model combining supervised and unsupervised learning:

我的方案，是一个结合了有监督和无监督学习的混合模型：

Layer 1: Rule Engine| 第一层：规则引擎

Before the model intervenes, I would first use a rule engine to filter out the most obvious fraud patterns that don't need machine learning. These rules are typically formulated by experienced Fraud Analysts:

在模型介入之前，我会先用一个规则引擎，过滤掉那些最明显、最无需动用机器学习的欺诈模式。这些规则通常是由经验丰富的欺诈分析师制定的：

A transaction amount exceeds 10 times the card's average transaction amount over the past 3 months.
一笔交易的金额，超过了该卡过去3个月平均交易金额的10倍。
A card has transaction records in two cities more than 1000 kilometers apart within 1 hour.
一张卡在1小时内，在两个相距超过1000公里的城市，都有交易记录。

Layer 2: Supervised Learning Model| 第二层：有监督学习模型

For transactions that pass through the rule engine, I would use a supervised learning model to predict fraud probability. I chose XGBoost because it typically performs well on tabular data.

对于通过了规则引擎的交易，我会用一个有监督学习模型来预测其欺诈概率。我选择了XGBoost模型，因为它在处理表格数据时通常有很好的效果。

Feature Engineering| 特征工程

I would construct two categories of features:

我会构建两类特征：

Transaction-level features: Transaction amount, transaction time, Merchant Category Code (MCC), transaction location, etc.
交易级特征：交易金额、交易时间、商户类别码（MCC）、交易地点等。
User-level features: Aggregated features based on user history—'number of transactions in the past 24 hours,' 'average transaction amount over the past 7 days,' 'most frequently used merchant categories.' These help the model capture each user's 'normal' spending patterns.
用户级特征：基于用户的历史交易构建一些聚合特征，比如「该用户过去24小时的交易次数」、「该用户过去7天的平均交易金额」、「该用户最常交易的商户类别」等。这些特征可以帮助模型捕捉到每个用户的「正常」消费模式。

Handling Class Imbalance| 处理样本不平衡

In fraud detection, fraud samples (positive class) are typically far fewer than normal samples (negative class). To address this, I recommend using the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to oversample the minority class.

在欺诈侦测中，欺诈样本（正样本）通常远少于正常样本（负样本）。为了解决这个问题，我建议采用SMOTE算法来对少数类样本进行过采样。

Layer 3: Unsupervised Learning Model| 第三层：无监督学习模型

Supervised learning models can only identify 'known' fraud patterns. For new fraud methods we've never seen, they're helpless. Therefore, I introduced an unsupervised learning model as a supplement.

有监督学习模型只能识别出那些「已知的」欺诈模式。对于那些新型的、我们从未见过的欺诈手段，它就无能为力了。因此，我引入了一个无监督学习模型作为补充。

Model Choice: I chose Isolation Forest algorithm. This algorithm doesn't need labels—it finds 'outlier' data points by randomly partitioning the data space. In our scenario, these outliers are likely new types of fraud.
模型选择：我选择了孤立森林（Isolation Forest）算法。这个算法不需要标签，它通过随机地切分数据空间，来寻找那些「离群」的数据点。在我们的场景中，这些离群点很可能就是新型的欺诈交易。
Application: I would periodically (e.g., daily) scan all transaction data with Isolation Forest. When a transaction's 'Anomaly Score' exceeds a threshold, I flag it for human analysts to investigate. If confirmed as new fraud, we add it to our training set to update the supervised model.
模型应用：我会定期地（比如每天）用孤立森林对所有的交易数据进行扫描。一旦发现某个交易的「异常得分」超过了某个阈值，我就会把它标记出来，交给人工分析师进行调查。如果确认是新型欺诈，我们就可以把它加入到我们的训练集中，来更新我们的有监督学习模型。

Q&A: Model Deployment| Q&A：模型部署

The interviewer asked a question about deployment: 'Your three-layer model sounds complex. In a real production environment, how do you ensure it can complete a prediction within 100 milliseconds?'

面试官问了一个关于「模型部署」的问题：「你设计的这个三层模型，听起来很复杂。在真实的生产环境中，你如何保证它能在100毫秒内完成一次预测？」

My answer: I would package the rule engine and XGBoost model into one service, deployed alongside a low-latency in-memory database (like Redis) for quick access to user-level features. The more computationally expensive Isolation Forest model can run offline in batch mode—no need for real-time requirements.

我的回答是，我会把规则引擎和XGBoost模型打包成一个服务，并部署在低延迟的内存数据库（如Redis）旁边，以快速地获取用户级的特征。而计算成本更高的孤立森林模型，则可以进行离线的、批处理的计算，不需要满足实时的要求。

Key Takeaways| 面试心得

Throughout the interview, I felt HSBC's DS really values whether you can combine machine learning technology with real, complex business problems. You need to think like a 'system architect' about model performance, stability, and scalability.
整个面试下来，感觉汇丰的DS，非常看重你是否能把机器学习技术，和一个真实的、复杂的业务问题结合起来。你需要像一个「系统架构师」一样，去思考模型的性能、稳定性、和可扩展性。