View on GitHub

Transaction Fraud Detection System

A Project by Tanishqa Aggarwal and Shreevatsa Agnihotri

🎬 OPENING SCENE

png

The Challenge: Security vs. Sanity

The Numbers Don’t Lie (But Fraudsters Do)

Think about it: You’re at the grocery store. You try to pay. DECLINED . You just got Sarah’d. And now you’re embarrassed. And angry. And hungry.

What are we doing in this notebook?

We are stepping into the shoes of a fraud analytics team at a digital payments company.
Our mission: Spot suspicious transactions fast enough that people like Sarah never lose a dollar.

Think of this notebook as a mix of:

Along the way, we will:

Cute Shocked

Project Roadmap

PHASE 1: THE EVIDENCE (Data Collection)

PHASE 2: SANITIZING THE DATA (Preprocessing)

PHASE 3: THE INVESTIGATION (Exploratory Analysis)

PHASE 4: THE DETECTIVES (Machine Learning)

PHASE 5: EVALUATING ON UNSEEN DATA

FUTURE WORK

REFERENCES & RECOURCES

DATA COLLECTION

The Data

Every investigation begins with data collection. And for our detective work we need real live payment systems data harvested under strict privacy constraints. Hence we have chosen IEEE-CIS Fraud Detection data from Kaggle.

Who created this data?

Why this dataset?

If you are unfamiliar with Kaggle, it is like “GitHub for data scientists” - a platform where companies post real datasets and challenges. Data scientists compete to build the best models. It’s how professionals practice, learn, and showcase skills.

Think of it as:

The Data Architecture

To catch a thief, we need to look at the full picture. Our dataset is a relationship between two distinct stories linked by a TransactionID.

1. The Paper Trail (Transaction Table)

This is the “what” and the “where.” It holds the details you’d expect on a receipt:

2. The Digital Fingerprint (Identity Table)

This is the “who” (or “what machine”). It captures the hidden network details:

The Split

We teach the model using a Training Set (where we already know who the fraudsters are) and then grade its performance on a Test Set (unseen data) to ensure it can handle the real world.

1.1 Importing the Toolkit

First, we need to gather our tools. We’re importing the standard data science stack to help us wrangle the raw data and visualize the hidden patterns inside.

Meet our toolbox

Before we interrogate any suspicious transaction, we need our investigation gear:

Once the toolkit is ready, we can invite the data into the lab.

New to these libraries? Explore the official documentation for pandas, numpy, matplotlib, seaborn, and scikit-learn to dive deeper into each powerful tool.

import os
import gc
import math
import numpy as np
import pandas as pd
from google.colab import drive

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# stats/tests
from scipy import stats
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix, roc_curve , precision_recall_curve , auc , f1_score , precision_score , recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
import xgboost as xgb

# settings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 500)
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
RANDOM_STATE = 42

drive.mount('/content/drive')

Mounted at /content/drive

1.2 Loading the Raw Logs

Time to bring in the evidence. We load the raw transaction and identity logs to begin looking for clues.

# load files
tx_train_path = "/content/drive/MyDrive/602-Project/train_transaction.csv"
id_train_path = "/content/drive/MyDrive/602-Project/train_identity.csv"

print("Files exist:", os.path.exists(tx_train_path), os.path.exists(id_train_path))
Files exist: True True

1.2.1 Transaction & Identity Data

df_tx_train = pd.read_csv(tx_train_path, low_memory=False)

df_tx_train.head(10)
TransactionID isFraud TransactionDT TransactionAmt ProductCD card1 card2 card3 card4 card5 card6 addr1 addr2 dist1 dist2 P_emaildomain R_emaildomain C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 M1 M2 M3 M4 M5 M6 M7 M8 M9 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208 V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224 V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256 V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304 V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320 V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 V337 V338 V339
0 2987000 0 86400 68.5 W 13926 NaN 150.0 discover 142.0 credit 315.0 87.0 19.0 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 0.0 1.0 1.0 14.0 NaN 13.0 NaN NaN NaN NaN NaN NaN 13.0 13.0 NaN NaN NaN 0.0 T T T M2 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 117.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 117.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2987001 0 86401 29.0 W 2755 404.0 150.0 mastercard 102.0 credit 325.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 T T NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2987002 0 86469 59.0 W 4663 490.0 150.0 visa 166.0 debit 330.0 87.0 287.0 NaN outlook.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 315.0 NaN NaN NaN 315.0 T T T M0 F F F F F 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2987003 0 86499 50.0 W 18132 567.0 150.0 mastercard 117.0 debit 476.0 87.0 NaN NaN yahoo.com NaN 2.0 5.0 0.0 0.0 0.0 4.0 0.0 0.0 1.0 0.0 1.0 0.0 25.0 1.0 112.0 112.0 0.0 94.0 0.0 NaN NaN NaN NaN 84.0 NaN NaN NaN NaN 111.0 NaN NaN NaN M0 T F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 48.0 28.0 0.0 10.0 4.0 1.0 38.0 24.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 50.0 1758.0 925.0 0.0 354.0 135.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 28.0 0.0 0.0 0.0 0.0 10.0 0.0 4.0 0.0 0.0 1.0 1.0 1.0 1.0 38.0 24.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 50.0 1758.0 925.0 0.0 354.0 0.0 135.0 0.0 0.0 0.0 50.0 1404.0 790.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2987004 0 86506 50.0 H 4497 514.0 150.0 mastercard 102.0 credit 420.0 87.0 NaN NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1803.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15557.990234 169690.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 2987005 0 86510 49.0 W 5937 555.0 150.0 visa 226.0 debit 272.0 87.0 36.0 NaN gmail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN 0.0 T T T M1 F T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 2987006 0 86522 159.0 W 12308 360.0 150.0 visa 166.0 debit 126.0 87.0 0.0 NaN yahoo.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN 0.0 T T T M0 F F T T T 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 2987007 0 86529 422.5 W 12695 490.0 150.0 visa 226.0 debit 325.0 87.0 NaN NaN mail.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN 0.0 NaN NaN NaN M0 F F NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 2987008 0 86535 15.0 H 2803 100.0 150.0 visa 226.0 debit 337.0 87.0 NaN NaN anonymous.com NaN 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 18.0 140.0 0.0 0.0 0.0 0.0 1804.0 49.0 64.0 0.0 0.0 0.0 0.0 0.0 0.0 15607.990234 169740.796875 0.0 0.0 0.0 515.0 5155.0 2840.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 2987009 0 86536 117.0 W 17399 111.0 150.0 mastercard 224.0 debit 204.0 87.0 19.0 NaN yahoo.com NaN 2.0 2.0 0.0 0.0 0.0 3.0 0.0 0.0 3.0 0.0 1.0 0.0 12.0 2.0 61.0 61.0 30.0 318.0 30.0 NaN NaN NaN NaN 40.0 302.0 NaN NaN NaN 318.0 T T T M0 T T NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 495.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
df_id_train = pd.read_csv(id_train_path, low_memory=False)

df_id_train.head(10)
TransactionID id_01 id_02 id_03 id_04 id_05 id_06 id_07 id_08 id_09 id_10 id_11 id_12 id_13 id_14 id_15 id_16 id_17 id_18 id_19 id_20 id_21 id_22 id_23 id_24 id_25 id_26 id_27 id_28 id_29 id_30 id_31 id_32 id_33 id_34 id_35 id_36 id_37 id_38 DeviceType DeviceInfo
0 2987004 0.0 70787.0 NaN NaN NaN NaN NaN NaN NaN NaN 100.0 NotFound NaN -480.0 New NotFound 166.0 NaN 542.0 144.0 NaN NaN NaN NaN NaN NaN NaN New NotFound Android 7.0 samsung browser 6.2 32.0 2220x1080 match_status:2 T F T T mobile SAMSUNG SM-G892A Build/NRD90M
1 2987008 -5.0 98945.0 NaN NaN 0.0 -5.0 NaN NaN NaN NaN 100.0 NotFound 49.0 -300.0 New NotFound 166.0 NaN 621.0 500.0 NaN NaN NaN NaN NaN NaN NaN New NotFound iOS 11.1.2 mobile safari 11.0 32.0 1334x750 match_status:1 T F F T mobile iOS Device
2 2987010 -5.0 191631.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 NaN 410.0 142.0 NaN NaN NaN NaN NaN NaN NaN Found Found NaN chrome 62.0 NaN NaN NaN F F T T desktop Windows
3 2987011 -5.0 221832.0 NaN NaN 0.0 -6.0 NaN NaN NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 NaN 176.0 507.0 NaN NaN NaN NaN NaN NaN NaN New NotFound NaN chrome 62.0 NaN NaN NaN F F T T desktop NaN
4 2987016 0.0 7460.0 0.0 0.0 1.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound NaN -300.0 Found Found 166.0 15.0 529.0 575.0 NaN NaN NaN NaN NaN NaN NaN Found Found Mac OS X 10_11_6 chrome 62.0 24.0 1280x800 match_status:2 T F T T desktop MacOS
5 2987017 -5.0 61141.0 3.0 0.0 3.0 0.0 NaN NaN 3.0 0.0 100.0 NotFound 52.0 -300.0 Found Found 166.0 18.0 529.0 600.0 NaN NaN NaN NaN NaN NaN NaN Found Found Windows 10 chrome 62.0 24.0 1366x768 match_status:2 T F T T desktop Windows
6 2987022 -15.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NotFound 14.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 2987038 0.0 31964.0 0.0 0.0 0.0 -10.0 NaN NaN 0.0 0.0 100.0 Found NaN -300.0 Found Found 166.0 15.0 352.0 533.0 NaN NaN NaN NaN NaN NaN NaN Found Found Android chrome 62.0 32.0 1920x1080 match_status:2 T F T T mobile NaN
8 2987040 -10.0 116098.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 100.0 NotFound 52.0 NaN Found Found 121.0 NaN 410.0 142.0 NaN NaN NaN NaN NaN NaN NaN Found Found NaN chrome 62.0 NaN NaN NaN F F T T desktop Windows
9 2987048 -5.0 257037.0 NaN NaN 0.0 0.0 NaN NaN NaN NaN 100.0 NotFound 52.0 NaN New NotFound 225.0 NaN 484.0 507.0 NaN NaN NaN NaN NaN NaN NaN New NotFound NaN chrome 62.0 NaN NaN NaN F F T T desktop Windows
print("Transaction shape:", df_tx_train.shape)
print("Identity shape:", df_id_train.shape)
Transaction shape: (590540, 394)
Identity shape: (144233, 41)
print(df_tx_train['isFraud'].value_counts())
isFraud
0    569877
1     20663
Name: count, dtype: int64

First glimpse at the fraud universe

We have just loaded a huge ledger of transactions.
Each row is a tiny moment in someone’s financial life: a coffee, a hotel booking, an online game purchase, or… a stolen card being abused.

When you see columns like:

…you can imagine a security control room where each of these columns lights up when something feels off.

We will now start by asking a simple question:

“What does a typical transaction look like? And how does a fraudulent one differ?”

1.2.2 Creating the Master Dataset (Merge)

Context is everything in fraud detection. By merging the payment logs with device data, we link the action (the purchase) to the actor (the digital fingerprint) to reveal the full story.

# Merge with identity table
df_merged_data = df_tx_train.merge(df_id_train, how="left", on="TransactionID")
print("Merged shape:", df_merged_data.shape)
Merged shape: (590540, 434)

DATA PREPROCESSING

Cleaning the crime scene

Real-world transaction data is messy:

If we took this raw data and threw it straight into a model, it would be like trying to solve a case in a dark room with half the evidence smudged.

In this section, we:

You can think of it as prepping the evidence before presenting it to a very picky judge: the classifier.

2.1 Consistency Check (Duplicates)

Reliable forensic analysis requires unique evidence. We scan the logs for duplicate entries to ensure that every TransactionID represents a single, distinct event in the real world.

print("Checking for Duplicates -")

# Checking the number of completely duplicated rows
duplicate_rows = df_merged_data.duplicated().sum()
print(f"Found {duplicate_rows} duplicated rows.")

if duplicate_rows > 0:
    df_merged_data = df_merged_data.drop_duplicates().reset_index(drop=True)

# Checking duplicated Transaction IDs (if any)
trx_id = df_merged_data['TransactionID'].nunique()
if trx_id != df_merged_data.shape[0]:
    print(f"Found {df_merged_data.shape[0] - trx_id} duplicated TransactionID.")
else:
    print("No duplicated TransactionID found.")
Checking for Duplicates -
Found 0 duplicated rows.
No duplicated TransactionID found.
print("Initial Data Types and Non-Null Counts:")
df_merged_data.info()
Initial Data Types and Non-Null Counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 1.9+ GB

2.2 Parsing Data Types

Algorithms can mistake numerical IDs for values (e.g., thinking Device ID 5 is “greater” than Device ID 1). We fix this by enforcing categorical types, ensuring the model sees them as identifiers, not math.

# (b) Parse - Fix Data Types
print("Parsing Categorical Columns -")

# All columns that are categorical, based on the data description
categorical_features = [
    'ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'DeviceType', 'DeviceInfo',
]
categorical_features += [f'card{i}' for i in range(1, 7)]  # card1 - card6
categorical_features += [f'M{i}' for i in range(1, 10)]   # M1 - M9
categorical_features += [f'id_{i}' for i in range(12, 39)] # id_12 - id_38

cols_converted = []
for col in categorical_features:
    if col in df_merged_data.columns:
        df_merged_data[col] = df_merged_data[col].astype(str)
        cols_converted.append(col)

print(f"Converted {len(cols_converted)} columns to 'object' (string) type.")
print("\nCell 3 complete: Data types parsed.")
Parsing Categorical Columns -
Converted 49 columns to 'object' (string) type.

Cell 3 complete: Data types parsed.
df_merged_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(382), int64(3), object(49)
memory usage: 1.9+ GB

2.3 Handling Missing Intelligence

With 434 features, we are dealing with a wide but potentially sparse dataset. Before we can fix the data, we need to measure the damage.

The Obstacle: Many columns contain a high percentage of NaN values. Imputing (filling in) this many blanks would introduce too much noise into our investigation.

The Strategy: We will perform a sparsity check. We are printing the top 50 worst offenders by missing percentage to decide where to draw the line between useful features and dead weight.

2.3.1 - Investigation

# (Part 1: Investigation)
print("Investigating Missing Values -")

# Calculate missing value counts and percentages
missing_values = df_merged_data.isnull().sum()
missing_percent = (missing_values / len(df_merged_data)) * 100

# Create a summary DataFrame
missing_summary = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percent': missing_percent
})

# Sort to see the worst columns
missing_summary.sort_values(by='Missing Percent', ascending=False, inplace=True)

# Display the top 50 columns with the most missing data
print("Top 50 columns with the most missing values:")
print(missing_summary.head(50))
Investigating Missing Values -
Top 50 columns with the most missing values:
       Missing Count  Missing Percent
id_07         585385        99.127070
id_08         585385        99.127070
dist2         552913        93.628374
D7            551623        93.409930
D13           528588        89.509263
D14           528353        89.469469
D12           525823        89.041047
id_03         524216        88.768923
id_04         524216        88.768923
D6            517353        87.606767
id_09         515614        87.312290
D9            515614        87.312290
D8            515614        87.312290
id_10         515614        87.312290
V138          508595        86.123717
V153          508595        86.123717
V148          508595        86.123717
V154          508595        86.123717
V163          508595        86.123717
V139          508595        86.123717
V149          508595        86.123717
V142          508595        86.123717
V140          508595        86.123717
V141          508595        86.123717
V146          508595        86.123717
V161          508595        86.123717
V158          508595        86.123717
V155          508595        86.123717
V156          508595        86.123717
V147          508595        86.123717
V162          508595        86.123717
V157          508595        86.123717
V144          508589        86.122701
V152          508589        86.122701
V145          508589        86.122701
V159          508589        86.122701
V151          508589        86.122701
V150          508589        86.122701
V165          508589        86.122701
V164          508589        86.122701
V160          508589        86.122701
V143          508589        86.122701
V166          508589        86.122701
V323          508189        86.054967
V322          508189        86.054967
V333          508189        86.054967
V334          508189        86.054967
V338          508189        86.054967
V339          508189        86.054967
V330          508189        86.054967
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: Missing data percentage by column
ax1 = axes[0]
missing_pct = (df_merged_data.isnull().sum() / len(df_merged_data) * 100).sort_values(ascending=False)
missing_pct = missing_pct[missing_pct > 0].head(20)
colors_missing = ['#d64545' if x > 50 else '#f4a261' if x > 20 else '#2b7d99' for x in missing_pct.values]
ax1.barh(range(len(missing_pct)), missing_pct.values, color=colors_missing, edgecolor='black')
ax1.set_yticks(range(len(missing_pct)))
ax1.set_yticklabels(missing_pct.index, fontsize=9)
ax1.set_xlabel('Missing Data (%)', fontsize=11, fontweight='bold')
ax1.set_title('Missing Data Pattern (Top 20 Features)', fontsize=12, fontweight='bold')
ax1.axvline(x=50, color='red', linestyle='--', linewidth=2, alpha=0.5, label='50% threshold')
ax1.grid(alpha=0.3, axis='x')
ax1.legend()

# Subplot 2: Missing data distribution by fraud status
ax2 = axes[1]
legit_missing = df_merged_data[df_merged_data['isFraud'] == 0].isnull().sum() / len(df_merged_data[df_merged_data['isFraud'] == 0]) * 100
fraud_missing = df_merged_data[df_merged_data['isFraud'] == 1].isnull().sum() / len(df_merged_data[df_merged_data['isFraud'] == 1]) * 100

missing_cols = (df_merged_data.isnull().sum() > 0)
cols_to_plot = missing_cols[missing_cols].index[:15]

x = np.arange(len(cols_to_plot))
width = 0.35
ax2.bar(x - width/2, legit_missing[cols_to_plot], width, label='Legitimate', color='#2b7d99', edgecolor='black')
ax2.bar(x + width/2, fraud_missing[cols_to_plot], width, label='Fraud', color='#d64545', edgecolor='black')
ax2.set_xlabel('Features', fontsize=11, fontweight='bold')
ax2.set_ylabel('Missing Data (%)', fontsize=11, fontweight='bold')
ax2.set_title('Missing Data: Fraud vs Legitimate (Top 15)', fontsize=12, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(cols_to_plot, rotation=45, ha='right', fontsize=8)
ax2.legend()
ax2.grid(alpha=0.3, axis='y')

plt.show()

png

2.3.2 Deletion Strategy

The audit flagged a significant amount of “dead weight.” Many columns are over 50%—or even 90%—empty.

The Decision: We cannot reliably impute (guess) data when the majority of the history is missing. Doing so introduces bias, not signal.

The Action: We are applying a strict 40% threshold. Any column missing more than 40% of its values will be dropped immediately to preserve the integrity of the dataset.

# (Part 2: Column Deletion)
print("Deleting Sparse Columns...")

# Any column with more than 40% missing data will be dropped.
threshold = 40.0

# Get the list of columns to drop
cols_to_drop = missing_summary[missing_summary['Missing Percent'] > threshold].index

print(f"Found {len(cols_to_drop)} columns with > {threshold}% missing values.")

# Correct by dropping these columns
df_cleaned = df_merged_data.drop(columns=cols_to_drop)

print(f"Dropped columns. New shape of df_cleaned: {df_cleaned.shape}")
print(f"Original shape was: {df_merged_data.shape}")
Deleting Sparse Columns...
Found 194 columns with > 40.0% missing values.
Dropped columns. New shape of df_cleaned: (590540, 240)
Original shape was: (590540, 434)
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 240 entries, TransactionID to DeviceInfo
dtypes: float64(188), int64(3), object(49)
memory usage: 1.1+ GB

2.3.3 Cleaning Strategy - Categorical Data

Our categorical features are currently noisy and inconsistent.

The Glitch: Converting data types created the text string 'nan', which the model misinterprets as a real value. Additionally, inconsistent user inputs (e.g.,gmail.com vs. googlemail.com) split identical groups, diluting the signal.

The Fix: We normalize text, merge aliases, and relabel 'nan' to ‘Missing’. This ensures the model treats missing data as a distinct pattern rather than a random string.

# Investigating Email Domains -

# Set pandas to display all rows
pd.set_option('display.max_rows', None)

print("P_emaildomain (All Unique Values) -")
print(df_cleaned['P_emaildomain'].value_counts())

print("\n\nR_emaildomain (All Unique Values) -")
print(df_cleaned['R_emaildomain'].value_counts())

# Reset display options to default
pd.set_option('display.max_rows', 100)
P_emaildomain (All Unique Values) -
P_emaildomain
gmail.com           228355
yahoo.com           100934
nan                  94456
hotmail.com          45250
anonymous.com        36998
aol.com              28289
comcast.net           7888
icloud.com            6267
outlook.com           5096
msn.com               4092
att.net               4033
live.com              3041
sbcglobal.net         2970
verizon.net           2705
ymail.com             2396
bellsouth.net         1909
yahoo.com.mx          1543
me.com                1522
cox.net               1393
optonline.net         1011
charter.net            816
live.com.mx            749
rocketmail.com         664
mail.com               559
earthlink.net          514
gmail                  496
outlook.es             438
mac.com                436
juno.com               322
aim.com                315
hotmail.es             305
roadrunner.com         305
windstream.net         305
hotmail.fr             295
frontier.com           280
embarqmail.com         260
web.de                 240
netzero.com            230
twc.com                230
prodigy.net.mx         207
centurylink.net        205
netzero.net            196
frontiernet.net        195
q.com                  189
suddenlink.net         175
cfl.rr.com             172
sc.rr.com              164
cableone.net           159
gmx.de                 149
yahoo.fr               143
yahoo.es               134
hotmail.co.uk          112
protonmail.com          76
yahoo.de                74
ptd.net                 68
live.fr                 56
yahoo.co.uk             49
hotmail.de              43
servicios-ta.com        35
yahoo.co.jp             32
Name: count, dtype: int64


R_emaildomain (All Unique Values) -
R_emaildomain
nan                 453249
gmail.com            57147
hotmail.com          27509
anonymous.com        20529
yahoo.com            11842
aol.com               3701
outlook.com           2507
comcast.net           1812
yahoo.com.mx          1508
icloud.com            1398
msn.com                852
live.com               762
live.com.mx            754
verizon.net            620
me.com                 556
sbcglobal.net          552
cox.net                459
outlook.es             433
att.net                430
bellsouth.net          422
hotmail.fr             293
hotmail.es             292
web.de                 237
mac.com                218
prodigy.net.mx         207
ymail.com              207
optonline.net          187
gmx.de                 147
yahoo.fr               137
charter.net            127
mail.com               122
hotmail.co.uk          105
gmail                   95
earthlink.net           79
yahoo.de                75
rocketmail.com          69
embarqmail.com          68
scranton.edu            63
yahoo.es                57
live.fr                 55
juno.com                53
roadrunner.com          53
frontier.com            52
windstream.net          47
hotmail.de              42
protonmail.com          41
yahoo.co.uk             39
cfl.rr.com              37
aim.com                 36
servicios-ta.com        35
yahoo.co.jp             33
twc.com                 29
ptd.net                 27
cableone.net            27
q.com                   25
suddenlink.net          25
frontiernet.net         14
netzero.com             14
centurylink.net         12
netzero.net              9
sc.rr.com                8
Name: count, dtype: int64

Observation & Strategy: The domain audit reveals that identical providers, such as Gmail and Googlemail.com, are currently treated as separate entities. Also the presence of “nan” values creates gaps in the data. To resolve these issues, we will merge these aliases to consolidate the groups and explicitly tag any missing entries. This approach will help ensure that the model interprets the entities correctly, rather than merely focusing on their spelling.

print("Cleaning Categorical Inconsistencies -")
object_cols = df_cleaned.select_dtypes(include='object').columns

print(f"Cleaning {len(object_cols)} object/categorical columns...")
gmail_variations = ['gmail', 'googlemail.com']
yahoo_variations = [
    'yahoo.com.mx', 'ymail.com', 'yahoo.fr', 'yahoo.es',
    'yahoo.de', 'yahoo.co.uk', 'yahoo.co.jp', 'rocketmail.com'
]
hotmail_variations = ['hotmail.co.uk', 'hotmail.de', 'hotmail.es', 'hotmail.fr']
outlook_variations = ['outlook.es']
live_variations = ['live.com.mx', 'live.fr']
apple_variations = ['me.com', 'mac.com', 'icloud.com']
german_variations = ['web.de', 'gmx.de']
aol_variations = ['aim.com']
netzero_variations = ['netzero.net']
isp_variations = [
    'comcast.net', 'att.net', 'sbcglobal.net', 'verizon.net', 'bellsouth.net',
    'cox.net', 'optonline.net', 'charter.net', 'roadrunner.com', 'windstream.net',
    'frontier.com', 'embarqmail.com', 'twc.com', 'centurylink.net',
    'frontiernet.net', 'q.com', 'suddenlink.net', 'cfl.rr.com', 'sc.rr.com',
    'cableone.net', 'ptd.net'
]

for col in object_cols:
    df_cleaned[col] = df_cleaned[col].str.lower()
    df_cleaned[col] = df_cleaned[col].replace('nan', 'Missing')

    if col in ['P_emaildomain', 'R_emaildomain']:
        df_cleaned[col] = df_cleaned[col].replace(gmail_variations, 'gmail.com')
        df_cleaned[col] = df_cleaned[col].replace(yahoo_variations, 'yahoo.com')
        df_cleaned[col] = df_cleaned[col].replace(hotmail_variations, 'hotmail.com')
        df_cleaned[col] = df_cleaned[col].replace(outlook_variations, 'outlook.com')
        df_cleaned[col] = df_cleaned[col].replace(live_variations, 'live.com')
        df_cleaned[col] = df_cleaned[col].replace(apple_variations, 'apple.com')
        df_cleaned[col] = df_cleaned[col].replace(german_variations, 'german_mail')
        df_cleaned[col] = df_cleaned[col].replace(aol_variations, 'aol.com')
        df_cleaned[col] = df_cleaned[col].replace(netzero_variations, 'netzero.com')
        df_cleaned[col] = df_cleaned[col].replace(isp_variations, 'isp_mail.com')

print("Categorical column cleaning and grouping complete.")

print("\n--- 'P_emaildomain' (Top 30) AFTER Cleaning ---")
print(df_cleaned['P_emaildomain'].value_counts().head(30))

print("\n--- 'R_emaildomain' (Top 30) AFTER Cleaning ---")
print(df_cleaned['R_emaildomain'].value_counts().head(30))
Cleaning Categorical Inconsistencies -
Cleaning 49 object/categorical columns...
Categorical column cleaning and grouping complete.

--- 'P_emaildomain' (Top 30) AFTER Cleaning ---
P_emaildomain
gmail.com           228851
yahoo.com           105969
Missing              94456
hotmail.com          46005
anonymous.com        36998
aol.com              28604
isp_mail.com         25432
apple.com             8225
outlook.com           5534
msn.com               4092
live.com              3846
mail.com               559
earthlink.net          514
netzero.com            426
german_mail            389
juno.com               322
prodigy.net.mx         207
protonmail.com          76
servicios-ta.com        35
Name: count, dtype: int64

--- 'R_emaildomain' (Top 30) AFTER Cleaning ---
R_emaildomain
Missing             453249
gmail.com            57242
hotmail.com          28241
anonymous.com        20529
yahoo.com            13967
isp_mail.com          5033
aol.com               3737
outlook.com           2940
apple.com             2172
live.com              1571
msn.com                852
german_mail            384
prodigy.net.mx         207
mail.com               122
earthlink.net           79
scranton.edu            63
juno.com                53
protonmail.com          41
servicios-ta.com        35
netzero.com             23
Name: count, dtype: int64

2.3.4 Imputation Strategy

Here we can’t just guess the missing numbers; we need a strategy based on how much data is actually gone. This step helps us decide between a quick fix (median) or a smarter calculation (regression) to complete the picture.

print("Investigating and Imputing Remaining Missing Values (Numeric)- ")

# Get all numeric columns
numeric_cols = df_cleaned.select_dtypes(include=np.number).columns
print(f"Found {len(numeric_cols)} numeric columns.")

# Investigate: Find numeric columns that have missing values
missing_numeric_counts = df_cleaned[numeric_cols].isnull().sum()
missing_numeric_cols = missing_numeric_counts[missing_numeric_counts > 0]

# Print the percentage of nulls for those columns
if missing_numeric_cols.empty:
    print("No missing values found in any numeric columns.")
else:
    print("\nNumeric Columns with Missing Values (Before Imputation) -")
    missing_numeric_summary = pd.DataFrame({
        'Missing Count': missing_numeric_cols,
        'Missing Percent': (missing_numeric_cols / len(df_cleaned)) * 100
    })
    print(missing_numeric_summary.sort_values(by='Missing Percent', ascending=False))
Investigating and Imputing Remaining Missing Values (Numeric)- 
Found 191 numeric columns.

Numeric Columns with Missing Values (Before Imputation) -
      Missing Count  Missing Percent
V40          168969        28.612626
V41          168969        28.612626
V42          168969        28.612626
V43          168969        28.612626
V44          168969        28.612626
...             ...              ...
V317             12         0.002032
V318             12         0.002032
V319             12         0.002032
V320             12         0.002032
V321             12         0.002032

[173 rows x 2 columns]

Issue: Remaining numeric columns have NaN values.

Strategy: A tiered approach based on the percentage of missing data.

# Define our threshold based on your slides' "Rule of thumbs"
impute_threshold = 20.0

# Get list of columns for SIMPLE (Median) imputation
cols_to_impute_median = missing_numeric_summary[missing_numeric_summary['Missing Percent'] < impute_threshold].index

# Impute with MEDIAN
print(f"Imputing {len(cols_to_impute_median)} columns with < {impute_threshold}% missing data using MEDIAN...")
for col in cols_to_impute_median:
    median_val = df_cleaned[col].median()
    df_cleaned[col] = df_cleaned[col].fillna(median_val)

print("Median imputation complete for low-missingness columns.")
Imputing 154 columns with < 20.0% missing data using MEDIAN...
Median imputation complete for low-missingness columns.

Simple averages ignore the story the rest of the data is telling. This technique uses the remaining valid clues to intelligently infer what should be there, keeping the data’s internal logic intact.

# Get all numeric columns
numeric_cols = df_cleaned.select_dtypes(include=np.number).columns

# Get list of columns for ADVANCED (Regression) imputation
missing_numeric_counts = df_cleaned[numeric_cols].isnull().sum()
missing_numeric_summary = pd.DataFrame({
    'Missing Count': missing_numeric_counts[missing_numeric_counts > 0],
    'Missing Percent': (missing_numeric_counts[missing_numeric_counts > 0] / len(df_cleaned)) * 100
})

impute_threshold = 20.0
cols_to_impute_regression = missing_numeric_summary[missing_numeric_summary['Missing Percent'] >= impute_threshold].index

if len(cols_to_impute_regression) > 0:
    print(f"Found {len(cols_to_impute_regression)} columns for advanced imputation: {list(cols_to_impute_regression)}")

    imputer = IterativeImputer(
        max_iter=5,
        verbose=2,
        random_state=0,
        n_nearest_features=20
    )

    df_cleaned[numeric_cols] = imputer.fit_transform(df_cleaned[numeric_cols])

    print("Advanced imputation complete.")
else:
    print("No columns required advanced imputation.")

# Final check
total_nans = df_cleaned.isnull().sum().sum()
print(f"\nTotal remaining NaN values in the entire dataset: {total_nans}")
Found 19 columns for advanced imputation: ['D4', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52']
[IterativeImputer] Completing matrix with shape (590540, 191)
[IterativeImputer] Ending imputation round 1/5, elapsed time 72.03
[IterativeImputer] Change: 402.3353873351911, scaled tolerance: 15811.131000000001 
[IterativeImputer] Early stopping criterion reached.
Advanced imputation complete.

Total remaining NaN values in the entire dataset: 0
Cute Shocked

EXPLORATORY DATA ANALYSIS

Profiling the suspects

Now that the data is somewhat cleaned, it is time to look at patterns, not just rows.

In a fraud story, EDA is where you:

We will:

As you look at each plot, try to answer:

“If I were a human fraud analyst, would this pattern make me raise an eyebrow?”

3.1: Descriptive Statistics & Frequency Analysis

The Goal: Answer the question: “Is the crime scene dominated by innocent bystanders?”

Why This Matters: Imagine a detective who sits at their desk and stamps “Innocent” on every single case file. In a safe city, they might be right 99% of the time—but they catch zero criminals.

We check the frequency counts (.value_counts()) now to ensure we don’t accidentally build a model that achieves 99% accuracy by being lazy.

print("Method 1: Analyzing Target Variable 'isFraud' -")

# Get the exact percentage for 'isFraud'
fraud_percentage = df_cleaned['isFraud'].value_counts(normalize=True) * 100
print(f"Fraud Percentage:\n{fraud_percentage}\n")

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Subplot 1: Overall class distribution
ax1 = axes[0]
fraud_counts = df_cleaned['isFraud'].value_counts()
colors_pie = ['#2b7d99', '#d64545']
wedges, texts, autotexts = ax1.pie(fraud_counts.values, labels=['Legitimate', 'Fraud'],
                                     autopct='%1.2f%%', colors=colors_pie, startangle=90,
                                     textprops={'fontsize': 11, 'fontweight': 'bold'})
ax1.set_title('Class Distribution (Imbalance)', fontsize=12, fontweight='bold')

# Subplot 2: Log scale bar chart
ax2 = axes[1]
ax2.bar(fraud_counts.index, fraud_counts.values, color=colors_pie, edgecolor='black', alpha=0.8)
ax2.set_yscale('log')
ax2.set_ylabel('Count (Log Scale)', fontsize=11, fontweight='bold')
ax2.set_xlabel('Transaction Type', fontsize=11, fontweight='bold')
ax2.set_title('Class Imbalance Severity (Log Scale)', fontsize=12, fontweight='bold')
ax2.set_xticks(fraud_counts.index)
ax2.set_xticklabels(['Legitimate', 'Fraud'])
ax2.grid(alpha=0.3, axis='y')

# Subplot 3: Ratio visualization
ax3 = axes[2]
ratio = fraud_counts[0] / fraud_counts[1]
ax3.text(0.5, 0.7, f'{ratio:.1f}:1', fontsize=60, fontweight='bold', ha='center',
         bbox=dict(boxstyle='round', facecolor='#d64545', alpha=0.3))
ax3.text(0.5, 0.3, 'Legitimate to Fraud Ratio', fontsize=14, fontweight='bold', ha='center')
ax3.set_xlim(0, 1)
ax3.set_ylim(0, 1)
ax3.axis('off')
ax3.set_title('Data Imbalance Ratio', fontsize=12, fontweight='bold')

plt.show()

Method 1: Analyzing Target Variable 'isFraud' -
Fraud Percentage:
isFraud
0.0    96.500999
1.0     3.499001
Name: proportion, dtype: float64

png

Conclusion 1: The dataset is severely imbalanced. Out of 590,540 transactions, approximately 96.5% are non-fraudulent (isFraud=0) and only 3.5% are fraudulent (isFraud=1).

This means that a model that simply guesses ‘No Fraud’ every time would be 96.5% accurate. Therefore, ‘accuracy’ is a poor metric. We must use other metrics like Precision-Recall or F1-Score for our analysis.

3.2: Correlation Analysis

The Goal: To connect the dots. We want to know which specific clues (features) show up every time a crime (Fraud) is committed.

Why This Matters: We have over 200 features (suspects). If we try to draw a “red string” between all of them on our detective wall, we’ll just end up with a tangled mess.

Instead, we filter for the Top 15 features that have the strongest relationship with Fraud. We ignore the noise and focus strictly on the smoking guns.

print("Method 2: Analyzing Feature Correlation with 'isFraud'-")

# Calculate the correlation matrix for all numeric columns
corr_matrix = df_cleaned.corr(numeric_only=True)

# Top 20 features most correlated with 'isFraud'
# We use .abs() to find strong positive OR negative correlations
top_corr_features = corr_matrix['isFraud'].abs().sort_values(ascending=False)[1:21].index
print(f"Top 20 most correlated features:\n{top_corr_features.values}\n")

numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
correlation_with_fraud = df_cleaned[numeric_cols].corr()['isFraud'].sort_values(ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Subplot 1: Top Positive & Negative Correlations
ax1 = axes[0]
top_corr = pd.concat([correlation_with_fraud.head(15), correlation_with_fraud.tail(15)])
top_corr = top_corr[top_corr.index != 'isFraud']  # Exclude isFraud itself
colors_corr = ['#d64545' if x > 0 else '#2b7d99' for x in top_corr.values]
y_pos = np.arange(len(top_corr))
ax1.barh(y_pos, top_corr.values, color=colors_corr, edgecolor='black')
ax1.set_yticks(y_pos)
ax1.set_yticklabels(top_corr.index, fontsize=9)
ax1.set_xlabel('Correlation with Fraud', fontsize=11, fontweight='bold')
ax1.set_title('Top Features Correlated with Fraud', fontsize=12, fontweight='bold')
ax1.grid(alpha=0.3, axis='x')
ax1.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

# Subplot 2: Correlation heatmap (top 15 features)
ax2 = axes[1]
top_features = correlation_with_fraud[correlation_with_fraud.index != 'isFraud'].head(15).index.tolist()
corr_matrix = df_cleaned[[*top_features, 'isFraud']].corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, ax=ax2,
            cbar_kws={'label': 'Correlation'}, fmt='.2f')
ax2.set_title('Correlation Heatmap (Top 15 Features)', fontsize=12, fontweight='bold')

plt.show()

Method 2: Analyzing Feature Correlation with 'isFraud'-
Top 20 most correlated features:
['V45' 'V86' 'V87' 'V44' 'V52' 'V51' 'V40' 'V39' 'V79' 'V43' 'V38' 'V94'
 'V42' 'V33' 'V17' 'V18' 'V81' 'V34' 'V74' 'V80']

png

Conclusion 2: Several ‘V’ features are moderately correlated with fraud. While no single feature has a very strong (e.g., > 0.8) correlation, the heatmap shows that features like V45, V86, V87, V44, and V52 have the strongest linear relationships.

This is important because it identifies a clear list of features that are predictive of fraud.

It also highlights that fraud is not explained by one or two simple variables, but likely by the interaction of many.

3.3: Hypothesis Testing & Outlier Analysis

The Goal: Follow the money. We need to determine if fraud leaves a distinct financial footprint compared to normal spending.

Why This Matters: Do thieves always spend big to maximize their payout? Or do they test the waters with small amounts? To answer this, we can’t just look at a chart; we need statistical proof.

We are running an Independent Two-Sample T-Test to settle this legally:

The Visualization: We will also use a Box Plot to spot the “Whales”—the extreme outliers that break the pattern and demand immediate attention.

print("Method 3: Hypothesis Test for Transaction Amount-")

# Create two groups for the t-test
fraud_transactions = df_cleaned[df_cleaned['isFraud'] == 1]['TransactionAmt']
non_fraud_transactions = df_cleaned[df_cleaned['isFraud'] == 0]['TransactionAmt']

# Performing the t-test
t_statistic, p_value = stats.ttest_ind(
    fraud_transactions,
    non_fraud_transactions,
    equal_var=False,
    nan_policy='omit'
)

print(f"T-Test Results -")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value}\n")


# Plot
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Subplot 1: Distribution of Transaction Amount
ax1 = axes[0, 0]
df_cleaned[df_cleaned['isFraud'] == 0]['TransactionAmt'].apply(np.log1p).hist(
    bins=50, ax=ax1, alpha=0.7, label='Legitimate', color='#2b7d99', edgecolor='black'
)
df_cleaned[df_cleaned['isFraud'] == 1]['TransactionAmt'].apply(np.log1p).hist(
    bins=50, ax=ax1, alpha=0.7, label='Fraud', color='#d64545', edgecolor='black'
)
ax1.set_xlabel('Transaction Amount (log scale)', fontsize=11, fontweight='bold')
ax1.set_ylabel('Frequency', fontsize=11, fontweight='bold')
ax1.set_title('Transaction Amount Distribution: Legit vs Fraud', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# Subplot 2: Fraud Rate by Transaction Amount Bins
ax2 = axes[0, 1]
df_temp_bins = df_cleaned[['TransactionAmt', 'isFraud']].copy()
df_temp_bins['amt_bin'] = pd.cut(df_temp_bins['TransactionAmt'], bins=10)
fraud_rate_by_amt = df_temp_bins.groupby('amt_bin', observed=False)['isFraud'].agg(['sum', 'count'])
fraud_rate_by_amt['fraud_rate'] = (fraud_rate_by_amt['sum'] / fraud_rate_by_amt['count'] * 100)
ax2.bar(range(len(fraud_rate_by_amt)), fraud_rate_by_amt['fraud_rate'].values,
        color=['#d64545' if x > fraud_rate_by_amt['fraud_rate'].mean() else '#2b7d99'
               for x in fraud_rate_by_amt['fraud_rate'].values], edgecolor='black')
ax2.axhline(y=df_cleaned['isFraud'].mean()*100, color='red', linestyle='--', linewidth=2, label='Overall Fraud Rate')
ax2.set_xlabel('Transaction Amount Bins', fontsize=11, fontweight='bold')
ax2.set_ylabel('Fraud Rate (%)', fontsize=11, fontweight='bold')
ax2.set_title('Fraud Rate by Transaction Amount', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3, axis='y')

# Subplot 3: Box plot of Transaction Amount
ax3 = axes[1, 0]
bp_data = [df_cleaned[df_cleaned['isFraud'] == 0]['TransactionAmt'],
           df_cleaned[df_cleaned['isFraud'] == 1]['TransactionAmt']]
bp = ax3.boxplot(bp_data, tick_labels=['Legitimate', 'Fraud'], patch_artist=True)
for patch, color in zip(bp['boxes'], ['#2b7d99', '#d64545']):
    patch.set_facecolor(color)
ax3.set_yscale('log')
ax3.set_ylabel('Transaction Amount ($)', fontsize=11, fontweight='bold')
ax3.set_title('Transaction Amount Outliers', fontsize=12, fontweight='bold')
ax3.grid(alpha=0.3, axis='y')

# Subplot 4: Cumulative distribution
ax4 = axes[1, 1]
legit_sorted = np.sort(df_cleaned[df_cleaned['isFraud'] == 0]['TransactionAmt'])
fraud_sorted = np.sort(df_cleaned[df_cleaned['isFraud'] == 1]['TransactionAmt'])
ax4.plot(legit_sorted, np.arange(1, len(legit_sorted)+1), label='Legitimate', linewidth=2.5, color='#2b7d99')
ax4.plot(fraud_sorted, np.arange(1, len(fraud_sorted)+1), label='Fraud', linewidth=2.5, color='#d64545')
ax4.set_xlabel('Transaction Amount ($)', fontsize=11, fontweight='bold')
ax4.set_ylabel('Cumulative Count', fontsize=11, fontweight='bold')
ax4.set_title('Cumulative Transaction Amount', fontsize=12, fontweight='bold')
ax4.legend()
ax4.grid(alpha=0.3)

plt.tight_layout()
plt.show()

Method 3: Hypothesis Test for Transaction Amount-
T-Test Results -
T-statistic: 8.9494
P-value: 3.846046075647657e-19

png

Conclusion 3: The Verdict

The evidence is overwhelming.

Transaction Amount Distribution (Log Scale) - The histogram on a log scale shows that both legiti and fraudulent transactions follow a similar log-normal pattern. Fraudulent transactions peak at slightly lower amounts, but the two distributions still overlap heavily. This suggests that transaction amount by itself is not a reliable indicator of fraud—fraudsters typically blend in by keeping amounts within common low-to-mid ranges rather than consistently targeting large sums.

Fraud Rate by Transaction Amount Bins - The bar chart makes the dataset’s strong right skew obvious: most transactions fall into the smallest value ranges. The higher-value bins look empty because even if such transactions exist, they are extremely rare compared to the huge volume of small ones. This displays that using equal-width bins isn’t ideal here and that the data is dominated by low-value transactions.

Transaction Amount Outliers (Box Plot) - The box plot display how the transaction amounts are spread out and it clearly shows that fraud is not simply linked to high amounts. In fact, fraudulent transactions have a lower median value than legitimate ones. Legitimate transactions also include the most extreme high-value outliers, meaning very large transactions are statistically more likely to be rare valid purchases than fraud.

Cumulative Transaction Amount - The cumulative plot rises sharply near zero, then levels off, forming an “elbow” shape. This means that almost all transactions—around 99%—occur at very small amounts for both classes. The long flat stretch to the right represents the tiny fraction of large transactions. This pattern reinforces that detecting fraud requires focusing on subtle signals in small transactions, not on assumptions about unusually high amounts.

3.4: Temporal Analysis (Time Patterns)

The Goal: To pinpoint the “Time of Crime.” We are analyzing the timeline to see if fraudsters operate on a specific schedule—like striking in the dead of night when the world is asleep—or if they attack in coordinated waves over specific days.

Why This Matters: Context is everything. 100 fraudulent transactions at 2 PM might be normal because everyone is awake, but 100 fraudulent transactions at 4 AM is highly suspicious. We use these plots to distinguish between “busy hours” (high traffic) and “danger zones” (high fraud probability), helping us catch automated bots or international attacks that don’t follow the local time zone.

hour_series = (df_cleaned['TransactionDT'] // 3600) % 24
day_series = df_cleaned['TransactionDT'] // (24 * 3600)

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

ax1 = axes[0, 0]
hourly_fraud = df_cleaned.groupby(hour_series, observed=False)['isFraud'].agg(['sum', 'count'])
hourly_fraud['rate'] = hourly_fraud['sum'] / hourly_fraud['count'] * 100
colors_hourly = ['#d64545' if x > df_cleaned['isFraud'].mean()*100 else '#2b7d99' for x in hourly_fraud['rate']]
ax1.bar(hourly_fraud.index, hourly_fraud['rate'], color=colors_hourly, edgecolor='black', alpha=0.8)
ax1.set_xlabel('Hour of Day', fontsize=11, fontweight='bold')
ax1.set_ylabel('Fraud Rate (%)', fontsize=11, fontweight='bold')
ax1.set_title('Fraud Activity by Hour', fontsize=12, fontweight='bold')
ax1.grid(alpha=0.3, axis='y')

ax2 = axes[0, 1]
ax2_twin = ax2.twinx()
ax2.bar(hourly_fraud.index, hourly_fraud['count'], color='#7dd9d4', alpha=0.6, label='Total Transactions', edgecolor='black')
ax2_twin.plot(hourly_fraud.index, hourly_fraud['sum'], color='#d64545', marker='o', linewidth=2.5, markersize=6, label='Frauds')
ax2.set_xlabel('Hour of Day', fontsize=11, fontweight='bold')
ax2.set_ylabel('Total Transactions', fontsize=11, fontweight='bold', color='#7dd9d4')
ax2_twin.set_ylabel('Fraud Count', fontsize=11, fontweight='bold', color='#d64545')
ax2.set_title('Transaction & Fraud Volume by Hour', fontsize=12, fontweight='bold')
ax2.grid(alpha=0.3, axis='y')
ax2.tick_params(axis='y', labelcolor='#7dd9d4')
ax2_twin.tick_params(axis='y', labelcolor='#d64545')

ax3 = axes[1, 0]
daily_fraud = df_cleaned.groupby(day_series, observed=False)['isFraud'].agg(['sum', 'count'])
daily_fraud['rate'] = daily_fraud['sum'] / daily_fraud['count'] * 100
ax3.plot(daily_fraud.index, daily_fraud['rate'], color='#d64545', linewidth=2, marker='o', markersize=4, label='Fraud Rate')
ax3.fill_between(daily_fraud.index, daily_fraud['rate'], alpha=0.3, color='#d64545')
ax3.set_xlabel('Day Number', fontsize=11, fontweight='bold')
ax3.set_ylabel('Fraud Rate (%)', fontsize=11, fontweight='bold')
ax3.set_title('Fraud Trend Over Days', fontsize=12, fontweight='bold')
ax3.grid(alpha=0.3)

ax4 = axes[1, 1]
heatmap_data = df_cleaned['isFraud'].groupby(hour_series, observed=False).mean() * 100
heatmap_df = pd.DataFrame(heatmap_data).T
sns.heatmap(heatmap_df, cmap='RdYlGn_r', annot=False, fmt='.1f', cbar_kws={'label': 'Fraud Rate (%)'}, ax=ax4)
ax4.set_xlabel('Hour of Day', fontsize=11, fontweight='bold')
ax4.set_title('Fraud Rate Heatmap by Hour', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

png

Our timeline analysis reveals that fraudsters prefer to operate in the shadows, similar to burglars striking when a neighborhood is quiet. While the sheer number of fraudulent transactions is highest in the evenings when everyone is shopping, the risk is actually highest in the early morning (5 AM – 10 AM). This “sleeping user” effect occurs because legitimate customers are offline, causing the fraudulent activity to stand out sharply. Additionally, the day-by-day view shows a clear “attack wave” starting around Day 30, where the fraudsters suddenly became more aggressive and unpredictable, indicating a distinct shift in their strategy.

Cute Shocked

MACHINE LEARNING

Teaching Algorithms to Read Evidence

It’s time to bring our hero detective into the investigation—The Models. Machine Learning is the detective that never sleeps, trained on thousands of past cases to recognize the subtle fingerprints of fraud.

Unlike rule-based systems that follow rigid scripts, our models learn the statistical signatures of deception—how fraudsters deviate from normal behavior in ways too complex for human intuition. They process 240+ features in milliseconds, scoring each transaction’s risk so human analysts can focus on the most suspicious cases.

In this investigation, we’ll test three elite detectives. Stay Tuned to know who they are!

4.1 Feature Encoding & Data Split

Digitizing the Clues: Forensic algorithms cannot process words. To make the evidence usable, we must convert categorical labels (like card type or email provider) into machine-readable numbers. We also discard administrative noise like TransactionID—the case number doesn’t make you guilty.

The Control Group: To ensure our investigation is unbiased, we split the evidence.

# We drop TransactionID, TransactionDT and isFraud as they are not features
cols_to_drop_model = ['isFraud', 'TransactionID', 'TransactionDT']

# Filter to ensure we only drop columns that actually exist
existing_drop_cols = [c for c in cols_to_drop_model if c in df_cleaned.columns]

X = df_cleaned.drop(columns=existing_drop_cols)
y = df_cleaned['isFraud']

# Encode Categorical Features
cat_cols = X.select_dtypes(include=['object']).columns
print(f"Encoding {len(cat_cols)} categorical columns..")

for col in cat_cols:
    le = LabelEncoder()
    X[col] = X[col].astype(str)
    X[col] = le.fit_transform(X[col])

# We take 5% out completely. This 'X_holdout' is our final "unseen" data.
X_main, X_holdout, y_main, y_holdout = train_test_split( X, y, test_size=0.05, random_state=42, stratify=y)

# 'Stratify=y' ensures that both the training and test sets have the same percentage of Fraud (3.5%)
X_train, X_test, y_train, y_test = train_test_split( X_main, y_main, test_size=0.2, random_state=42, stratify=y_main)

print(f"Data Split Complete.")
print(f"Training Shape: {X_train.shape}")
print(f"Testing Shape:  {X_test.shape}")
print(f"Holdout Shape: {X_holdout.shape}")
Encoding 49 categorical columns..
Data Split Complete.
Training Shape: (448810, 237)
Testing Shape:  (112203, 237)
Holdout Shape: (29527, 237)

4.2 Building the Basic Detector : Logistic Regression

Every investigation starts with the obvious leads. We’re deploying a simple Logistic Regression model on the raw evidence to see if the case is easy to crack. If this fails, we know we need more advanced tactics.

The Technical Necessity: Scaling the Evidence Logistic Regression is sensitive to the “magnitude” of the clues. It might blindly assume a $10,000 transaction is more important than a Device ID of 5 simply because the number is larger.

We apply StandardScaler which shrinks all numbers to the same range (roughly -1 to 1) so the model judges features based on their predictive power, not their size.

print("Baseline Model Training...")

# Standard Scaling is required for Logistic Regression
baseline_model = make_pipeline(StandardScaler(), LogisticRegression(solver='liblinear', random_state=42, max_iter=1000))

baseline_model.fit(X_train, y_train)

# Get Baseline Predictions
y_prob_base = baseline_model.predict_proba(X_test)[:, 1]
y_pred_base = baseline_model.predict(X_test)

baseline_auc = roc_auc_score(y_test, y_prob_base)
print(f"Baseline ROC-AUC Score: {baseline_auc:.4f}")
print("Baseline Report:\n", classification_report(y_test, y_pred_base))
Baseline Model Training...
Baseline ROC-AUC Score: 0.8428
Baseline Report:
               precision    recall  f1-score   support

         0.0       0.97      1.00      0.98    108277
         1.0       0.78      0.17      0.29      3926

    accuracy                           0.97    112203
   macro avg       0.88      0.59      0.63    112203
weighted avg       0.96      0.97      0.96    112203

Performance Check: “Can we do better?”

The Verdict: Good Instincts, Bad Eyesight

This is a classic “High Precision, Low Recall” result.

Accuracy (97%) is a Mirage: In a dataset where 96.5% of transactions are legit, this score essentially means the model is lazy, not smart.

ROC-AUC (0.84) is Decent: We are beating random guesses, but we aren’t elite yet.

Precision (0.75) is Trustworthy: When the model does flag a crime, it’s usually right. False alarms are low.

Recall (0.17) is the Failure: This is the dealbreaker. We are catching only 17% of the fraudsters. The vast majority are walking away undetected.

Time to bring out the big guns

Our rookie detective (Logistic Regression) struggled with the complexity of the case. It’s time to bring in a full task force. The Random Forest algorithm doesn’t rely on just one decision, it builds a “forest” of hundreds of individual Decision Trees.

4.3 Random Forest Classifier

How it Works: Imagine 100 detectives examining the same evidence, but each asking different questions. One looks at the money, one looks at the location, and one looks at the device. They then vote on the final verdict. This "crowd wisdom" helps filter out noise and catches complex, non-linear schemes that a single linear model would miss.

print("Training Model 2 (Random Forest) -")

rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced',
    verbose=0 # Set to 0 to keep output clean
)

rf_model.fit(X_train, y_train)

# Get predictions
y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]

print("\n--- Model 2 Results: Random Forest ---")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob_rf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

# --- VISUALIZATION DASHBOARD ---
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0], cbar=False,
            annot_kws={'size': 14, 'weight': 'bold'})
axes[0, 0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Predicted', fontsize=12)
axes[0, 0].set_ylabel('Actual', fontsize=12)
axes[0, 0].set_xticklabels(['Legitimate', 'Fraud'])
axes[0, 0].set_yticklabels(['Legitimate', 'Fraud'])

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob_rf)
roc_auc = auc(fpr, tpr)
axes[0, 1].plot(fpr, tpr, color='#d64545', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
axes[0, 1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 1].set_xlim([0.0, 1.0])
axes[0, 1].set_ylim([0.0, 1.05])
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[0, 1].legend(loc="lower right")
axes[0, 1].grid(alpha=0.3)

# 3. Precision-Recall Curve (Best for Imbalanced Data)
precision, recall, _ = precision_recall_curve(y_test, y_prob_rf)
pr_auc = auc(recall, precision)
axes[1, 0].plot(recall, precision, color='#2b7d99', lw=2, label=f'PR Curve (AUC = {pr_auc:.2f})')
axes[1, 0].set_xlabel('Recall', fontsize=12)
axes[1, 0].set_ylabel('Precision', fontsize=12)
axes[1, 0].set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
axes[1, 0].legend(loc="lower left")
axes[1, 0].grid(alpha=0.3)

# 4. Feature Importances
feature_importances = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values(by='importance', ascending=False).head(10)

sns.barplot(x='importance', y='feature', data=feature_importances,
            hue='feature', palette='viridis', ax=axes[1, 1], legend=False)
axes[1, 1].set_title('Top 10 Feature Importances', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Importance Score', fontsize=12)
axes[1, 1].set_ylabel('Feature', fontsize=12)
axes[1, 1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()
Training Model 2 (Random Forest) -

--- Model 2 Results: Random Forest ---
Accuracy:  0.9586
ROC-AUC:   0.9557

Classification Report:
              precision    recall  f1-score   support

         0.0       0.99      0.97      0.98    108277
         1.0       0.45      0.78      0.57      3926

    accuracy                           0.96    112203
   macro avg       0.72      0.87      0.77    112203
weighted avg       0.97      0.96      0.96    112203

png

The Verdict:

Our switch to the Random Forest was a massive success.

Vizualization - Decision Tree

To verify our model isn’t guessing, we visualize the first three levels of its decision-making process. The colors guide us: blue paths lead towards Fraud patterns, while orange paths identify safe, Non-Fraud behavior.

print("Visualizing a Single Decision Tree (Depth=7)...")

# Pick the first tree from the forest
one_tree = rf_model.estimators_[0]

plt.figure(figsize=(28, 12)) # Large size to accommodate depth 7
plot_tree(
    one_tree,
    feature_names=X_train.columns,
    class_names=['Not Fraud', 'Fraud'],
    filled=True,
    rounded=True,
    max_depth=3,
    fontsize=8      # Smaller font to fit the text
)
plt.title("Decision Path of a Single Tree (Depth 3)", fontsize=16)
plt.show()
Visualizing a Single Decision Tree (Depth=3)...

png

The Pursuit of Excellence: From Democracy to Strategy

Random Forest delivered solid results—its task force of voting trees caught many fraudsters that slipped past linear models. But in fraud detection, “solid” isn’t enough when billions are at stake. We began to wonder: what if instead of letting 100 detectives vote independently, we could create a strategic hierarchy where each new investigator learns from the mistakes of the previous ones?

4.4 XGBoost:

This is the leap from parallel democracy to sequential mastery. Random Forest’s trees are trained in isolation, averaging their votes. While robust, this can be inefficient. Some trees waste effort on easy cases while others miss subtle patterns. XGBoost (Extreme Gradient Boosting) introduces intelligence through iteration: it builds trees one at a time, with each new tree explicitly correcting the residual errors left by its predecessors. This targeted focus often yields higher accuracy, faster training convergence, and better handling of the rare fraud signals buried in massive datasets.

print("Model 3: XGBoost...")

# 1. Handle Class Imbalance
weight_ratio = float(len(y_train[y_train == 0])) / len(y_train[y_train == 1])
print(f"Imbalance Ratio (scale_pos_weight): {weight_ratio:.2f}")

# 2. Initialize XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=9,            # Slightly deeper to capture complex fraud patterns
    learning_rate=0.1,
    scale_pos_weight=weight_ratio, # Critical for this imbalanced dataset
    eval_metric='auc',      # Optimize for ROC-AUC directly
    random_state=42,
    n_jobs=-1,
    tree_method='hist'      # 'hist' is much faster for large datasets
)

# 3. Train
print("Training XGBoost...")
xgb_model.fit(X_train, y_train)

# 4. Predict
y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]

# 5. Evaluate
print("\n--- Model 3 Results: XGBoost ---")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_xgb):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_prob_xgb):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb))

# --- VISUALIZATION DASHBOARD ---
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm, annot=True, fmt='d', cmap='Oranges', ax=axes[0, 0], cbar=False,
            annot_kws={'size': 14, 'weight': 'bold'})
axes[0, 0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Predicted', fontsize=12)
axes[0, 0].set_ylabel('Actual', fontsize=12)
axes[0, 0].set_xticklabels(['Legitimate', 'Fraud'])
axes[0, 0].set_yticklabels(['Legitimate', 'Fraud'])

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob_xgb)
roc_auc = auc(fpr, tpr)
axes[0, 1].plot(fpr, tpr, color='#d64545', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
axes[0, 1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0, 1].set_xlim([0.0, 1.0])
axes[0, 1].set_ylim([0.0, 1.05])
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curve', fontsize=14, fontweight='bold')
axes[0, 1].legend(loc="lower right")
axes[0, 1].grid(alpha=0.3)

# 3. Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob_xgb)
pr_auc = auc(recall, precision)
axes[1, 0].plot(recall, precision, color='#2b7d99', lw=2, label=f'PR Curve (AUC = {pr_auc:.2f})')
axes[1, 0].set_xlabel('Recall', fontsize=12)
axes[1, 0].set_ylabel('Precision', fontsize=12)
axes[1, 0].set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
axes[1, 0].legend(loc="lower left")
axes[1, 0].grid(alpha=0.3)

# 4. Feature Importances
feature_importances_xgb = pd.DataFrame({
    'feature': X_train.columns,
    'importance': xgb_model.feature_importances_
}).sort_values(by='importance', ascending=False).head(10)

sns.barplot(x='importance', y='feature', data=feature_importances_xgb,
            hue='feature', palette='magma', ax=axes[1, 1], legend=False)
axes[1, 1].set_title('Top 10 Feature Importances (XGBoost)', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Importance Score', fontsize=12)
axes[1, 1].set_ylabel('Feature', fontsize=12)
axes[1, 1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()
Model 3: XGBoost...
Imbalance Ratio (scale_pos_weight): 27.58
Training XGBoost...

--- Model 3 Results: XGBoost ---
Accuracy:  0.9553
ROC-AUC:   0.9747

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.96      0.98    108277
         1.0       0.43      0.87      0.58      3926

    accuracy                           0.96    112203
   macro avg       0.71      0.91      0.78    112203
weighted avg       0.98      0.96      0.96    112203

png

Verdict:

The Champion Has Arrived XGBoost lives up to the hype. By analyzing the errors of previous trees, it pushed our performance into the elite tier.

4.5 Evaluating our fraud detector: Beyond plain accuracy

In fraud detection, accuracy alone is misleading.
We care about questions like:

The confusion matrix tells a very practical story:

A risk team may prefer:

We interpret our results in that spirit.

print(" 4.5 The Verdict: ROC-AUC Comparison -")

# 1. Calculate Scores
# We assume you have run the previous cells and have these probability variables
score_lr  = roc_auc_score(y_test, y_prob_base)
score_rf  = roc_auc_score(y_test, y_prob_rf)
score_xgb = roc_auc_score(y_test, y_prob_xgb)

# 2. Create a Comparison DataFrame
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'ROC-AUC': [score_lr, score_rf, score_xgb]
})

# Sort by Score (Highest on top)
results = results.sort_values(by='ROC-AUC', ascending=False).reset_index(drop=True)

print("\nFinal Leaderboard:")
print(results)
print('\n')

# --- 1. GATHER PREDICTIONS ---
# (Assuming your models are named 'log_model', 'rf_model', 'xgb_model')

models = {
    'Logistic Regression': baseline_model,
    'Random Forest': rf_model,
    'XGBoost': xgb_model
}

# Create a dictionary to store results
results = {}

for name, model in models.items():
    # Predict
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    # Store metrics
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_prob),
        'Recall': recall_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'y_prob': y_prob  # Save probas for plotting curves
    }

# Convert metrics to DataFrame for plotting
metrics_df = pd.DataFrame(results).T.drop(columns=['y_prob']) # Drop probs for the bar chart
metrics_df = metrics_df.reset_index().rename(columns={'index': 'Model'})
metrics_melted = metrics_df.melt(id_vars='Model', var_name='Metric', value_name='Score')

# --- 2. VISUALIZATION DASHBOARD ---
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(2, 2)

# Plot 1: Grouped Bar Chart (Top spanning both columns)
ax1 = fig.add_subplot(gs[0, :])
sns.barplot(x='Metric', y='Score', hue='Model', data=metrics_melted, palette=['#2b7d99', '#f4a261', '#d64545'], ax=ax1)
ax1.set_ylim(0, 1.1)
ax1.set_title('Model Performance Showdown: Key Metrics', fontsize=16, fontweight='bold')
ax1.set_ylabel('Score', fontsize=12)
ax1.set_xlabel('')
ax1.legend(loc='lower right', title='Model', fontsize=11)
ax1.grid(axis='y', alpha=0.3)

# Add value labels on top of bars
for p in ax1.patches:
    ax1.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')

# Plot 2: ROC Curve Overlay (Bottom Left)
ax2 = fig.add_subplot(gs[1, 0])
colors = {'Logistic Regression': '#2b7d99', 'Random Forest': '#f4a261', 'XGBoost': '#d64545'}

for name, model in models.items():
    y_prob = results[name]['y_prob']
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc_score = results[name]['ROC-AUC']
    ax2.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.2f})', color=colors[name], lw=2)

ax2.plot([0, 1], [0, 1], 'k--', lw=1) # Diagonal line
ax2.set_xlim([0.0, 1.0])
ax2.set_ylim([0.0, 1.05])
ax2.set_xlabel('False Positive Rate', fontsize=12)
ax2.set_ylabel('True Positive Rate', fontsize=12)
ax2.set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')
ax2.legend(loc="lower right")
ax2.grid(alpha=0.3)

# Plot 3: Precision-Recall Curve Overlay (Bottom Right)
ax3 = fig.add_subplot(gs[1, 1])

for name, model in models.items():
    y_prob = results[name]['y_prob']
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(recall, precision)
    ax3.plot(recall, precision, label=f'{name} (AUC = {pr_auc:.2f})', color=colors[name], lw=2)

ax3.set_xlabel('Recall', fontsize=12)
ax3.set_ylabel('Precision', fontsize=12)
ax3.set_title('Precision-Recall Curve Comparison', fontsize=14, fontweight='bold')
ax3.legend(loc="lower left")
ax3.grid(alpha=0.3)

plt.tight_layout()
plt.show()
 7.3 The Verdict: ROC-AUC Comparison -

Final Leaderboard:
                 Model   ROC-AUC
0              XGBoost  0.974681
1        Random Forest  0.955687
2  Logistic Regression  0.842800

png

Final Verdict: XGBoost - The Undisputed Victor

Why XGBoost wins:

  1. Highest Recall (Most Important for Fraud Detection)

    • Logistic Regression: Recall 0.17 → misses most fraud.
    • XGBoost: Recall 0.87 → detects the majority of fraudulent transactions.
  2. Best Precision–Recall Performance

    • XGBoost PR AUC: 0.83
    • Logistic Regression PR AUC: 0.36 XGBoost handles class imbalance far more effectively.
  3. Acceptable Precision Trade-off

    • Logistic Regression Precision: 0.75
    • XGBoost Precision: 0.43 Lower precision is acceptable since false negatives (missed fraud) are costlier than false positives.
  4. Overall Strength

    • ROC-AUC: 0.97 (highest among all models)

→ Final Choice: XGBoost It provides the strongest fraud detection performance and minimizes missed fraudulent activity.

Cute Shocked

EVALUATING ON UNSEEN DATA

The Final Trial: Judging on Unseen Cases

To ensure the integrity of our investigation, we cannot rely solely on the evidence we have already analyzed. We must sequester 5% of the transaction data into a digital “vault,” effectively creating a set of “cold cases” that our algorithms have never seen. By testing our final suspect profile against this untouched evidence, we prove that our model hasn’t just memorized the details of past crimes, but has genuinely learned to detect the fingerprints of fraud in the real world.

print("Final Check: Predictions on the Holdout Set...")

# 1. Make Predictions
y_pred_holdout = xgb_model.predict(X_holdout)

# 2. Retrieve TransactionIDs
holdout_ids = df_cleaned.loc[X_holdout.index, 'TransactionID']

# 3. Create a Results DataFrame
results_df = pd.DataFrame({
    'TransactionID': holdout_ids.values,
    'y_predicted': y_pred_holdout,
    'y_actual': y_holdout.values
})

# 4. Select 10 Fraud and 10 Non-Fraud Samples
sample_fraud = results_df[results_df['y_actual'] == 1].head(15)
sample_legit = results_df[results_df['y_actual'] == 0].head(15)

# 5. Combine into one table
final_display = pd.concat([sample_fraud, sample_legit])

holdout_accuracy = accuracy_score(y_holdout, y_pred_holdout)
holdout_auc = roc_auc_score(y_holdout, xgb_model.predict_proba(X_holdout)[:, 1])

print(f"Total Rows in Holdout: {len(y_holdout)}")
print(f"Holdout Accuracy:      {holdout_accuracy:.4f}")
print(f"Holdout ROC-AUC:       {holdout_auc:.4f}\n")

print(f"\nDisplaying {len(final_display)} sample rows from Holdout Set:")
print(final_display.to_string(index=False))
Final Check: Predictions on the Holdout Set...
Total Rows in Holdout: 29527
Holdout Accuracy:      0.9535
Holdout ROC-AUC:       0.9743


Displaying 30 sample rows from Holdout Set:
 TransactionID  y_predicted  y_actual
     3265024.0            1       1.0
     3338787.0            0       1.0
     3228012.0            1       1.0
     3288978.0            1       1.0
     3378820.0            1       1.0
     3494141.0            1       1.0
     3510134.0            1       1.0
     3409700.0            1       1.0
     3042258.0            1       1.0
     3475711.0            1       1.0
     3399748.0            1       1.0
     3320659.0            1       1.0
     3307507.0            1       1.0
     3335475.0            1       1.0
     3151082.0            1       1.0
     3495345.0            0       0.0
     3059907.0            0       0.0
     3568138.0            0       0.0
     3306214.0            0       0.0
     3395254.0            0       0.0
     3447490.0            0       0.0
     2999686.0            0       0.0
     3246309.0            0       0.0
     3076325.0            0       0.0
     3200308.0            0       0.0
     3222756.0            1       0.0
     3121241.0            0       0.0
     3118889.0            0       0.0
     3088558.0            0       0.0
     3222495.0            0       0.0

The Closure

The Unveiling of the Vault: Upon unlocking our 5% holdout set, the evidence completely sequestered from our training process, our XGBoost model achieved a ROC-AUC of 0.9743. In forensic terms, this means our suspect profile is incredibly robust, maintaining elite performance even on “cold cases” it had never seen before.

Evidence Analysis:

Impact: Implementing this system would likely save millions in chargeback losses while ensuring that the vast majority of legitimate customers can buy their party supplies without interruption.

Cute Shocked

🎬 CLOSING SCENE: Sarah’s new ending

png

Future Work: The Next Frontier

As we look ahead, the next stage is to make our fraud detection system faster, clearer, and more adaptable. Moving from batch processing to real-time scoring will allow transactions to be evaluated within milliseconds, which is essential in real-world payment environments. Adding model explainability through tools like SHAP will help analysts and regulators understand exactly why a transaction was flagged. Because fraud patterns constantly change, we’ll also need automated retraining pipelines that detect when the model’s performance drops and update it accordingly. In the future, we can explore more advanced methods such as graph neural networks to capture relationships between users and transactions, and semi-supervised learning to take advantage of large volumes of unlabeled data. Together, these improvements will make the system more reliable and better equipped to adapt to evolving fraud behavior.

References & Resources

Data Source

Methodologies & Libraries

Further Reading