Data Analyst ยท Business Intelligence ยท SQL & Python
I transform complex datasets into clear, actionable insights that drive revenue growth, reduce
operational costs, and accelerate strategic decision-making.
Each project follows a structured case-study format โ from business problem through methodology to measurable impact.
Built an XGBoost classifier to identify at-risk telecom customers, segment them by churn probability, and recommend targeted retention actions โ supported by SQL analytics and a Power BI dashboard.
A telecom provider faced rising customer attrition with no systematic way to predict which customers would leave or why. Retention campaigns were untargeted, wasting budget on low-risk customers while high-risk accounts churned silently.
Data Cleaning (missing values, type casting) โ Feature Engineering (encoding, scaling) โ XGBoost classifier with hyperparameter tuning (200 estimators, max_depth=6, lr=0.05) โ Risk segmentation into Low / Medium / High tiers โ SQL-based cohort analysis โ Interactive Power BI dashboard.
Developed a multi-model regression pipeline to predict insurance charges and quantify the impact of smoking, BMI, and age on premiums โ achieving 82% accuracy with Ridge + Polynomial features.
An insurance provider needed a data-driven approach to understand which policyholder characteristics drive premium costs, enabling more accurate pricing and risk segmentation instead of relying on broad actuarial tables.
Data Wrangling (replace invalid markers, impute missing values, type casting) โ Correlation Analysis (identified smoker as r=0.789 with charges) โ Progressive model comparison: Simple LR โ Multi-feature LR โ Polynomial Pipeline โ Ridge Regression โ Ridge + Polynomial (degree=2) with 80/20 train-test split.
Engineered an end-to-end ELT pipeline that ingests 51K+ sales records from JSON, loads into SQLite, computes KPIs via SQL, and renders a real-time Flask web dashboard โ fully automated.
The sales team relied on manual spreadsheet analysis to track KPIs, leading to delayed reporting, inconsistent metrics, and no single source of truth. Leadership needed a self-refreshing dashboard for real-time performance visibility.
Extract: Ingest JSON via Pandas โ Load: Write to SQLite with schema auto-creation โ Transform: SQL aggregate queries to compute 5 KPIs (Total Sales, Total Profit, Quantity Sold, Avg. Discount, Profit Margin) + category breakdowns and Top-10 products โ Serve: Flask web app with HTML dashboard.
Conducted geospatial EDA on 17.7K NYC air quality records and built a Linear Regression model to forecast PM2.5 trends โ revealing a 13-year declining pollution trajectory and neighbourhood-level disparities.
Public health agencies needed to quantify pollution trends across NYC neighbourhoods to allocate monitoring resources, prioritise interventions, and forecast future air quality under current regulatory policies.
Date parsing โ IQR-based outlier removal (Q1=8.9, Q3=26.9) โ Missing value handling โ Filter PM2.5 subset โ Feature engineering (extract year) โ 70/30 train-test split โ Linear Regression โ Temporal trend analysis + geographic heatmaps.
Trained an XGBoost regressor on 1.14M+ Connecticut real estate transactions (2001โ2023) to predict sale prices, identifying location and assessed value as the dominant pricing signals.
Real estate stakeholders โ agents, investors, and assessors โ needed a data-driven pricing model to validate property valuations, identify underpriced opportunities, and reduce reliance on subjective comparable market analysis.
Date parsing โ Drop rows with missing critical values โ Feature engineering (Years_Since_List) โ OrdinalEncoder for categorical features โ XGBoost Regressor (hist method, max_depth=3, 200 estimators, lr=0.1, subsample=0.8) โ Evaluation with Rยฒ, MAE, RMSE.
A two-part EDA investigation โ urban air quality patterns across NYC neighbourhoods and property market valuation gaps between assessed vs. actual sale prices โ demonstrating rigorous data-cleaning and visual storytelling.
Before any predictive model can succeed, the underlying data must be understood. These EDAs answer foundational questions: Which neighbourhoods face the worst pollution? How well do government property assessments match actual market prices?
Data Ingestion (CSV/Excel via Pandas) โ Null handling & type conversion โ Univariate analysis (histograms, KDE) โ Bivariate analysis (scatter, box plots) โ Correlation heatmaps โ Statistical summarisation โ Publication-ready Matplotlib/Seaborn charts.
Built a CNN-based binary classifier to detect AI-generated facial images in real-time, deployed via a Streamlit web app with confidence scoring โ supported by a published research paper.
With the proliferation of AI-generated imagery, media organisations and security teams need automated tools to flag potentially manipulated facial images before they're published or used for fraud.
Image preprocessing (resize, normalise, RGBAโRGB) โ Custom 3-layer CNN (32โ64โ128 filters, MaxPooling, Dropout=0.5) โ Binary classification (sigmoid output) โ Trained with Adam optimiser (lr=0.001) + Binary Crossentropy โ Evaluated with confusion matrix & classification report โ Deployed via Streamlit with real-time upload & prediction.
Full-stack engineering and AI tooling that complement my analytics work.
Full-stack React + Spring Boot application with MySQL persistence, dynamic pricing tiers, smart validation rules, and a glassmorphism UI with light/dark theme.
MERN-stack AI-powered code review tool using Deepseek R1 via Groq API for ultra-fast inference, generating intelligent code suggestions and content.
Flask-based API server exposing the Maincoder-1B code generation model with configurable parameters, GPU-accelerated inference on Colab, and public URL via ngrok.
I approach every dataset with one question: what decision does this need to inform? My work centres on translating raw numbers into business narratives that stakeholders can act on โ whether that means identifying a $1.4M churn-risk segment, optimising insurance pricing models, or building automated KPI dashboards that replace hours of manual reporting.
I combine statistical rigour with strong stakeholder communication โ writing SQL that answers real business questions, building predictive models that quantify risk, and designing visualisations that make complex findings immediately accessible to non-technical audiences.
My toolkit spans the full analytics pipeline: from data wrangling and feature engineering in Python, through machine-learning modelling with XGBoost and Scikit-learn, to interactive dashboarding in Power BI and Flask.
Every analysis follows a clear framework โ business question โ data audit โ methodology โ insight โ recommendation.
Converting model outputs into dollar-value impact metrics that resonate with leadership and inform strategy.
Building ELT pipelines, automated dashboards, and reproducible analysis workflows that scale beyond one-off reports.
Whether you have a data challenge, a collaboration idea, or just want to say hello โ I'd love to hear from you.