Yash Karle — Data Analyst Portfolio

Portfolio

Featured Projects

Each project follows a structured case-study format — from business problem through methodology to measurable impact.

Machine Learning

Customer Churn Prediction & Retention Strategy

Built an XGBoost classifier to identify at-risk telecom customers, segment them by churn probability, and recommend targeted retention actions — supported by SQL analytics and a Power BI dashboard.

PythonXGBoostSQLPower BIPandasScikit-learn

Model Accuracy

78.5%

ROC AUC

0.826

📌 Business Problem

A telecom provider faced rising customer attrition with no systematic way to predict which customers would leave or why. Retention campaigns were untargeted, wasting budget on low-risk customers while high-risk accounts churned silently.

📂 Data

Source: Internal CRM — 1,993 unique customer records
Features: Demographics, service details, financial metrics, usage patterns, security features
Target: Customer status (Churned / Stayed / Joined)

🔬 Methodology

Data Cleaning (missing values, type casting) → Feature Engineering (encoding, scaling) → XGBoost classifier with hyperparameter tuning (200 estimators, max_depth=6, lr=0.05) → Risk segmentation into Low / Medium / High tiers → SQL-based cohort analysis → Interactive Power BI dashboard.

💡 Key Insights

611 customers (30.7%) flagged as high churn risk — enabling proactive outreach
Month-to-month contracts are the #1 churn driver; two-year contracts slash churn significantly
Value Deal 5 is the strongest retention lever (18.4% feature importance)
Customers without online security churn at materially higher rates
Top churn reasons: competitor offers, network reliability, pricing

📈 Business Impact

Identified ~$500K+ monthly revenue at risk from high-churn segment
Enabled targeted retention campaigns, potentially reducing churn by 15–20%
Automated risk scoring replaced manual quarterly reviews, saving ~40 analyst-hours/quarter
Power BI dashboard provided leadership with real-time churn visibility for strategic planning

Machine Learning

Insurance Cost Prediction & Risk Modelling

Developed a multi-model regression pipeline to predict insurance charges and quantify the impact of smoking, BMI, and age on premiums — achieving 82% accuracy with Ridge + Polynomial features.

PythonScikit-learnRidge RegressionPandasMatplotlib

Best R² Score

0.821

Records

2,771

📌 Business Problem

An insurance provider needed a data-driven approach to understand which policyholder characteristics drive premium costs, enabling more accurate pricing and risk segmentation instead of relying on broad actuarial tables.

📂 Data

Source: Insurance claims dataset — 2,771 records × 7 features
Features: Age, gender, BMI, number of children, smoker status, region, charges
Target: Insurance charges (continuous)

🔬 Methodology

Data Wrangling (replace invalid markers, impute missing values, type casting) → Correlation Analysis (identified smoker as r=0.789 with charges) → Progressive model comparison: Simple LR → Multi-feature LR → Polynomial Pipeline → Ridge Regression → Ridge + Polynomial (degree=2) with 80/20 train-test split.

💡 Key Insights

Smoking status is the dominant cost driver (r=0.789), explaining 62% of variance alone
Polynomial features boosted explanatory power from 75% → 85% (in-sample)
Ridge regularisation delivered a robust R²=0.821 on unseen data
BMI and age contribute additive, non-linear effects on premium costs

📈 Business Impact

Enabled risk-adjusted premium pricing with 82% prediction accuracy
Smoking-based segmentation alone could improve pricing precision by ~20%
Model supports actuarial teams in identifying high-cost customer profiles before policy issuance
Reduced underwriting processing time through automated risk scoring

Analytics & BI

KPI Dashboard & ELT Pipeline

Engineered an end-to-end ELT pipeline that ingests 51K+ sales records from JSON, loads into SQLite, computes KPIs via SQL, and renders a real-time Flask web dashboard — fully automated.

PythonSQLFlaskSQLitePandasHTML/CSS

Total Sales

$12.6M

Profit Margin

11.61%

📌 Business Problem

The sales team relied on manual spreadsheet analysis to track KPIs, leading to delayed reporting, inconsistent metrics, and no single source of truth. Leadership needed a self-refreshing dashboard for real-time performance visibility.

📂 Data

Source: StoreSales.json — 51,291 sales records
Features: 24 columns covering order details, customer data, location, product info, and financial metrics
Database: SQLite with single normalised sales table (13.3 MB)

🔬 Methodology

Extract: Ingest JSON via Pandas → Load: Write to SQLite with schema auto-creation → Transform: SQL aggregate queries to compute 5 KPIs (Total Sales, Total Profit, Quantity Sold, Avg. Discount, Profit Margin) + category breakdowns and Top-10 products → Serve: Flask web app with HTML dashboard.

💡 Key Insights

$12.6M total sales with $1.47M profit across 178K units
Average discount of 14.29% across all transactions
Technology category leads revenue; Office Supplies leads volume
Identified Top-10 products contributing disproportionately to profit

📈 Business Impact

Dashboard replaced ~8 hours/week of manual Excel reporting
Provided leadership with real-time KPI visibility for quarterly planning
Automated ELT pipeline ensures consistent, reproducible metrics
Scalable architecture supports additional data sources and KPIs

Machine Learning

Air Quality Index Prediction

Conducted geospatial EDA on 17.7K NYC air quality records and built a Linear Regression model to forecast PM2.5 trends — revealing a 13-year declining pollution trajectory and neighbourhood-level disparities.

PythonScikit-learnPandasMatplotlibSeaborn

Records

17,743

Time Span

2009–2022

📌 Business Problem

Public health agencies needed to quantify pollution trends across NYC neighbourhoods to allocate monitoring resources, prioritise interventions, and forecast future air quality under current regulatory policies.

📂 Data

Source: NYC Dept. of Health — 17,743 records × 12 features
Indicators: PM2.5, NO₂, O₃, boiler emissions, vehicle miles, health impacts
Coverage: 42 UHF neighbourhoods across NYC, 2009–2022

🔬 Methodology

Date parsing → IQR-based outlier removal (Q1=8.9, Q3=26.9) → Missing value handling → Filter PM2.5 subset → Feature engineering (extract year) → 70/30 train-test split → Linear Regression → Temporal trend analysis + geographic heatmaps.

💡 Key Insights

PM2.5 levels show a consistent declining trend from 2009–2022
Bronx & Harlem neighbourhoods (High Bridge, Hunts Point) have the highest mean AQI values — 39.3 vs. city average of 21.6
Seasonal patterns: Ozone spikes significantly in summer months
Data completeness is excellent — 99.95% non-null records

📈 Business Impact

Identified priority neighbourhoods for additional monitoring station deployment
Forecasting model supports regulatory compliance planning and EPA reporting
Geographic disparity analysis informs equitable public health resource allocation
Trend projections help estimate long-term healthcare cost savings from declining pollution

Machine Learning

Real Estate Price Prediction

Trained an XGBoost regressor on 1.14M+ Connecticut real estate transactions (2001–2023) to predict sale prices, identifying location and assessed value as the dominant pricing signals.

PythonXGBoostScikit-learnPandasMatplotlib

R² Score

0.616

Records

1.14M

📌 Business Problem

Real estate stakeholders — agents, investors, and assessors — needed a data-driven pricing model to validate property valuations, identify underpriced opportunities, and reduce reliance on subjective comparable market analysis.

📂 Data

Source: Connecticut Real Estate Sales (data.gov) — 1,141,722 records
Features: Town, residential type, assessed value, sales ratio, date listed
After cleaning: 738,541 valid records (80/20 split → 590K train / 148K test)

🔬 Methodology

Date parsing → Drop rows with missing critical values → Feature engineering (Years_Since_List) → OrdinalEncoder for categorical features → XGBoost Regressor (hist method, max_depth=3, 200 estimators, lr=0.1, subsample=0.8) → Evaluation with R², MAE, RMSE.

💡 Key Insights

Town (37.6%) and Assessed Value (33.3%) drive 71% of prediction power — location is king
Sales Ratio (16.7%) captures market-to-assessment discrepancies
MAE of $35,357 — usable for initial screening and bulk valuation
Model performs strongest on mid-range properties; high-end outliers increase RMSE

📈 Business Impact

Automated bulk property valuation for 700K+ transactions
Supports investors in identifying undervalued properties where sale price < model prediction
Reduces time-to-valuation from hours to seconds per property
Assessed-vs-predicted gap analysis informs tax assessment appeals

Analytics & BI

Exploratory Data Analysis Portfolio

A two-part EDA investigation — urban air quality patterns across NYC neighbourhoods and property market valuation gaps between assessed vs. actual sale prices — demonstrating rigorous data-cleaning and visual storytelling.

PythonPandasNumPyMatplotlibSeaborn

Datasets

Combined Records

11K+

📌 Business Problem

Before any predictive model can succeed, the underlying data must be understood. These EDAs answer foundational questions: Which neighbourhoods face the worst pollution? How well do government property assessments match actual market prices?

📂 Data

Air Quality: ~1,000 environmental monitoring records — pollutant indicators, geographic locations, time periods
Property Sales: ~10,000 real estate transactions — assessed values, sale amounts, sales ratios, property types

🔬 Methodology

Data Ingestion (CSV/Excel via Pandas) → Null handling & type conversion → Univariate analysis (histograms, KDE) → Bivariate analysis (scatter, box plots) → Correlation heatmaps → Statistical summarisation → Publication-ready Matplotlib/Seaborn charts.

💡 Key Insights

Certain NYC neighbourhoods have 2× the pollution levels of city-wide averages
Property assessments systematically undervalue properties in high-demand towns
Box plots revealed extreme outlier sales that summary statistics missed entirely
Seasonal air quality variation follows predictable, actionable patterns

📈 Business Impact

EDA findings directly informed feature selection for subsequent ML models
Property analysis revealed systemic valuation bias — actionable for government assessors
Demonstrated a reproducible EDA framework applicable to any new dataset

Deep Learning

Deepfake Image Detector

Built a CNN-based binary classifier to detect AI-generated facial images in real-time, deployed via a Streamlit web app with confidence scoring — supported by a published research paper.

TensorFlowKerasStreamlitNumPyScikit-learnPIL

Architecture

3-Layer CNN

Inference

<1 sec

📌 Business Problem

With the proliferation of AI-generated imagery, media organisations and security teams need automated tools to flag potentially manipulated facial images before they're published or used for fraud.

📂 Data

Source: Curated dataset of real and AI-generated (deepfake) facial images
Preprocessing: Resized to 128×128, normalised pixel values (0–1), 80/20 train-test split
Model size: ~38 MB

🔬 Methodology

Image preprocessing (resize, normalise, RGBA→RGB) → Custom 3-layer CNN (32→64→128 filters, MaxPooling, Dropout=0.5) → Binary classification (sigmoid output) → Trained with Adam optimiser (lr=0.001) + Binary Crossentropy → Evaluated with confusion matrix & classification report → Deployed via Streamlit with real-time upload & prediction.

💡 Key Insights

Sub-second inference enables real-time screening of uploaded images
Confidence scoring provides probability-based risk assessment rather than binary yes/no
Dropout regularisation (0.5) prevents overfitting on limited training data
Research paper documents methodology for academic reproducibility

📈 Business Impact

Deployable tool for media verification workflows and content moderation
Streamlit interface enables non-technical users to perform deepfake audits independently
Research contribution advances the deepfake detection knowledge base
Architecture is extensible to video detection and ensemble methods

Other Projects

Full-stack engineering and AI tooling that complement my analytics work.

🎫 Seat Booking System

Full-stack React + Spring Boot application with MySQL persistence, dynamic pricing tiers, smart validation rules, and a glassmorphism UI with light/dark theme.

ReactSpring BootMySQLREST APICSS

🤖 AI Code Reviewer

MERN-stack AI-powered code review tool using Deepseek R1 via Groq API for ultra-fast inference, generating intelligent code suggestions and content.

ReactNode.jsExpressMongoDBGroq API

🔌 LLM REST API

Flask-based API server exposing the Maincoder-1B code generation model with configurable parameters, GPU-accelerated inference on Colab, and public URL via ngrok.

FlaskTransformersngrokGoogle ColabREST API

Hi, I'm Yash Karle

Featured Projects

Customer Churn Prediction & Retention Strategy

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Insurance Cost Prediction & Risk Modelling

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

KPI Dashboard & ELT Pipeline

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Air Quality Index Prediction

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Real Estate Price Prediction

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Exploratory Data Analysis Portfolio

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Deepfake Image Detector

📌 Business Problem

📂 Data

🔬 Methodology

💡 Key Insights

📈 Business Impact

Other Projects

🎫 Seat Booking System

🤖 AI Code Reviewer

🔌 LLM REST API

Data-Driven Problem Solver

Structured Problem-Solving

Business Translation

Automation & Pipelines

Certifications & Achievements

Let's Connect