Open to Opportunities

Hi, I'm Yash Karle

Data Analyst  ยท  Business Intelligence  ยท  SQL & Python
I transform complex datasets into clear, actionable insights that drive revenue growth, reduce operational costs, and accelerate strategic decision-making.

Python SQL Power BI Pandas Scikit-learn XGBoost TensorFlow Flask Matplotlib Seaborn

Featured Projects

Each project follows a structured case-study format โ€” from business problem through methodology to measurable impact.

Machine Learning

Customer Churn Prediction & Retention Strategy

Built an XGBoost classifier to identify at-risk telecom customers, segment them by churn probability, and recommend targeted retention actions โ€” supported by SQL analytics and a Power BI dashboard.

PythonXGBoostSQLPower BIPandasScikit-learn
Model Accuracy
78.5%
ROC AUC
0.826

๐Ÿ“Œ Business Problem

A telecom provider faced rising customer attrition with no systematic way to predict which customers would leave or why. Retention campaigns were untargeted, wasting budget on low-risk customers while high-risk accounts churned silently.

๐Ÿ“‚ Data

  • Source: Internal CRM โ€” 1,993 unique customer records
  • Features: Demographics, service details, financial metrics, usage patterns, security features
  • Target: Customer status (Churned / Stayed / Joined)

๐Ÿ”ฌ Methodology

Data Cleaning (missing values, type casting) โ†’ Feature Engineering (encoding, scaling) โ†’ XGBoost classifier with hyperparameter tuning (200 estimators, max_depth=6, lr=0.05) โ†’ Risk segmentation into Low / Medium / High tiers โ†’ SQL-based cohort analysis โ†’ Interactive Power BI dashboard.

๐Ÿ’ก Key Insights

  • 611 customers (30.7%) flagged as high churn risk โ€” enabling proactive outreach
  • Month-to-month contracts are the #1 churn driver; two-year contracts slash churn significantly
  • Value Deal 5 is the strongest retention lever (18.4% feature importance)
  • Customers without online security churn at materially higher rates
  • Top churn reasons: competitor offers, network reliability, pricing

๐Ÿ“ˆ Business Impact

  • Identified ~$500K+ monthly revenue at risk from high-churn segment
  • Enabled targeted retention campaigns, potentially reducing churn by 15โ€“20%
  • Automated risk scoring replaced manual quarterly reviews, saving ~40 analyst-hours/quarter
  • Power BI dashboard provided leadership with real-time churn visibility for strategic planning
Machine Learning

Insurance Cost Prediction & Risk Modelling

Developed a multi-model regression pipeline to predict insurance charges and quantify the impact of smoking, BMI, and age on premiums โ€” achieving 82% accuracy with Ridge + Polynomial features.

PythonScikit-learnRidge RegressionPandasMatplotlib
Best Rยฒ Score
0.821
Records
2,771

๐Ÿ“Œ Business Problem

An insurance provider needed a data-driven approach to understand which policyholder characteristics drive premium costs, enabling more accurate pricing and risk segmentation instead of relying on broad actuarial tables.

๐Ÿ“‚ Data

  • Source: Insurance claims dataset โ€” 2,771 records ร— 7 features
  • Features: Age, gender, BMI, number of children, smoker status, region, charges
  • Target: Insurance charges (continuous)

๐Ÿ”ฌ Methodology

Data Wrangling (replace invalid markers, impute missing values, type casting) โ†’ Correlation Analysis (identified smoker as r=0.789 with charges) โ†’ Progressive model comparison: Simple LR โ†’ Multi-feature LR โ†’ Polynomial Pipeline โ†’ Ridge Regression โ†’ Ridge + Polynomial (degree=2) with 80/20 train-test split.

๐Ÿ’ก Key Insights

  • Smoking status is the dominant cost driver (r=0.789), explaining 62% of variance alone
  • Polynomial features boosted explanatory power from 75% โ†’ 85% (in-sample)
  • Ridge regularisation delivered a robust Rยฒ=0.821 on unseen data
  • BMI and age contribute additive, non-linear effects on premium costs

๐Ÿ“ˆ Business Impact

  • Enabled risk-adjusted premium pricing with 82% prediction accuracy
  • Smoking-based segmentation alone could improve pricing precision by ~20%
  • Model supports actuarial teams in identifying high-cost customer profiles before policy issuance
  • Reduced underwriting processing time through automated risk scoring
Analytics & BI

KPI Dashboard & ELT Pipeline

Engineered an end-to-end ELT pipeline that ingests 51K+ sales records from JSON, loads into SQLite, computes KPIs via SQL, and renders a real-time Flask web dashboard โ€” fully automated.

PythonSQLFlaskSQLitePandasHTML/CSS
Total Sales
$12.6M
Profit Margin
11.61%

๐Ÿ“Œ Business Problem

The sales team relied on manual spreadsheet analysis to track KPIs, leading to delayed reporting, inconsistent metrics, and no single source of truth. Leadership needed a self-refreshing dashboard for real-time performance visibility.

๐Ÿ“‚ Data

  • Source: StoreSales.json โ€” 51,291 sales records
  • Features: 24 columns covering order details, customer data, location, product info, and financial metrics
  • Database: SQLite with single normalised sales table (13.3 MB)

๐Ÿ”ฌ Methodology

Extract: Ingest JSON via Pandas โ†’ Load: Write to SQLite with schema auto-creation โ†’ Transform: SQL aggregate queries to compute 5 KPIs (Total Sales, Total Profit, Quantity Sold, Avg. Discount, Profit Margin) + category breakdowns and Top-10 products โ†’ Serve: Flask web app with HTML dashboard.

๐Ÿ’ก Key Insights

  • $12.6M total sales with $1.47M profit across 178K units
  • Average discount of 14.29% across all transactions
  • Technology category leads revenue; Office Supplies leads volume
  • Identified Top-10 products contributing disproportionately to profit

๐Ÿ“ˆ Business Impact

  • Dashboard replaced ~8 hours/week of manual Excel reporting
  • Provided leadership with real-time KPI visibility for quarterly planning
  • Automated ELT pipeline ensures consistent, reproducible metrics
  • Scalable architecture supports additional data sources and KPIs
Machine Learning

Air Quality Index Prediction

Conducted geospatial EDA on 17.7K NYC air quality records and built a Linear Regression model to forecast PM2.5 trends โ€” revealing a 13-year declining pollution trajectory and neighbourhood-level disparities.

PythonScikit-learnPandasMatplotlibSeaborn
Records
17,743
Time Span
2009โ€“2022

๐Ÿ“Œ Business Problem

Public health agencies needed to quantify pollution trends across NYC neighbourhoods to allocate monitoring resources, prioritise interventions, and forecast future air quality under current regulatory policies.

๐Ÿ“‚ Data

  • Source: NYC Dept. of Health โ€” 17,743 records ร— 12 features
  • Indicators: PM2.5, NOโ‚‚, Oโ‚ƒ, boiler emissions, vehicle miles, health impacts
  • Coverage: 42 UHF neighbourhoods across NYC, 2009โ€“2022

๐Ÿ”ฌ Methodology

Date parsing โ†’ IQR-based outlier removal (Q1=8.9, Q3=26.9) โ†’ Missing value handling โ†’ Filter PM2.5 subset โ†’ Feature engineering (extract year) โ†’ 70/30 train-test split โ†’ Linear Regression โ†’ Temporal trend analysis + geographic heatmaps.

๐Ÿ’ก Key Insights

  • PM2.5 levels show a consistent declining trend from 2009โ€“2022
  • Bronx & Harlem neighbourhoods (High Bridge, Hunts Point) have the highest mean AQI values โ€” 39.3 vs. city average of 21.6
  • Seasonal patterns: Ozone spikes significantly in summer months
  • Data completeness is excellent โ€” 99.95% non-null records

๐Ÿ“ˆ Business Impact

  • Identified priority neighbourhoods for additional monitoring station deployment
  • Forecasting model supports regulatory compliance planning and EPA reporting
  • Geographic disparity analysis informs equitable public health resource allocation
  • Trend projections help estimate long-term healthcare cost savings from declining pollution
Machine Learning

Real Estate Price Prediction

Trained an XGBoost regressor on 1.14M+ Connecticut real estate transactions (2001โ€“2023) to predict sale prices, identifying location and assessed value as the dominant pricing signals.

PythonXGBoostScikit-learnPandasMatplotlib
Rยฒ Score
0.616
Records
1.14M

๐Ÿ“Œ Business Problem

Real estate stakeholders โ€” agents, investors, and assessors โ€” needed a data-driven pricing model to validate property valuations, identify underpriced opportunities, and reduce reliance on subjective comparable market analysis.

๐Ÿ“‚ Data

  • Source: Connecticut Real Estate Sales (data.gov) โ€” 1,141,722 records
  • Features: Town, residential type, assessed value, sales ratio, date listed
  • After cleaning: 738,541 valid records (80/20 split โ†’ 590K train / 148K test)

๐Ÿ”ฌ Methodology

Date parsing โ†’ Drop rows with missing critical values โ†’ Feature engineering (Years_Since_List) โ†’ OrdinalEncoder for categorical features โ†’ XGBoost Regressor (hist method, max_depth=3, 200 estimators, lr=0.1, subsample=0.8) โ†’ Evaluation with Rยฒ, MAE, RMSE.

๐Ÿ’ก Key Insights

  • Town (37.6%) and Assessed Value (33.3%) drive 71% of prediction power โ€” location is king
  • Sales Ratio (16.7%) captures market-to-assessment discrepancies
  • MAE of $35,357 โ€” usable for initial screening and bulk valuation
  • Model performs strongest on mid-range properties; high-end outliers increase RMSE

๐Ÿ“ˆ Business Impact

  • Automated bulk property valuation for 700K+ transactions
  • Supports investors in identifying undervalued properties where sale price < model prediction
  • Reduces time-to-valuation from hours to seconds per property
  • Assessed-vs-predicted gap analysis informs tax assessment appeals
Analytics & BI

Exploratory Data Analysis Portfolio

A two-part EDA investigation โ€” urban air quality patterns across NYC neighbourhoods and property market valuation gaps between assessed vs. actual sale prices โ€” demonstrating rigorous data-cleaning and visual storytelling.

PythonPandasNumPyMatplotlibSeaborn
Datasets
2
Combined Records
11K+

๐Ÿ“Œ Business Problem

Before any predictive model can succeed, the underlying data must be understood. These EDAs answer foundational questions: Which neighbourhoods face the worst pollution? How well do government property assessments match actual market prices?

๐Ÿ“‚ Data

  • Air Quality: ~1,000 environmental monitoring records โ€” pollutant indicators, geographic locations, time periods
  • Property Sales: ~10,000 real estate transactions โ€” assessed values, sale amounts, sales ratios, property types

๐Ÿ”ฌ Methodology

Data Ingestion (CSV/Excel via Pandas) โ†’ Null handling & type conversion โ†’ Univariate analysis (histograms, KDE) โ†’ Bivariate analysis (scatter, box plots) โ†’ Correlation heatmaps โ†’ Statistical summarisation โ†’ Publication-ready Matplotlib/Seaborn charts.

๐Ÿ’ก Key Insights

  • Certain NYC neighbourhoods have 2ร— the pollution levels of city-wide averages
  • Property assessments systematically undervalue properties in high-demand towns
  • Box plots revealed extreme outlier sales that summary statistics missed entirely
  • Seasonal air quality variation follows predictable, actionable patterns

๐Ÿ“ˆ Business Impact

  • EDA findings directly informed feature selection for subsequent ML models
  • Property analysis revealed systemic valuation bias โ€” actionable for government assessors
  • Demonstrated a reproducible EDA framework applicable to any new dataset
Deep Learning

Deepfake Image Detector

Built a CNN-based binary classifier to detect AI-generated facial images in real-time, deployed via a Streamlit web app with confidence scoring โ€” supported by a published research paper.

TensorFlowKerasStreamlitNumPyScikit-learnPIL
Architecture
3-Layer CNN
Inference
<1 sec

๐Ÿ“Œ Business Problem

With the proliferation of AI-generated imagery, media organisations and security teams need automated tools to flag potentially manipulated facial images before they're published or used for fraud.

๐Ÿ“‚ Data

  • Source: Curated dataset of real and AI-generated (deepfake) facial images
  • Preprocessing: Resized to 128ร—128, normalised pixel values (0โ€“1), 80/20 train-test split
  • Model size: ~38 MB

๐Ÿ”ฌ Methodology

Image preprocessing (resize, normalise, RGBAโ†’RGB) โ†’ Custom 3-layer CNN (32โ†’64โ†’128 filters, MaxPooling, Dropout=0.5) โ†’ Binary classification (sigmoid output) โ†’ Trained with Adam optimiser (lr=0.001) + Binary Crossentropy โ†’ Evaluated with confusion matrix & classification report โ†’ Deployed via Streamlit with real-time upload & prediction.

๐Ÿ’ก Key Insights

  • Sub-second inference enables real-time screening of uploaded images
  • Confidence scoring provides probability-based risk assessment rather than binary yes/no
  • Dropout regularisation (0.5) prevents overfitting on limited training data
  • Research paper documents methodology for academic reproducibility

๐Ÿ“ˆ Business Impact

  • Deployable tool for media verification workflows and content moderation
  • Streamlit interface enables non-technical users to perform deepfake audits independently
  • Research contribution advances the deepfake detection knowledge base
  • Architecture is extensible to video detection and ensemble methods

Other Projects

Full-stack engineering and AI tooling that complement my analytics work.

๐ŸŽซ Seat Booking System

Full-stack React + Spring Boot application with MySQL persistence, dynamic pricing tiers, smart validation rules, and a glassmorphism UI with light/dark theme.

ReactSpring BootMySQLREST APICSS

๐Ÿค– AI Code Reviewer

MERN-stack AI-powered code review tool using Deepseek R1 via Groq API for ultra-fast inference, generating intelligent code suggestions and content.

ReactNode.jsExpressMongoDBGroq API

๐Ÿ”Œ LLM REST API

Flask-based API server exposing the Maincoder-1B code generation model with configurable parameters, GPU-accelerated inference on Colab, and public URL via ngrok.

FlaskTransformersngrokGoogle ColabREST API

Data-Driven Problem Solver

I approach every dataset with one question: what decision does this need to inform? My work centres on translating raw numbers into business narratives that stakeholders can act on โ€” whether that means identifying a $1.4M churn-risk segment, optimising insurance pricing models, or building automated KPI dashboards that replace hours of manual reporting.

I combine statistical rigour with strong stakeholder communication โ€” writing SQL that answers real business questions, building predictive models that quantify risk, and designing visualisations that make complex findings immediately accessible to non-technical audiences.

My toolkit spans the full analytics pipeline: from data wrangling and feature engineering in Python, through machine-learning modelling with XGBoost and Scikit-learn, to interactive dashboarding in Power BI and Flask.

๐ŸŽฏ

Structured Problem-Solving

Every analysis follows a clear framework โ€” business question โ†’ data audit โ†’ methodology โ†’ insight โ†’ recommendation.

๐Ÿ“Š

Business Translation

Converting model outputs into dollar-value impact metrics that resonate with leadership and inform strategy.

โš™๏ธ

Automation & Pipelines

Building ELT pipelines, automated dashboards, and reproducible analysis workflows that scale beyond one-off reports.

Certifications & Achievements

2025Data Analysis with Python โ€” IBM (Model Development & Evaluation)
2024Supervised Machine Learning: Regression & Classification โ€” DeepLearning.AI (Stanford)
2024Python Data Mastery: Fundamentals to Machine Learning โ€” LPU
2023Metvy Data Analytics Program (Live Cohort)
๐Ÿ†Finalist (Top 5) โ€” College Hackathon for AI-based mental health assistant
๐Ÿ“„Research Paper โ€” CNN-based Deepfake Image Detection, leveraging preprocessing pipelines and model evaluation

Let's Connect

Whether you have a data challenge, a collaboration idea, or just want to say hello โ€” I'd love to hear from you.