Enterprise ML & Data Platform — 100 Jobs, 631 Tables, NLP at Scale
Overview
Client: Global manufacturing company (B2B and B2C, 35+ markets)
Timeline: 2023–2026
Scope: Full-stack data and ML platform — pipelines, models, NLP, AI agents, analytics dashboards
Delivery: Solo technical lead — architecture, implementation, and operations
The client had data spread across more than 20 systems with no unified analytics layer and no ML capability in production. I built both from the ground up.
Scale
| What | Numbers |
|---|---|
| Databricks jobs (owned and operated) | 100 |
| Notebooks in workspace | 201 |
| Unity Catalog schemas | 20 |
| Curated tables | 631 |
| MLflow experiments | 13 |
| Product records (STEP masterdata) | 1,456,280 |
| Product attribute rows | 71,016,677 |
| Support cases analyzed | 1,460,000+ |
| B2B customer records (churn model) | 7,399 |
| B2C customers (churn model) | 968,000 |
| Dealer portal sessions | 5,000,000+ |
| B2B orders | 1,300,000+ |
Architecture
Medallion Lakehouse — Azure Databricks
All data follows Bronze → Silver → Gold with daily incremental updates. 20 schemas organized by business domain: marketingmanagement, salesmanagement, supportmanagement, customermanagement, masterdatamanagement, and more.
Sources ingested (20+): BigQuery, GA4, Google Ads, Meta Ads, DV360, Google Search Console, Microsoft Clarity, Dynamics 365, ServiceNow, Salesforce, CosmosDB, SFCC, TestFreaks, UserTesting, and custom internal systems.
Governance through Unity Catalog: column-level lineage, access controls, and documentation across 270+ tagged tables.
Pipeline Operations
100 scheduled Databricks jobs across five domains — all owned, maintained, and monitored solo:
- Marketing ingest: BigQuery, paid media (Google Ads, Meta Ads, DV360), organic (GA4, GSC)
- Support management: D365 and ServiceNow case ingestion, NLP enrichment, AI analysis
- ML/AI models: churn scoring, demand forecasting, cross-sell, RFM clustering
- Dealer Portal: 5M+ session analytics, order intelligence, anomaly detection
- Masterdata: STEP product catalog (1.46M products, 71M attribute rows, daily source check)
Support Intelligence Platform
The most complex component. Built a complete NLP and AI layer on top of 1.46M+ B2B support cases from Dynamics 365 and ServiceNow.
Topic Modeling — BERTopic
Ran BERTopic across the full support case corpus. Identified 447 topic clusters with an outlier fraction of 28.3% — meaning 72% of cases mapped to a named, actionable topic.
Support Routing Model
Trained a classification model to route incoming cases to the correct team automatically:
- 93.6% accuracy on the test set
- 187,000 training samples
- 129 topic categories
Deflection Knowledge Base
Built a knowledge base of 6,001 entries from resolved cases, indexed by topic cluster. New incoming cases are matched against the KB before routing — enabling self-service deflection at intake.
AI Agents on Databricks
Three AI agents run on scheduled Databricks jobs:
| Agent | Schedule | Purpose |
|---|---|---|
| AI Analyst | Daily | Surfaces anomalies, trend shifts, and KPI commentary |
| Content Editor | Weekly | Drafts product and support documentation from case data |
| Product Owner | Weekly | Generates backlog items from support pattern analysis |
ML Models in Production
All models tracked and versioned in MLflow. Weekly or daily refresh depending on model type.
Churn Models
B2B churn — trained on 7,399 enterprise customers:
- val_auc_roc = 95.1%
- val_auc_pr = 50.4%
- Churn rate baseline: 4.5%
- Runs weekly, output used in sales prioritization
B2C churn — trained on 968,000 consumer records:
- val_auc = 1.0 (near-perfect separation on held-out set)
- n_train = 775k, n_val = 194k
- Runs on full refresh
Demand Forecasting
Prophet model with GDD (growing degree day) regressors. Weather-adjusted demand signals for seasonal product categories. Feeds into inventory planning.
Additional Models
- Cross-sell recommender: trained on purchase history, surfaced in dealer portal
- RFM clustering: recency/frequency/monetary segmentation refreshed weekly
- Weather × sales correlation: quantified seasonal demand sensitivity by market
- Portal search intelligence: intent signals from search behavior in the B2B dealer portal
- Multi-touch attribution: 5 attribution models across 365-day windows (linear, time decay, position-based, Shapley, data-driven)
Analytics — 17-Page Streamlit Dashboard
Built directly against Databricks SQL Warehouse. No intermediate BI tool, no export layer — queries run live.
B2C dashboards: Paid media ROI, revenue analysis (14 European markets), LTV and cohort analysis, out-of-stock revenue impact, multi-touch attribution, ML intelligence, live anomaly notification centre.
B2B dashboards: Dealer performance, distributor pipeline, market scorecard, B2B sales intelligence, NPS and feedback analysis, product matrix, support intelligence, ML churn scoring.
Dealer Portal: 5M+ sessions and 1.3M B2B orders with daily anomaly detection (z-score, |z| > 2/3 thresholds).
Tooling Built Alongside the Platform
- Dealer visit form — n8n workflow + GitHub Pages app with authentication, feeding structured visit data into the lakehouse
- Product trade-in form — multi-step web form integrated with backend pipeline
- B2B mobile prototype — lightweight mobile interface for field sales
- Price data export — automated pricing data extraction across markets
Stack
Python · PySpark · SQL · Azure Databricks · Delta Lake · Unity Catalog · MLflow · Azure Data Factory · BERTopic · Prophet · XGBoost · Scikit-learn · GPT-4o · Streamlit · Plotly · BigQuery · GA4 · Dynamics 365 · ServiceNow · Salesforce · GitHub Actions
Want to discuss a similar engagement? Get in touch.
Interested in similar projects?
I help companies build modern data solutions and web applications. Let's discuss your next project!
Contact Me