Skip to main content

Enterprise ML & Data Platform — 100 Jobs, 631 Tables, NLP at Scale

Overview

Client: Global manufacturing company (B2B and B2C, 35+ markets)
Timeline: 2023–2026
Scope: Full-stack data and ML platform — pipelines, models, NLP, AI agents, analytics dashboards
Delivery: Solo technical lead — architecture, implementation, and operations

The client had data spread across more than 20 systems with no unified analytics layer and no ML capability in production. I built both from the ground up.


Scale

WhatNumbers
Databricks jobs (owned and operated)100
Notebooks in workspace201
Unity Catalog schemas20
Curated tables631
MLflow experiments13
Product records (STEP masterdata)1,456,280
Product attribute rows71,016,677
Support cases analyzed1,460,000+
B2B customer records (churn model)7,399
B2C customers (churn model)968,000
Dealer portal sessions5,000,000+
B2B orders1,300,000+

Architecture

Medallion Lakehouse — Azure Databricks

All data follows Bronze → Silver → Gold with daily incremental updates. 20 schemas organized by business domain: marketingmanagement, salesmanagement, supportmanagement, customermanagement, masterdatamanagement, and more.

Sources ingested (20+): BigQuery, GA4, Google Ads, Meta Ads, DV360, Google Search Console, Microsoft Clarity, Dynamics 365, ServiceNow, Salesforce, CosmosDB, SFCC, TestFreaks, UserTesting, and custom internal systems.

Governance through Unity Catalog: column-level lineage, access controls, and documentation across 270+ tagged tables.

Pipeline Operations

100 scheduled Databricks jobs across five domains — all owned, maintained, and monitored solo:

  • Marketing ingest: BigQuery, paid media (Google Ads, Meta Ads, DV360), organic (GA4, GSC)
  • Support management: D365 and ServiceNow case ingestion, NLP enrichment, AI analysis
  • ML/AI models: churn scoring, demand forecasting, cross-sell, RFM clustering
  • Dealer Portal: 5M+ session analytics, order intelligence, anomaly detection
  • Masterdata: STEP product catalog (1.46M products, 71M attribute rows, daily source check)

Support Intelligence Platform

The most complex component. Built a complete NLP and AI layer on top of 1.46M+ B2B support cases from Dynamics 365 and ServiceNow.

Topic Modeling — BERTopic

Ran BERTopic across the full support case corpus. Identified 447 topic clusters with an outlier fraction of 28.3% — meaning 72% of cases mapped to a named, actionable topic.

Support Routing Model

Trained a classification model to route incoming cases to the correct team automatically:

  • 93.6% accuracy on the test set
  • 187,000 training samples
  • 129 topic categories

Deflection Knowledge Base

Built a knowledge base of 6,001 entries from resolved cases, indexed by topic cluster. New incoming cases are matched against the KB before routing — enabling self-service deflection at intake.

AI Agents on Databricks

Three AI agents run on scheduled Databricks jobs:

AgentSchedulePurpose
AI AnalystDailySurfaces anomalies, trend shifts, and KPI commentary
Content EditorWeeklyDrafts product and support documentation from case data
Product OwnerWeeklyGenerates backlog items from support pattern analysis

ML Models in Production

All models tracked and versioned in MLflow. Weekly or daily refresh depending on model type.

Churn Models

B2B churn — trained on 7,399 enterprise customers:

  • val_auc_roc = 95.1%
  • val_auc_pr = 50.4%
  • Churn rate baseline: 4.5%
  • Runs weekly, output used in sales prioritization

B2C churn — trained on 968,000 consumer records:

  • val_auc = 1.0 (near-perfect separation on held-out set)
  • n_train = 775k, n_val = 194k
  • Runs on full refresh

Demand Forecasting

Prophet model with GDD (growing degree day) regressors. Weather-adjusted demand signals for seasonal product categories. Feeds into inventory planning.

Additional Models

  • Cross-sell recommender: trained on purchase history, surfaced in dealer portal
  • RFM clustering: recency/frequency/monetary segmentation refreshed weekly
  • Weather × sales correlation: quantified seasonal demand sensitivity by market
  • Portal search intelligence: intent signals from search behavior in the B2B dealer portal
  • Multi-touch attribution: 5 attribution models across 365-day windows (linear, time decay, position-based, Shapley, data-driven)

Analytics — 17-Page Streamlit Dashboard

Built directly against Databricks SQL Warehouse. No intermediate BI tool, no export layer — queries run live.

B2C dashboards: Paid media ROI, revenue analysis (14 European markets), LTV and cohort analysis, out-of-stock revenue impact, multi-touch attribution, ML intelligence, live anomaly notification centre.

B2B dashboards: Dealer performance, distributor pipeline, market scorecard, B2B sales intelligence, NPS and feedback analysis, product matrix, support intelligence, ML churn scoring.

Dealer Portal: 5M+ sessions and 1.3M B2B orders with daily anomaly detection (z-score, |z| > 2/3 thresholds).


Tooling Built Alongside the Platform

  • Dealer visit form — n8n workflow + GitHub Pages app with authentication, feeding structured visit data into the lakehouse
  • Product trade-in form — multi-step web form integrated with backend pipeline
  • B2B mobile prototype — lightweight mobile interface for field sales
  • Price data export — automated pricing data extraction across markets

Stack

Python · PySpark · SQL · Azure Databricks · Delta Lake · Unity Catalog · MLflow · Azure Data Factory · BERTopic · Prophet · XGBoost · Scikit-learn · GPT-4o · Streamlit · Plotly · BigQuery · GA4 · Dynamics 365 · ServiceNow · Salesforce · GitHub Actions


Want to discuss a similar engagement? Get in touch.

Interested in similar projects?

I help companies build modern data solutions and web applications. Let's discuss your next project!

Contact Me