Candidate: Pranav Tanaji Ghadge
The most complex inference system I’ve managed was a real-time demand forecasting pipeline designed to serve enterprise-scale transactional data with a sub-200ms latency SLA. The core model was a gradient-boosted ensemble trained on roughly 1.5TB of data (60M+ records).
The Stack: I used AWS SageMaker for model hosting, S3 as the primary feature store, and Airflow to orchestrate micro-batch feature updates. This allowed us to maintain prediction freshness without the overhead of heavy real-time feature engineering.
Key Optimizations:
- Feature Pruning: By identifying and dropping redundant lag features, I reduced the inference payload and cut total inference time by ~22%.
- Pre-materialization: I decoupled feature freshness from request latency by pre-materializing feature vectors in a low-latency cache, ensuring the model didn't have to wait on upstream data fetches.
- Environment Tuning: I diagnosed and fixed a container misconfiguration where the model was being reloaded into memory on every single request. Resolving this removed a 60-80ms bottleneck per call. It was a clear reminder that in production, the bottleneck is often the infrastructure surrounding the model, not the model's math.
My most significant production "pain" occurred when a pricing model began serving stale predictions to live customers for over four hours. There were no hard errors or 5xx status codes; the system just served plausible-looking but incorrect data. The root cause was an upstream partner feed providing malformed records that bypassed our schema validation but silently corrupted our feature distributions.
The Solution: To prevent this from recurring, I implemented a two-layer defense:
- Statistical Guardrails: I added ingestion-layer checks that compared incoming feature distributions against a 7-day rolling baseline. If the variance exceeded a specific threshold, it triggered an immediate alert.
- Shadow Validation: I deployed a lightweight reference model in parallel. This acted as a "sanity check," flagging the system whenever the primary model’s outputs drifted outside of an expected, statistically probable range.
Result: These changes reduced our Mean Time to Detection (MTTD) from 4 hours to under 30 minutes. More importantly, it shifted our team culture from reactive "firefighting" to a proactive stance where we caught data-level failures before they impacted the customer experience.