The Oracle of Lake: Predicting Big Data Processing Time at Goldman Sachs
Data Lake is the firm's big data platform. It consists of a proprietary data store, services, and infrastructure components that are used to house and process petabytes of data every day. The Data Lake platform scales to run hundreds of thousands of ingestions and more than a million exports per day (and we are constantly growing!). Fast processing times for this scale is a financial need - speed is everything! Each of these ingestions and exports is time-sensitive, and the workloads they process vary tremendously in volume. For instance, the same producer can send anywhere from zero to millions of rows depending on the number of trades executed in that hour. The most difficult question to answer operationally has been: which particular ingestion or export has delayed, and also has the potential to delay downstream consumer deadlines? This question led us to start estimating the time taken for each ingestion and export using a machine learning model, and subsequently creating an anomaly detection model based on predictions to find outliers. Lake Oracle, a machine learning model, automatically learns from dynamic workloads and sets up predicted thresholds for each flow where flows are marked happy if they are running within the threshold. It also finds runtime anomalies and identifies priorities for performance improvements for the entire platform. This is the story of Lake Oracle - how modeling techniques were built and improved in collaboration with Data Engineers over the course of two years, the successful adoption of this model for building granular ingenious happiness SLOs across the Data Lake Platform and our lessons learned along the way. Join us for this remarkable journey of DataOps + Applied ML in collaboration with Data Engineers.
Jaimita BansalVice President/Lead Data Scientist
Jaimita is a Lead Data Scientist at Goldman Sachs in Data Lake Engineering building machine learning solutions to proactively predict data processing delays and improve overall operational efficiency of vast data pipelines in the firm. She loves data and applied machine learning space and is currently working to build bots that can take accurate actions on data delays. She is passionate and curious about life and tech; and exploring both through nature and data/ML. Jaimita earned an MTech in Computer Science and Engineering from IIT Kanpur in 2016 specializing in NLP domain.