(McLaren Racing via Splunk)

Overview

Kermit Troy Berry collaborated with McLaren Racing’s Formula 1 team to build a real time telemetry analytics platform that transforms how race data is utilized. In the high stakes world of motorsports, every millisecond of sensor data can influence strategy and performance. Kermit’s project involved harnessing a torrent of telemetry (over 100,000 events per second from car sensors) and converting it into actionable insights using Splunk’s Data to Everything platform. The solution enhanced on track decision making and off track reliability by enabling instant anomaly detection and comprehensive observability across McLaren’s hybrid cloud and edge infrastructure.

Objective

McLaren Racing sought to improve on track decisions and operational resilience by leveraging the massive volume of data generated during races and testing. The objective was to develop a telemetry analytics system capable of:

  • Ingesting and processing telemetry data in real time (from cars’ hundreds of sensors, race simulations, and IT systems).
  • Automatically detecting anomalies or performance issues (in car components or IT infrastructure) as they occur, so engineers can respond immediately.
  • Streamlining log and incident analysis through intelligent automation, given the limited time windows during and between races.
  • Providing a unified observability dashboard for both trackside and factory teams to monitor system health, car performance metrics, and alerts.
    Ultimately, the goal was to sharpen McLaren’s competitive edge by using data more effectively than ever, ensuring that both the race cars and the supporting technology operate at peak performance at all times.

Role and Responsibilities

As a Senior ML Software Engineer attached to the McLaren–Splunk partnership, Kermit took on several key responsibilities:

  • Data Pipeline Architecture: He architected the end to end telemetry ingestion pipeline using OpenTelemetry standards to collect data from distributed sources and funnel it into Splunk. This pipeline was designed to reliably handle an onslaught of ~100k events per second, filtering and indexing telemetry data (car sensor streams, network logs, application metrics) in real time.
  • Anomaly Detection Development: Kermit created custom anomaly detection algorithms using a combination of Splunk’s Search Processing Language (SPL) for rule based thresholds and Python based machine learning for more complex pattern recognition. These detectors were configured to flag abnormal patterns in sensor data (e.g. sudden drops in oil pressure, surges in brake temperatures) and in IT system logs, enabling proactive issue resolution.
  • SOAR Integration with NLP: Leveraging Splunk Phantom (Security Orchestration, Automation, and Response), he extended McLaren’s incident response automation. Kermit integrated a PyTorch NLP model into Phantom playbooks to intelligently triage log messages and alerts. This model could read unstructured log text or error reports and classify their criticality or category, helping engineers prioritize significant issues (for example, distinguishing a harmless sensor glitch from a genuine early warning of component failure).
  • Infrastructure as Code & Automation: He implemented Infrastructure as Code for the telemetry platform using Terraform and automated provisioning via Splunk’s REST APIs. All Splunk deployments (search head, indexers, forwarders configurations) and Phantom integrations were codified, allowing reproducible environments and quick rollout of updates across on premises trackside rigs and cloud servers. This ensured consistency in configuration from the garage to the cloud and reduced setup time when racing at different circuits.
  • Training and Best Practices: Kermit also served as an evangelist of observability within the team. He conducted training sessions for McLaren’s performance engineers and IT staff on using the new Splunk dashboards, understanding anomaly alerts, and following observability best practices. He prepared playbooks and guidelines so that race strategists, reliability engineers, and developers could all confidently use the telemetry insights and respond to incidents swiftly.

Approach and Technology Stack

Real Time Data Ingestion: The approach centered on adopting OpenTelemetry instrumentation across McLaren’s systems. Every critical sensor on the F1 car, as well as supporting IT systems (databases, applications, network devices), was instrumented to emit standardized telemetry data. Kermit configured OpenTelemetry collectors to feed into Splunk at an extremely high throughput. The system leveraged Splunk Enterprise as the core platform for data ingestion and analytics, chosen for its proven ability to handle massive machine data streams. During races and test sessions, data was streamed both to on site Splunk instances (at the trackside IT rig) for low latency needs and to cloud based Splunk infrastructure for broader analysis and historical comparison.

Anomaly Detection & Analytics: Once data was in Splunk, Kermit developed a suite of analytics:

SPL based Alerts: For known thresholds (e.g., engine temperature exceeding X°C, network latency above Y ms), he used Splunk’s query language to define real time alerts. These were tuned with McLaren’s domain experts to minimize false alarms and ensure they reflected conditions that would warrant attention during a race.

ML based Detection: For more complex anomalies, such as subtle changes in vibration frequencies or multi sensor correlations that precede a failure, Python scripts were integrated via Splunk’s Machine Learning Toolkit and custom search commands. He built anomaly detection models (like unsupervised clustering and predictive models) that ran over sliding windows of telemetry data. For example, a model using statistical profiling identified when a combination of sensor readings deviated from the normal pattern for that car and track condition, flagging it as a potential issue for engineers to investigate.

Natural Language Processing for Logs: A notable innovation was embedding a PyTorch based NLP model into Splunk Phantom workflows. Kermit trained this model on historical log data and incident reports to recognize patterns in text (like error messages or stack traces) and categorize them (for instance, “network issue,” “sensor fault,” “software error”). Phantom playbooks used this model’s output to decide how to route or escalate alerts. As a result, if multiple log anomalies occurred, the system could auto prioritize those indicating critical failures (e.g., an imminent server crash affecting telemetry) over benign warnings, saving precious time for the team.

Tech Stack & Tools: The solution stack was diverse:

Splunk Platform: including Splunk Enterprise (search and indexing engine for telemetry), Splunk Forwarders (for data ingestion), and Splunk Phantom (for automation and response).

OpenTelemetry & Telemetry SDKs: for instrumenting data from car ECUs (Electronic Control Units) and applications into a unified format.

Python & PyTorch: used for custom ML models and the NLP log analysis component. These ran in tandem with Splunk (via the Splunk Python SDK and Phantom’s automation framework).

Terraform: for provisioning cloud infrastructure (like Splunk indexer clusters, virtual machines, network configurations) as code. The Splunk REST API was also scripted to create indexes, data sources, and saved searches automatically during deployments.

Monitoring & CI/CD: While Splunk itself monitored racing systems, Kermit ensured the telemetry platform’s health was also monitored (for example, forwarding Splunk internal metrics to a separate monitoring dashboard). Deployment of configuration changes and updates to ML models was managed through a CI/CD pipeline that tested changes in a staging environment before applying to production (especially important before a race weekend).

Challenges and Solutions

Ultra High Velocity Data: Managing 100k events per second in real time is non trivial. The sheer volume of data (100kHz telemetry streaming) could have overwhelmed systems. Kermit tackled this by optimizing each stage of the pipeline. He tuned OpenTelemetry collectors to batch data efficiently and set up multiple parallel ingestion streams into Splunk to distribute load. Within Splunk, indexer clustering and load balancing ensured throughput capacity. He also implemented selective filtering at the source for non critical data during peak times (for instance, lowering debug level logs ingestion during races) so that the most important signals were always prioritized. The result was a resilient pipeline that consistently kept up with the data firehose, proven through stress tests and live race conditions.

Data Quality and Noise: With thousands of data points per second, distinguishing a true performance anomaly from normal noise or expected variation was challenging. The solution was a layered anomaly detection strategy. Basic threshold breaches would catch obvious issues, but for nuanced patterns, Kermit’s ML models looked at combinations of signals and temporal trends. He worked closely with McLaren’s race engineers to incorporate domain rules (e.g., expected ranges when DRS is active, tire wear effects on telemetry) into model features. This hybrid approach dramatically improved signal to noise ratio in alerts the team saw a reduction in “false positive” alerts by an estimated ~30%, ensuring that when an alert fired, it genuinely required attention.

Hybrid Infrastructure Complexity: The telemetry system spanned on car edge devices, trackside servers, and cloud servers back at McLaren’s factory. Ensuring consistency and reliability across this hybrid environment was complex. Kermit’s introduction of Infrastructure as Code (Terraform) proved vital. It allowed the team to deploy the same Splunk configurations across different environments reproducibly. For example, before each race, the trackside rig’s configuration (dashboards, alerts, indexes) could be synced with the master configuration tested at the factory. He also established robust offline buffering if connectivity dropped (as can happen at remote circuits), the on site system would cache data and forward it when back online, preventing loss. This design delivered continuity in data collection despite the distributed nature of the setup.

Rapid Incident Response: In F1, timing is everything; if an issue is detected, engineers need immediate insight. Initially, sifting through log files or numerous alerts could slow response. The integration of Splunk Phantom with an NLP model was Kermit’s answer to speed this up. By automating log analysis and triage, the team could respond to critical incidents faster. One example occurred when a sudden network glitch struck the garage systems during practice Phantom’s NLP classifier instantly flagged the network related error messages as high priority and triggered a response playbook, whereas previously such an issue might have been buried among dozens of less important alerts. This automation potentially saved the team from missing crucial telemetry during that practice session.

User Adoption and Skills Gap: Introducing advanced analytics and ML into a fast paced racing team environment required user trust and understanding. Some engineers were initially skeptical of automated insights or worried the new tools might distract during intense moments. Kermit addressed this by deeply involving the end users in the development cycle. He ran hands on workshops where engineers used the Splunk dashboards on historical data to see how it surfaced issues they already knew about. He also implemented a feedback loop: every false alert or missed detection was analyzed post race, and rules/models were refined. Over time, as the team gained confidence that the system reliably mirrored their own expertise (and even caught things they hadn’t seen), they embraced it. Training materials were created for new team members, and an “observability champion” was designated on the racing team to ensure continuous advocacy of best practices.

Results and Impact (with Metrics)

The Real Time Telemetry Analytics initiative delivered clear improvements for McLaren Racing:

Comprehensive Observability: McLaren now has a powerful, unified view of its racing data ecosystem. The Splunk platform streams and analyzes telemetry at 100kHz (100,000 data points per second) for real time decision making, covering everything from engine performance to IT infrastructure status. This resulted in unprecedented visibility as one McLaren tech lead put it, “we’ve never had this level of insight before”. The team can monitor car health and systems health side by side, ensuring no blind spots during critical operations.

Improved Reliability & Faster Issue Resolution: The anomaly detection system has led to a marked improvement in reliability and response times. Critical anomalies (whether a degrading sensor, a server memory leak, or network latency spike) are now caught within seconds and immediately flagged to the team. During the last season, there were multiple instances where early warnings from the system allowed McLaren to address issues during pit stops or between sessions avoiding on track failures that could have cost points. The operations and IT infrastructure achieved greater consistency and uptime, as evidenced by the fact that McLaren experienced zero telemetry related outages during races after the system’s implementation. Any issues in the telemetry pipeline were proactively identified and resolved by the monitoring Kermit put in place.

Data Driven Performance Gains: With high quality data readily available, McLaren’s race strategists and engineers could make more informed decisions. For example, data insights from Splunk helped optimize engine tuning and tire strategy by correlating real time telemetry with historical race data. The team credited the platform with helping accelerate their development feedback loop: car setup changes and updates could be evaluated faster across thousands of data points, contributing to improved lap times and consistency. While competitive advantage is hard to quantify, McLaren’s leadership has publicly noted that data analytics is a key part of their race to race improvements.

Efficient Log Triage and Team Productivity: The integration of the NLP driven alert triage reduced the volume of alerts engineers had to manually review by ~40%. Instead of wading through a flood of messages, they now receive a concise summary of the most critical issues. This efficiency means engineers spend less time in war room debugging, and more time implementing solutions. It also reduces stress and cognitive load during races team members can trust the automation to catch important issues, allowing them to focus on strategy and car performance.

Team Empowerment and Culture Shift: By training and upskilling the team on observability tools, Kermit helped cultivate a more data driven culture in McLaren’s operations. Engineers who traditionally might focus only on mechanical aspects are now comfortable diving into telemetry dashboards to hunt for improvements. The user friendly visualizations and the successes of the system have made data a central part of race debriefs and planning. McLaren’s management noted that this project has “sparked curiosity and innovation across the organization,” encouraging experimentation and continuous improvement. In effect, the platform not only delivered technical benefits but also influenced the way the team approaches problem solving with data at the forefront.

Recognition and Partnership Value: This project strengthened the McLaren–Splunk partnership. McLaren’s case became a flagship example for Splunk (featured in Splunk’s customer success stories), and it validated Splunk’s technology in one of the most demanding real time data scenarios. For Kermit, his work received recognition from both McLaren’s leadership and Splunk executives, enhancing his reputation as an expert in real time analytics solutions.

Lessons Learned

Working on F1 level real time analytics provided several key lessons:

  • Design for Scale and Speed from Day One: In extreme data environments, you have to assume things will go wrong at scale that never appear in small tests. Kermit learned to rigorously load test and profile the system with race like conditions. The project underscored that building with headroom (extra capacity) and low latency design patterns (parallel processing, efficient buffering) is essential when milliseconds matter.
  • Marry Domain Expertise with Data Science: An important takeaway was the value of integrating domain experts (race engineers, mechanics) into the data science process. Many improvements in anomaly detection came from suggestions by those who deeply understood the car and racing dynamics. This partnership ensured the analytics weren’t just theoretically sound, but practically relevant. It reinforced that in applied ML, context is king solutions must be tailored to the real world environment for best results.
  • Automation Improves Human Focus: Initially, there was concern that automation (like Phantom playbooks) might oversimplify or miss nuances. Instead, the lesson was that smart automation, when properly tuned, augments human capabilities. By offloading rote tasks (scanning logs, basic threshold monitoring), the team could focus on complex problem solving. Automation should handle the “detect and aggregate” so humans can do the “diagnose and decide.” This philosophy, well executed here, can be applied to many enterprise scenarios beyond racing.
  • Infrastructure as Code and DevOps are Strategic: The use of Terraform and CI/CD wasn’t just a technical convenience; it proved to be a strategic enabler. It allowed McLaren to deploy their analytics setup anywhere, anytime (important given the traveling nature of the sport) and ensured reliability. The lesson: treating your ops configuration and analytics environment as code brings repeatability and faster iteration, which is especially crucial when working under tight time constraints (like the few days between races).
  • User Adoption is as Important as Technical Accuracy: No matter how advanced an analytics system is, its value is limited if the end users don’t trust or understand it. Kermit learned the importance of change management involving users early, demonstrating quick wins, and providing training. By converting skeptics into advocates through patience and proof, the project achieved lasting impact. This highlighted that successful tech projects are equal parts technology, people, and process.

Visual Summary

Suggested visuals to illustrate this case study include:

  • Telemetry Flow Diagram: A visual diagram showing the flow of data from the race car to actionable insights. For instance, an infographic with the race car on one end emitting data, feeding into a trackside server, then into the Splunk cloud. Include icons for OpenTelemetry, the Splunk logo, and a database icon to represent indexed data, and finally a dashboard icon. Annotate it with the volume (100k events/sec) and types of data (engine, tires, brakes, network, etc.) to convey the scope.
  • Real Time Dashboard Screenshot: A sample Splunk dashboard view that McLaren engineers might see during a race. This could show live graphs of key metrics (engine RPM, temperature, fuel rate, etc.), along with an alert panel highlighting any anomalies (maybe a red flag on “Gearbox vibration anomaly detected!”). Even a mock up would help readers visualize the real time nature and the clarity the system brings.
  • Automated Alert Triage Illustration: A small flowchart or comic style graphic showing how a flood of log alerts is distilled by the NLP model. For example, an image with multiple log files on one side, an “NLP filter” in the middle, and then a few critical alerts on the other side reaching an engineer. This emphasizes how the smart log triage works and benefits the team.
  • Race Impact Graph: Perhaps a before and after chart or timeline highlighting an incident avoided. For instance, a timeline of a race stint where a certain sensor started behaving abnormally mark the point where the system alerted the team, and how early intervention prevented a failure. While this might be hypothetical, it would illustrate concretely the value of early detection in a race scenario.
View Work