This post is co-written with Curtis Maher and Anjali Thatte from Datadog.
This post walks you through Datadog’s new integration with AWS Neuron, which helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability into resource utilization, model execution performance, latency, and real-time infrastructure health, enabling you to optimize machine learning (ML) workloads and achieve high-performance at scale.
Neuron is the SDK used to run deep learning workloads on Trainium and Inferentia based instances. AWS AI chips, Trainium and Inferentia, enable you to build and deploy generative AI models at higher performance and lower cost. With the increasing use of large models, requiring a large number of accelerated compute instances, observability plays a critical role in ML operations, empowering you to improve performance, diagnose and fix failures, and optimize resource utilization.
Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and ML operations. Datadog is excited to launch its Neuron integration, which pulls metrics collected by the Neuron SDK’s Neuron Monitor tool into Datadog, enabling you to track the performance of your Trainium and Inferentia based instances. By providing real-time visibility into model performance and hardware usage, Datadog helps you achieve efficient training and inference, optimized resource utilization, and the prevention of service slowdowns.
Datadog’s integration with the Neuron SDK automatically collects metrics and logs from Trainium and Inferentia instances and sends them to the Datadog platform. Upon enabling the integration, users will find an out-of-the-box dashboard in Datadog, making it straightforward to start monitoring quickly. You can also modify preexisting dashboards and monitors, and add news ones tailored to your specific monitoring requirements.
The Datadog dashboard offers a detailed view of your AWS AI chip (Trainium or Inferentia) performance, such as the number of instances, availability, and AWS Region. Real-time metrics give an immediate snapshot of infrastructure health, with preconfigured monitors alerting teams to critical issues like latency, resource utilization, and execution errors. The following screenshot shows an example dashboard.

For instance, when latency spikes on a specific instance, a monitor in the monitor summary section of the dashboard will turn red and trigger alerts through Datadog or other paging mechanisms (like Slack or email). High latency may indicate high user demand or inefficient data pipelines, which can slow down response times. By identifying these signals early, teams can quickly respond in real time to maintain high-quality user experiences.
Datadog’s Neuron integration enables tracking of key performance aspects, providing crucial insights for troubleshooting and optimization:
By consolidating these metrics into one view, Datadog provides a powerful tool for maintaining efficient, high-performance Neuron workloads, helping teams identify issues in real time and optimize infrastructure as needed. Using the Neuron integration combined with Datadog’s LLM Observability capabilities, you can gain comprehensive visibility into your large language model (LLM) applications.
Datadog’s integration with Neuron provides real-time visibility into Trainium and Inferentia, helping you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see AWS Inferentia and AWS Trainium Monitoring.
To learn more about how Datadog integrates with Amazon ML services and Datadog LLM Observability, see Monitor Amazon Bedrock with Datadog and Monitoring Amazon SageMaker with Datadog.
If you don’t already have a Datadog account, you can sign up for a free 14-day trial today.
Curtis Maher is a Product Marketing Manager at Datadog, focused on the platform’s cloud and AI/ML integrations. Curtis works closely with Datadog’s product, marketing, and sales teams to coordinate product launches and help customers observe and secure their cloud infrastructure.
Anjali Thatte is a Product Manager at Datadog. She currently focuses on building technology to monitor AI infrastructure and ML tooling and helping customers gain visibility across their AI application tech stacks.
Jason Mimick is a Senior Partner Solutions Architect at AWS working closely with product, engineering, marketing, and sales teams daily.
Anuj Sharma is a Principal Solution Architect at Amazon Web Services. He specializes in application modernization with hands-on technologies such as serverless, containers, generative AI, and observability. With over 18 years of experience in application development, he currently leads co-building with containers and observability focused AWS Software Partners.
Manuel Rioux est fièrement propulsé par WordPress