About AnoSys

AnoSys is an AI observability company building the monitoring and analytics platform for the agentic AI era. Our platform gives engineering teams end-to-end visibility into AI agents, LLM pipelines, and ML-powered applications — from user interaction through model inference to downstream business outcomes. We help organizations detect silent regressions, trace non-deterministic behavior, and turn massive telemetry streams into actionable insights, all in real time. Backed by leading investors and built by a team of infrastructure and AI veterans, AnoSys is defining the category of AI-native observability.

Who We're Looking For

We're a small, high-impact team, and every person here shapes the product, the culture, and the trajectory of the company. We look for intellectually curious individuals who combine critical thinking with meticulous attention to detail — people who can identify problems early, reason through ambiguity, and solve challenges independently.

If you thrive in fast-paced, high-ownership environments — where your work directly shapes a category-defining product — we'd love to hear from you.

About the Role

AnoSys processes billions of telemetry events every day — traces, logs, metrics, and model evaluations generated by AI agents and LLM-powered applications running in production. As a Backend Engineer, you will design, build, and operate the core infrastructure that makes this possible.

You will work across the full backend stack: from high-throughput ingestion services that handle millions of events per second, to real-time stream processing pipelines that power anomaly detection and trace correlation, to the query engines and APIs that customers depend on for interactive exploration and alerting. You will tackle problems at the intersection of distributed systems, data-intensive computing, and applied machine learning infrastructure.

This is a foundational engineering role. The systems you build will directly determine the reliability, performance, and scalability of the AnoSys platform. You will have significant ownership and autonomy — from system design through production deployment — and your decisions will shape the technical trajectory of the company.

What You'll Do

Design and implement scalable, fault-tolerant APIs and microservices for telemetry ingestion, trace correlation, anomaly detection, and real-time alerting
Build and optimize high-performance data pipelines capable of processing millions of structured and semi-structured events per second with low latency
Architect distributed systems patterns including event streaming (Kafka, Pub/Sub), partitioned storage, write-ahead logs, and eventually consistent data stores
Optimize query performance across time-series, columnar, and graph-like data models for both real-time dashboards and deep analytical workloads
Contribute to platform reliability through infrastructure-as-code, CI/CD pipelines, observability instrumentation, capacity planning, and incident response
Collaborate with research scientists and product engineers to integrate ML models — including anomaly detectors and LLM evaluators — into production-grade serving infrastructure
Participate in system design reviews, code reviews, and technical planning to maintain a high bar for engineering quality across the team

What We're Looking For

3+ years of backend engineering experience building and operating production-grade distributed systems at scale
Strong proficiency in Python, Go, or Rust — with a willingness to learn new languages and frameworks as the technical landscape evolves
Hands-on experience with message queues and streaming systems (Apache Kafka, Google Pub/Sub, Amazon Kinesis), columnar or time-series stores (BigQuery, ClickHouse, Apache Druid), and cloud infrastructure (GCP, AWS, or Azure)
Solid understanding of distributed systems fundamentals — consistency models, partitioning strategies, replication, consensus, and failure modes
Experience with containerized deployments (Docker, Kubernetes) and modern DevOps practices including infrastructure-as-code (Terraform, Pulumi) and CI/CD automation
A detail-oriented mindset with a passion for writing clean, well-tested, and performant code — and a strong sense of ownership over the systems you build
Strong written and verbal communication skills, with the ability to articulate complex technical concepts to both engineering and non-engineering stakeholders

Nice to Have

Experience building or operating observability, monitoring, or APM infrastructure (e.g., at companies like Datadog, Honeycomb, Grafana Labs, or Chronosphere)
Familiarity with OpenTelemetry, distributed tracing specifications, or telemetry data modeling
Experience with ML serving infrastructure, feature stores, or model evaluation pipelines
Contributions to open-source infrastructure or data systems projects

Apply for This Role

Prefer to attach your resume? Email us at hiring@anosys.ai