Skip to main content

Senior Software Development Engineer, Stores Incident Monitoring

Job ID: 2653380 | Services LLC


We’re hiring a Senior Software Development Engineer to help shape and drive Incident Monitoring tooling and engineering efforts as part of the incident response program for the worldwide Amazon retail websites.
We are re-imagining incident management & response for Amazon’s retail operations. Amazon is evolving faster than our incident management/response programs can keep up. It’s time to change that.
As an L6 Software Development Engineer on the Monitoring team, you will play a pivotal role in the design and implementation of a strategic monitoring platform for the central incident response team. When Amazon is under duress, every single minute matters, and your technical contributions will have a direct impact on the decisions made by Amazon executives and the teams that rely on our centralized control centers and outage management capabilities.
You will be required to dive deep into the intricacies of post-incident analysis, uncovering what went wrong, identifying opportunities for improvement, and ensuring that blind spots are addressed in the future. Amazon incidents are inherently complex, fast-paced, and highly nuanced, presenting a unique and challenging environment for technical problem-solving.

Key job responsibilities
As an L6 SDE on the Monitoring team, you will play a crucial role in defining, building, and integrating key performance indicators for various website experiences into our product. This will require you to navigate the complex architectural landscape of Amazon and work collaboratively with experience owners across the organization.
Your technical expertise and insightful architectural design instincts will be instrumental in developing simple, elegant, and scalable solutions that can support the monitoring of thousands of unique retail website experiences. You will be expected to take initiative and thrive in a relatively unstructured environment, leveraging your problem-solving skills to deliver innovative technical solutions.
A deep passion for understanding the retail business and providing real-time visibility into Amazon's operational health will be a key requirement for this role. You will need to enjoy working within the Amazon ecosystem, collaborating with sister teams and retail experience owners, and building foundational solutions that will empower the central response team.
Mentoring and supporting junior engineers will be a crucial aspect of your role, as you work to foster a culture of continuous learning and improvement within the team.
Maintaining a deep understanding of the broader incident management ecosystem and its interdependencies will be essential.

A day in the life
The challenges you will face will not be easy. The sheer scale of Amazon's operations and the semi-connected nature of its systems will present unique technical problems that require creative problem-solving and persistence. However, these are the types of big challenges that will have a substantial impact on the Central Reliability and Response organization, contributing to its ongoing efforts to improve operational resilience and responsiveness.
By embracing these challenges and leveraging your technical expertise, you will play a vital role in enhancing the monitoring capabilities that are crucial for safeguarding the seamless operation of Amazon's retail experiences.

About the team
The Incident Command Systems team at Amazon is responsible for envisioning and building programs, which consistently improve remediation times for outages. This group consists of multiple 2-pizza teams (teams of 6-10 engineers) that each own software components for monitoring, anomaly detection of website degrading issues as well as incident management software used during these outages.


- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team


- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Bachelor's degree in computer science or equivalent
- Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems

Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit

Our compensation reflects the cost of labor across several US geographic markets. The base pay for this position ranges from $151,300/year in our lowest geographic market up to $261,500/year in our highest geographic market. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience. Amazon is a total compensation company. Dependent on the position offered, equity, sign-on payments, and other forms of compensation may be provided as part of a total compensation package, in addition to a full range of medical, financial, and/or other benefits. For more information, please visit This position will remain posted until filled. Applicants should apply via our internal or external career site.