Being On-Call in 2024; feat. Parny.io

Engin Can Höke
4 min readApr 20, 2024

Yeah, it’s still a thing, and still, infrastructures need us, real people, for their continuity. Who am I? I’m just a DevOps who builds these infrastructures from scratch to highly scalable, reliable, secure, performant & cost-effective. On-call procedures are needed to maintain these values against disasters & day-to-day issues, ensuring their continuity.

The IT Crowd & Mr. Robot Scenes to Show How We’re Feeling

Identifying on-call rotations and responsibilities might be challenging from day one. But it needs to be quick to keep the team motivated & cause serious consequences to the organization and workload. Getting calls during working and nonworking hours to keep services reliable and available requires a procedure and contract with the team. The rotation schedules for on-call responsibilities should be equally distributed with emergency coverages that bring layer concept to us. Having multiple layers on the call at the same time makes escalation more easy and more encouraged. Escalations should be performed early and often so that risks can be identified, and prevented from causing incidents.

Escalating on-call issues to upper levels is seeking experiences and knowledge. Besides, we might need more insight from the resources posted/discussed online which we commonly refer to as ‘Googling’, which can provide additional insights crucial for resolution. The year 2024 has seen an increase in the use of artificial intelligence (AI) in supporting the work of on-call engineers. There are some options specializing in DevOps, such as Amazon Q and DevOps GPT (in ChatGPT-4). These AI-powered assistants offer quick and relevant responses to emergencies, facilitate problem-solving, and take action with their vast reservoir of data and know-how.

Enabling On-Call Management to directly integrate an AI assistant is an innovative perspective to help On-Call Engineers with their actions when outcomes are at risk.

Parny (parny.io) achieves this by employing different personas specialized in Development, DevOps, and Database, thereby offering focused insights from the AI engine without the hassle of explaining it yourself;

Ask Parny AI (parny.io)

This is influential while it streamlines communication and problem-solving processes, allowing teams to quickly access relevant insights and solutions without the need for extensive explanation or expertise on the specific issue.

I’m also On-Call :D

To take the edge off the constant calls, implementing some best practices can help smooth things over. Here are some of them according to Google, and how Parny solves it;

Balanced On-Call

The concept of Balanced On-Call within DevOps teams is basically a consideration of both the quantity and quality of on-call shifts. Quantity can be measured by the percentage of the time engineers spend on on-call shifts, while quality can be measured by the number of incidents occurring during these shifts. While these engineers also take place in engineering tasks, these on-call durations need to be lower as much as possible to prevent burnout. Effective incident management protocols, clear escalation paths, and blameless postmortems help in resolving incidents and learning from mistakes to prevent recurrence in the future. Smart On-Call Lists and automatic shifts help in this manner.

Simply Create On-Call Schedule Lists and Never Miss Alerts

Avoiding Inappropriate Operational Load

“Operational Overload” is basically opposition to maintaining a sustainable workload. Measurable symptoms of this kind of overload, such as the high number of tickets or paging events, help quantify goals for workload reduction. Misconfigured monitoring systems can be identified as a common reason for overload; recommendations to align alerts with service-level objectives (SLOs) and group-related alerts might be the first step to reducing noise. Having DevOps Research and Assessment (DORA) metrics to acknowledge the situation and help the organization to start remediations.

DORA Metrics to Measure & Improve

Lastly, another innovation that came up with Parny is the perspective of SocialOps; Social Media-like Interaction features such as an interactive timeline, comments, emojis, mentions, likes, dislikes, and advanced alert details to Parny’s incident management platform boost the capabilities of On-Call Engineering teams. Also, this brings some joy to the place.

Interactive On-Call Management & Alerting Service

Yeah, finally, it’s wise to compare Parny.io against other market leaders like PagerDuty, OpsGenie, or VictorOps, which might offer different features that could be more suited to specific requirements or budget constraints.

Eventually, choosing an on-call management tool should be based on a thorough assessment of the organizational needs, and how well it integrates into the existing operations. Experimenting with the tool with a pilot project before fully adopting might provide valuable insights into its suitability for the environment.

References:

OPS03-BP03 escalation is encouraged — AWS well-architected framework.

Beyer, B. (2016). CHAPTER 11 — Being On-Call. In Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media.

On-call management service — Ask Parny AI: Get Detailed Incident Recommendations. Parny.

--

--