On-call Chronicles

sound.ambient.moodaudio.url: 'media/mr_robot.mp3' sound.ambient.moodaudio.description: 'Optional audio for mood: moderate urgency' sound.ambient.coffee.url: 'media/coffee.mp3' sound.ambient.coffee.description: 'Optional audio for mood: relaxed coffee break' moveCount: 0 clueLoginLatencyAndErrors: false clueReviewSift: false clueLogEntries: false clueBruteForceAttack: false clueKubeCrashes: false clueNoDeployments: false clueAlertRunbook: false clueReproduceError: false clueOneRegion: false clueIAM: false taskAckAlerts: false taskDeclareIncident: false taskScaleUpDeployment: false taskPageTeam: false taskAddToPIR: false taskMakeCoffee: false // Page layout and behavior config.style.page.theme.override: 'dark' config.body.transition.name: 'crossfade' config.body.transition.duration: '800ms' // Typography config.style.googleFont: '<link href="https://fonts.googleapis.com/css2?family=Inter:ital,opsz,wght@0,14..32,100..900;1,14..32,100..900&family=Oswald:wght@200..700&family=Source+Sans+3:ital,wght@0,200..900;1,200..900&display=swap" rel="stylesheet">' config.style.page.font: 'Inter/sans-serif 18' config.style.dark.page.color: 'rgb(204, 204, 220)' // Footer contents config.style.dark.page.footer.borderColor: 'rgb(180,180,180)' config.footer.right: '{restart link}' config.style.page.footer.link.font: '' config.style.page.footer.link.color: 'rgb(110, 159, 255)' config.style.page.footer.link.active.font: '' config.style.dark.page.footer.link.active.color: 'white on rgba(61, 113, 217, 0.15)' // Link style config.style.page.link.font: '' config.style.dark.page.link.lineColor: '' config.style.dark.page.link.color: 'rgb(110, 159, 255)' config.style.dark.page.link.active.color: 'white on rgba(61, 113, 217, 0.15)' config.style.page.fork.divider.size: 0 -- Welcome! On-call Chronicles is a choose-your-own-adventure experience where you step into the shoes of an on-call engineer, hurriedly untangling a complex, multi-factor incident using Grafana Cloud. Important: * You're scored based on moves, not time, so you can take the time to read; but choose carefully. Or: have fun; it's just a game! (Or is it?? Yes, it is.) * Once you start the game, there's ambient audio. Mute this browser tab now if you prefer. You're relatively new to on-call in Sidewinder, Inc., since you only joined earlier this year. You've done some rehearsal and shadowing, though; you'll be fine. Probably. At the moment, you are fast asleep. There's an irritating noise in the room, though, as > [[your phone buzzes again.->Wakeup]] {embed image: 'media/mobile_on_couch.jpg', alt: 'mobile with Grafana IRM launching'}

moveCount: moveCount+1 -- {ambient sound: 'moodaudio'} Your phone buzzes again and you silence it without looking. What... what time is it? The room is brightly lit, but the night outside is still and dark; you hear the kitchen light humming faintly. It feels like the middle of the night; you must have dozed off on the couch. Your brain is slowly coming online, and your shoulders slump as you recall that it's a Tuesday night, and you were hoping for a good night's sleep, to be sharp because— Your eyes fly fully open and you freeze, mid-shuffle into the next room. On call. You're on-call, for just the second time, and your manager has reassured you that "it'll be fine" and "there's a secondary who'll be alerted 30 minutes after your first alert, so worst case still isn't too bad!" But surely it won't look good if you *completely fail* it. You scrabble for your mobile just as it politely buzzes for the second time. Third time? Even as you recognize the Grafana IRM notifications, though, you feel a wave of fatigue wash over you, and think back to yesterday, when the engineer on call was paged during an important meeting, and hurried out... but then slipped back in 20 minutes later muttering about "that oversensitive SLO again". Actually, that error on top looks kind of familiar. > [[It's probably nothing; ACK the page and get some rest->ProbablyJustNoise]] > [[No, safer to check->Wakeup2]]

moveCount: moveCount+1 -- It's true, it could be nothing. Or: it could be a truly serious problem, and every second that you stand here, it's getting worse. You spin around to dash to your desk and nearly slam into the half-closed door. Okay, deep breath. 30 seconds more won't matter. You try to ignore the notifications, fill a glass with cool water, and bring it over to your desk at a more measured pace. Your monitor bathes your face in a ghostly night-time LED glow as it flickers to life. {embed image: 'media/zhyar-ibrahim-REu-jM6vQUw-unsplash.jpg', alt: 'laptop and monitor'} You narrow your eyes and carefully push back the flutters of panic. Where do you want to start? > [[Review mobile app notifications->ReviewMobileApp]] or jump straight to your computer: > [[Open Grafana Cloud->GrafanaCloudHome]] > [[Open Slack->StartInSlack]]

moveCount: moveCount+1 -- Useful tips in here! {embed image: 'media/runbook.png', alt: 'Sidewinder runbook'} It points you to a few useful dashboards for various issues, plus has queries you can use to filter logs when you're trying to debug the login service. There are also steps to increase resources for the login service pods; that could be useful. <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>Something similar has happened before. A colleague has created a runbook to help diagnose this kind of problem.</span> </div> </div> [Javascript] clueAlertRunbook = true; [continued] Now: > [[Check the Asserts Network graph for your services->AssertsMyService]] > [[Open the Grafana Cloud home->GrafanaCloudHome]] > {back link}

moveCount: moveCount+1 -- [TODO] Need to know how to zoom in... Remember service name for filter? Or get from runbook only? [continue] The dynamically-balanced graph visualization is a bit mesmerizing, but you're also lost. {embed image: 'media/asserts_all_services.png', alt: 'Asserts all services'} Choices: > [[Filter by service->AssertsMyService]] > [[back to Grafana Cloud home->GrafanaCloudHome]]

moveCount: moveCount+1 -- [TODO] magically we've gotten a filtered view Set up clue(s) can be the same clue we can get elsewhere do we want to include workbench? > [[TBD>AssertsWorkbench]] [continue] You filtered on the login service, and now it's very clear that your `login-service` is unhappy ...and that the shared `authentication` service may be to blame. {embed image: 'media/asserts_my_service.png', alt: 'Asserts my service'} [Javascript] clueIAM = true; [continued] <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>Your upstream dependency on the authentication service are contributing to your latency problems</span> </div> </div> > [[return to Grafana Cloud home->GrafanaCloudHome]]

moveCount: moveCount+1 -- [TODO] What do we learn here? can be a clue we can also get elsewhere [continue] {embed image: 'media/asserts_workbench.png', alt: 'Asserts workbench'} Choices: > {back link} > [[back to Grafana Cloud home->GrafanaCloudHome]]

moveCount: moveCount+1 -- You remember a big marketing event was scheduled for somewhere around now. Could that explain the higher traffic? You quickly check a few Slack channels, but there's no fast way to figure out "what kind of traffic do we expect from this and when?" On the other hand, it won't help that much to chase down these details. Right now, you just need to do what you can to put out the fire. > {back link}

moveCount: moveCount+1 noClues: false pastTheHour: moveCount % 10 hours: (moveCount - pastTheHour) / 10 minutes: pastTheHour * 6 forceAckAlerts: !taskAckAlerts && moveCount >= 4 forceNoIncidentEnding: !taskDeclareIncident && moveCount>=15 -- {ambient sound: 'moodaudio', volume: 1.0} [note] This page has some endings: if the player hasn't created an incident yet after 15 moves (1.5 hours), the CTO calls and it's game over. see endings/NoIncidentEnding.tw if the overall game goes to over Y moves, it also ends. (let's nail down these details) [continue] [if hours < 1] You glance at the time — {minutes} minutes since you started... [else] You glance at the time — {hours} hour(s) and {minutes} minutes since you started... [continue] *** [if forceAckAlerts] Your mobile is buzzing again! ...the same alerts are still firing. Did you [[acknowledge your alerts in OnCall?->AckAlerts]] [continue] [if forceNoIncidentEnding] You've been procrastinating about creating an incident because you still don't understand what is going wrong. You'd like to chase down another hypothesis, but you're jolted from your focus: [[is that your phone?->NoIncidentEnding]] [continue] [if !forceAckAlerts && !forceNoIncidentEnding] {embed passage: 'GrafanaCloudHome_Choices'}

noClues: false -- [TODO] - fix link lists: either fill in flows or remove. Real flows we can do: - research with a log query: before runbook, get tied up in writing LogQL. After runbook: run those queries. - dashboards: check high-level dashboard (not enough info) but struggle with Explore queries [continue] [note] Wrapped by GrafanaCloudHome because the page has a "break here" ending, and Chapbook can't nest conditionals. Jump-off point -- list the actions to jump to. Some of those (like 'run a custom query in the Explore view' won't work until e.g. they've found the runbook) 'create an incident' isn't wise before we've confirmed that users are impacted -- is also a deadend UNTIL they have enough clues. [continue] On your main Grafana tab, you think about where to go next. All the features can be a bit overwhelming sometimes. You wish you had spent more time playing with it, back when everything wasn't on fire. **Can you reproduce the problem?** > [[Try to reproduce errors locally->ReproLocally]] > [[Sign into prod to see if you hit errors there->ReproOnProd]] **Dig into service health and dependencies** > [[Review Service Level Objectives->SLO]] > [[Check the Asserts service map and workbench->AssertsAllServices]] > [[Find relevant dashboards->ReviewDashboards]] **Hunt for contributing factors**  > [[Check recent deployments->RecentReleases]] > [[Check if a big marketing event caused a surge->CheckMarketing]] **Work with the escalation** > [[Open OnCall->OnCallLanding]] [if !taskDeclareIncident] > [[Declare an Incident->DeclareIncident]] [else] > [[Go back to your Incident->GotoIncident]] [continue] **And maybe you're ready for some bigger steps** > [[Page a team for help->IncidentPageTeam]] > [[Add more cpu and memory to your pods->ScaleUpDeployment]] Or — or. Your thoughts whirl uselessly for a moment, and you rub your prickly eyes — maybe you need a breather. > [[Make some coffee->MakeCoffee]] > [[Take a quick social media break->SocialMediaBreak]] > [[Try a few stretches->DoStretches]]

moveCount: moveCount+1 -- You open OnCall under "Alerts & IRM" in the nav. "Alerts" may be useful, too. You're less familiar with Alerting, though, so maybe save that for later. The main OnCall page has tons of alerts, but these are *everyone's*, not just yours. {embed image: 'media/oncall.png', alt: 'OnCall main screen'} There's definitely a way to filter down to just your own alerts... how did you do that before? You puzzle over the filter dropdown for a minute; it's not "Acknowledged by", or "Invitees Are" (what does that mean?) Not "Involved Users Are"... this is simpler in the mobile app. Ah right! The dropdown can scroll, and "Mine" is in there near the bottom. You see the alerts you were paged for, and browse through again. [Javascript] clueLoginLatencyAndErrors = true; [continued] From here, you can: > [[Acknowledge these alerts->AckAlerts]] [if !taskDeclareIncident] > [[Declare an Incident->DeclareIncident]] [else] > [[Go back to your Incident->GotoIncident]] [continue] > [[Go back to Grafana Cloud home->GrafanaCloudHome]] Opening the top firing alert, you also notice: > [[a runbook link for the alert->AlertRunbook]] > [[results from a Sift investigation->Sift]]

moveCount: moveCount+1 -- [TODO] Where / how do we check releases? If there's a dashboard tracking rollouts, do you use it regularly? Search? Is it locked behind the runbook clue? Could be *faster* with a runbook link. Or there are other ways to check... what's most likely? See also deployments in the k8s app [continue] You check over releases. The login service had new changes released 4 days ago. Probably not related to any new issues? Though the traffic spike might be stressing badly unoptimized code that was released 4 day ago... <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>Errors probably not caused by a recent release</span> </div> </div> [Javascript] clueNoRecentReleases = true; [continued] > {back link}

moveCount: moveCount+2 -- [note] Don't even bother tracking a clue for this; it's not really useful [continue] If you can just *see* the error happening, it'll be a lot easier to start fixing it. You spin up your local environment, and— Ah, wait... you stash in-progress code changes, switch back to the main branch, and start the build. Actually, what version is deployed to prod right now? You can look it up, but you don't have much time. Eh, main is close enough; any serious bug is very likely still in the codebase. You spin up your local environment and login as the admin user you normally test with. It's near instantaneous. > {back link}

moveCount: moveCount+1 -- Well, if users are hitting high latency and errors on login, maybe you'll hit it, too? You have test user login details stored in a password management tool; it takes you a minute to get through the multi-factor authentication, but then you can sign in. Huh. Login is quick, as usual. What's going on? [if clueOneRegion] Actually... what region is this user hitting? Only prod-west is affected, so maybe this user is hitting prod-east. You can look it up in the database, but that'd take even more time; the prod database is locked down tight. [else] You aren't sure at the moment. [continue] At least this does mean that **some** users are still getting normal service. [Javascript] clueReproduceError = true; [continued] > {back link}

moveCount: moveCount+1 -- [TODO] WIP WIP WIP Many dashboards are focused on usage more than system health; and an SWE uses these more frequently. What do they find here? Can see traffic drop off in prod-west, not east; error rates? If they've seen the runbook, that could link to more "service health" dashboards. [continue] You scan over the dashboards you know already — mostly focused on usage — and take a minute searching through the dozens of dashboards set up by other teams, but at the moment you mostly just confirm what you already know. Service isn't recoving on its own, that much is clear. Some dashboards you find probably *do* include more relevant info, but you struggle to find the right one, and there are quite a few that you can't confidently interpret. What's COGS? Turn-up? How do you interpret something like CPU Throttling metrics? Several dashboards with promising names don't seem to be working at all: maybe they're early drafts, or obsolete. > {back link}

moveCount: moveCount+1 -- You click the top IRM app notification and scan down the alerts. Here's an example: {embed image: 'media/mobile_irm_alert.jpg', alt: 'Mobile IRM alert: High SLO burn'} [if clueLoginLatencyAndErrors] They're the same alerts you found before. Nothing new here. [else] The alerts look to be focused on login failures - high latency and errors. The latency one looks worse ("burn rate very high" sounds bad), but what are real users seeing? If a lot of people are getting 5 second logins, that's slower than we *want* for a good experience, but nothing to panic about. ... <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>There is unusual latency and errors in the login service</span> </div> </div> [continue] [Javascript] clueLoginLatencyAndErrors = true; [continued] > [[Acknowledge these alerts->AckAlerts]] You take a deep breath and set your phone down, turning to your computer. > [[Open Grafana Cloud->GrafanaCloudHome]] > [[Open Slack->StartInSlack]]

moveCount: moveCount+1 -- Being thankful that your team has well-defined SLOs, you quickly realize that the Sidewinder Latency SLO has fallen off a cliff. You assume that this implies customers are actually suffering, but you'd like to know which ones are being impacted. {embed image: 'media/SLO_performance.png', alt: 'SLO Performance'} You decide to: > [[Go back to Grafana home->GrafanaCloudHome]] > [[Browse to SLO Details->SLODetails]]

moveCount: moveCount+1 -- [Javascript] clueOneRegion = true; [continued] Navigating to the SLO Detail Dashboard for the impacted SLO, you see that prod-west is suffering but prod-east isn't. The HTTP Request Rate also skyrockets > 10x for prod-west for a few minutes. It's still elevated. Interesting! Probably not something in the code, then: both regions run the same code. {embed image: 'media/SLO_Details.png', alt: 'SLO Details'} <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>The impact is contained to prod-west</span> </div> </div> Now: > [[Go back to Grafana Cloud->GrafanaCloudHome]] > {back link}

moveCount: moveCount+1 -- [TODO] Make sure links are all working well; ...ooh, this page is multiple pages in one. Break out? [continue] [if clueReviewSift] You return to the Sift investigation. It might be worth checking the results one more time. [continue] Sift is generally optimistic about the state of your environment, but it does highlight two *interesting* results: * There have been some recent [[Kubernetes crashes->SiftKubeCrashes]] * There are some [[new error patterns->SiftErrorPatternLogs]] in relevant logs [Javascript] clueReviewSift = true; [continued] > [[Go to Grafana home->GrafanaCloudHome]] > [[Back to OnCall->OnCallLanding]] > [[Back to Slack->StartInSlack]]

[if !clueKubeCrashes] Sift has scoured adjacent logs from your environment and returned that you have pods crashing - OOMKills!! The load on the login service is causing pods to fall over. [continue] [Javascript] clueKubeCrashes = true; [continued] <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>Sift has shown that you have Login pods crashing with Out Of Memory errors. Is this causing the rising errors seen for those endpoints?</span> </div> </div> > [[Go to Grafana home->GrafanaCloudHome]] > {back link}

[if !clueLogEntries] Sift has done something remarkable, it has found logs from the shared "authentication" service that your team uses to validate login requests! [else] It's the same log entries you've already seen, but it's good confirmation that Sift also thinks there's something suspicious going on. [continue] These high latency response logs from the authentication service may be contributing to your own service's high latency. [Javascript] clueLogEntries = true; [continued] <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>High latency responses from the Authentication service are a contributing factor!</span> </div> </div> > [[Go to Grafana home->GrafanaCloudHome]] > {back link}

moveCount: moveCount+1 -- [TODO] Details to fill in -- and do we link to Sift from here? [continue] You check for messages waiting in Slack from Grafana IRM in #coreapp-team-alerts. Here's the first one: {embed image: 'media/slack_oncall_notif.png', alt: 'Slack: Post from OnCall'} [if clueLoginLatencyAndErrors] They're the same alerts you've already reviewed. [else] The alerts look to be focused on login failures - high latency and errors. The latency one looks worse ("burn rate very high" sounds bad), but what are real users seeing? If a lot of people are getting 5 second logins, that's slower than we *want* for a good experience, but nothing to panic about. ... <div class="clue" > <img src="media/clue_grot.svg"> <div class="text_container"> <span class="title">You found a clue!</span> <span>There is unusual latency and errors in the login service</span> </div> </div> [continue] [Javascript] clueLoginLatencyAndErrors = true; [continued] In the alert itself, you notice a runbook link. In the Slack thread, you find the results of an automatic Sift investigation. You might: > [[Acknowledge these alerts->AckAlerts]] [if !taskDeclareIncident] > [[Declare an Incident->DeclareIncident]] [else] > [[Go back to your Incident->GotoIncident]] [continue] or keep exploring from here: > [[Open the alert runbook->AlertRunbook]] > [[Review the Sift check results->Sift]] > [[Open OnCall->OnCallLanding]] > {back link}

moveCount: moveCount+1 -- You're going to be nervous until you know everything's fully under control, but you take a few minutes to check over your many open tabs. You're just adding another note into the incident when Slack pings. [if taskScaleUpDeployment] `I've just scanned over the history here - nice work! Quick huddle? I see that you scaled up the deployment. I'll monitor the rollout of the authentication fix and scale back down when it's ready.` [else] `I've just scanned over the history here - nice work! Quick huddle? I can take over from here, but if you know why the login service isn't coming up that could help.` You realize that you were so happy to find the problem with the authenticaion service that you never mitigated the login timeouts your customers were seeing. Maybe if you'd tried throwing some more cpu and memory at the problem. Oh well, live and learn. [continue] Choices: > [[Open the Post Incident Review doc->AddToPIR]] or just > [[review your score->Score]]

[Javascript] taskAddToPIR = true; [continued] You're glad that you were able to detect a customer issue and report it promptly to the authentication team, but... you wish it could have happened during business hours. You open the draft PIR document and add an action item for the authentication team - they should have an SLO that fires for this so they don't rely on you next time. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>You contributed an action item to the Post Incident Review - you're behaving like a real engineer!</span> </div> </div> > [[review your score->Score]]

moveCount: moveCount+1 -- It doesn't register for a few long moments that your phone is ringing, and you stare at it for a moment as if you'd just spotted a frog in your coffee mug. You don't have much time; the clock is ticking! But you peer at your phone; you don't recognize the number. [after 4s] "Uh, hel— [after 5s; append] hello?" [after 6s] The voice on the other end is scratchy — clearly they've not been awake long — but brisk and urgent, and you recognize it immediately. It's Sidewinder, Inc's CTO. *Why is the CTO calling you?* A wave of terrible possibilities floods through your mind and you completely miss the next few seconds. "—to panic, just tell me what you've found up thus far, and I can—" _Okay, so maybe it's not the huge disaster it feels like?_ "—to hand this off as of right now, and we'll need to figure out tomorrow how this could have—" No, it's a huge disaster. Your ... boss's boss's boss quickly explains the unfolding damage. You give halting answers to a few more rapid fire questions, and you both hang up. You stare at the phone in your hand. It seems like because you were so focused on debugging by yourself, you didn't raise an incident or reach out to anyone. Meanwhile, Sidewinder's biggest enterprise-level customer struggled with failing logins for a while before calling in a dreaded `P0` escalation: "drop everything until the fire is out" priority. **Lots** of people got woken up, they were upset that our SLOs and alerts had failed to catch the problem earlier... and quickly found that that the problem **was** caught. You just hadn't told anyone. {embed image: 'media/oregon_trail_died.png', alt: 'Oregon Trail logo'} [continue] > [[Sigh. Let's check the score.->Score]]

moveCount: moveCount+1 -- It's a pain that you were woken up, but this looks like just that same oversensitive SLO. Let someone sort out the noisy alerting during working hours. Opening the app quickly to acknowledge the alert, you decide to ask tomorrow about how to get this fixed. It's not such a big deal when someone gets interrupted during the workday, but this... You feel physically terrible, your sleep has been sabotaged, and all for nothing? You head back to bed. --- [after 7s] 4 hours later, you awaken to find a looong line of Slack notifications waiting for you. Unfortunately, this time the disaster was real, and service was down for nearly half the company's users for hours. {embed image: 'media/oregon_trail_died.png', alt: 'Oregon Trail logo'} [continue] > [[Walk towards the light->Score]]

pastTheHour: moveCount % 10 hours: (moveCount - pastTheHour) / 10 minutes: pastTheHour * 6 clueCount: 0 taskCount: 0 finalScore: 0 sound.ambient.moodaudio.playing: false -- [TODO] Show what clues & tasks they found/completed, and time elapsed. Calculate a score and named rating? Current list from Start.tw: clueLoginLatencyAndErrors: false clueOneRegion: false clueReviewSift: false clueLogEntries: false clueBruteForceAttack: false clueKubeCrashes: false clueNoDeployments: false clueAlertRunbook: false clueReproduceError: false clueIAM: false taskAckAlerts: false taskDeclareIncident: false taskScaleUpDeployment: false taskPageTeam: false taskMakeCoffee: false [continue] You've made it — for better or for worse. You can get some rest, spend some time reviewing the experience, and make some tweaks before the next time, towards: * a less fragile system * more complete monitoring & less noisy alerts * truly helpful & well-tested runbooks --- [Javascript] clueLoginLatencyAndErrors && clueCount++; clueOneRegion && clueCount++; clueAlertRunbook && clueCount++; clueIAM && clueCount++; clueBruteForceAttack && clueCount++; clueKubeCrashes && clueCount++; clueReproduceError && clueCount++; clueLogEntries && clueCount++; taskAckAlerts && taskCount++; taskDeclareIncident && taskCount++; taskPageTeam && taskCount++; taskScaleUpDeployment && taskCount++; taskAddToPIR && taskCount++; finalScore = clueCount * 10 + taskCount * 20 - moveCount; clueFill = Math.floor(clueCount/9 * 100); clueEmpty = 100-clueFill; taskFill = Math.floor(taskCount/5 * 100); taskEmpty = 100-taskFill; [continued] <div class="scorecard"> <div class="header"> <span class="title">On-call Chronicles<br/>scorecard</span> <div class="score"> <span class="value">{finalScore}</span> <span class="label">score</span> </div> </div> <div class="valueRow"> <span>Clues found</span> <span class="actualValue">{clueCount}/9</span> </div> <div class="chartRow"> [Javascript] write("<div class='fullPart' style='width:" + clueFill +"%'></div>"); write("<div class='emptyPart' style='width:" + clueEmpty +"%'></div>"); [continued] </div> <div class="valueRow"> <span>Key tasks complete</span> <span class="actualValue">{taskCount}/5</span> </div> <div class="chartRow"> [Javascript] write("<div class='fullPart' style='width:" + taskFill +"%'></div>"); write("<div class='emptyPart' style='width:" + taskEmpty +"%'></div>"); [continued] </div> <div class="valueRow"> <span>Time spent</span> <span class="actualValue">{hours}h {minutes}m ({moveCount} moves)</span> </div> </div> And you lived to be on-call another day. Congratulations, or sorry, and we'll see you soon!

moveCount: moveCount+1 -- Everyone is relieved to hear the problem is not on your end. You send out an email to customers letting them know about the AWS outage, its impact on your services, and the expected recovery time. Unfortunately, the expected recovery time passes but users are still reporting problems. You grimace and send out a follow-up email, explaining that your team is still investigating the situation. [[Back to Grafana->_example_Introduction]]

moveCount: moveCount+1 -- [TODO] Accurate alert & context details [continue] The Grafana Cloud home page *also* looks different — styling and other parts of the UI, but also there's a banner at the top saying **Jump from mobile?** In smaller text, two links: > [[Open Escalation 'High SLO burn login flow' in IRM->FutureIRM]] > [[Start investigation for this escalation->FutureStartInvestigation]] You can also explore other parts of Grafana Cloud from here. *This is a stub; you can help fill it out, if you're interested.*

moveCount: moveCount+1 -- [TODO] Should reflect the status from the other part of the game, right? Check tasks... did we open an incident? Is it linked to just your team, or both (yours + upstream auth)? > Open Slack->FutureSlack ? [continue] You're feeling a little light-headed, and collapse in slow motion into your chair, still staring at your phone. The same escalation you were looking at earlier is there... but now in the app you can see at a glance that: * Your service is seriously affected — the relevant SLO is burning fast, and your error budget will be gone in 2 hours. * One upstream service is also affected (health measures harmed) though their SLOs don't seem to be in trouble yet. * There's no related active incident for either your service or upstream * Sift has isolated two contributing causes, one capacity-related and one security. You set your phone down on the desk, torn between the urgency of the escalation and the urgency of *understanding if the time-space continuum is crumbling*. You choose wisely and turn to the computer. > [[Open Grafana Cloud->FutureCloudHome]]

moveCount: moveCount+1 -- A new investigation side panel slides open crisply. Your escalation is already there, along with a tidy subset of the service graph, sparklines for SLOs, and related Sift results. `Drag & drop clues into this pane to build your case` *This is a stub; you can help fill it out, if you're interested.* {back link}

moveCount: moveCount -- [note] moveCount unchanged and intro Future Grafana! [continue] What was that flash? Did something happen to the lights? Grimacing, you crouch to scoop up the broken stone, which seems to have split into two perfectly even halves like a walnut shell. But even as you shift your weight, you catch sight of the same stone, unbroken and still on the shelf, warmly lit by the afternoon sunlight. Afternoon? You look to the window, where a setting sun is just vanishing behind the neighboring building. A breath later, stars fill the night sky. You are standing in the kitchen, your coffee in— no, you are crouching by the shelf again, lifting up the two pieces of the split stone, your face bathed in the fierce glow streaming from— No, the stone is back on the shelf, unbroken. The photo stands next to it, undisturbed. You stand in front of the bookshelf, just straightening up again from your stretch. That— that was odd. What just happened? You barely breathe for a long moment but everything just seems... normal. You blink. You are standing alone in a quiet room. You decide you'll definitely need a good nap when this is wrapped up. And before that you have a lot to do! You've taken a single step back towards your desk when your phone makes a gentle but insistent hum, *a sound it has never made before in the two years you have owned it*, and you drop into the Grafana IRM app before you even realize what you're doing. > [[It looks different.->FutureMobile]]

moveCount: moveCount+1 -- [if taskAckAlerts] You scan down the list of alerts: you already acknowledged these, it looks like, and that status hasn't expired yet. [else] Smart thinking: if you hadn't acknowledged these, the escalation would probably start paging other people. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>You acknowledged the firing alerts. Coworkers who come online will see you're working this issue.</span> </div> </div> [continue] [JavaScript] taskAckAlerts = true [continued] > {back link}

moveCount: moveCount+1 -- [TODO] Check first to see that they have enough clues! Need: clueLoginLatencyAndErrors Also (partly just in the passage text), the guideline is roughly: - There is visible impact to customers - Coordination is needed between multiple teams - Issue is unresolved after an hour of analysis - There's potential for financial impact BUT it's better to declare and realize it wasn't needed, than NOT declare it. Render the various clues from SLOs, Asserts, etc. here? * It looks like most users in prod-west are affected, etc. * There's a service failing (for unclear reasons) that you'll probably need help getting back online. [continue] You review the clues you've collected so far: [if clueLoginLatencyAndErrors] * The login service is suffering from high latency & rising errors [continue] [if clueOneRegion] * Only one region, prod-west, is affected... not prod-east [continue] [if clueReproduceError] * Customers seem to be affected, though you weren't able to reproduce the problem yourself. [continue] [if clueNoDeployments] * There wasn't any clear connection to a recent deployment. [continue] [if clueIAM] * There wasn't any clear connection to a recent deployment. [continue] [if clueIAM || clueLogEntries] * The upstream authentication service seems to be part of the problem. [continue] You still have a lot of questions, but this isn't a false alarm — time to declare an incident! {embed image: 'media/declare_incident.png', alt: 'Declare incident'} You fill in the details and set yourself as Investigator to start with. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>With an incident declared, you can keep a paper trail of the response. It's also a collaboration hub; you'll be able to share updates and clues with colleagues.</span> </div> </div> [JavaScript] taskDeclareIncident = true [continued] What next? You can try getting the login service back up, and probably someone will need to reach out to customers. [if clueIAM || clueLogEntries] As for the struggling Authentication service... that's another team. > [[Page @authentication-oncall?->GotoIncident]] [continue] > {back link}

moveCount: moveCount-1 -- [TODO] moveCount-- and: Easter egg to access future Grafana. [continue] You stand up straight and wince as your spine unkinks. Whoa, maybe you needed this more than you realized... Closing your eyes and consciously relaxing your face, you massage your forehead and feel the threat of a headache receed a little. Maybe get more space? You step toward the middle of the room, bending to one side then the other, looking up to the ceiling — somehow every muscle is stiff, even your neck! You don't have much time to waste — you wish there were a more efficient way to untangle incidents! — so you try stretching out a shoulder at the same time. Overbalancing, you lurch heavily against a bookcase as the room spins suddenly. A framed photo topples backwards, and something rolls along the shelf for a moment before tumbling off. Horror on your face, you make a desperate swipe to catch it, miss, and it strikes the floor with a sharp crack. Just as you recognize it — that odd jagged stone that "wild Auntie V" brought back from her Great Antarctic Expedition! — there's a bright orange flash in the room... and that's when [[things get weird.->TimeJump]]

moveCount: moveCount+1 -- [TODO] Not sure what else to do here, maybe options to add evidence to the incident and tag with robot emoji [continue] [JavaScript] if (clueIAM) { taskPageTeam = true } [continued] [if clueIAM || clueLogEntries] You already declared an incident when you realized your users were getting timeout errors on your login page, but now that you know the authentication team is involved... You jump into the incident slack channel and page `@auth-team-oncall` to let them know what's going on. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>You paged the authentication team. Now that they're aware of the problem, they can jump in to help.</span> </div> </div> {embed image: 'media/incident_help.png', alt: 'empty incident slack channel '} What do you do now?: > [[Trust that the Auth team has it under control->HelpArrives]] > [[Go back to Grafana and keep exploring->GrafanaCloudHome]] [else] You navigate to the autocreated Incident channel and see only the blinking cursor staring back at you. Taunting your lack of knowledge but providing no useful criticism. {embed image: 'media/incident_empty.png', alt: 'empty incident slack channel '} [continue] > {back link}

moveCount: moveCount+1 -- [TODO] Already declared incident? (We're paging from there) Clues needed: must already know which team to contact for help: is this for an upstream failure? Q: is the Incident the right place to page someone from? Other options are inside OnCall, or just from Slack (that could be a deadend, actually) This is an ending path (no back button if they get here) [continue] [if !taskDeclareIncident] Oh, right, you haven't declared an Incident yet. It'll be hard to coordinate with anyone else at this stage. > {back link} [continue] [if !taskPageTeam && taskDeclareIncident && (clueLogEntries || clueIAM)] ... {embed image: 'media/incident_page_team.png', alt: 'Incident: page team'} You open up the incident again. Now that you know the authentication service is involved, you decide to update the incident's "paged particpants" list. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>You paged the authentication team. Now that they're aware of the problem, they can jump in to help.</span> </div> </div> > [[Go to the Incident Slack channel to await acknowledgement from the authentication team->GotoIncident]] [continue] [if taskDeclareIncident && !(clueLogEntries || clueIAM)] You stare at the Incident page for a minute. What team do you call, though? You'll need to narrow down the problem a little before you can pick one. {embed image: 'media/incident_page_team.png', alt: 'Incident: page team'} > {back link} [continue] [if taskDeclareIncident && taskPageTeam] You've already paged the authentication team. > [[Go to the Incident Slack channel to await acknowledgement from the authentication team->GotoIncident]] [continue] [JavaScript] taskPageTeam = taskDeclareIncident && (clueLogEntries || clueIAM) [continued]

moveCount: moveCount-1 visitCount: passage.visits visitCount0indexed: visitCount - 1 tipsAvailable: 3 tipIndex: visitCount0indexed % tipsAvailable + 1 -- {ambient sound: 'coffee', volume: 0.3} [note] Making coffee isn't required! But this is where to find maybe-useful clues. This page loops through available tips (see below). Also: remove 1 from moveCount (so this doesn't actually cost you moves) TO ADD MORE TIPS: * increase the "tipsAvailable" variable above * add a new [if tipIndex === X] and [continue] section below [continue] You stand up and close your eyes for a moment, taking a few slow breaths. Massaging your lower back as you go, you shuffle out to the kitchen. [if moveCount > 25] The room was dark when this all started, but now there's early sunlight peeking through the window and brightening the room. Outside, the familiar noises of the world starting to wake up have replaced the eerie nighttime hush. [continue] [if visitCount === 1] You absently prepare a coffee, puzzling over the clues you've seen so far. [else] You absently prepare coffee number {visitCount}, puzzling over the clues you've seen so far. [continue] --- [if tipIndex === 1] You started out thinking about "the problem" and searching for a "root cause", but ... of course more than one thing can happen at once; what if several changes, or several outside events, are combining to cause chaos? [continue] [if tipIndex === 2] Priorities... what's your real goal? Definitely not "fix it, no matter what it takes," because for many issues, it'll be far faster for some other team to do it. Ah, and you also can't go paging other teams at the first hint of a problem, either: what if you choose the wrong team? What if the "issue" actually isn't affecting anyone? [continue] [if tipIndex === 3] In the end, you think, it all comes down to how quickly you can see how serious the problem is, see which services are involved, and either make a safe fix (if you can) or bring in the right people... maybe both. Now that you think about it, you can probably make this runbook even better, tomorrow. [continue] --- You take an appreciative sip, then bring the coffee with you back over to your desk. [JavaScript] taskMakeCoffee = true [continued] > {back link}

moveCount: moveCount+1 -- [if taskScaleUpDeployment] You've already tried to scale up the deployment. It's not wise to burn more 💲💲💲 without a better theory about what's happening. [else] You decide that more cpu and memory for each pod might help, and you add a few extra replicas hoping that it will mitigate the customer impact while you continue to investigate. <div class="clue" > <img src="media/task_grot.svg"> <div class="text_container"> <span class="title">Key task complete!</span> <span>You scaled up your deployment. It's not a permanent fix, but it will buy you more time to investigate.</span> </div> </div> [continue] [JavaScript] taskScaleUpDeployment = true [continued] > {back link}

moveCount: moveCount+3 -- [note] Extra time vanishes... moveCount+3 [continue] You drop into BlueSky on your mobile, just to relax your brain from forced focus for a minute. You aren't disappointed; it sounds like that guy who shot a healthcare CEO turned up at a McDonalds in Pennsylvania, the South Korean president is probably going to be forced to resign, and Elon Musk is up to something fishy. Also, AI is destroying the world but also revolutionizing it, and you start writing a heated reply in a thread on performance optimization —but then sigh and delete the draft. You'd really better get back to the incident. > {back link}