Glossary

This guide will walk you through the terminologies and their meanings used in Squadcast.

This glossary is a guide to walk you through all the terminologies used within Squadcast and others relevant to the incident management and SRE space.

Roles in Squadcast

Account Owner

An Account Owner is the root user of your organization - they have full access. The Account Owner can manage the account subscription settings and billing. Furthermore, Squadcast sends subscription and payment-related update e-mails to the account owner.

Squadcast assigns account ownership to the user that signs up for a Squadcast account. After then, the account owner can manage permissions for each added user.

An Account Owner has the ability to:

  1. Access all billing information

  2. Add new users/ admins/ stakeholder

  3. Delete users/ admins/ stakeholder

  4. Create/edit/delete on-call schedules

  5. Create/edit/delete escalation policies

  6. Create/edit/delete services

  7. Create/edit/delete status pages

  8. Add/edit/delete postmortem templates

  9. Change the account owner or delete the account

Users

Users are typically folks who go on-call and need to be notified every time an incident is triggered. Users can only access the configurations that they’re part of, and they can only access the incidents for teams that they’re a part of. Users can be granted additional permissions controlled by RBAC.

A user can:

  1. Create/edit/delete on-call schedules

  2. Create/edit/delete escalation policies

  3. Create/edit/delete services

  4. Create/edit/update/delete Status Page

Stakeholders

Stakeholders are individuals or groups from within the organization, that take an interest and are impacted by the outcomes of the incident management process.

Stakeholders have view-only access to all incidents. They are not notified by default for any of the incidents created in Squadcast. They also can create manual incidents should they notice something wrong and want to notify the on-call team of it. They can add notes to an incident and act as an incident watcher.

A stakeholder can:

  1. Access the incident dashboard

  2. Create incidents from the dashboard and assign it to a user/admin/account owner

  3. Chat in the war room

  4. Watch Incidents

  5. Add Incident Notes

  6. View/update the status page

  7. Add tags to an incident

  8. View the analytics page

Teams

Teams are used to segregate data and have different environments for different functional units. By default, all the users are added to the default team.

Note: The default team cannot be deleted.

Squads

Squads are sub-groups that can refer to folks handling a specific functionality, service, or project within the team. Squads are handy when you need to notify the whole group together. For instance, when a coordinated response is required for high-urgency high-complexity incidents, or at the end of an escalation policy when nobody has acknowledged it.

Examples:

  • Payment gateway Squad

  • Backend Squad

  • Frontend Squad

  • All Hands

My Profile

My profile holds the contact information of a user. It also displays the squads, schedules, and escalation policies that the user is a part of along with the details of his MTTA, MTTR for that particular organization. Additionally, it displays the on-call shifts of the user.

Note:

You can set your own customized notification rules - rules for how you want to be notified and after how long from the time of incident trigger.

Although, you cannot set the notification rules for any other user.

Notification Rules

Notification Rules are rules that determine how an individual user is notified of an incident assigned to them. You can set up rules to notify you on any of the following notification channels:

  1. Phone Call

  2. SMS

  3. Email

  4. Push Notification from the Squadcast mobile app

Note:

You can set up a rule to be notified immediately after an incident trigger or at any time interval (in minutes). You can add as many rules as you want in the rule chain.

Dashboard

The Dashboard is the first screen that appears when you log in to your Squadcast account. It has two sections:

  1. Summary Section: The top of the page holds the incident summary where you will be able to see the number of incidents distributed by their state.

  2. Incidents Section: The bottom of the page holds all the existing and incoming incidents with all the incident details associated with it.

The dashboard supports the following functionalities:

  1. Toggle function: You can use the toggle button to see incidents assigned to just you or the entire organization. The toggle button can be found on the top left corner of the dashboard page

  2. Incident states: You can see the number of incidents in each of the states - triggered, acknowledged, resolved and suppressed.

  3. User metrics: You can use the dashboard toggle function to see your MTTA & MTTR or that of your entire organization.

  4. Filter incident activity by time: You can filter to see incidents from last week, last month, last year or for a custom date range. This can be done using the Last Week, Last Month, Last Year and Custom Range buttons on the top right corner of the dashboard page.

  5. Incident Filter: In the incident section, incidents can be filtered via Impacted service (service name(s)), Incident source (alert source), assigned to (name of user, squad or escalation policies).

  6. Bulk Actions: You can take actions to bulk acknowledge or resolve incidents from the Actions button the incident section from the dashboard page.

MTTA

Mean Time To Acknowledge is the average time taken to acknowledge incidents. You can use the toggle switch to view the MTTA for yourself or the organization.

Note:

The MTTA is calculated as a separate metric for every organization that you are a part of on Squadcast.

MTTR

Mean Time To Resolve is the average time taken to resolve incidents. You can use the toggle switch to view the MTTR for yourself or the organization.

Note:

The MTTR is calculated as a separate metric for every organization that you are a part of on Squadcast.

Alert

An Alert is an incoming JSON sent to Squadcast from any alerting tool. Alerts are sent into Squadcast through alert source integrations that you can find here. Alerts can be of different types - informational, warnings or actionable. You will also be able to send in alerts through our API or Email Integration.

Alert Forwarding

Alert Forwarding is used to forward one’s alerts to another on-call user for a period of time. It can be accessed from the Users page on the navigation sidebar.

Alert Forwarding is typically used if the on-call user is sick, on vacation or had to step away due to an emergency and want another user to fill in for their on-call shift. Alert forwarding is also known as Vacation Mode.

Incident

An incident can be made up of multiple alerts or can be a standalone incident that is service / customer impacting. An incident is triggered within a service via the alert source integration. This then sets off the notification for the on-call user as per its escalation policy. When an incident is triggered, it will be in the Triggered state until the on-call user acknowledges it.

When an incident is triggered, it comes with the following information:

  1. Incident ID: The incident number on Squadcast. Incident number follows the general order with which the incidents are pushed into Squadcast.

  2. Incident Name: The name of an incident. This typically describes the nature of the incident.

  3. Incident Description: This carries a short summary of the incident and supporting links can be added here to give more context to the incident

  4. Impact on: The name of the service for which the incident was triggered

  5. Created Via: Alert source through which the incident was created

  6. Assigned To: The name of the user/ squad/ escalation policy that the incident was assigned to

  7. Status: The Incident state: triggered/acknowledged/resolved or suppressed.

  8. Tags: Tags are added to an incident to classify them however best fits your incident management process (ex: sev:high or sev:critical; frontend or backend)

States of an Incident

Triggered

An incident is considered to be in the triggered state before any user responds to the incident notification. Once an incident is triggered it will notify the on-call user(s) based on their notification rules. This also means that the incident is in the open state.

Acknowledged

An incident is considered to be in the acknowledged state when a user has acknowledged an incident and is working on resolving it.

Resolved

An incident is considered resolved when the user has fixed the issue and they want the incident to be closed. Once an incident is resolved, no additional notifications will be sent and the incident cannot be opened again. This is one of the two final states on the Squadcast platform.

Suppressed

All incidents that evaluate to be true to any of the suppression rules configured for service will automatically go into the suppressed state. This, and resolved are the two final states on the Squadcast platform.

Incident Page

The incident page holds all the details associated with an incident. Each incident will open into an incident page. The page has three main sections:

Incident Details

The Incident Details give you the context of the incident by holding details such as - the name, description, impacted service, alert source, and tag(s) on the top of the page. The page also holds all the actions you can take on the incident:

  • Acknowledge

  • Resolve

  • More Actions

Tags

Incident Tags are used to add more context to your incident and help classify incidents. You can add as many tags and map them with relevant information to add more context to the incident.

You can configure tags from Tagging Rules associated with a service. You can choose to configure rules with an incident JSON to automatically add tags when incidents are triggered. To know more about how to configure this, click here.

Incident Notes

Incident Notes enable you to add important notes for you and your team that can help mitigate an incident faster. You can @mention specific users or teams in the Notes section to collaborate with them. This is also adding them as Incident Watchers.

Incident Timeline

You can access the Incident Timeline by visiting the Incident Details page in the web app and the timeline will be displayed on the right-hand side of the page. The Incident Timeline will display the timeline of the incident in reverse chronological order as to when the incident was first Triggered and Assigned, who Acknowledged it or Re-assigned, and who resolved them and when. The incident timeline can be exported in PD and MD formats.

Squadcast Actions

Squadcast Actions lets you take actions directly from Squadcast as a response to incidents by clicking the More Actions button from the incident page. Squadcast Actions are typically used as a means to reduce any customer-impacting issue as soon as possible. In some cases, this resolves the issue and in others, there is a need for longer-term remediation. This is left to the user and team to decide and act on.

Today the platform has the following Actions:

  1. Manual Webhook Triggers

Some simple examples of actions are rebuilding your project, rolling back to the previous build, and rebooting a server. You can choose to build a repository of any actions, even more, complex ones to take action from Squadcast.

Schedules

Schedules define on-call rotations to ensure coverage at all times and to distribute load across your team members. Schedules can be set up based on any flexibility you desire - 24 x7 , Mon-Friday, etc.

Note:

A Schedule must be added to an Escalation Policy for it to be active.

Rotations Type

Rotation types determine how schedules function. Rotation types can be set to have users on-call for a day at a time or a week at a time, or the rotation can be customized to any specified number of hours, days, or weeks.

On-Call Restrictions

Restrictions on an on-call schedule determine what hours during the day, and which days, a user is on-call. You can restrict on-call shifts to daily or weekly for a period of time. No notifications will be triggered if an incident is triggered at any time outside of this restriction.

Gaps in Schedule

A gap in the schedule indicates that no one is on call for a certain amount of time. If there is a gap in the schedule, and no one is on call. This is typically seen when custom rotations are created. In this case, a primary on-call rotation is used as a fallback layer for incidents that may occur during the schedule gaps. This would mean that if an incident were to occur in a gap period, this incident would automatically be sent to the user on the primary rotation layer.

Rotation Layers

Rotation layers are used to create a fallback layer when there are schedule gaps.

Escalation Policies

An Escalation Policy is the chain of escalations that determines who should be notified first, second and so on when an incident is triggered. Escalation policies are attached to a specific service. The same escalation policy can be attached to multiple services.

Escalation policies can have users, multiple users, squads, and schedules.

You will need to add the below details to create an escalation policy:

  1. Policy Name: The name of the escalation policy (Ex: backend escalation)

  2. Policy Description: The description of the policy (Ex: Incidents triggered for all the backend services should be routed to the backend escalation policy)

  3. Rules

  4. User or Squad or Schedule: Add users, squads or schedules to the rule

  5. Escalation After: This is the time period after which if an incident is still in the triggered state, is escalated to the next rule in the policy. This time period can be adjusted to any amount of time (in minutes).

  6. Escalate if: Incident is not acknowledged. Right now we only support this. Soon, we will be adding capability for escalating if not resolved.

Escalation Rule

Multiple escalation rules make an escalation policy. Typically, each escalation rule represents a different level of on-call duty. The first rule in the policy will determine who gets notified first about the triggered incident. This can either be a user, multiple users, squads, and schedules.

If the incident still remains in the triggered state after the notifications have gone through to the users/squad/schedule and the time period closes from the first escalation rule, then the user/squad/schedule on the second rule on the escalation policy will be notified, and so on.

Services

Services are at the core of Squadcast. It represents a logical unit that maps to alert sources (aka monitoring tools) in your environment. When alerts are received from these sources, incidents are triggered and routed to users based on the attached escalation policy. You can also opt-in to get notified on ChatOps tools such as Slack, MS Teams, and Google Hangouts.

Routing Rules

Alert Routing allows you to configure rules to ensure that alerts are routed to the right responder with the help of event tags attached to each alert. Routing is a part of the rules engine associated with each service.

Note: This rule will override the escalation policy attached to the service.

This is typically used in cases where severities are configured via tags and each severity type is to be handled by a different level of on-call user.

Service Dependencies

Dependencies capture the dependent services for each service. This is an indication of what other services can get affected if the main service is impacted. You can add dependant services for each service from the services page. These defined dependencies are associated with the Status Page.

Maintenance Mode

Maintenance Mode is used to temporarily disable notifications for a service for a set period of time. Incidents triggered when a service is in maintenance mode will automatically go into the suppressed state. You can add multiple maintenance windows or schedule recurring maintenance windows.

Note: No notifications will be sent when a service is under maintenance.

Service Levels

Service Levels are attached to services and connect to the SLO dashboard. Service Levels are Service Level Objectives that you define for each service and can be configured for a service from the service page. Today, the configuration is made through our open-source SDK, DEX. DEX SDK is available in Golang and NodeJS.

You can add multiple Service Level Indicators that make up the Service Level Objective for a service. Today you can choose from Latency, Memory, and Status Codes for Service Level Indicators.

DEX SDK

Service Levels are configured with DEX, our open-source SDK. DEX is available in two languages:

  • DEX SDK for Golang: https://github.com/squadcastHQ/dex-go

  • DEX SDK for NodeJS: https://github.com/squadcastHQ/dex-node

Alert Source Integrations

An Alert Source Integration is used to integrate with any monitoring, logging, or tracing tool. Alert source integrations can be configured for services. You can also choose to send in alerts for a service using the API or email integrations. You can search through our documentation to find helpful alert source integration guides to walk you through any particular integration.

View our list of alert source integrations here.

Note:

If you don't see an integration you want, feel free to reach out to us via the Intercom Chat Widget in the bottom right corner of your screen or you can drop a line to our Support Team and someone will get back to you.

Extensions (Integrations)

Extensions are deeper integrations with tools where actions can be taken from within the platform to reflect on the tool as well. Within Squadcast, these are called Extensions and can be found on the navigation sidebar.

Typically, extensions augment your incident management process by connecting with other tools where actions are required. ITSM, Communication, Web conferencing, Version Control, CI/CD, and SSO tools would typically act as extensions.

Circle CI Actions

Squadcast support actions such as rebuilding CircleCI projects directly from the incident page. The link for the project status will be added to the timeline and you should be able to click on the link to view the status from Squadcast.

JIRA Cloud & Server

The JIRA Cloud and JIRA Server Actions allow you to create tickets in JIRA with the incidents from Squadcast and sync status bidirectionally.

This is especially helpful if you have some tasks for the longer-term remediation of a particular incident or project.

Service Level Objective (SLO)

Service Level Objective is an agreement within an SLA about a specific metric over a certain period of time. It is expressed as a percentage or ratio over some time, for example, “99.95% availability over 24 hours”.

Service Level Agreement (SLA)

Service Level Agreement is an explicit or implicit agreement between a client and service provider stipulating the client’s reliability expectations and the service provider’s consequences for not meeting them.

Service Level Indicator (SLI)

Service Level Indicator measures compliance with an SLO (Service Level Objective). So, for example, if an SLA specifies that your system will be available 99.95% of the time, your SLO is likely 99.95% uptime and your SLI is the actual measurement of your uptime. Maybe it’s 99.90%, or maybe it’s 99.99%.

Error Budgets

Error Budgets are the single metric that can be utilized to determine whether the system can take on the additional risk of deploying a new feature or if the team should rather be focused on making the system more reliable. It is an indication of the amount of headroom you have before an SLA breach.

ErrorBudget=(1Availability)=FailedRequests/(TotalNumberofRequests)Error Budget = (1 - Availability) = Failed Requests / (Total Number of Requests)

So if an SLO for a service is indicated as an Availability of 99.5%, then this service has an error budget of 0.5% which specifies the amount of total downtime you are allowed.

Toil Budgets

Toil Budgets are a way for an engineer to understand how much time is going into doing work that holds no long-term value that can be automated. Toil is any work that tends to be manual, repetitive, automatable, tactical, and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows.

Note:

Today, this feature is not available on the platform but we are working on bringing the capabilities to measure this.

SLO Summary

The SLO summary is a visual representation of your services’ health.

Note:

Only the services for which service levels are configured will show up on the SLO dashboard.

Squadcast Runbooks

Runbooks are a compilation of routine procedures and tasks, in the form of checklists, that are documented for reference while working on a critical incident.

Postmortem

Postmortems help teams document incident failures and their subsequent fixes to create a knowledge base of learnings that can be shared across the organization. All of the steps taken for incident resolution can be documented and shared in a Postmortem.

Note:

You can choose from several popularly used postmortem templates or have your admin/ account owner create one for your organization.

Status Page

The Status Page helps you communicate critical updates about outages and scheduled maintenance to your customers and stakeholders.

Status Pages can either be public (accessible by everyone) or private (accessible by just your team on Squadcast) on Squadcast.

Note:

You can also add a subscription option for your public status page so customers are automatically informed of any updates on the Status Page.

Webforms

Webforms give your stakeholders and customers a way to report issues through a publicly hosted form. Improve your customer support offering and manage critical customer-impacting issues as incidents within Squadcast.

Analytics

The Analytics dashboard helps analyze incident data using Organization-level analytics & Team-level analytics. Visualize with graphs: MTTA & MTTR, Alert noise reduction, Incident count by Name, Tags, Alert sources, and so on.

Have any questions? Ask the community.

Last updated