Glossary
This guide will walk you through the terminologies and their meanings used in Squadcast.
This glossary is a guide to walk you through all the terminologies used within Squadcast and others relevant to the incident management and SRE space.
Roles in Squadcast
Account Owner
An Account Owner is the root user of your organization - they have full access. The Account Owner can manage the account subscription settings and billing. Furthermore, Squadcast sends subscription and payment-related update e-mails to the account owner.
Squadcast assigns account ownership to the user that signs up for a Squadcast account. After then, the account owner can manage permissions for each added user.
An Account Owner has the ability to:
Access all billing information
Add new users/ admins/ stakeholder
Delete users/ admins/ stakeholder
Create/edit/delete on-call schedules
Create/edit/delete escalation policies
Create/edit/delete services
Create/edit/delete status pages
Add/edit/delete postmortem templates
Change the account owner or delete the account
Users
Users are typically folks who go on-call and need to be notified every time an incident is triggered. Users can only access the configurations that they’re part of, and they can only access the incidents for teams that they’re a part of. Users can be granted additional permissions controlled by RBAC.
A user can:
Create/edit/delete on-call schedules
Create/edit/delete escalation policies
Create/edit/delete services
Create/edit/update/delete Status Page
Stakeholders
Stakeholders are individuals or groups from within the organization, that take an interest and are impacted by the outcomes of the incident management process.
Stakeholders have view-only access to all incidents. They are not notified by default for any of the incidents created in Squadcast. They also can create manual incidents should they notice something wrong and want to notify the on-call team of it. They can add notes to an incident and act as an incident watcher.
A stakeholder can:
Access the incident dashboard
Create incidents from the dashboard and assign it to a user/admin/account owner
Chat in the war room
Watch Incidents
Add Incident Notes
View/update the status page
Add tags to an incident
View the analytics page
Teams
Squads
Examples:
Payment gateway Squad
Backend Squad
Frontend Squad
All Hands
My Profile
Notification Rules
Phone Call
SMS
Email
Push Notification from the Squadcast mobile app
Dashboard
Summary Section: The top of the page holds the incident summary where you will be able to see the number of incidents distributed by their state.
Incidents Section: The bottom of the page holds all the existing and incoming incidents with all the incident details associated with it.
The dashboard supports the following functionalities:
Toggle function: You can use the toggle button to see incidents assigned to just you or the entire organization. The toggle button can be found on the top left corner of the dashboard page
Incident states: You can see the number of incidents in each of the states - triggered, acknowledged, resolved and suppressed.
User metrics: You can use the dashboard toggle function to see your MTTA & MTTR or that of your entire organization.
Filter incident activity by time: You can filter to see incidents from last week, last month, last year or for a custom date range. This can be done using the Last Week, Last Month, Last Year and Custom Range buttons on the top right corner of the dashboard page.
Incident Filter: In the incident section, incidents can be filtered via Impacted service (service name(s)), Incident source (alert source), assigned to (name of user, squad or escalation policies).
Bulk Actions: You can take actions to bulk acknowledge or resolve incidents from the Actions button the incident section from the dashboard page.
MTTA
Mean Time To Acknowledge is the average time taken to acknowledge incidents. You can use the toggle switch to view the MTTA for yourself or the organization.
MTTR
Mean Time To Resolve is the average time taken to resolve incidents. You can use the toggle switch to view the MTTR for yourself or the organization.
Alert
Alert Forwarding
Alert Forwarding is typically used if the on-call user is sick, on vacation or had to step away due to an emergency and want another user to fill in for their on-call shift. Alert forwarding is also known as Vacation Mode.
Incident
An incident can be made up of multiple alerts or can be a standalone incident that is service / customer impacting. An incident is triggered within a service via the alert source integration. This then sets off the notification for the on-call user as per its escalation policy. When an incident is triggered, it will be in the Triggered state until the on-call user acknowledges it.
When an incident is triggered, it comes with the following information:
Incident ID: The incident number on Squadcast. Incident number follows the general order with which the incidents are pushed into Squadcast.
Incident Name: The name of an incident. This typically describes the nature of the incident.
Incident Description: This carries a short summary of the incident and supporting links can be added here to give more context to the incident
Impact on: The name of the service for which the incident was triggered
Created Via: Alert source through which the incident was created
Assigned To: The name of the user/ squad/ escalation policy that the incident was assigned to
Status: The Incident state: triggered/acknowledged/resolved or suppressed.
Tags: Tags are added to an incident to classify them however best fits your incident management process (ex: sev:high or sev:critical; frontend or backend)
States of an Incident
Triggered
An incident is considered to be in the triggered state before any user responds to the incident notification. Once an incident is triggered it will notify the on-call user(s) based on their notification rules. This also means that the incident is in the open state.
Acknowledged
An incident is considered to be in the acknowledged state when a user has acknowledged an incident and is working on resolving it.
Resolved
An incident is considered resolved when the user has fixed the issue and they want the incident to be closed. Once an incident is resolved, no additional notifications will be sent and the incident cannot be opened again. This is one of the two final states on the Squadcast platform.
Suppressed
All incidents that evaluate to be true to any of the suppression rules configured for service will automatically go into the suppressed state. This, and resolved are the two final states on the Squadcast platform.
Incident Page
The incident page holds all the details associated with an incident. Each incident will open into an incident page. The page has three main sections:
Incident Details
Acknowledge
Resolve
More Actions
Tags
Incident Notes
Incident Timeline
Squadcast Actions
Squadcast Actions lets you take actions directly from Squadcast as a response to incidents by clicking the More Actions button from the incident page. Squadcast Actions are typically used as a means to reduce any customer-impacting issue as soon as possible. In some cases, this resolves the issue and in others, there is a need for longer-term remediation. This is left to the user and team to decide and act on.
Today the platform has the following Actions:
Some simple examples of actions are rebuilding your project, rolling back to the previous build, and rebooting a server. You can choose to build a repository of any actions, even more, complex ones to take action from Squadcast.
Schedules
Rotations Type
Rotation types determine how schedules function. Rotation types can be set to have users on-call for a day at a time or a week at a time, or the rotation can be customized to any specified number of hours, days, or weeks.
On-Call Restrictions
Restrictions on an on-call schedule determine what hours during the day, and which days, a user is on-call. You can restrict on-call shifts to daily or weekly for a period of time. No notifications will be triggered if an incident is triggered at any time outside of this restriction.
Gaps in Schedule
A gap in the schedule indicates that no one is on call for a certain amount of time. If there is a gap in the schedule, and no one is on call. This is typically seen when custom rotations are created. In this case, a primary on-call rotation is used as a fallback layer for incidents that may occur during the schedule gaps. This would mean that if an incident were to occur in a gap period, this incident would automatically be sent to the user on the primary rotation layer.
Rotation Layers
Rotation layers are used to create a fallback layer when there are schedule gaps.
Escalation Policies
Escalation policies can have users, multiple users, squads, and schedules.
You will need to add the below details to create an escalation policy:
Policy Name: The name of the escalation policy (Ex: backend escalation)
Policy Description: The description of the policy (Ex: Incidents triggered for all the backend services should be routed to the backend escalation policy)
Rules
User or Squad or Schedule: Add users, squads or schedules to the rule
Escalation After: This is the time period after which if an incident is still in the triggered state, is escalated to the next rule in the policy. This time period can be adjusted to any amount of time (in minutes).
Escalate if: Incident is not acknowledged. Right now we only support this. Soon, we will be adding capability for escalating if not resolved.
Escalation Rule
Multiple escalation rules make an escalation policy. Typically, each escalation rule represents a different level of on-call duty. The first rule in the policy will determine who gets notified first about the triggered incident. This can either be a user, multiple users, squads, and schedules.
If the incident still remains in the triggered state after the notifications have gone through to the users/squad/schedule and the time period closes from the first escalation rule, then the user/squad/schedule on the second rule on the escalation policy will be notified, and so on.
Services
Routing Rules
Service Dependencies
Maintenance Mode
Service Levels
Service Levels are attached to services and connect to the SLO dashboard. Service Levels are Service Level Objectives that you define for each service and can be configured for a service from the service page. Today, the configuration is made through our open-source SDK, DEX. DEX SDK is available in Golang and NodeJS.
You can add multiple Service Level Indicators that make up the Service Level Objective for a service. Today you can choose from Latency, Memory, and Status Codes for Service Level Indicators.
DEX SDK
Service Levels are configured with DEX, our open-source SDK. DEX is available in two languages:
DEX SDK for Golang: https://github.com/squadcastHQ/dex-go
DEX SDK for NodeJS: https://github.com/squadcastHQ/dex-node
Alert Source Integrations
An Alert Source Integration is used to integrate with any monitoring, logging, or tracing tool. Alert source integrations can be configured for services. You can also choose to send in alerts for a service using the API or email integrations. You can search through our documentation to find helpful alert source integration guides to walk you through any particular integration.
Extensions (Integrations)
Typically, extensions augment your incident management process by connecting with other tools where actions are required. ITSM, Communication, Web conferencing, Version Control, CI/CD, and SSO tools would typically act as extensions.
Circle CI Actions
JIRA Cloud & Server
This is especially helpful if you have some tasks for the longer-term remediation of a particular incident or project.
Service Level Objective (SLO)
Service Level Agreement (SLA)
Service Level Indicator (SLI)
Error Budgets
So if an SLO for a service is indicated as an Availability of 99.5%, then this service has an error budget of 0.5% which specifies the amount of total downtime you are allowed.
Toil Budgets
Toil Budgets are a way for an engineer to understand how much time is going into doing work that holds no long-term value that can be automated. Toil is any work that tends to be manual, repetitive, automatable, tactical, and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows.
SLO Summary
Squadcast Runbooks
Postmortem
Status Page
Status Pages can either be public (accessible by everyone) or private (accessible by just your team on Squadcast) on Squadcast.
Webforms
Analytics
Last updated
Was this helpful?