Example Questions for Discovery Sessions
The operators may not be able to readily spell out all of the content and context requirements for you. Here are some examples of questions to ask the operator to guide their thinking:
What sort of incidents have you seen previously and what sort of correlation did you have to do to figure out the impact? If the correlation is manual, you can see if it can be done programmatically via enrichment. This question also highlights potential external enrichment sources.
Any site-specific issues you would like to highlight? This exposes their current pain points.
What kind of manual correlation did you have to do in your head in the past few months? Can you show it to me and elaborate? The answer to this question will help you identify the enrichment requirements.
If you would like to identify issues by their location - is there a way to know the location from the alert payload? Does the hostname naming convention allow me to infer it? Can I look up a device's network address in an inventory and get the location from that? The answer to this question can help you determine whether you need to parse the incoming information or you need to look up a CMDB to source the information.
Can we use a relationship lookup to aid clustering? For example a job or application dependency relationship or device type or some other parent-child relationship. You might know whether you want to use topology information for clustering. Also, you will identify the enrichment requirements.
How do you prioritize alerts apart from their severity? Are there any areas carrying higher importance? Perhaps SLA bound? Or by the customer or particular locations?
How many environments do you have? For example UAT, DEV, Production. How would you like them to be prioritized? This will help you identify the clustering requirements. If they have a production environment and DEV environment, most likely the alerts with the same attributes should still be clustered separately if one is from the production environment and one is from the DEV environment.
What alerts would you route to your team? How do you identify these alerts and what is the common element within? Common elements can include; services, region, application, data center, etc... Knowing the answer helps you determine the clustering and routing strategy.
What specific cause-effect type of incidents would you like Moogsoft Enterprise to identify? Some events on their own do not represent much interest to the operators, but in combination with a separate occurrence, these can perhaps be an indication of a more critical issue.
If, for example, the context of a situation is the impacted business server, would you also want to break the clusters further into the impacted technological stack? If the answer is yes, you need to figure out how to source that information. Can you parse the source data to extract this? Or do you need to perform a lookup?
Would you like to cluster alerts across the technological stack as well?
Situations are evolving entities so it is acceptable for a situation to change its context throughout its lifecycle. What contexts should be merged together based on the alert overlap?
Should you be notified only when multiple things fail or even a single alert is of concern? This helps you identify the alert threshold.
How long after the initial candidate cluster creation should alerts be considered in the scope of the cluster? The answer to this question directly impacts the choice of cook_for time and its extension.