A collarge of screenshots related to Smart SOAR's error handling capabilities

Smart SOAR’s Innovative Approach to Error-Handling Explained

Our commitment to innovation is deeply rooted in the feedback we receive from those who use our Smart SOAR platform daily. It was through listening to feedback from our customers that we identified and addressed a crucial opportunity for improvement: our error-handling capabilities. The feedback was simple: customers needed to be notified of issues (such as playbook or ingestion failure) immediately. The notification also needed to be informative enough to indicate exactly where to troubleshoot. This would ensure not only minimal downtime in troubleshooting but also guarantee resilience and reliability.

Now, with the solution in place, the result for our enterprise customers is what we believe to be the most robust and reliable SOAR platform in the industry. For our MSSP customers, the benefits extend further, because of the effects that delays in alert processing and response can have on SLAs and client relationships. With D3 today, MSSPs get a vendor-agnostic SOAR solution with the robustness needed to streamline client onboarding and minimize error-related risks.

Here’s how we’ve engineered the Smart SOAR platform to ensure precision and reliability in your SecOps automation efforts.

Handling Ingestion Errors with Smart SOAR’s Data Reacquire

Event ingestion failures can happen for a number of reasons, like hitting a rate limit or unexpected changes in firewall settings. Smart SOAR’s FetchEvent command runs on a regular cadence, querying your detection tool(s) for new alerts. The Data Reacquire option automatically schedules a task to re-fetch data after a scheduled task finishes, to be executed at a future time (e.g., 30 or 120 minutes later), ensuring data completeness. Smart SOAR will also alert SOC teams about potential issues with event ingestion, based on a given number of consecutive failed event fetches, and automatically restart stuck ingestion jobs.

Screenshot from Smart SOAR's View Data Source tab

Smart SOAR also has a mechanism to ensure that no alerts are missed during ingestion by utilizing a Tolerance Scope. This feature is like a safety net, acting as a time buffer that captures any events that may have been missed during data collection. If you have a Tolerance Scope set for 5 minutes and the data collection is scheduled for 10:00 AM, Smart SOAR will collect not just the events at the exact scheduled time but also those from 9:55 AM to 10:00 AM.

Screenshot of Smart SOAR highlighting the tolerance scope setting

For a step-by-step guide on setting up and using the Tolerance Scope feature in your data ingestion process, please refer to our technical documentation on Data Ingestion.

Incident Playbooks: Auto-Retry on Error

There are a few ways Smart SOAR playbooks handle errors related to incidents. Each playbook task comes with these options for error handling:

  1. Trigger a workflow specifically designed to address errors.
  2. Stop the playbook.
  3. Retry the command at whatever interval you choose.

The “Auto-Retry on Error” feature in Smart SOAR’s Playbook Tasks allows for continued operations by retrying failed tasks automatically. This helps minimize manual re-runs and keeps workflows moving. You can set up to five automatic retries and specify intervals between retries in seconds, minutes, or hours, providing flexibility to handle errors efficiently.

All errors are collected in the Investigation Dashboard as well, under the ‘Playbook Errors’ tab.

Screenshot: Where you can find playbook errors in Smart SOAR

‘On Playbook Task Error’ Trigger

Smart SOAR has also enhanced its error management capabilities with the introduction of the “On Playbook Task Error” trigger.

Screenshot: Smart SOAR playbook error trigger

This feature automatically initiates a follow-up workflow if an error occurs during the execution of playbook tasks such as commands, data formatter, conditionals, and REST API calls. This trigger is designed to immediately alert an analyst by sending an email or Slack notification or executing a predefined workflow, ensuring quick response to incidents. Moreover, to avoid the risk of creating infinite error loops, workflows started by this error trigger won’t retrigger themselves if they fail, maintaining system stability while providing actionable error notifications.

Screenshot: Stop on Error feature in Smart SOAR

Stop on Error

Available in REST API, conditional, and data formatter tasks, the “Stop on Error” option, when enabled, halts the execution of a task sequence if an error occurs in the current task. This feature is crucial for error-handling within automated workflows because it prevents subsequent tasks from executing on potentially corrupted or inaccurate data, which can occur due to the error. By halting the sequence, the system allows operators to investigate and resolve the issue without the risk of cascading errors through the rest of the automated process. This ensures that operators maintain control over the automation and can enforce quality checks at each step of the process.

Unmatched Reliability: Smart SOAR’s Advanced Error Management

Apart from these feature updates to our SOAR platform, we’ve also improved our operational processes around integration development and bolstered our white glove customer support. Our technology team has updated our monitoring and notification systems to ensure a robust process for error identification and resolution. This includes log ingestion reviews by the support team and having a developer on call for any after-hours issues. Schedule a demo to see how these enhancements can streamline your SOC operations, improving response times and operational reliability.

Powering the World’s Best SecOps Teams

Get Started with D3 Security