According to artificial intelligence (AI) adoption research by the UK government, around 1 in 6 UK companies currently use AI in some way. As a result, enterprises now have more telemetry data than ever, while response speed and monitoring at scale have also improved significantly. However, the actual reality of network visibility has become more complicated.
Too many dashboards and endless – often contradictory – AI insights have meant that signal volume has increased faster than organisations’ capacity to understand and analyse it. This has introduced challenges to root-cause analysis and troubleshooting. As enterprises increasingly prioritise efficiency and resilience, poorly managed and opaque automation could become a significant stumbling block if left unaddressed.
“As an industry, we have successfully solved the data collection problem but worsened the context problem. We’ve created a context deficit,” says Dray Agha, senior manager of security operations at Huntress.
How automation created a monitoring overload
Several organisations are currently going through a massive automation wave, enabling autonomous systems to carry out tasks with as little human intervention as possible.
This is being driven by a variety of systems and tools, such as intent-based networking, which has transformed networking from a device-by-device process into centralised and policy-driven automation.
Similarly, zero-touch provisioning means that manual onboarding and configuration steps for devices such as switches and routers can be largely bypassed. Artificial intelligence for IT Operations (AIOps) has allowed telemetry correlation at scale, while letting self-healing systems automatically solve common problems such as restarting failed services.
“Most enterprises do not have an observability shortage anymore – they have a context correlation problem,” says Ken Herron, co-founder of VCONify. “Telemetry volume has exploded across SD-WAN, hybrid cloud, edge infrastructure, SaaS platforms and security tooling. But operational context remains fragmented across tickets, chat systems, calls, dashboards and tribal knowledge. So organisations increasingly see more signals while understanding less causality.”
While these systems have improved operational efficiency by reducing the need for manual intervention, they have also increased opacity between systems and human network operators. These automation tools have created an avalanche of metrics, logs, flow data, traces and synthetic tests across multiple observability platforms.
“Visibility without governance is just expensive noise,” says Herman Errico, senior product manager of technical research at Vanta. “The core problem is that ingestion is being treated as the goal rather than detection.”
Enterprises are now juggling overlapping and inconsistent narratives, with no single version of network truth. This is worsened by multiple dashboards across several vendors and teams, along with constant and conflicting AI insights. These can all contribute heavily to alert fatigue and further obscure visibility.
“Tool fragmentation compounds the problem. Anyone who has sat through a major incident bridge call knows the first chunk of time is often spent reconciling conflicting views of the same environment rather than actually diagnosing the issue,” Errico adds.
This fragmentation can be a major hurdle when it comes to root-cause analysis for outages, leaving enterprises more vulnerable to expensive network failures that can cost significant time and resources to address.
The observability paradox: how networks are becoming black boxes
The rise in automation, along with more telemetry data, in recent years has led to a key observability paradox. Despite highly sophisticated analysis tools, understanding crucial context for network outages and other issues has often become more difficult. As such, more advanced systems and automation have not led to easier troubleshooting or faster root-cause analysis, as several enterprises may have hoped.
Similarly, some AI-driven root-cause analysis systems may not always be reliable due to their heavy reliance on correlation, instead of highlighting underlying logic. This makes it more difficult for enterprises to accurately and independently verify information or identify hallucinations.
“The result is correlation without meaning. Engineers are left seeing what changed but not always why the system chose that specific corrective path,” says Jacob Strauss, co-founder and CTO of ChaseLabs. “In large distributed environments, even small timing differences between data sources can distort interpretation and make behaviour appear inconsistent or misleading.”
Most enterprises do not have an observability shortage anymore – they have a context correlation problem Ken Herron, VCONify
However, a more deeply rooted issue is the rising number of abstraction layers between underlying infrastructure and human operators, in an attempt to make networks more self-managing. This has led to increased black-box network behaviour.
By using tools such as software-defined networking (SDN), many teams now default to centralised decision-making, with administrators no longer needing to interact directly with devices. Similarly, intent-based networking takes high-level policies and translates them into decisions executed by machines, with far less traceability.
Despite reducing downtime and strengthening resilience in many cases, AI-driven remediation can further complicate things too. This is because while several platforms can identify issues, reroute traffic and automatically restart services, the fundamental reasoning for these decisions may not be as transparent or easily auditable. This can result in human operators struggling to understand what really happened or the key assumptions that drove the response.
Dynamic enterprise conditions also mean that workloads constantly shift across edge and cloud environments, while policies and configurations are frequently updated. This makes understanding outages even more difficult due to changing network states.
Highly fragmented data ecosystems further compound this problem, with the vast majority of enterprise data being deeply siloed across unstructured and disconnected systems such as emails and legacy databases. As such, AI models are often fed incomplete or incorrect data which increases the likelihood of unpredictable results.
This means that failures are far more distributed and hidden across several layers, rather than a single, easily identifiable source. Symptoms of unusual behaviour could be visible on one platform, but be caused by an event on another dashboard altogether or a hidden, automated update.
Eventually, this leads to far more resources and time being spent deciphering and connecting signals across systems than fixing underlying issues. With automation increasingly becoming a crucial component of enterprise infrastructure, improving understanding, context and transparency is no longer optional, but could be even more important than collecting more data.
What this means for organisations and human teams
Poor network observability can have significant operational and human consequences in the long term. Slower root-cause analysis despite rising telemetry data can lead to more productivity loss and downtime, with engineers stuck spending more time decoding and correlating signals rather than implementing remedial strategies.
In many cases, this can lead to the same issues occurring again and again, especially if no clear trigger can be found, forcing teams to use temporary patches rather than permanent solutions. Longer outages also erode client trust due to poor service reliability and product failures. The subsequent hit to enterprise reputation then becomes much harder to measure.
For complex industries such as supply chains and power grids, hidden cascading failures are a significant concern too. This can happen when a single, unseen issue drives a chain reaction of secondary breakdowns, which could lead to a system-wide outage. All of these factors considerably strain engineering and financial resources, which can directly increase operational costs while decreasing revenue.
In heavily regulated sectors such as manufacturing and healthcare, slower investigations could even result in regulatory breaches and missed compliance deadlines. This could lead to serious business consequences such as high fines, product seizures and loss of operating licences.
On the human side, teams may face higher fatigue from the cognitive burden of constantly having to switch between multiple tools, alert systems and dashboards, instead of inspecting systems directly. Similarly, burnout and low team morale from endlessly applying stop-gap and superficial solutions to the same recurring problems is a significant concern.
Another key issue is the decline of network and infrastructure intuition and low-level troubleshooting skills among engineers at all levels due to reduced manual intervention.
“AI-driven remediation feels safer because the system auto-heals. But every auto-heal nobody investigates is a future incident nobody will know how to debug. Every auto-heal should leave a question behind for a human to answer later, or it shouldn’t run at all,” says Viktoriia Moskalets, senior data analyst at Sigma Software Group.
Junior engineers may miss key initial training and diagnostic intuition building opportunities by never seeing raw network behaviour or being exposed to failures. Similarly, senior engineers may rely mainly on platform outputs, rather than device-level understanding.
“The problem is that automation silently erodes institutional knowledge, and nobody notices until something novel happens. If a team has only ever interacted with the intent layer, the underlying complexity doesn’t disappear; it just becomes inaccessible when it matters most,” Errico adds.
This creates a crucial skills gap during major outages and other incidents. It also results in institutional knowledge becoming overly platform-dependent, which introduces a layer of risk. As such, network observability challenges have moved beyond a purely technical issue into a much more serious cognitive and organisational problem.
Can observability be rebuilt?
As more organisations build artificial intelligence into core infrastructure, network visibility has become more crucial than ever. One way to rebuild observability is by re-establishing human-readable audit trails through “explainable AI”. This can help to ensure that every automated action along with its underlying assumptions and decision-making processes is traceable.
“The well-documented fix in the observability community is baselining: defining what ‘normal’ looks like in your specific environment and writing detection logic that only fires on meaningful deviations from those patterns,” Errico says.
Enterprises also need to regularly review the number of dashboards being used to eliminate redundant and unused tools. Moskalets adds: “Effective observability governance is far less about adding policies and far more about deleting things.
If your automation can’t explain what it did and why, it should not be trusted with high-impact decisions Herman Errico, Vanta
“Three anchors that work in practice: every alert maps to a specific decision someone has to make, and alerts with no owner are removed; every dashboard is traceable to a real question the operations team needs answered; AI insights are scored against ground truth on a regular cadence, not trusted by default.”
Frequently tested human overrides for automation tools, as well as guardrails such as explicit authorisation for remediation can help bring back network visibility too. Organisations also need to decrease “observability debt”, mainly caused by the growing complexity of various teams maintaining several overlapping monitoring tools, conflicting versions of system health and telemetry pipelines.
One way in which this can be done is by establishing unified telemetry pipelines and dependency graphs. This allows network and security teams to gain a consistent view of system behaviour across complex infrastructure layers.
Towards more understandable networks
While automation can greatly enhance efficiency, systems should still be carefully monitored and audited to ensure maximum transparency and understandability. This is especially as processes lacking auditability can have long-term business, financial and regulatory consequences, while also causing significant reputational damage.
“If your automation can’t explain what it did and why, it should not be trusted with high-impact decisions,” warns Errico. “The organisations getting this right treat observability as a product with a roadmap, a dedicated team, defined internal users and success metrics. The ones struggling are treating observability as a byproduct of buying tools, and that gap is widening.”
As such, the next networking era will depend far more on implementing explainable and accountable automation than collecting even more telemetry data.