There is a person on your team right now who knows something nobody else knows.
Maybe it is how a specific edge router was configured during an emergency two years ago, a workaround that was never documented and never properly fixed. Maybe it is which vendor’s escalation path actually works and which one burns three hours before reaching someone useful. Maybe it is why that particular BGP policy looks wrong but is actually right. Whatever it is, that knowledge lives in one human head. And when the network breaks at 2am, that human gets the call.
Every service provider and data center operator I have worked with has at least one person like this. Usually several. They are celebrated. Promoted. Counted on. And they represent one of the most significant operational risks in your business.
Not because they are bad engineers. Because they are great ones. And the organization has optimized around their greatness instead of building something more resilient.
That is the hero culture problem. It’s not a people problem. It’s a leadership problem.
Why Hero Culture Made Sense
Hero culture did not emerge from poor judgment. It emerged from the nature of the work.
In networking’s first and second eras, when the job was first to make connectivity exist and then to make it reliable at scale, individual expertise was genuinely the right tool. Networks were complex but finite. A small team of skilled engineers could hold the logical topology in their heads. Institutional knowledge lived in people because that was the most efficient place to put it. The hero was not a failure of process. The hero was the process.
For service providers and data center operators, this was especially true. Building and running network infrastructure at any meaningful scale required people who had done it before, who knew the vendor quirks, the protocol edge cases, the operational patterns that worked. Those people were scarce. Organizations that found them protected them, rewarded them, and built around them. This was rational.
The conditions that made hero culture rational have changed. Networks require more expertise, across more domains, changing faster, with higher stakes attached to every failure. What’s changed is that no individual can hold all of it anymore. Not because engineers have gotten worse. Because networks have gotten more complex than any one person can fully master — and more importantly, because the stakes of depending on individuals have grown too high.
The Network Is the Product
Here is where the leadership argument diverges sharply from the enterprise version of this conversation, and why the stakes are different for your organization.
For an enterprise IT shop, a network failure means employees can’t work. That’s bad. But the business still exists. Products still exist. Sales are temporarily disrupted. The network is a support function that has failed.
For a service provider or data center operator, a network failure means your product has failed. An outage doesn’t just disrupt operations. It triggers SLA penalties, damages customer trust, and in competitive markets, accelerates churn. For a colocation operator, it affects your customers’ customers. For a carrier, it may mean regulatory scrutiny on top of the commercial consequences. The stakes are different in kind, not just degree.
This is why leadership practices that might be discretionary for enterprise are effectively mandatory for you. When the network is the cost center, hero culture is expensive but survivable. When the network is the product, hero culture is a business risk you are running whether or not you have named it.
The hero who knows your network is not your safety net. They are evidence that your safety net does not exist.
The NOC as Structural Enforcer
You cannot talk about leadership in SP and DC operations without talking about the NOC.
The NOC is where hero culture is structurally reinforced, often by design. 24/7 eyes on glass, reactive by mandate, staffed for firefighting. The metrics are MTTR, ticket closure, incident response time. The culture rewards the engineer who clears incidents fast, not the one who prevents them from recurring. Not because NOC teams are short-sighted, but because they are measured on the wrong things, and leadership has not changed the measurement.
The result is an organizational gravity that is hard to escape. Smart, capable engineers get faster at responding to the same categories of problems rather than eliminating them. The ones who are best at firefighting get recognized and promoted. The ones who want to do the structural work — documentation, automation, process improvement — often feel like they are fighting for permission. Or they leave.
This is not a critique of the people in the NOC. It is an observation about what the NOC, as commonly structured, does to an organization’s ability to transform. The reactive posture that keeps your network running today is the same posture that makes meaningful improvement nearly impossible to sustain.
Research from the DORA (DevOps Research and Assessment) team, detailed in Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim, shows that transformational leadership drives technology delivery performance not by directly controlling technical work, but by enabling teams to adopt better practices and ways of working. The lever is organizational and cultural. The outcome is operational. Leadership that doesn’t touch that lever doesn’t get the outcome, regardless of what tools they buy.
If your NOC is designed for reactive firefighting and your leadership model rewards it, no automation platform in the world will change the operating pattern. That goes for AI-assisted operations as much as any other automation: a faster tool in a culture that rewards firefighting just produces faster firefighting. It will be deployed on top of the existing culture and inherit its limitations.
Measuring What Actually Matters
The shift from hero culture to high performance starts with changing what you measure.
Traditional SP and DC operations measure activity: tickets closed, changes deployed, uptime percentage, MTTR. These are not useless. But they describe what happened inside your operations. They do not describe what your customers experienced or what the business achieved.
Outcome-focused leadership asks different questions. How long does it take to turn up a new customer circuit from signed contract to live traffic? What percentage of network changes achieve their intended result without causing an incident? When a new service is enabled, how quickly can a customer actually consume it?
These questions are uncomfortable for a reason. They reveal the gap between operational activity and business value. An organization that deploys 300 changes a month with strong uptime numbers can simultaneously have a 45-day average circuit provisioning time that is losing deals to a competitor who provisions in a week. The first two metrics looked fine. The third is a competitive problem, and you cannot see it if you are not asking.
The transformation in measurement does not require abandoning operational metrics. It requires adding the customer-impact layer and making it visible to leadership. For managed service providers and hosted infrastructure operators, this is partly contractual: the SLA structure should reflect customer outcomes, not just network performance statistics. That distinction changes where leadership attention goes, which changes where engineering attention goes.
Eliminating Waste
Lean management is often explained in manufacturing terms that don’t quite land for network operators. But the underlying principle does: every activity that doesn’t contribute to a customer outcome is waste.
Consider what a customer circuit turn-up actually looks like at most operators. The engineering work — provisioning, testing, validation — might take a few days. The surrounding process — order intake, handoffs between provisioning, NOC, and billing, approval queues, customer communication, final acceptance — can take weeks. Not because any individual step is badly designed. Because the overall process was never mapped end to end, and nobody owns the full cycle.
Tracing the complete path from customer request to delivered outcome almost always reveals the same pattern: the technical work is a small fraction of total cycle time. The majority of time is spent waiting. Waiting for approvals that exist because someone once made a configuration error that caused an incident. Waiting for handoffs between teams that have never established a clear process. Waiting because the systems involved don’t talk to each other and a human has to carry information between them.
The lean management lens doesn’t ask you to move faster. It asks you to identify what’s slowing you down and whether it’s actually buying you anything. Not every approval gate is waste. Some exist for good reasons. But many exist because nobody has questioned them recently, and they have accumulated into a change management process that takes longer than the change itself.
Other waste patterns worth examining: incident responses where the first 30 minutes go toward reconstructing topology context that should already be in your monitoring system. Support escalations that burn three tiers before reaching someone with authority to act. Maintenance windows scheduled weeks out for changes that are low-risk and fully reversible. Each of these has a direct cost in engineer time, customer experience, and your team’s capacity for forward-looking work.
Fear Is a Leadership Choice
Network operations has historically punished failure visibly. Outages leave paper trails. In SP and DC environments, where a significant incident can mean SLA credits, regulatory notifications, and executive-level customer escalations, that visibility is even more acute.
The predictable result is conservatism. Teams conditioned to associate change with risk develop change-aversion. Improvements get deferred because the potential downside is concrete and near-term, while the benefit is diffuse and long-term. The hero who keeps the network running through personal expertise and heroic troubleshooting gets more organizational support than the team working to make heroics unnecessary.
Westrum’s organizational culture typology, and its application to technology operations in the DORA research, describes the failure mode as pathological: failure is concealed, messengers are shot, cross-functional cooperation is discouraged. The alternative is not a culture without accountability. It is a generative culture where failure is analyzed for learning rather than assigned for blame, and where the system is examined before the individual.
Creating that culture in an SP or DC environment requires deliberate choices from leadership, not declarations. It means separating the post-incident conversation about what broke from the conversation about who made a mistake. It means establishing rollback procedures and canary processes so that change risk is managed structurally rather than by avoiding change. It means defining what a well-managed failure looks like, where the right processes were followed and the team learned something, and treating that differently from a reckless one.
The practical reason this matters, beyond culture: if your engineers are afraid to make changes, your network cannot evolve. In a market where your competitors are provisioning faster, responding smarter, and scaling more efficiently, a fear-bound operations team is a competitive liability. Fear is a leadership choice, because the conditions that produce it are set by leadership.
Teams Over Heroes
The pillar that makes everything else sustainable is organizing around teams rather than individuals.
Team Topologies, by Matthew Skelton and Manuel Pais, offers a framework that maps well onto network operations. The core insight is that organizational structure determines how work flows. Structure around individual expertise and work flows to individuals. Structure around service outcomes and work flows to teams with end-to-end accountability.
Traditional SP and DC operations tend toward the first model: separate teams for IP networking, optical, security, NOC, and provisioning, each with their own tools, processes, and organizational incentives. Coordination happens through tickets and handoffs. This works for steady-state operations. It breaks down when you need to deliver something that crosses team boundaries quickly, or when the nature of the work is changing faster than the organizational boundaries can adapt.
Team-focused operations means giving cross-functional teams clear ownership of specific outcomes. Not “the NOC handles incidents” and “engineering handles changes” as two separate worlds, but teams with end-to-end accountability for the operational health of specific services, with the authority and tools to handle both. This structure makes the true cost of handoffs visible rather than hiding it inside functional silos. It puts the people with the most context on a problem closest to the decision.
Empowerment is the practical piece. Teams can only function with real accountability if they have the authority to act without escalating routine decisions. That means establishing clear guidelines about what a team can do independently and reserving escalation for decisions that genuinely require it. The question worth asking is whether your current approval structure reflects actual risk, or whether it reflects accumulated distrust of team judgment built up over years of hero-dependent operations.
The Thing Nobody Wants to Own
Everything above describes what to do. The harder truth is that none of it happens without a leadership decision to change what gets measured, what gets rewarded, and what gets recognized.
Hero culture is self-reinforcing because it produces real value in the short term. The engineer who stays until 3am to restore service is genuinely valuable in that moment. The problem is that celebrating that moment, without also asking why the 3am call happened, trains the organization to produce more 3am calls. It trains engineers to be available for emergencies rather than to eliminate them. It trains management to tolerate the underlying fragility because the hero is always there to absorb the consequences.
Breaking that cycle requires leadership to make the structural work visible and valued. The engineer who builds the automation that prevents a class of incidents from recurring is doing more durable work than the one who responds to that incident 40 times. In most organizations, only one of those two gets recognized. That is a leadership choice, and it has consequences.
The DORA research is clear about where the lever is. Transformational leadership — leaders who communicate a clear vision, challenge teams to improve, enable autonomy, and recognize contribution — is one of the strongest predictors of organizational performance in technology operations. Not tooling. Not process. Leadership.
Your network’s ability to operate with intelligence depends on this evolution, whether you are running a regional ISP or a hyperscale data center, a managed services operation or a national carrier. The tools have been available for years. What has been missing, in most of the organizations I have worked with, is the leadership decision to use them differently.
That decision is yours to make.
Ready to move from diagnosis to action? Khadga Consulting helps service providers, data centers, and infrastructure operators close the operational intelligence gap: technology, culture, business, and all the uncomfortable parts in between. Let’s talk.





