The data center operator I was working with had a real problem. It took me a bit to figure out what it actually was.

They were building a private cloud service on top of an existing physical network. The physical infrastructure was mature, well-run, and CLI-driven, with Cisco and Juniper throughout. The new cloud layer ran on VMware. Virtual networking inside it, the NSX overlay, was configured through vCloud Director. A GUI the VMware team knew well. A GUI the network team had never touched.

Nobody had answered the obvious question: when a customer needed a virtual network configured, who does it?

The VMware team knew the tool. The network team knew networking. Tickets bounced. Configs got done twice, or not at all. Teams blamed each other in every post-incident review. The actual problem had nothing to do with technical skill. Nobody had defined who configures what, using what tool, following what process.

When people talk about network standardization, they usually mean configuration consistency: hostname conventions, IP schemes, VLAN numbering. That stuff matters, and most organizations stop there. But configuration standardization is one dimension of a problem with four. Operational standardization is who does what, with what tools. Tooling standardization is which tools belong to which jobs, and where the mandates stop. Intent-layer standardization is how you define what a service is supposed to be, regardless of vendor or platform. Get all four right and automation becomes possible. Get one wrong and you build on a foundation that cracks under load.

Operational Standardization

The VMware/NSX situation I described isn’t unusual. I’ve seen versions of it everywhere organizations are navigating ambiguous ownership boundaries: physical versus virtual, network versus compute, in-house versus managed service, acquired infrastructure versus existing operations. The question of who owns a given layer of the network can be genuinely hard: and M&A, platform migrations, and new service models keep generating new versions of the same question.

Most organizations answer it reactively. After the first incident.

It’s not just about what gets configured. It’s about who configures it, in what system, under what change process, with what rollback procedure. At service providers and data center operators, this plays out across scale and complexity most other organizations never approach.

Consider a PoP refresh at a regional ISP. Physical network team owns the router configs. A vendor controller manages overlay functions. Provisioning goes through an OSS with its own internal logic. The NOC has runbooks referencing CLI commands that haven’t been accurate since the last platform migration. Five teams can touch a single service. If the operational standard doesn’t specify who does what at each layer, you get duplication, gaps, and post-incident reviews that identify “process issues” without fixing anything.

Most organizations treat this as an org chart problem and hand it to management. It’s an architecture problem. Matthew Skelton and Manuel Pais make exactly this point in Team Topologies: unclear ownership boundaries and overlapping cognitive load between teams don’t produce collaboration, they produce friction and dropped handoffs. The solution isn’t a better RACI matrix. You have to design the operational model the same way you design the network: explicit interfaces, clear ownership, documented handoffs.

Tooling Standardization

The second dimension is where the impulse toward standardization becomes counterproductive.

In several data center environments I’ve worked in, server and infrastructure teams had already adopted Chef or Puppet for server builds and VM provisioning. When the network team started automating, leadership’s instinct was to mandate the same tools. Seems reasonable: one toolchain, consistent model, shared skills.

It usually didn’t work. Chef and Puppet were built for server configuration management. They don’t map cleanly to how networks operate: transactional changes to declarative state, multi-vendor device diversity, real-time validation before committing. Some teams made it work, sometimes at significant cost. More often, it produced shadow toolchains leadership didn’t know about, maintained by engineers who had to actually get things done.

Ansible has a similar story. Capable tool. “We use Ansible” as a blanket standard, without thinking carefully about fit, tends to end up in the same place. Any tool can be misused, or just a bad fit for a specific team at a specific time.

One tool for one job is the right principle. But define the job before you assign the tool, and be honest about where the company-wide mandate should stop and where team autonomy should begin. A network team and a server team have different jobs. Requiring the same tools because it looks tidier optimizes for the org chart, not for operational outcomes – don’t do that.

Intent-Layer Standardization

This is the hardest dimension. It’s also the most important one for service providers of all kind. And the most commonly skipped.

A mature SP network is never going to be homogeneous. Cisco in one region, Nokia in another, Juniper in the core, three generations of hardware from an acquisition still carrying live customer circuits. Multi-vendor isn’t a transition state most operators will get through. It’s permanent. Standardizing the hardware is a fantasy.

But you can standardize the intent.

One ISP I worked with had a specific problem: they offered a standard set of managed services, but every provisioner built those services differently. Same product sold to the customer. Different configuration underneath. Some added custom QoS policies. Others used different interface descriptions. Route targets weren’t consistent. The services worked, roughly. But automation was impossible. You couldn’t write a reliable script to modify, audit, or decommission a service because you couldn’t predict what you’d find. Each instance needed human inspection before anything could be done to it.

The fix wasn’t standardizing hardware or vendor commands. It was standardizing the intent: defining precisely what the configuration of that service was supposed to look like in abstract terms, then implementing that abstraction consistently regardless of the device underneath. What parameters define this service? What values are valid? What must be present in every instance? Answer those questions and you have an intent model. Once you have one, you can write automation against it. Without one, you’re back to one device at a time.

Most standardization efforts skip this layer because it’s the hardest. It requires the people who design services, the people who provision them, and the people who operate them to agree on a shared definition nobody currently owns. That conversation is uncomfortable. It usually reveals that your product definitions are less precise than anyone wanted to admit.

Design-Driven Automation

Standards matter because of what they enable. The end goal is a network you can reason about, validate against, and change through automation rather than through individual human judgment applied one device at a time.

Jeremy Schulman is a network automation pioneer and a founding advisory board member of the Network Automation Forum. He’s been working on this problem since the early days of Juniper automation, and he’s developed a framework he calls Network Automation by Design; the idea being that you declaratively represent the expected operational state of the network as a design, and that design becomes the single source of truth. Configuration generation, operational validation, documentation, compliance checking: everything flows from the design. He laid it out at AutoCon0 in 2023, and it remains the clearest articulation I’ve seen of why standards and automation have to be built together rather than sequentially.

The key insight: configuration is just implementation. The design is what matters. If the design is precise and machine-readable, configuration is a derivation. State what you intend, the tooling produces the config. If the design is vague — which it is, everywhere standards haven’t been established — every configuration decision is a judgment call made by whoever happens to be on shift.

That’s the real cost of non-standardization at SP scale. Not that changes take longer. It’s that every change requires human interpretation of an underdefined intent, at 2am, during a maintenance window, on a network carrying tens of thousands of customer circuits. That’s where mistakes happen.

The research backs this up. Forsgren, Humble, and Kim’s Accelerate established the empirical link between standardization, automation, and high organizational performance. The DORA metrics (deployment frequency, change failure rate, mean time to restore) don’t just apply to software teams. They apply to any operations organization where repeatable, automated processes are what separates high performers from everyone else. Network operations is no exception. A January 2024 Enterprise Management Associates survey of 354 IT professionals found only 18% rated their network automation a complete success, with 38% reporting failure or uncertainty. Network World’s coverage of that research included a quote from a network tools engineer that really says it all: “Standardization is the biggest obstacle. When the network is not standardized and the data is not standardized and you don’t have a standard way of generating inventories and a source of truth, it’s a big problem. You can’t automate at scale because you’re forced to automate one device at a time without standardization.”

That practitioner and Schulman’s framework are pointing at exactly the same thing. The design can only drive automation if it’s specific enough to reason from. That requires standards.

One more thing worth saying plainly here: AI doesn’t change this calculus, it only raises the stakes. The assumption that AI-driven automation will sort out non-standardized environments is exactly the kind of thinking that will get organizations burned at a new scale. An LLM-based automation agent operating on inconsistent configurations, inconsistent telemetry, and inconsistent service definitions isn’t solving the problem, it’s amplifying it. AI reasons from your data. If your data reflects a non-standardized network, you get non-standardized outcomes, faster. The standards work isn’t a prerequisite you can skip because the tooling got smarter. It’s the foundation the smarter tooling depends on.

SP and DC Operators Have It Harder

Everything I’ve described is harder at service provider and data center scale, and harder in ways that are structurally different.

Multi-vendor is the obvious one. But the less-discussed reason is acquisition. Most regional and national ISPs have grown through M&A, and every acquisition brings a different operational model, different tooling, and different tribal knowledge about why things are configured the way they are. The brownfield problem at an SP isn’t just decades of organic drift. It’s often two or three companies’ worth of conventions running simultaneously in the same network, with integration seams that may never have been fully resolved.

Then there’s the OSS/BSS layer. A network standard that doesn’t propagate into provisioning and billing is incomplete. I’ve seen operators build clean, consistent network configurations that their OSS couldn’t interpret correctly, because it was built against the old configuration model and nobody updated the integration when standards changed. Revenue leakage that stays invisible until someone runs a full audit. A standards problem wearing a billing problem’s clothes.

24/7 NOC operations add their own dimension. Rotating shifts require procedures that work for every engineer on every shift, not just the senior person who wrote the runbook. Non-standardized configurations mean non-standardized runbooks. Troubleshooting steps depend on which config variant you happen to encounter. That’s not a training problem. It puts a ceiling on NOC performance no matter how good your people are.

An Analysys Mason study on CSP automation found 75% of communications service providers struggle with single points of human failure in their automation: specific engineers whose institutional knowledge is the only thing keeping it running. They called the result “snowflake clouds” — environments where every element is slightly different and repeatable automation is structurally impossible. That’s what happens when you bolt automation onto an unstandardized foundation.

Progress Over Perfection, Governance Over Projects

Organizations that stall on standardization almost always make the same mistake: they treat it as a project with a completion date. Scope it, staff it, run it for a year, declare partial victory, move on. Eighteen months later the network has drifted back. Because the project ended and network operations didn’t.

Standardization is a governance problem. It needs ongoing enforcement: design reviews that check new deployments against defined standards, compliance tooling that surfaces drift before it compounds, change management that rejects changes violating the intent model. None of that is complicated. All of it requires sustained discipline over years.

The starting point doesn’t need to be comprehensive. Sufficient standardization in a specific domain is enough to start automating reliably in that domain. Start with the service type that changes most often, or the infrastructure carrying the most customer traffic, or wherever non-standardization is causing the most pain right now. Build the standard, enforce it, automate against it, then expand.

What you’re building toward, in Schulman’s framing, is a network where the design is always current and the operational state always reflects it. The gap between design intent and operational reality, crossed every time someone makes a change without updating the source of truth, is where unreliability lives. Closing that gap is a standards problem before it’s an automation problem.

Strategic Discipline

Service providers and infrastructure operators who get this right don’t get there because they have better tools or bigger teams. They get there because leadership treated standardization as a strategic discipline, maintained governance over years, and kept connecting the work back to what actually mattered: faster provisioning, cleaner NOC operations, and automation that runs reliably at 2am without a senior engineer on standby.

That’s what’s on the other side. It’s worth doing right.


Ready to move from diagnosis to action? Khadga Consulting works with service providers, data centers, and infrastructure operators on the technical, operational, and organizational dimensions of network transformation: standards, automation, culture, and all the uncomfortable parts in between. Let’s talk.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.