When I start a new engagement, one of the first things I ask for is a current network inventory. Not a diagram. Not a design doc. Just a list of active devices on the network.
More often than I would like to admit, the operator can’t give me one.
This is most common in regional ISPs and small-to-mid-sized data center operators, but the underlying problem appears at organizations of every size. When that inventory doesn’t exist, I spend early days doing archaeology instead of strategy, reconstructing what should already be known. And in every one of those environments, there’s a longer list of things that can’t happen because of it: security updates falling behind, support contracts covering devices that no longer exist or missing devices that do, and automation sitting on the shelf because you can’t automate what you don’t know you have.
Documentation isn’t just technical hygiene. For service providers and infrastructure operators, it’s the foundation that every other operational capability is built on. A bad foundation leaves everything above it compromised.
You Can’t Automate What You Don’t Understand
The foundational argument of intelligent networking is simple: you cannot effectively automate what you don’t understand, and you can’t fully understand what you haven’t documented.
This sounds obvious until you look at how most operators actually work. Traditional network engineering culture treats documentation as an afterthought, as something you do after the configuration is in place, if you do it at all. Sure, there is probably a new design doc for that new project. But did anyone go back and update how it was actually built, how it changed over time, how it now deviates from the original design? Be honest.
The result is a growing gap between what’s in the documentation and what’s on the network. Eventually the documentation is so stale that no one trusts it, so it stops being used, so it gets even more stale. I’ll bet that cycle is familiar to every infrastructure leader.
The transformation to intelligent networking inverts this relationship. Documentation stops being the byproduct of configuration and becomes the driver of it. That’s a significant cultural shift, and it’s one of the harder ones in our industry. But it’s also where a lot of the returns come from.
Why This Is Harder for SPs and DC Operators
Every network has a documentation problem. Service providers, data center operators, and infrastructure operators have a harder version of it.
The obvious reason is scale. Hundreds or thousands of devices across multiple points of presence, multi-vendor environments, physical plant complexity that most other organizations never approach. But scale alone doesn’t explain why it’s harder. The harder part is the layers.
SP and DC networks aren’t just L2/L3 environments with some routing on top. They span optical and DWDM infrastructure, IP/MPLS cores, BGP peering fabrics, segment routing, EVPN VxLAN, customer-facing provisioning layers, and physical plant documentation covering fiber, cabling, rack, and power. Each layer has its own documentation requirements. And the relationships between layers are where things get genuinely painful.
When you don’t have cross-layer documentation, a change at one layer creates uncertainty at every other. Does that DWDM circuit change affect IP connectivity? How? To what customers? If you can’t answer those questions before you make changes, you’re accepting unnecessary risk every time you touch the network.
The customer-facing layer creates its own specific problem. I’ve seen it at operators of all sizes: billing gets turned off, but the service doesn’t. The customer stops paying, but the traffic keeps flowing. This isn’t a rare edge case. It’s a pattern, and it’s a direct consequence of operational and billing systems that aren’t grounded in a common, accurate source of network truth. When your documentation is a mess, that kind of revenue leakage stays invisible until someone does a full audit.
BGP peering documentation is another area where the stakes are specifically SP-sized. It’s not just knowing who your peers are. It’s knowing whether your traffic engineering policies are achieving the traffic flows you intend, whether you’re paying for transit you could replace with settlement-free peering, whether your routing policies match your commercial agreements. A peering audit is its own discipline, and it depends entirely on documentation accurate enough to reason from.
The Tribal Knowledge Trap
Another challenge I find in nearly every environment, regardless of size or sophistication: the people know things that aren’t written down anywhere.
A senior engineer knows that a particular core router has a quirk that was never documented because it was a workaround for a problem three years ago that nobody filed a ticket for. A NOC supervisor knows which vendor’s TAC is actually responsive and which takes three escalations to get anywhere useful. The lead architect carries the entire logical topology in their head because the diagram hasn’t been updated since a major migration two years ago.
Organizations tell themselves this is fine because those people are still there. And they’re right, until they’re not.
Someone goes on vacation. Someone leaves for a competitor. Someone gets promoted and stops doing hands-on work. At that point, troubleshooting times stretch. Not because the network got harder, but because the institutional knowledge that was filling the gaps in the documentation has walked out the door.
The dependence on tribal knowledge isn’t just a retention risk. It’s a daily operational cost that’s easy to overlook because it’s baked into how long things take. When any trained engineer can troubleshoot a problem using accurate documentation, you have a resilient operation. When only specific people can troubleshoot specific things because they’re the only ones who know how it actually works, you have a fragile one.
What to Document
Building a solid documentation foundation means getting comprehensive across four areas. Here’s what that looks like at a high level:
Device inventory is the starting point. Hostnames, IP addresses, make, model, location, access methods, but enriched beyond the basics. Lifecycle information like purchase dates, warranty status, and end-of-life dates drive maintenance and upgrade decisions. Patch levels and vulnerability status feed your security posture. Software and licensing details inform support contract rationalization. Change history ties the device record to the operational record. For SP and DC operators, this inventory needs to cover the entire stack from physical power and optical gear up to DNS resolvers, routers, and everything in between.
Network topology needs to be documented at multiple layers and, critically, with the cross-layer relationships mapped. Physical connectivity including cables, ports, interfaces, fiber plant, and DWDM systems. L2/Ethernet topology including switching domains and loop prevention. L3/IP topology including routing domains, subnets, gateways, and protocols. BGP peering topology including neighbor relationships, routing policies, and traffic engineering intent. For data center operators, rack layout, physical connectivity, power, and cooling are part of this picture; which is where DCIM tooling earns its place. IP address management (IPAM) belongs here too. For operators managing large IP space across multiple customers and services, IPAM isn’t optional. It’s core infrastructure.
Device configurations need to be stored, versioned, and tracked – not just current state, but change history. This becomes your audit trail, your incident response foundation, and your automation source material.
Network and service designs document both current state and intent. This matters because the as-built state of a network almost always diverges from the design over time. Capturing both, and tracking the divergence, is how you eventually close that gap systematically. Design documentation also enables a more mature automation posture: design-driven automation, where changes are derived from and validated against documented designs rather than ad hoc configurations.
Building Your Source of Truth
Getting all of this into a usable state requires the right tooling, not heroic manual effort.
NetBox, Nautobot, and InfraHub are the leading network source of truth platforms. These are purpose-built data stores for network information that can serve as the authoritative record for your inventory, topology, configurations, and designs. All three support discovery-driven data ingestion (whether through native tooling, add-on apps, or integrations with dedicated discovery platforms) so that your source of truth can be built and maintained from actual network state rather than manual entry.
There are other ways to overcome the challenge of getting accurate data into these systems and keeping it current. Tools like Auvik, Forward Networks, Gluware, IP Fabric, Kentik, Itential, NetBrain, Slurp’it, and Stardust Systems cover different parts of the discovery and documentation automation problem. The most capable can deploy distributed discovery agents across complex multi-site environments, detect drift between what’s documented and what’s actually running, process and validate collected data, and generate visualizations directly from your data stores.
On the open source side, Nmap is a solid starting point for network discovery at no cost. Oxidized handles automated configuration backup and has been a workhorse in our industry for years.
None of these tools eliminate the need for good process and clear ownership. But they do dramatically reduce the manual burden. More importantly, they enable something that most operators miss entirely: read-only automation as an entry point.
Read-Only Automation: The Entry Point You’re Missing
Most conversations about automation lead with the exciting capabilities: pushing configs, deploying services, responding to events autonomously. The unglamorous reality is that the most valuable first use of automation for most operators isn’t configuration management. It’s documentation.
Read-only automation means using automation to discover, capture, and continuously update your network documentation without touching any live configuration. You’re querying devices, not changing them. The risk profile is much lower than with write operations. And the return is immediate: you start building a source of truth that reflects the actual state of the network rather than what someone thought was true the last time they updated a spreadsheet.
This matters because it dissolves the sequencing problem that stops a lot of organizations cold. You don’t have to solve documentation before you can start automating. You can use automation to solve documentation. The two efforts reinforce each other from day one.
Once you have a reliable, continuously updated source of truth, you unlock a cascade of real capabilities. Troubleshooting gets faster because any engineer can reason about the network from accurate documentation rather than tracking down whoever was there when it was built. Security hygiene improves because you know what you have and can track patch status systematically across a known inventory. Support contracts get rationalized because you’re not paying for coverage on decommissioned devices or missing coverage on active ones. Config audits become possible, where you can compare running configurations against defined standards and find deviations at scale. (I’ll address that standardization challenge directly in a future article.) New service enablement becomes meaningfully faster because building on top of a known current state is categorically different from building on top of an assumed one.
And then there’s alerting and root cause analysis. Monitoring tools depend on topology context to do their jobs. They can only be as accurate as the topology data you feed them. Inaccurate documentation means degraded alerting, degraded RCA, and longer MTTR. That connection is direct, and the impact compounds every time something goes wrong on your network.
AI Inherits Your Documentation Debt
The pitch is that AI will figure out your network for you. That you can point it at a complex environment and it will develop the understanding that your documentation currently lacks. Some tools do remarkable things with discovery and analysis. But the fundamental requirement hasn’t changed, nor have the actual methods – there’s no magic here.
LLMs can only “reason” about what they can observe or what they’re told. If your inventory is incomplete, they’re reasoning about an incomplete network. If your topology documentation doesn’t include the optical layer, they can’t correlate optical events with IP impacts. If configurations haven’t been collected, they’re guessing at configuration state. Just like any other automation tools, you can use AI to help document your network, but you still want documentation – you don’t want an LLM guessing about your network every time you ask it a question – you want the ground truth established as context ahead of that query.
The same foundation that documentation-driven automation provides, AI-driven operations requires. Clean, comprehensive, current documentation isn’t something AI replaces. It’s what it needs to be useful. If you’re planning to adopt AI-driven network operations and your documentation is a mess, you’re not ready. The mess doesn’t disappear because you added an AI layer on top of it.
Why Documentation Initiatives Fail
Most operators I’ve worked with have attempted to improve their documentation at least once. Most of those efforts eventually stalled. The failure modes are consistent.
The most common one is treating documentation as a project rather than a process. A team spends weeks or months building out a source of truth, declares success, and moves on to the next priority. Six months later the documentation is stale. A year later it’s back to being an undependable mess. Documentation that isn’t maintained continuously isn’t really documentation. It’s a snapshot that decays.
The second failure mode is relying on engineer goodwill to keep things updated. Engineers are busy. Documentation competes with every other demand on their time and almost always loses unless it’s integrated into workflows in a way that makes skipping it harder than doing it. This is partly an automation problem. Automated configuration backup means configs get captured regardless of whether someone remembers to update them. Discovery tooling means topology drift gets detected without requiring manual checks. The goal is to make keeping documentation current the path of least resistance, not the path that demands the most discipline.
The third is lack of clear ownership. Documentation needs an owner. Not a committee, not a vague shared responsibility, but a person or team accountable for its accuracy and completeness. Without that, it slips.
And underpinning all of this: without executive sponsorship, documentation will always lose the prioritization battle. When the CTO or VP of Engineering treats accurate network documentation as an operational requirement rather than a nice-to-have, the culture follows. When they don’t, engineers take the cue.
The Foundation Comes First
I won’t pretend this is simple. Getting network documentation into shape across a real SP or DC environment takes real effort.
But the framing matters. Documentation isn’t a prerequisite you have to suffer through before the interesting work begins. It’s how the interesting work becomes possible. To unlock faster troubleshooting, security hygiene, automation at scale, AI-driven operations, and new service velocity; you must first build an accurate, current, comprehensive understanding of the network. Without that foundation, you’re building on sand.
Start with read-only automation and a data store. Instrument discovery. Get inventory and topology into a system you can actually trust, and build the process for keeping it current into your operational workflows before you build anything else on top of it.
You can’t automate what you don’t understand. You can’t understand what you haven’t documented. And you can’t afford to keep operating as if that’s someone else’s problem.
If you’re working through these challenges and want to compare notes, the Network Automation Forum is where those conversations are happening.
And if you’re ready to move from knowing the problem to fixing it: Khadga Consulting helps service providers and infrastructure operators build the operational foundation that automation and AI actually require. Documentation, “source of truth,” the processes that keep it current, and the organizational habits that make it stick. Let’s talk.





