full image - Repost: Mitigating Human Error in a Partially Automated Network and Reconciling Switch Config with the ERP System? (from Reddit.com, Mitigating Human Error in a Partially Automated Network and Reconciling Switch Config with the ERP System?)
Mining:
Exchanges:
Donations:
Hi All,tl;dr - automation is hard when there are still people in your network and you have to integrate with legacy ERP systems. How did you solve for issues related to that?I was wondering if I could pick your collective super-brain about some growing pains I've hit and the next steps I should consider? I work for a data center company with a continental EVPN-VXLAN. It spans over 100 devices; in 49 data centers, across 12 cities. I've spent the last 18 months writing automation around PyEZ; primarily to provision VTEPs connecting customers to the major cloud service providers, to each other and themselves.The tool also handles the CSP circuit provisioning and can do other day to day operational stuff like JunOS upgrades, batch config, ESI LAG management, etc. So it's pretty cool. I've actually never been so proud of a piece of work in my 25 year career. It's in production. It's successfully provisioned hundreds of ports and EVCs. But I keep running into physical, people and process problems that degrade it's all-around awesomeness.What's your experience been like in the problem areas below and how have you solved them? Is there a tool out there today that could solve all this so I didn't have to spend the rest of my life on it? Is this what Apstra or NocProject is for?Datacenter operations staff frequently cross-wire redundant connections - like often enough that I wonder if they think it's hilarious.Another favorite game is using the wrong SFP.The automation isn't managing the EVPN fabric yet and I think there's been a config issue every time a new device has been added. We don't add new devices that often, but velocity is increasing.Anytime a support case comes in - I'm gonna say *always* due to physical or customer-side issues - the network-support team can't resist fiddling with the automated, template-based, known-good config that is currently operational in hundreds of other ports and thousands of EVCs. I get it honestly. Somethings not working and you can't figure out why so you try everything. But honestly guys, it's not the config!Sometimes there are legitimate reasons for config changes not yet supported in the automation; port moves or VLAN changes.In either of the previous two cases, there's no mechanism yet to sync config changes back up to the ERP system. And no matter how many times you ask, network engineers wont do business records apparently - because there's some other priority issue they need to go fix. But the ERP system serves as the source of truth for the automation so unsynchronized changes means the automation is broken the next time it needs to interact with a given port or EVC.Network Engineering maintains a list of Spines and Leaves in Confluence which is what the automation uses for connectivity checks. But with alarming frequency, the IP is different on the ERP equipment record; or the equipment record doesn't exist at all.No one can seem to keep the interface records fresh in the ERP system. AE interfaces don't get added. Physical interfaces don't get set to in-use/available if [de]provisioned by a person and not the automation system.On the flip side to that, manual disconnects might get handled in the ERP system to close out the billing, but no one ever bothers to clear out the config on the switch.I'd love to hear your thoughts. Essentially I think I can solve for people and process issues a couple ways, with a synchronizing poller and with a custom interface with guard rails but surely I don't still have to write that myself in 2022?I think I can solve for physical issues with automated checks requiring customer-side config changes at port provisioning time; but that adds complexity and time to every new port order when there are only issues a small percentage of the time. And the whole point of this system is rapid provisioning on-demand.Thanks in advance. Sorry that was super long.
Social Media Icons