Zero Touch Provisioning with DHCP (KEA)

In my last blog I wrote about building ZTP configurations for SONiC switches using the data in NetBox. This post will explain how those configurations are served to these devices without the need for any human intervention.

The key elements are:

  • DHCP Server (to receive requests from new devices and point to software/config locations)
  • KEA class definitions (unique for each vendor and platform)
  • HTTP Server (send firmware)
  • GIT Server (send configuration files)
  • ztp.json file that defines the final steps needed to fully configure a SONiC switch

The ZTP process is as follows:

  • New switch is plugged into the network and DHCP request is sent requesting software (ONIE)
  • DHCP server determines hardware vendor/type and sends the correct software to switch
  • Switch loaded with new software, reboots & requests configuration file from DHCP server
  • DHCP server responds with the correct configuration file (based on serial number)
  • Switch applies the configuration file and reboots before completing ZTP
  • Switch looks at ZTP configuration file for further configuration instructions and or QA

KEA example of the various classes that define the different vendors and platforms. This is just a list of all the classes currently defined in KEA.

Drilling down into the class for Edgecore we can see the location of the software file and also the vendor and platform ID which determines the actual hardware type.

Class definition for Wistron. The unique HEX value is how the DHCP server can identify the correct hardware.

Once the switch has the correct software applied and has rebooted, then the ZTP process can begin. All switches default to the name “sonic” after having the software upgraded so this is the key piece of info the DHCP server is looking for in this step of the process.

This is the ztp.json file that contains the instructions to configure the switch. The configs are stored on git and the serial number is the unique identifier to ensure the correct config ends up on the proper device. In addition to the main configuration file, there is also the routing portion of the config (FRR). Finally there is a basic connectivity check via ping.

You can add any instructions you might need for your infra. The first step in the ZTP process was a password change and setting a sleep timer.

That’s all there is to Zero Touch Provisioning a SONiC switch using KEA DHCP server. I will write another blog post on KEA in the near future since it’s very helpful when managing infrastructure.

Posted in Uncategorized | Comments Off on Zero Touch Provisioning with DHCP (KEA)

Infrastructure as Code: Using NetBox and Jinja2 to build Zero Touch Provisioning for SONiC switches

When I was hired at eBay one of the first tasks I was given was to find an easy way to swap switches within a rack to help facilitate testing of various SONiC platforms. I knew this as ZTP from my years as a network engineer but I was not sure how it was all going to work considering the Hardware Engineering team had no automation at the time with regards to infrastructure. They were still working off spreadsheets and wiki pages, all manually updated.

The first step to realize the goal of ZTP required a “Source of Truth” for all of the assets in the hardware lab. Luckily I had just watched a NANOG video that talked about the open source tool called NetBox. I setup NetBox in my first week and started the process of adding in all of the lab assets. At first we entered a lot of stuff by hand but quickly all of the info was collected via python automation using Nornir and Netmiko. Nornir is a python automation framework with plugins for NetBox that greatly simplifies the process of connecting to various devices within the infrastructure.

Zero Touch Provisioning is simply automating the configuration and deployment of an end device like a network switch. There are several steps in the process but one critical piece is building the configurations. If you have several of the same devices in your infrastructure your configs will have a static portion that is the same across all devices and a dynamic portion that is unique to each device. I will be showing off how I built the dynamic portions of the configuration using Infrastructure as Code concepts achieved with open sources tools like NetBox and the Jinja2 template language.

Dynamic “interface” section of the SONiC switch configuration using Jinja2

Final rendered dynamic interface configuration for the SONiC switch

Dynamic “port” section of the SONiC switch configuration using Jinja2

Final rendered dynamic port configuration for the SONiC switch 10G, 25G,100G & 400G ports

Dynamic “VLAN” section of the SONiC switch configuration using Jinja2

Final rendered dynamic vlan configuration for the SONiC switch 25G,100G & 400G ports

Posted in Uncategorized | Comments Off on Infrastructure as Code: Using NetBox and Jinja2 to build Zero Touch Provisioning for SONiC switches

NetBox Shell Examples with Jupyter Notebook

When I started working with the hardware team at eBay I was shocked to learn that they were tracking ALL of their assets with Wiki and spreadsheets. Everything was static, all of the IP addresses, the hardware locations, serial numbers, etc and let me tell you it was a nightmare to try and manage that infrastructure or even use the lab for that matter.

One of the things I was most proud of during my time at eBay was how I was able to transform the entire lab with the help of NetBox and a whole lot of Python. In the end I was using NetBox to build real time ZTP configurations for SONiC switches using the data in NetBox along with Jinja2 templates for true Infrastructure as Code deployments.

Infrastructure as Code example using NetBox and Jinja2 to build ZTP SONiC configs

I brought that lab into the 21st century and transformed it in such a manner that we could offer Lab as a Service to other teams. If you have not used NetBox you would be AMAZED at what is possible with free and open source software.

I’ve posted this example on github in case you have Jupyter Notebook and want to download. I’m not going to cover the process of getting these tools working together in this post but it’s not too difficult. In this blog post I will just be using screenshots taken from the notebook I created.

List all of the NetBox Models that are available

Examine Shell Help for the IP Address Model

Import all the NetBox Models so we can start chaining results later

We will use the IP address to find out all kinds of other things using different models

We can use the IP address to get all the info about the interface it’s assigned to

We can also look at what device that IP is assigned to

Using the IP address we can see all of the details of the device it’s assigned to

Using the IP we can see all of the details of the device type by chaining the models

Next we can dig into the help for the Device Type Model

You can work from the other direction and find IP from Device Name

You can define config contexts that the devices inherit based off user criteria

Digging into the Interface model various ways

Sync Data from Git to NetBox

Jinja 2 Examples using NetBox data to render configs (Infrastructure as Code)

Posted in Uncategorized | Comments Off on NetBox Shell Examples with Jupyter Notebook

How I melted a data center switch and effected changes to an existing product line.

I was hired onto the Hardware Platform Engineering team at eBay just as they started getting ready to deploy their next-gen 400G network that was using SONiC, an open source network operating system. The team is in charge of validating all of the hardware that runs on the eBay infrastructure, so it’s pretty important to find issues BEFORE the gear gets deployed to Production.

This is the story about how a curious network engineer that knew very little about hardware was able to blow up a switch that was already in production and effect changes on the existing product line. Moreover, I was able to expose important operational issues that would have been a nightmare to support in the field.

Observability is criticial when evaluating a system, be it a manual process, how hardware runs, or software is supposed to work. You must have tests and feedback loops so that you can look for patterns or inconsistencies that can help you find potential landmines before they blow up in your infrastructure. In the examples below we used Prometheus to export the data off the switch and send it to Grafana, so we could see the data in real time.

Power discrepancy was immediately apparent between PSU1 and PSU2. I’m just a network engineer but I’ve dabbled in building some of my own hardware, so I’m already thinking something is not right here.

Jumping onto the switch and running the show commands verifies the huge delta between PSU1 / PSU2. This is NOT how power sharing is supposed to work! One important thing to note here is the model of the PSU1 & PSU2 is exactly the same but the serial numbers are different.

After the switch blew up and as part of the root cause analysis, the switch vendor tells us to NEVER mix the 2 DIFFERENT power supply vendors, even though the part numbers are the same. Apparently, one of the manufacturers forgot to calibrate their power supplies after they retooled them to change the airflow to meet production demands!

Here is the picture of ONE of the chips that melted. In addition to the switch blowing up, we also lost 16 out of the 32 400G optics. Apparently the chip that handles the 12V to 3.3V conversion did not protect downstream devices from this power event, it effectively cascaded through the switch as it was in the process of melting down!

3 Chips EOS on main board and 1 chip EOS on the fan board

Another issue that was uncovered during these tests were fluctuating power temps as can be seen here.

Apparently more quality control issues with the power supply factory, turns out there was an empty weld causing the problem with the power temp readings.

The main takeaway here is that observability, pattern recognition, curiosity and attention to details are all very important skills to develop.

Nobody told me to mix the power supplies as part of my test regime but I thought that it was a worthwhile test since they had the EXACT same part number and should have been interchangeable!

As a result of these findings, eBay was able to make sure that we only received the GOOD power supplies and communication was made to look out for this potential issue. Could you imagine these things hitting production and doing a simple power supply swap and then the Production switch blowing up!

Finally, as a result of these issues I helped to uncover, the switch has had it’s hardware and software upgraded to prevent this issue moving forward! They beefed up the power chips and added downstream protections to the optics that were fried in the process. Also the CEO of the company reached out to me to connect on LinkedIn 😉

Never stop asking questions or wondering why something is happening. Always build in feedback loops to ensure your systems are working as you expect them to. DON’T ASSUME THINGS, BE CURIOUS AND ASK QUESTIONS.

Posted in Uncategorized | Comments Off on How I melted a data center switch and effected changes to an existing product line.

Snake Test: 4 Nodes @ 800Gbps = 25.6 Tbps

The Network team took our traffic generator so we needed to build our own using (4) servers with (8) 100G NIC’s. Here is the design I came up with to build our own “home brew” traffic generator used to saturate the switches while stress testing them in the thermal chamber.

8 x 100G Host NIC’s feeding the Snake Test: 48 Hours

2x400G Uplinks to Snake Test Switch: 48 Hours

Snake Test Switch 32x400G: 48 Hours

Snake Test Switch @ 25.6 Tbps: 48 Hours

Posted in Uncategorized | Comments Off on Snake Test: 4 Nodes @ 800Gbps = 25.6 Tbps

DNS over TLS over VPN and back

DNS over TLS is a big step in fixing a badly designed protocol, at least in terms of privacy, but it still leaves you having to trust the endpoint with all of that data that you are trying to keep private. While Cloudflare claims not to log your info or sell your data, you can never know for sure what’s going on at the far end.

To that point, I’ve decided to send all of my DNS traffic over my self managed VPN that terminates on a server I rent in a Canada. This is an extra step I take to break the direct connection from my home IP address. In fact there are ZERO DNS packets leaving my home network, port 53 or 853.

I’m running a DNS-TLS resolver on my pfSense firewall and connecting to 1.1.1.1 on port 853. I’m also explicitly blocking Google DNS 8.8.8.8/8.8.4.4 and QUIC protocol (UDP 443)

While my DNS lookup times suffer greatly, the resolver caches entries so only the initial lookup is slow.

DNS Lookup to digg.com took 721 msec! (west coast to east coast and back)

The following lookup took 3msec, much better!

Here are my pfSense settings where I’m sending the DNS over the VPN

More DNS Resolver settings

General DNS Settings:

DNS Traffic exiting my VPN Server in Canada

Blocking Google DNS and QUIC with pfSense

Read more about DoT and DoH at Cloudflare:

https://www.cloudflare.com/learning/dns/dns-over-tls/

https://blog.cloudflare.com/handshake-encryption-endgame-an-ech-update/

Posted in Uncategorized | Comments Off on DNS over TLS over VPN and back

Prepping for changes with EVE-NG; SONiC, Arista, Juniper and Cisco

This weekend I got motivated to start building out my EVE-NG lab with SONiC since we are in the middle of testing out 400G in our hardware labs. The green links below need to be added to our real environment so I figured I would get the configurations vetted ahead of time. Also I’ll be using this lab to practice my network automation with Python, Nornir, Netmiko, NAPALM, etc.

There are a few links at the bottom that are helpful and should spell out all you need to do. The hardest part I had was finding a link to the virtual image so I included that direct link below for anyone else who may struggle to find that.

I may add some more details later but it’s already Sunday evening and I’ve been messing with this lab all day long and it’s time for bed.

Here you can see EVE-NG stats installed onto ESXi 7.0

TOP showing that Juniper uses 94% CPU and IOL is using 1%. I made some tweaks for KVM performance but it still is not where I’m expecting to see it. You can see the Arista using about 7-10% and SONiC using 13-16% CPU.

SONiC running in eve-ng, along with Arista, Juniper and Cisco.

SONiC Baking in the oven as we perform 400G optic testing.

Download sonic-vs.img.gz: https://sonic-jenkins.westus2.cloudapp.azure.com/job/vs/job/buildimage-vs-image-202012/lastStableBuild/artifact/target/

How-To: https://translate.google.com/translate?sl=auto&tl=en&u=https://moisio.fr/2021/01/11/sonic-sur-eve-ng/

How-To #2: http://www.networkhints.com/2021/01/microsoft-sonic-virtual-switch-on-eve-ng.html

EVE-NG Blog: https://jncie.eu/

Juniper Performance vs Lite Mode: https://www.juniper.net/documentation/us/en/software/vmx/vmx-getting-started/topics/task/vmx-chassis-flow-caching-enabling.html

Posted in bgp, Networks | Comments Off on Prepping for changes with EVE-NG; SONiC, Arista, Juniper and Cisco

Object-Oriented Programming vs. Functional Programming

Today I found a great article on hacker news with regards to Object Oriented Programming. Some interesting arguments about code complexity causing issues with Volkswagen accelerator issues and the Boeing MAX issues.

https://suzdalnitski.medium.com/oop-will-make-you-suffer-846d072b4dce

“None of the built-in OOP features help with preventing spaghetti code — encapsulation simply hides and scatters state across the program, which only makes things worse. Inheritance adds even more confusion. OOP polymorphism once again makes things even more confusing — there are no benefits in not knowing what exact execution path the program is going to take at runtime. Especially when multiple levels of inheritance are involved.”

These are very valid arguments based on my own experience with JAVA based OOP. The complexity it adds seems to outweigh the “efficiency” of the code and runs the risk of morphing into tangled mess of “Spaghetti Code”.

A great read and an interesting concept for people who do software development.

Posted in Programming | Comments Off on Object-Oriented Programming vs. Functional Programming

OSX 10.15 on 2009 Macbook Pro & ESXi 7.0

I updated my 2009 macbook pro this weekend from 10.11 (last supported OS) to 10.15. The best part about it all is that the computer runs FASTER with Catalina installed then it did with El Capitan! It’s possible that this is due to the fresh install but it’s nice to breathe new life into old hardware that is clearly still working perfectly fine as a daily machine. If i need some more horsepower there is always the ESXi 7.0 instance with any flavor of OS, including the new version of Catalina I virtualized last weekend.

Macbook upgrade patch can be found at https://dosdude1.com, just make sure you have a decent thumb drive. I spent HOURS fidgeting with this issue before plugging in a proper EFI supported hard drive for the boot device.

Mac OS on ESXi 7.0 tutorial can be found here, everything works great with the 3.02 unlocker patch needed for ESXi to create the mac OS virtual machine.

Posted in Uncategorized | Comments Off on OSX 10.15 on 2009 Macbook Pro & ESXi 7.0

Python Inventory Search

I wrote a python script to query a website and check to see if an item is in stock. If it finds the product in question it will email me a report with a hyperlink so that I can click the link and place and order straight from my email. This script runs as a cron job (once every hour) and saves me the time of having to constantly check their website, wondering when they will get the next delivery.

It was a great exercise in using the BeautifulSoup4 python library as well as using selenium for the first time which was needed to flesh out the javascript that is creating the dynamic content.

from selenium import webdriver
from bs4 import BeautifulSoup as bs
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import NoSuchElementException
import lxml
import smtplib
import time

options = Options()
options.headless = True

mylist = []
not_found = ''

driver = webdriver.Firefox(options=options)
driver.get("https://www.website.com)

# Try and fix the random timing errors --> better way is with selenium waitfor
time.sleep(5)

# Look for the state "no product in stock" --> "0 matches, that stinks"
try:
    not_found = driver.find_element_by_class_name("css-1ctldcn.ew1p50q2")
    not_found_html = not_found.get_attribute('innerHTML')

# handle the exception of product actually being found.
except NoSuchElementException as e:
    print (str(e))

# print "not found message" and exit program
if(not_found):
    print (not_found_html)
    driver.close()
    exit()

try: 

    # They have stock; now find how many products they have at runtime.
    products = driver.find_element_by_class_name("css-hecap1.ettsl931")
    total_products = products.get_attribute('innerHTML')

# handle the exception of product elements not being found and exit program
except NoSuchElementException as e:
    print (str(e))
    driver.close()
    exit()

# Find all of the Grid Elements, or all of the products available - all products use the same grid ID
element = driver.find_element_by_class_name("css-19ofktj.e29d1tf2")
html = element.get_attribute('innerHTML')
soup = bs(html, "lxml")

print (total_products)

for a in soup.find_all('a', href=True):
    mylist.append("Found the URL: https://www.website.com" + a['href'])

# Python 3 only
print (*mylist, sep="\n")

# See the whole tree with price and description for each item
# prettyHTML = soup.prettify()
# print (prettyHTML)

port = 587  
sender_email = "SENDER@gmail.com"
receiver_email = "RECEIVER@email.com"

message = """\
Subject: new products have arrived!

{}. """ .format(total_products) + str(mylist)

server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.ehlo()
server.login(sender_email, "SENDERPASSWORD")
server.sendmail(sender_email, receiver_email, message)
server.quit()

driver.close()
Posted in Python | Leave a comment