Recently a problem came up on a single site of a multi-site network. This one site had a problem with high CPU utilization on thier Cisco 4510R layer-3 switch. This switch is the core of this particular site, and there are other similarly configured locations with similar hardware.
Troubleshooting begins with all the usual suspects: show ver, show mod, show proc cpu, show log. Nothing stands out except the show proc cpu shows Cat4k Mgmt LoPri is taking up 70-80% of the CPU. I am relatively new to the Catalyst 4500 Sup 6 card, and spend most of my time on the 6500 platform. I quickly discover that this is a process with many threads. The details can be seen by executing a differenet command, show platform health. This command breaks down all the different threads that make up the management processes within a Catalyst 4500 switch. It also details target and actual CPU utilization statistics.
In this case, I find that K5CpuMan Review is taking up a lot more CPU than it is supposed to. Now I must say the name of this process surely means something to someone, but not to me. Apparently this is the name of the process that handles all CPU switched packets. (That was my first guess obviously!) Now packets can be CPU switched for many reasons. They might be destined for the CPU, there could be a misconfiguration, or they could be marked in a way that forces intermediate systems (routers and switches) to look at them.
The next step is to determine what traffic is being directed at the CPU and why. This step will probably invoice some debug commands. The good news to those out there that shiver at the idea of using debugs in a live production network is that the commands we will use have to impact on the performance of the equipment. Essentially, the platform is doing this stuff already; the commands just make it log the information so it is visible. Oh, these commands do NOT survive a reboot and don't show up in when a show debug is executed. What I would like to know is why not just leave these features on all the time? If we want to see where packets are coming from that are destined for the CPU, we execute debug platform packet all count. After the debug has been running for a bit, we can display the results with show platform cpu packet statistics. Unfortunately in this case nothing stands out as an obvious source for the problem.
For grins I decided to look at some other old standby commands: show interface stats and show ip traffic. show interface stats will normally show what traffic is switched by which mechanism. Unfortunately, process switched traffic is at 0 for all interfaces. We did learn something though; the traffic affecting the processor is destined for the processor. The next command, show ip traffic, is actually very telling. Below is an example capture of what this symptom might look like. Notice that there is a non-zero value for packets with options? Out of the 3.4 million packets handled by the processor in this example, 1.8 million of them have an IP option defined. Upon further examination, all of the 1.8 million packets have the router alert option defined!
LAB Router#show ip traffic
IP statistics:
Rcvd: 3249411 total, 2367891 local destination
0 format errors, 0 checksum errors, 200092 bad hop count
0 unknown protocol, 0 not a gateway
0 security failures, 0 bad options, 1857133 with options
Opts: 0 end, 0 nop, 0 basic security, 0 loose source route
0 timestamp, 0 extended security, 0 record route
0 stream ID, 0 strict source route, 1857133 alert, 0 cipso, 0 ump
0 other
I know what you are thinking, "What are IP options, and why is this important?" Well, IP options were flags created to allow hosts and routers to be informed about the special nature of a packet. RFC 2113 created the Router Alert, IP Option 20. This option tells the router to inspect any packet with this option passing through it, even if it is not destined for it. This inspection provides value when implementing RSVP and other similar protocols.
So lets see what's causing these packets to be marked with the Router Alert option. Another platform specific and non-impacting debug command can create a 1024 packet deep circular sniffer trace buffer. To enable this feature we execute debug platform packet all buffer. To view the results, we show platform cpu packets buffered. Now that's odd! I know packets are going to the processor, but they are not getting logged by the debug command we just executed. I am going to blame that on the difference between the Sup-6e and the other processors. The Sup-6e doesn't always do what the others are capable of. However other options exist! A sniffer hooked to the switch might just do the trick, but what interface do we monitor? Well, the CPU of course! The change we would make from the normal SPAN setup is in the source: monitor session 1 source cpu queue all rx. Now we just keep capturing until the options value of the show ip traffic command starts to increment...
Other things to consider here is the possibility of a Denial of Service (DoS) attack. A few years ago Cisco released a field advisory that explains a vulnerability in IOS when a router is subjected to packets with certain IP options set. However, if this issue was a DoS attack, one would expect to see a spike in traffic coming from one port, but as we have already seen, the show platform cpu packets statistics shown no specific spike in traffic, meaning the traffic is coming from everywhere, and is most likely normal traffic.
The sniffer is truly the tool to solve this particular issue. Once the packet capture came back, IGMP packets correlated with the increase in the alert option in the show ip traffic counters were discovered. Most of the IGM packets were multicast joins to 224.0.0.251: mDNS. mDNS is very rare in production networks, typically isolated to HP printers that are set to thier defaults. Apparently Mac OSX has and application called Bonjour that performs dynamic discovery on the network. One of the protocols in use by this application is mDNS. This application was discovered because the source MAC address in the sniffer trace ties to an Ethernet OUI registered to Apple.
The troubleshooting steps I have lined out here are a combination of personal experience and things TAC suggested. Afterwards, TAC emailed me a link to a document that contains a lot of these steps, some of which may be more applicable to other situations.
http://www.cisco.com/en/US/products/hw/switches/ps663/products_tech_note09186a00804cef15.shtml

Leave a comment