Page 1 of 2

Latency increases dramatically, fixed by reboot

Posted: Sat Apr 23, 2016 3:48 pm
by flameproof
I am noticing a problem that seems to happen at random on our switches. On occasions, they are reported as "down" by our network monitoring system, but in fact, the switch is not really down, just that latencies have increased significantly. Here is a continuous-running ping, with a reboot. You can see the pings go from random and high values, to sub-5ms which is what we see in our entire network when operating correctly:


Code: Select all
 64 bytes from 172.16.255.151: icmp_seq=14 ttl=63 time=26.1 ms
64 bytes from 172.16.255.151: icmp_seq=15 ttl=63 time=139 ms
64 bytes from 172.16.255.151: icmp_seq=16 ttl=63 time=854 ms
64 bytes from 172.16.255.151: icmp_seq=17 ttl=63 time=853 ms
64 bytes from 172.16.255.151: icmp_seq=18 ttl=63 time=1003 ms
64 bytes from 172.16.255.151: icmp_seq=19 ttl=63 time=575 ms
64 bytes from 172.16.255.151: icmp_seq=20 ttl=63 time=343 ms
64 bytes from 172.16.255.151: icmp_seq=21 ttl=63 time=5.80 ms
64 bytes from 172.16.255.151: icmp_seq=22 ttl=63 time=23.6 ms
64 bytes from 172.16.255.151: icmp_seq=23 ttl=63 time=630 ms
64 bytes from 172.16.255.151: icmp_seq=24 ttl=63 time=590 ms
64 bytes from 172.16.255.151: icmp_seq=25 ttl=63 time=373 ms
64 bytes from 172.16.255.151: icmp_seq=26 ttl=63 time=193 ms
64 bytes from 172.16.255.151: icmp_seq=27 ttl=63 time=809 ms
64 bytes from 172.16.255.151: icmp_seq=28 ttl=63 time=309 ms
64 bytes from 172.16.255.151: icmp_seq=29 ttl=63 time=3.85 ms
64 bytes from 172.16.255.151: icmp_seq=30 ttl=63 time=523 ms
64 bytes from 172.16.255.151: icmp_seq=31 ttl=63 time=718 ms
64 bytes from 172.16.255.151: icmp_seq=32 ttl=63 time=12.2 ms
64 bytes from 172.16.255.151: icmp_seq=33 ttl=63 time=703 ms
64 bytes from 172.16.255.151: icmp_seq=34 ttl=63 time=16.7 ms
64 bytes from 172.16.255.151: icmp_seq=60 ttl=63 time=8.64 ms
64 bytes from 172.16.255.151: icmp_seq=61 ttl=63 time=3.91 ms <- AFTER REBOOT
64 bytes from 172.16.255.151: icmp_seq=62 ttl=63 time=2.66 ms
64 bytes from 172.16.255.151: icmp_seq=63 ttl=63 time=2.66 ms
64 bytes from 172.16.255.151: icmp_seq=64 ttl=63 time=4.26 ms
64 bytes from 172.16.255.151: icmp_seq=65 ttl=63 time=3.50 ms
64 bytes from 172.16.255.151: icmp_seq=66 ttl=63 time=3.20 ms
64 bytes from 172.16.255.151: icmp_seq=67 ttl=63 time=3.84 ms
64 bytes from 172.16.255.151: icmp_seq=68 ttl=63 time=3.88 ms
64 bytes from 172.16.255.151: icmp_seq=69 ttl=63 time=3.86 ms
64 bytes from 172.16.255.151: icmp_seq=70 ttl=63 time=2.22 ms
64 bytes from 172.16.255.151: icmp_seq=71 ttl=63 time=3.36 ms
64 bytes from 172.16.255.151: icmp_seq=72 ttl=63 time=2.99 ms
64 bytes from 172.16.255.151: icmp_seq=73 ttl=63 time=2.82 ms


Any ideas?

Re: Latency increases dramatically, fixed by reboot

Posted: Sat Apr 23, 2016 4:10 pm
by sirhc
It would help to know your model and firmware version but even so there is no known issue like that.

More than likely I would assume there is traffic on your switch somewhere and when you reboot the switch you break the traffic/stream and when the switch comes back up the traffic has stopped.

Maybe investigate each port and track down the offending traffic. You could disable the ports one at a time.

Re: Latency increases dramatically, fixed by reboot

Posted: Sat Apr 23, 2016 4:37 pm
by lligetfa
I assume you are pinging the switch itself? If so, it is not a good measure of network latency as the switch probably puts a very low priority on answering pings. I had a bunch of HP Procurve 2524 switches that got real lazy answering pings if the switch CPU went over 25%. I stopped using them in high traffic areas cuz my NMS would false alert on them.

Are you taxing the switch CPU with SNMP? What do you get if you ping a device beyond the switch, preferably a device that doesn't get crazy busy?

Re: Latency increases dramatically, fixed by reboot

Posted: Sat Apr 23, 2016 4:47 pm
by sirhc
I am guessing there is some sort of traffic on his network then when he reboots the switch the stream stops and all is good again?

But still would help to know model and firmware.

I mean all the switches models are the exact same switch core and cpu with just more or less ports but still nice to know when someone reports an issue.

Re: Latency increases dramatically, fixed by reboot

Posted: Mon Apr 25, 2016 2:29 pm
by flameproof
So this happens on either WS-12-250A (we have 5, and I've seen it happen on all of them), or our WS-24-400B. They all run FW 1.3.9.

As for traffic, our system is not live yet, so we have very little traffic, peaks are 1Mbps. When I see this happening, the traffic levels on the switch are normal, no peaks or high sustained rates.

I have SNMP enabled, but since I consider all SNMP monitoring platforms to be bloated, inefficient, or too aggressive on hardware and network resources, I built my own, which does a single ping per minute, and alerts if the average latency over X minutes goes over Y much. It also connects via SSH/HTTPS and grabs JSON-format info about status, once every 10 minutes. See screenshot of my "dashboard"... (all PHP/HTML/JS based, road names removed for privacy)

Screenshot at Apr 25 20-23-13.png

Re: Latency increases dramatically, fixed by reboot

Posted: Mon Apr 25, 2016 2:32 pm
by sirhc
I hate to say this but there has to be something going on with your network.

There are over 12,000 switches in service and a bug like this would surely be upsetting a LOT of people.

Plus I have 25+ of these switches in service at my WISP and do not see this issue.

You could start by posting up all of your Config Tabs and explaining your network configuration.

Then post up your Switch Log and Device/Status Tab from just before you issue the reboot.

READ NEXT POST

Re: Latency increases dramatically, fixed by reboot

Posted: Mon Apr 25, 2016 2:33 pm
by sirhc
If you have a large Flat Network you should upgrade to v1.4.0rc12 or disable UBNT Discovery as there was an issue found with that which was fixed in v1.4.0rcX

I would suggest upgrading to v1.4.0rc12 and see what happens.

Re: Latency increases dramatically, fixed by reboot

Posted: Mon Apr 25, 2016 2:40 pm
by flameproof
OK, will do. The config is a star topology, of one central site (WS-24) connected to five sub-nodes, each with a WS-12. There are 5 Ubiquiti NanoBeam ac on the WS-24, linking to the matching NanoBeam ac on each of the five WS-12. On the WS-24 an AirFiber connects back to the fiber local loop.

The NanoBeam ac ports on the WS-24 are on one VLAN, with ports not isolated.

On each WS-12, there are a number (between 3 and 8) of NanoStation M5, which provide backhaul to the access points. The access points have dual 5GHz/2.4GHz radios. The 5GHz side connects back to the nearest M5, and 2.4GHz provides user device access. On the WS-12, the M5 ports are on the same VLAN, but isolated from each other.

I don't have storm control enabled on any of the switches, but loop protection is enabled. Discovery is disabled on all switches.

Next time this happens I'll share status & config details of the affected switch.

After reading suggestion: although Ubnt discovery is disabled, I'll update to 1.4.0rc12 and see what happens.

Re: Latency increases dramatically, fixed by reboot

Posted: Thu Apr 28, 2016 5:47 pm
by flameproof
So, I'm having this issue on a WS-12-250A right now. I have taken screenshots of all relevant tabs, and even the top command run over SSH. As you can see, there is almost NO traffic through the switch.

In addition, ping times to the NanoBeam feeding into the switch are the normal 2.5ms average, pings to anything that is across the switch, on the other side so to speak, are just as bad.

I've read some comments from people having nasty issues with the suggested rc firmware, is it safe to use in a production environment?

Screenshot at Apr 28 23-35-09.png


Screenshot at Apr 28 23-34-43.png


Screenshot at Apr 28 23-39-11.png


Screenshot at Apr 28 23-41-18.png

Re: Latency increases dramatically, fixed by reboot

Posted: Thu Apr 28, 2016 6:03 pm
by sirhc
I am running v1.4.0rc12 in production.

There is a small memory leak in rc12 with Discovery but it is small and would take many many days to cause an issues which would eventually result in the switch rebooting.

Memory leak discussed in this thread: viewtopic.php?f=17&t=1672

You can simply not enable Discovery on the Device/Configuration Tab but it would take many days for the memory leak to cause a problem and we hope to have the next rc version released by Monday.