Dropping ports on new WS, what is wrong with my setup?

Fri May 13, 2016 5:17 pm

Hey Yahel,

Your post made us look at our KLUDGE FIX which we are going to remove

First off there was a BUG that if you cleared the port stats it could cause the KLUDGE FIX to falsely disable FC.

I removed v1.4.0rc19 from download and will post up v1.4.0rc20 with the KLUDGE FIX REMOVED later today

Fri May 13, 2016 10:56 pm

So tonight I was upgrading all my towers to v1.4.0rc22 and I was able to see the Pause Frame Storm happen for the first time but since my towers all have multiple paths in and out it recovered on its own which is maybe why I have never seen it as I just happened to catch this. Keep in mind this switch was just upgraded less than 30 minutes ago so that many pause frames is not normal in a short span of 30 minutes.

In this picture you can see the Pause Frame Storm occurred. Now this happened when another tower in the main ring was rebooting after the firmware upgrade and 1/3 to 1/2 of my network traffic would have suddenly been thrust upon the AF5X link when OSPF converged on this higher cost link with a lot of packets and data bursts higher than the AF5X link could easily handle as the bulk of my traffic normally travels across AF24 links.

CLICK IMAGE TO VIEW FULL SIZE

I saw the Tx Drops and thought about what Chuck had said about maybe the switch is not Obeying the Tx Pause Frames from the AF so we looked at the core to insure the port is indeed set to Obey and it was as seen in this picture.

CLICK IMAGE TO VIEW FULL SIZE

In this picture you can see the AF5X that appears to have generated the Pause Frame Storm but after the traffic had been moved back to the primary AF24 links. As you can see it is a SOLID link.

CLICK IMAGE TO VIEW FULL SIZE

Now this picture is another switch in a side ring that goes out 10 miles into a rural area and the traffic never really pushes the AF5X link to the MAX and it is operating just fine and has never messed up.

CLICK IMAGE TO VIEW FULL SIZE

So what does this all prove....nothing really, just another piece of the puzzle. But I think it may have something to do with an event when the AF link is saturated very fast with a ton of packets and a high amount of capacity possibly with a lot of data bursts of more than the link can handle easily? - ALL CONJECTURE AND NO FACTS

I do not know who is at fault but we are all trying to figure it out, Chuck and his team over at UBNT is trying and we are trying over here as well.

For now I think everyone needs to turn OFF Flow Control either on the AF radio or the switch port facing the AF radio "if" the link can be saturated. Now I was talking to Yahel tonight who found out about the KLUDGE FIX turning OFF his FLow Control with v1.4.0rc19 and he said he thinks he sees the Pause Frame storm with an AC 500mm PTP link?

So maybe this is not limited to AF radios? Once again another piece of the puzzle to this mystery.

Fri May 13, 2016 11:23 pm

By the way this is the last switch to be upgraded tonight as it is the primary head end switch. It is running v1.3.9 and has been up for almost 97 days as you can see in this picture. This is also the switch on the tower that Sprint tore out our grounding system when they left the tower and fried Port 1 in the switch and an AF24HD radio when a storm rolled through which you can see the port has an X on it at the very top edge of the picture.

I had to move the AF24HD link from Port 1 to Port 4. So this switch took a surge hit months ago and is still in service as my MAIN switch. :crazy:

CLICK IMAGE BELOW TO VIEW FULL SIZE

Now this is where my MAIN feed comes in from my fiber. It is an AF24HD link but it does not link up solid above QAM64. It bounces around a lot, been meaning to go back and try and get a couple more dB out of the alignment.

CLICK IMAGE BELOW TO VIEW FULL SIZE

As you can see in this picture of the port stats it does see a lot of Pause Frames but my network has never gone down. I am guessing I never see a Pause Frame Storm or at least a prolonged one as the link is not pushed to the limit as it rarely sees 600 Mbps and only for short bursts, we are not that big of a WISP.

CLICK IMAGE BELOW TO VIEW FULL SIZE

I will upgrading this switch shortly to v1.4.0rc22 *finger crossed* as it is the main switch. But if it fails traffic will failover to the MIMOSA B5 link and head into Hillcrest tower making it the main switch for awhile so only this tower would be down but it has a couple hundred people on it. :willy:

Sat May 14, 2016 12:05 am

OK so you can see in this picture I did the upgrade and the switch is rebooting. You can see that pings to Quarry Road tower stopped and pings to the internet missed 3 pings until OSPF converged on the backup link which is the MIMOSA B5 shooting into Hillcrest tower.

CLICK IMAGE TO VIEW FULL SIZE

In this picture you can see the upgrade went fine and we are back to the switch login screen. I lost 15 pings to the tower during the reboot and OSPF converged back to Quarry Road Tower and no pings to the internet were lost. :hurray:

CLICK IMAGE TO VIEW FULL SIZE

In this picture you can verify that my main tower is now running v1.4.0rc22

CLICK IMAGE TO VIEW FULL SIZE

in this picture you can see I upgraded ALL my towers to v1.4.0rc22. I run the same switch and firmware you guys do and I always try it on my WISP first.

CLICK IMAGE TO VIEW FULL SIZE

Personally I think v1.4.0rc22 is our BEST firmware yet!!!

I am not going to lie that last one had my heart pumping!!!! - YOU KNOW WHAT I MEAN!

Sat May 14, 2016 2:19 am

Interesting. I'd say the link that took all the traffic and produced many pause frames was doing similar to the mis-aimed link I had after deployment the other day -> lots of pause frames from the AF5X. I'm still trying to find a link in my network where I could try that without harming my customers too much. I'd try it like this then:

Given "natural" downstream traffic to the far end of say 40 Mbps, I would use the AF CLI command 'af set powerout X' (where X is EIRP power) in 1 dB steps downwards until the AF UI tells me the link is running at 1x (SISO). It would have less capacity than traffic at this point and I could look at the RX FC counter from the AF. This is better than mis-amaing for a test because one can do an 'af set powerout <original>' to recover quickly (and because it's done on the near end).

Then again, UBNT should be able to do just that with their traffic generator lab setup. "BASIS" ;-) and I have given them enough hints ...

Sat May 14, 2016 6:59 am

Here's another report which is certainly going to shade some new light on things...
First, some context...
Just like Chris's all our towers have multiple paths (as all good WISPs should), so we never experienced loss-of-access to our towers (I prefer this terminology over the wrong "lock-down" term that was used in this thread). In other words, if one interface on a switch would go down due to FC-Pause-frames-storm, we would still have access as the network would converge and access would be possible via the alternate path.
Essentially, this probably means that the FC-Pause-frames-storm is nothing new, and probably have been happening in our network since long, but we simply never knew about it (still don't know what's the source - but we're learning... knowing that it does happen is the first step).
Here's some more data points...
Most of our network is routed (OSPF with BFD which is very fast to converge), but a part of our network is bridged, with multi-path managed through carefully set RSTP weights, which is also quite fast to converge (we should be moving away from bridged, but it works {mostly} and who has the energy to fix what's not broken).
Bottom line --- we would not have known about any of these issues unless the Kludge-fix of RC19!
Which is why I like it (yes, it's not done well, and has quite a few bugs, but it gives us some visibility).

After last night's upgrade to RC19, many ports on many switches got the "Excessive flow control pause frames received", and indeed got FC disabled.
We've since learned that this often got triggered falsely (due to reset of counters, reboot, upgrades, etc.), but it also gets triggered by the real thing, it appears - here's how I know:
I've since re-enabled FC on all ports of all switches (except one which would keep triggering the Kludge-fix -- and it's actually an AC500 radio not AirFiber).
It all was quiet all day, until high traffic started in the evening, and then, at 7:16pm... drumroll... I got an email from 11 switches (actually 7 switches and 11 ports) where the pause-frame-storm triggered the Kludge-fix... all at the same time!
These, of course, are the switches on my bridged ring (but not all of them - there are 5 other switches on this bridged domain where the Kludge-fix was not triggered).

This synced event could happen due to two reasons:

1. Netonix switches and/or UBNT radios in bridge mode (mostly AirFibers) forward FC-Pause-Frames across interfaces.
This is of course horrible, if true, and must be checked and ruled-out.
This suspicion is the reason for this post -- we must double check, in a lab, if this indeed happens - which could explain the problem.

2. What is probably the reason, the momentary shutdown of ports due to the Kludge-fix, caused RSTP to converge and high traffic volumes to pass via radios that triggered the attack (Perhaps it's normal? Yes - there are many Pause-frames shown in the ports details, but we have lots of traffic during these hours { >> 500Mbps, which many AF-links cannot handle if the other paths are down}).

Overall, no harm was noticed, and no packet loss registered in the Smokeping graphs (too coarse granularity -- I'm sure there were a few good seconds of loss, but these graphs won't register that). It all converged back with the FC disabled and was quiet since.

Strange, at least 3 switches where the Kludge-fix was triggered never actually disabled the FC ! Another Kludge-bug?

I therefore conclude that I like the Kludge-fix and I'm going to keep it and not upgrade for a while.
I wish it would not actually do anything but warn/email (and it's partly broken so sometimes it does not do anything else anyways)...
I also wish it would not trigger upon false positives, and perhaps only on very high rates of FC-Pause-Frames... But it does gives visibility to a potentially very harmful issue, so I like it.
At least until we find out what's the source of all this...

I'm going to reset all ports and re-enable FC on them and wait for the next event...

Throughts?

Sat May 14, 2016 9:35 am

I'm going to say this again. When we had the problem occurs the switch was inaccessible from any port and was not passing any traffic out any port.

Sat May 14, 2016 11:08 am

adairw wrote:I'm going to say this again. When we had the problem occurs the switch was inaccessible from any port and was not passing any traffic out any port.

And that can happen Adair, we talked about this.

If say port 4 is getting hammered with a never ending stream of Rx Pause Frames then as the switch buffers fill up the switch starts sending Tx Pause Frames out other ports because it has no place to deliver packets or hold any in buffers. Eventually all port could be Paused if they are all sending out Tx Pause Frames. If you remove the cable from the port receiving all the Rx Pause Frames then the buffers will clear and the other ports will stop sending Tx Pause Frames. Even if that cable is the only ingress/egress port because the switch will clear that MAC from its table and any packets held in que for that MAC are dropped and any new packets entering the switch destined for that MAC is dropped. As soon as the other devices that would normally get out that path lose their MAC entry for the exit they would also no longer send any packets either to the switch as there would be no destination known.

In my topology and Yahels the ports connected to the AF radios are ingress/egress only ports so only packets trying to enter or leave the tower flow across them and we have 2 or more of these doors so when 1 door shuts the packets go to the other door.

If your switch/tower only has one door to enter and exit then pretty much all packets need to use that door so if that door is locked down all other packets on all other ports are destined to that door and if it is shut then those ports also fill up their buffers and start sending out Tx Pause Frames and the switch appears locked.

Sat May 14, 2016 11:41 am

When Flow Control is active you can easily get many Pause Frames a second even on a perfect connection. You have to remember how small of an event Pause Frames are.

A PAUSE frame includes the period of pause time being requested, in the form of two byte unsigned integer (0 through 65535). This number is the requested duration of the pause. The pause time is measured in units of pause "quanta", where each unit is equal to 512 bit times.

A quanta is so small it is hard to wrap your mind around it, they can occur on any Ethernet link that is properly functioning and has no issues, it comes down to timing and when something occurs. Seeing a Pause Frame on a 1G Ethernet link only passing a few Mbps of payload is not uncommon.

It all depends on what happens and when, its all in the timing.

People often mistakenly assume if a Pause Frame occurs on a connection that is not saturated that something is wrong and that is simply not true.

Sat May 14, 2016 5:27 pm

So apparently many high end switches implement Pause Frame Flood Protection. Apparently this is a common problem not limited to Netonix and UBNT devices.

http://h17007.www1.hp.com/docs/enterpri ... about.html

So we are putting Pause Frame Flood Protection back into the firmware, we should have a new RC version tomorrow.

We are improving on our previous attempt of Pause Frame Flood Protection which was flawed (sorry). You will also have the ability to Disable it on the Device/Configuration Tab under the Storm Control section.

HP's and Cisco's solution is to disable the offending port but we has found that simply disabling Obey Pause Frames on the port is enough.

We did not feel this method would be acceptable as the switch could be many miles away so why not just disable the Obey Pause Frame on the offending port and alert the admin via SMTP alert.

The new Pause Frame protection will take into account that the port counters can be reset and it will require multiple samples of the storm to trip and cause the Obey Pause Frame to be disabled on the port and alert the admin of the action.

Currently we only allow you to Enable or Disable BOTH Obey and Generate Pause Frames per port but the next firmware will allow you to enable them separately per port.

We are also changing the Flow Control indicator on the Status Tab which currently only reports if both Obey and Generate are enabled and negotiated so now it will indicate if one or the other or both are active. - NICE ENHANCEMENT which could help diagnose issues.

HP wrote:Pause flood protection
Ethernet switch interfaces use pause frame-based flow control mechanisms to control data flow. When a pause frame is received on a flow control enabled interface, the transmit operation is stopped for the pause duration specified in the pause frame. All other frames destined for this interface are queued up. If another pause frame is received before the previous pause timer expires, the pause timer is refreshed to the new pause duration value. If a steady stream of pause frames is received for extended periods of time, the transmit queue for that interface continues to grow until all queuing resources are exhausted. This condition severely impacts the switch operation on other interfaces. In addition, all protocol operations on the switch are impacted because of the inability to transmit protocol frames. Both port pause and priority-based pause frames can cause the same resource exhaustion condition.

HP Virtual Connect interconnects provide the ability to monitor server down link ports for pause flood conditions and take protective action by disabling the port. The default polling interval is 10 seconds and is not customer configurable. The SNMP agent supports trap generation when a pause flood condition is detected or cleared.

This feature operates at the physical port level. When a pause flood condition is detected on a Flex-10 physical port, all Flex-10 logical ports associated with physical ports are disabled. When the pause flood protection feature is enabled, this feature detects pause flood conditions on server down link ports and disables the port. The port remains disabled until an administrative action is taken. The administrative action involves the following steps:

Resolve the issue with the NIC on the server causing the continuous pause generation. This might include updating the NIC firmware and device drivers.

Rebooting the server might not clear the pause flood condition if the cause of the pause flood condition is in the NIC firmware. In this case, the server must be completely disconnected from the power source to reset the NIC firmware.

Dropping ports on new WS, what is wrong with my setup?

Re: Dropping ports on new WS, what is wrong with my setup?

UPDATE TO ISSUE

1st Addendum to ISSUE update

2nd Addendum to ISSUE update

Re: Dropping ports on new WS, what is wrong with my setup?

Re: Dropping ports on new WS, what is wrong with my setup?

Re: Dropping ports on new WS, what is wrong with my setup?

Re: Dropping ports on new WS, what is wrong with my setup?

Re: Dropping ports on new WS, what is wrong with my setup?

New Firmware coming out soon to combat this ISSUE

Who is online