I mis-spoke. I only have flow control turned off on the switch, it's still on in the radio(s).
it's been five days uptime. So far so good.
Dropping ports on new WS, what is wrong with my setup?
-
zrob_12 - Member
- Posts: 59
- Joined: Wed Mar 25, 2015 1:51 pm
- Has thanked: 0 time
- Been thanked: 0 time
Re: Dropping ports on new WS, what is wrong with my setup?
I can confirm for now, that disabling Flow Control on all of my AirFiber units has resolved my issue. I did not disable Flow Control on my switches. I will report back if that changes...
-
adairw - Associate
- Posts: 465
- Joined: Wed Nov 05, 2014 11:47 pm
- Location: Amarillo, TX
- Has thanked: 98 times
- Been thanked: 132 times
Re: Dropping ports on new WS, what is wrong with my setup?
This is going to be lengthy and probably somewhat redundant. I'm only including for thoroughness.
Tonight, in a different part of the network we had another switch "lock up". What I learned is that this IS pause frames coming from the AF5X. And those frames were being broadcast out every port that had tagged VLAN's. Let me explain the site setup.
The tower has two WS12's running 1.3.9 and a Mikrotik CCR router. The primary back hauls are setup as mid-span's, untagged ports, these are unaffected by the pause frame storm. (so I'm calling it) Traffic was happily moving between those untagged ports to the router, no problem.
The AP subnets, mini pop back hauls and customer traffic are all vlan's originating from the router via VLAN's.
The two switches are cascaded together as well as each one connected directly to the router. Switch #2 was the one that locked up (no management from any port) tonight. Switch #1 was taking 8Mb/15K PPS from switch two but was unaffected by the pause storm, at least from a management perspective.
Via an EoIP tunnel I attempted to bridge myself into various ports on the router (a common thing we do) to put myself on the same layer 2 network as the switches (as if I was plugged in to the switch with a laptop, just remote) and no matter what I did I COULD NOT get switch #2 to respond to management traffic. It wasn't until I made it to the side and unplugged the AF5X feeding a pop everything started working normally and was responsive to management traffic and started passing customer traffic.
Here are some things I noticed.
1) The switches being cascaded together is only for management and backup vlan's that can be configured should the router fail. The two ports that link them together only share UNtagged management traffic, no other tagged vlan's are on the port of either switch. - So Why was the switch passing pause frames from an interface the AF5X was connected to, untagged vlan 10 and a bunch of tagged vlan's (customer traffic) to untagged vlan 1 (management)? They do not share an commonality.
2) Why did the interface from SW#2 not show ANY pause frame traffic on the port that connected directly to the router despite that port having ALL the tagged vlans for the AF5X and customer traffic??
In the images I'm including you can see from SW1 the stream of traffic coming from SW2 which was RX pause frames. On the left is the first shot I took and on the right another shot from a few minutes later showing the RX pause frames has gone WAY up.
The second image is from SW2 after I unplugged the AF5X and regained management of it locally. You can see the HUGE amount of RX pause frames on that port as well.
I've read a good number of the threads on pause frames but I don't understand exactly why they would be sent out ports that don't share common VLAN's or why the port towards the router didn't see them (flow control was off on that port, maybe that's why?). Likewise, why would pretty much every other port on the switch (as far as I can tell) have the pause storm traffic being sent out them? Why did it lock up SW2 until the traffic was gone but not SW1? Flow control was on SW1. Why did it not affect the mid-span back hauls? It seemed like if it was going to spill traffic out vlan 1 which wasn't shared either tagged or untagged with the AF5X that the traffic should have been on the mid-spans. Seems maybe the switch is handling the traffic inconsistently?
I also noticed that once I plugged my laptop in, the light on the switch was blinking fast even though my laptop was still booting up. I have to assume that the pause frames were hitting it, BUT when I looked at windows task manager and expected to see 8Mb of traffic hitting it, there was zero traffic. I can only assume that's because windows was taking in the traffic and not reporting it, maybe?
Lastly, why is this just showing up all of a sudden? Or does it just seem that way? This segment of the network has been up for months with the AF5X. It seems to me that a certain rate of traffic is triggering the pause flood. The leg that locked up is routed but only for management. The customer traffic is tagged vlan's between the main tower and the mini pop until I reconfigure the mini pop router. This mini pop also feeds one other smaller mini pop.
SO AGAIN we see the problem come up when there are multiple switches sharing layer 2 domains. Thinking further, the second down stream mini pop isn't even a netonix switch, it's a mikrotik router board that's acting like a switch and it's back haul is a NanoStation Loco M5 link. It seems that it's TRANSIT traffic across the switches that triggers it and it doesn't seem to have anything to do with the fact thatdown stream devices are not AF5X or netonix.
Tonight, in a different part of the network we had another switch "lock up". What I learned is that this IS pause frames coming from the AF5X. And those frames were being broadcast out every port that had tagged VLAN's. Let me explain the site setup.
The tower has two WS12's running 1.3.9 and a Mikrotik CCR router. The primary back hauls are setup as mid-span's, untagged ports, these are unaffected by the pause frame storm. (so I'm calling it) Traffic was happily moving between those untagged ports to the router, no problem.
The AP subnets, mini pop back hauls and customer traffic are all vlan's originating from the router via VLAN's.
The two switches are cascaded together as well as each one connected directly to the router. Switch #2 was the one that locked up (no management from any port) tonight. Switch #1 was taking 8Mb/15K PPS from switch two but was unaffected by the pause storm, at least from a management perspective.
Via an EoIP tunnel I attempted to bridge myself into various ports on the router (a common thing we do) to put myself on the same layer 2 network as the switches (as if I was plugged in to the switch with a laptop, just remote) and no matter what I did I COULD NOT get switch #2 to respond to management traffic. It wasn't until I made it to the side and unplugged the AF5X feeding a pop everything started working normally and was responsive to management traffic and started passing customer traffic.
Here are some things I noticed.
1) The switches being cascaded together is only for management and backup vlan's that can be configured should the router fail. The two ports that link them together only share UNtagged management traffic, no other tagged vlan's are on the port of either switch. - So Why was the switch passing pause frames from an interface the AF5X was connected to, untagged vlan 10 and a bunch of tagged vlan's (customer traffic) to untagged vlan 1 (management)? They do not share an commonality.
2) Why did the interface from SW#2 not show ANY pause frame traffic on the port that connected directly to the router despite that port having ALL the tagged vlans for the AF5X and customer traffic??
In the images I'm including you can see from SW1 the stream of traffic coming from SW2 which was RX pause frames. On the left is the first shot I took and on the right another shot from a few minutes later showing the RX pause frames has gone WAY up.
The second image is from SW2 after I unplugged the AF5X and regained management of it locally. You can see the HUGE amount of RX pause frames on that port as well.
I've read a good number of the threads on pause frames but I don't understand exactly why they would be sent out ports that don't share common VLAN's or why the port towards the router didn't see them (flow control was off on that port, maybe that's why?). Likewise, why would pretty much every other port on the switch (as far as I can tell) have the pause storm traffic being sent out them? Why did it lock up SW2 until the traffic was gone but not SW1? Flow control was on SW1. Why did it not affect the mid-span back hauls? It seemed like if it was going to spill traffic out vlan 1 which wasn't shared either tagged or untagged with the AF5X that the traffic should have been on the mid-spans. Seems maybe the switch is handling the traffic inconsistently?
I also noticed that once I plugged my laptop in, the light on the switch was blinking fast even though my laptop was still booting up. I have to assume that the pause frames were hitting it, BUT when I looked at windows task manager and expected to see 8Mb of traffic hitting it, there was zero traffic. I can only assume that's because windows was taking in the traffic and not reporting it, maybe?
Lastly, why is this just showing up all of a sudden? Or does it just seem that way? This segment of the network has been up for months with the AF5X. It seems to me that a certain rate of traffic is triggering the pause flood. The leg that locked up is routed but only for management. The customer traffic is tagged vlan's between the main tower and the mini pop until I reconfigure the mini pop router. This mini pop also feeds one other smaller mini pop.
SO AGAIN we see the problem come up when there are multiple switches sharing layer 2 domains. Thinking further, the second down stream mini pop isn't even a netonix switch, it's a mikrotik router board that's acting like a switch and it's back haul is a NanoStation Loco M5 link. It seems that it's TRANSIT traffic across the switches that triggers it and it doesn't seem to have anything to do with the fact thatdown stream devices are not AF5X or netonix.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
As discussed before, flow control pause frames should be consumed by the switch and not be forwarded directly to any other port. Chris then said that a FC storm on one port may cause other ports to generate pause frames, "indirectly" passing the FC frames on. This is maybe possible but it would require these other ports to each receive an appropriate amount of input traffic that should be told to pause (more than what your laptop generated while booting). So I guess, in practice, you may see an increase in FC activity but probably not the same amount of FC traffic on other ports. Therefore, frankly, I don't fully believe in Chris' explanation. And your other observations indicate that the switch might indeed be passing FC frames through.
If the problem ever arises again for you, please do me a favor and unplug/plug seemingly unrelated ports first before you unplug/plug the AF5x or whatever you think is the source of the storm. I'm asking this because I'm under the impression that a self fulfilling prophecy is developing here that AF5(X) is the bad guy. So if we could find one case that contradicts this theory we could again be more open to other theories. Your report is the first that is helpful in this way.
If the problem ever arises again for you, please do me a favor and unplug/plug seemingly unrelated ports first before you unplug/plug the AF5x or whatever you think is the source of the storm. I'm asking this because I'm under the impression that a self fulfilling prophecy is developing here that AF5(X) is the bad guy. So if we could find one case that contradicts this theory we could again be more open to other theories. Your report is the first that is helpful in this way.
--
Thomas Giger
Thomas Giger
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
Look this is what is happen as best as we can tell.
For what ever reason the AF radios start sending out Thousands of Pause Frames per second at the switch port connected to the AF radio which will show up as Rx Pause Frames on that port. Now at this point the switch will send Tx Pause frames out the port that is sending packets to the port connected to the AF radio.
So if you look at the example below the switch would get the Rx Pause Frames on Switch Port 2 from the AF radio so the switch would then start issuing Tx Pause Frames on Switch Port 1 to the Router to tell it to slow down which the Router would see as Rx Pause Frames on Port A.
But now Switch Port 1 and 2 are basically paused into submission (nothing is moving). Next traffic from the APs on ports 7 and 8 can not get out so their buffers fill up and they start issuing pause frames. Eventually all buffers on the switch get filled up and all ports are issuing pause frames.
Obviously this is but 1 possible network configuration and everyone's network configuration is different.
Things we know:
The pause frames are coming from AF radios. <== I think this has been confirmed by multiple people?
There are a LOT of pause frames (Thousands per second)
Disabling Flow Control on the AF radio or the switch port connected to the AF radio prevents this from happening.
This all started with the recent AF firmware release that actually got FC to work on the AF radios.
This has been duplicated on other switches including Cisco, ToughSwitch, MicroTik.
Supposedly there is a bug being fixed in the AF firmware now that is responsible for the excessive CRC errors maybe they are related?
Not every network configuration experiences the issue. Seems to be when there is a Flat Network segment with multiple hops with switches and not routed.
Keep in mind it is not a "crash". The switch never crashes it simply receives so many Pause Frames that it basically shuts down the switch ports as the ports become indefinitely paused.
It has been reported to be replicated on several switch models including Netonix, Cisco, MicroTik, and ToughSwitch.
When the event occurs rebooting the airFIBER or switch will resolve the issues. Or if you power the AF radio with the POE brick and turn POE off on the switch you can simply unplug the cable from the switch and then plug it back in and the problem resolves as it break Ethernet communications and re-initializes the Ethernet Link. Or if you have access to the switch via another port or the console cable (if it has a console port) you can disable Ethernet communication and then enable it and it resolves it.
Apparently it is up to 15K pause frames per second that is sent to the switch port from the AF radio.
A work around is to disable Flow Control on either the AF radio or the switch port that connects to it .
It does not affect all network configurations, it apparently occurs on Flat network segments where there are at least 2 switches involved such as below:
Router <AF LINK> Switch <AF LINK> Switch <AF LINK> Switch
Routed networks like ours shown below do not seem to exhibit the problem:
Router+Switch <AF LINK> Switch+Router <AF LINK> Switch+Router
Now our next firmware release will have a safety check in it where if it detects the Pause Frame Storm it will disable Flow Control on that switch port and at least prevent the need to roll a truck.
For what ever reason the AF radios start sending out Thousands of Pause Frames per second at the switch port connected to the AF radio which will show up as Rx Pause Frames on that port. Now at this point the switch will send Tx Pause frames out the port that is sending packets to the port connected to the AF radio.
So if you look at the example below the switch would get the Rx Pause Frames on Switch Port 2 from the AF radio so the switch would then start issuing Tx Pause Frames on Switch Port 1 to the Router to tell it to slow down which the Router would see as Rx Pause Frames on Port A.
But now Switch Port 1 and 2 are basically paused into submission (nothing is moving). Next traffic from the APs on ports 7 and 8 can not get out so their buffers fill up and they start issuing pause frames. Eventually all buffers on the switch get filled up and all ports are issuing pause frames.
Obviously this is but 1 possible network configuration and everyone's network configuration is different.
Things we know:
The pause frames are coming from AF radios. <== I think this has been confirmed by multiple people?
There are a LOT of pause frames (Thousands per second)
Disabling Flow Control on the AF radio or the switch port connected to the AF radio prevents this from happening.
This all started with the recent AF firmware release that actually got FC to work on the AF radios.
This has been duplicated on other switches including Cisco, ToughSwitch, MicroTik.
Supposedly there is a bug being fixed in the AF firmware now that is responsible for the excessive CRC errors maybe they are related?
Not every network configuration experiences the issue. Seems to be when there is a Flat Network segment with multiple hops with switches and not routed.
Keep in mind it is not a "crash". The switch never crashes it simply receives so many Pause Frames that it basically shuts down the switch ports as the ports become indefinitely paused.
It has been reported to be replicated on several switch models including Netonix, Cisco, MicroTik, and ToughSwitch.
When the event occurs rebooting the airFIBER or switch will resolve the issues. Or if you power the AF radio with the POE brick and turn POE off on the switch you can simply unplug the cable from the switch and then plug it back in and the problem resolves as it break Ethernet communications and re-initializes the Ethernet Link. Or if you have access to the switch via another port or the console cable (if it has a console port) you can disable Ethernet communication and then enable it and it resolves it.
Apparently it is up to 15K pause frames per second that is sent to the switch port from the AF radio.
A work around is to disable Flow Control on either the AF radio or the switch port that connects to it .
It does not affect all network configurations, it apparently occurs on Flat network segments where there are at least 2 switches involved such as below:
Router <AF LINK> Switch <AF LINK> Switch <AF LINK> Switch
Routed networks like ours shown below do not seem to exhibit the problem:
Router+Switch <AF LINK> Switch+Router <AF LINK> Switch+Router
Now our next firmware release will have a safety check in it where if it detects the Pause Frame Storm it will disable Flow Control on that switch port and at least prevent the need to roll a truck.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
sirhc wrote:Now our next firmware release will have a safety check in it where if it detects the Pause Frame Storm it will disable Flow Control on that switch port and at least prevent the need to roll a truck.
Please make sure the switch sends an email when it turns off FC so we know that this has happened.
--
Thomas Giger
Thomas Giger
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
tma wrote:sirhc wrote:Now our next firmware release will have a safety check in it where if it detects the Pause Frame Storm it will disable Flow Control on that switch port and at least prevent the need to roll a truck.
Please make sure the switch sends an email when it turns off FC so we know that this has happened.
It does send an alert when this happens if you have SMTP setup properly and it is also logs the action in the switch log.
We are open to other suggestions on this issue but the common thing seems to be AF radios with Flow Control and the packets appear to be coming from the AF radios to the switch in very large numbers?
We do know that disabling Flow Control on the AF radio or the switch port facing the AF radio prevents this behavior.
Our Flow Control has been working for 2+ years and only recently started acting up with AF radios once UBNT fixed FC in them?
Why it only seems to occur when there is multiple flat hops is yet to be determined?
The other reason we are reluctant to think it is us is the issue has been replicated on other brands of switches.
However getting 15K pause frames per second on a switch port is not a normal event, it simply should not happen?
And the switch is NOT "locked" up as you can pull the offending radio cable out and the switch becomes accessible again which also lends to the theory that the switch is simply paused framed into submission????
We are open to this being either UBNT or Netonix issue but so far I think the limited evidence points to the AF firmware?
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
sirhc wrote:But now Switch Port 1 and 2 are basically paused into submission (nothing is moving). Next traffic from the APs on ports 7 and 8 can not get out so their buffers fill up and they start issuing pause frames. Eventually all buffers on the switch get filled up and all ports are issuing pause frames.
I do understand what happens on port 1 and 2. But I fail to understand why traffic cannot get out on port 7 and 8. As pictured, port 1 and 2 is what you called a "mid-span PoE" in you how-to-video. If this doesn't prevent FC frames on port 6, why have this mid-span thing anyway?
FWIW, I'll ask at the UBNT forum whether the AF product line would be able to pass FC frames through to the remote side. Microwave gear like the Ceragon IP-10 can do FC between the local unit and the switch/router it is connected to but it can also be configured to forward FC frames to the remote side. UBNT Airmax devices connect the copper and wireless interfaces with a software bridge, which blocks FC frames, but AF gear has a direct pipe between the copper and wireless interfaces and thus may be able to forward FC frames.
--
Thomas Giger
Thomas Giger
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
If the APs need to get back the AF back haul and the packets coming from the client radios into ports 7 and 8 can only reach the switch and get no further then they too would get jammed up yes? Or maybe they get to the router at first and then the router buffers fill up and push back to the switch with pause frames.
Basically stuff gets jammed up with no place to go.
Keep in mind that buffers are SMALL and only meant to hold traffic for much less than 1 second but if the route out of the tower is paused how long do you think it takes for those buffers to fill up and start issuing pause frames?
Plus keep in mind switch buffers are SHARED memory so the AF port can take up all the buffers in the switch.
It is simply not a normal event to get 15K pause frames per second on a port, this should not occur....EVER
Basically stuff gets jammed up with no place to go.
Keep in mind that buffers are SMALL and only meant to hold traffic for much less than 1 second but if the route out of the tower is paused how long do you think it takes for those buffers to fill up and start issuing pause frames?
Plus keep in mind switch buffers are SHARED memory so the AF port can take up all the buffers in the switch.
It is simply not a normal event to get 15K pause frames per second on a port, this should not occur....EVER
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
sirhc wrote:We are open to other suggestions on this issue but the common thing seems to be AF radios with Flow Control and the packets appear to be coming from the AF radios to the switch in very large numbers?
The reason I'm questioning this explanation is that I have seen two Netonixes connected by a cable - nothing else connected and no AF in between - sending this 8 Mbps stream from one switch to the other (one way). And if the AF5 passes on FC frames from one side to the other (like a Ceragon IP-10 can do) then it is as if two Netonixes were connected by a cable. This would at least explain better why this thing is seen more often on cascaded links.
Please understand that I'm not trying to put the bug on your side. But I'm not convinced that there is proof that it can only be the AF and nothing else.
--
Thomas Giger
Thomas Giger
Who is online
Users browsing this forum: Google [Bot] and 18 guests