As said, we try to deploy a new AF5X and Netonix every day :-) Yesterday, I had an interesting experience which may help to narrow down the issue. The general setup is this:
Router(GbE) <--> Netonix1 <--LAG--> Netonix2 <--> AF5X ..... AF5X <--> Netonix3 <--> 100mbps-gear
This was on a leg of our network where we cannot provide a backup for topographic reasons. So while switching over from the PB5 link we had before and the AF5X coming up (with a terribly low signal for the first minute), traffic was certainly higher than available capacity. As seen later, Netonix2 had received a very high number of FC pause frames from the AF5X it is connected to. This seems somewhat "normal" because the AF5X had too much traffic and not enough capacity to do away with it.
At the same time (in this phase) I received these messages from Netonix1 and Netonix2:
LACP changed state to Active on port 24 (key 2) - from WLM-GS-S4 (10.x.x.21)
LACP changed state to Active on port 23 (key 2) - from WLM-GS-S4 (10.x.x.21)
Port 23 and 24 is where the LAG connects Netonix1 to Netonix2. Yes, there were no messages saying a LACP port was down. The log contains this with a comment from myself:
May 10 12:06:02 STP: set port 23 to discarding
May 10 12:06:02 STP: set port 24 to discarding
May 10 12:06:02 STP: set port 28 to learning # what port 28? is this a synonym for the LAG itself?
May 10 12:06:02 STP: set port 28 to forwarding
May 10 12:06:04 STP: set port 24 to learning
May 10 12:06:04 STP: set port 24 to forwarding
May 10 12:06:04 STP: set port 23 to learning
May 10 12:06:04 STP: set port 23 to forwarding
May 10 12:06:08 LACP: starting negotiation with partner 1C-BD-B9-DD-67-1A
May 10 12:06:08 LACP: LACP changed state to Active on port 24 (key 2)
May 10 12:06:08 LACP: LACP changed state to Active on port 23 (key 2)
May 10 12:06:09 STP: set port 27 to discarding
May 10 12:06:09 STP: set port 28 to discarding
May 10 12:06:09 STP: set port 23 to discarding
May 10 12:06:09 STP: set port 24 to discarding
May 10 12:06:09 STP: set port 27 to learning
May 10 12:06:09 STP: set port 27 to forwarding
May 10 12:06:10 STP: set port 27 to discarding
May 10 12:06:10 STP: set port 28 to learning
May 10 12:06:10 STP: set port 23 to learning
May 10 12:06:10 STP: set port 24 to learning
May 10 12:06:10 STP: set port 28 to forwarding
May 10 12:06:10 STP: set port 23 to forwarding
May 10 12:06:10 STP: set port 24 to forwarding
May 10 12:06:20 sSMTP[647]: Sent mail for wlm-gs-s4@xxx.xx (221 2.0.0 xxxxx.xxx.xx Service closing transmission channel) uid=0 username=xxx outbytes=459
May 10 12:06:20 sSMTP[651]: Sent mail for wlm-gs-s4@xxx.xx (221 2.0.0 xxxxx.xxx.xx Service closing transmission channel) uid=0 username=xxx outbytes=459
I should mention that nobody was at the site with Netonix1 and Netonix2 at that time (my guys were working at the Netonix3 side) - so the reason is certainly not that someone did something to the LAG cables. I assume the STP and LAG/LACP weirdness has been a side effect of the sudden burst of FC pause frames received from the AF5X on Netonix2 (WLM-GS-S4). Since we've aimed the link better and there's now (much) more capacity than traffic, the AF5X hasn't sent a single FC frame, though.
To some extent, this seems different than the FC frame flood reported here - mostly because it stopped by itself - but that may be because LACP acted up and helped as a kind of circuit breaker in my case, preventing further side effects that ultimately get the switch-AF pair into a vicious cycle. OTOH, it supports the theory that AF5/AF24 tend to send many pause frames - at least when load approaches capacity. This - load near capacity - may be a frequent situation for some links others have reported on. Or it could happen on AF24 links during rain fade, while capacity is reduced. Or it could happen while the AF switches modulation rates and there's some packets accumulating in the buffer.
BTW, UBNT-Chuck said that an AF would not deliver pause frames through the wireless link to the remote side.
Dropping ports on new WS, what is wrong with my setup?
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
Another day, next AF5X and Netonix deployment - more weirdness that may be related:
At first the AF5X link was badly mis-aimed and it reported a distance of 0 meters too - never seen that before:
Having more traffic than capacity and because I had some time to spare today, I looked at the switch ports that the AF5X was connected to. This is the far end:
Note that the AF5X on port had been sending a whopping 600 Mbps into the switch. And note the packet rate of 1.25 Mpps! Of course, there was no real traffic as the capacity was only ~ 16 Mbps. I was able to connect through the link into the far end switch, though, and customers were getting 16 Mbps just fine.
This one has been taken on the distribution tower (near end). Again, the AF5X on port 17 has been sending up to 550 Mbps, but fluctuating. The packet rate is up to 1 Mpps again.
Port statistics show that most of this traffic sent into the switch must have been pause frames.
This confirms yesterdays observation: An AF(5X) will send flow control pause frames to the switch if there's more traffic than capacity. I wouldn't have thought it could be that much, though. In fact, there must be something severely wrong, even if we consider one FC packet for each data packet received: for ~ 6 Mbps of traffic, we should get less than 6 Mbps of FC frames. Or put it that way: It seems the AF answers each data packet with not only one FC pause frame, but a staccato.
However, this subsides as soon as traffic is less than capacity. In particular, I haven't seen the AF firing 8 Mbps constantly for no reason. So while this confirms that the AF has a bug with FC (which I will post on the UBNT forum too), the switches didn't lock up for a second. In a way, that's nice but it doesn't help to determine what makes the switch lock up when it does.
At first the AF5X link was badly mis-aimed and it reported a distance of 0 meters too - never seen that before:
Having more traffic than capacity and because I had some time to spare today, I looked at the switch ports that the AF5X was connected to. This is the far end:
Note that the AF5X on port had been sending a whopping 600 Mbps into the switch. And note the packet rate of 1.25 Mpps! Of course, there was no real traffic as the capacity was only ~ 16 Mbps. I was able to connect through the link into the far end switch, though, and customers were getting 16 Mbps just fine.
This one has been taken on the distribution tower (near end). Again, the AF5X on port 17 has been sending up to 550 Mbps, but fluctuating. The packet rate is up to 1 Mpps again.
Port statistics show that most of this traffic sent into the switch must have been pause frames.
This confirms yesterdays observation: An AF(5X) will send flow control pause frames to the switch if there's more traffic than capacity. I wouldn't have thought it could be that much, though. In fact, there must be something severely wrong, even if we consider one FC packet for each data packet received: for ~ 6 Mbps of traffic, we should get less than 6 Mbps of FC frames. Or put it that way: It seems the AF answers each data packet with not only one FC pause frame, but a staccato.
However, this subsides as soon as traffic is less than capacity. In particular, I haven't seen the AF firing 8 Mbps constantly for no reason. So while this confirms that the AF has a bug with FC (which I will post on the UBNT forum too), the switches didn't lock up for a second. In a way, that's nice but it doesn't help to determine what makes the switch lock up when it does.
--
Thomas Giger
Thomas Giger
-
adairw - Associate
- Posts: 465
- Joined: Wed Nov 05, 2014 11:47 pm
- Location: Amarillo, TX
- Has thanked: 98 times
- Been thanked: 132 times
Re: Dropping ports on new WS, what is wrong with my setup?
Good reporting Thomas.
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
Lets be clear, THE SWITCH IS NOT LOCKING UP.
When the event occurs if you unplug the AF radio the switch returns to normal.
Then if you plug the AF back in it works again until the next random event.
People need to stop calling it a lock up, it implies the "switch" NEEDS a reboot and in fact this is not the case. A reboot appears to fix it but all is needed is the AF radio needs unplugged and then plugged right back in which resets the Ethernet communications.
My guess is if someone was consoled in to the switch they could simply disable the Ethernet port facing the offending traffic that appears to becoming from the AF radio at this time then re-enable the Ethernet port it would do the same thing as unplugging the cable.
Also for the new-comers to the thread disabling FC on the AF radio or the switch port facing the AF radio prevents the event from happening buit this is a work around for those not wanting to participate in finding more clues. But in the end this is a kludge fix as in the wireless industry you want Flow Control for best performance.
When the event occurs if you unplug the AF radio the switch returns to normal.
Then if you plug the AF back in it works again until the next random event.
People need to stop calling it a lock up, it implies the "switch" NEEDS a reboot and in fact this is not the case. A reboot appears to fix it but all is needed is the AF radio needs unplugged and then plugged right back in which resets the Ethernet communications.
My guess is if someone was consoled in to the switch they could simply disable the Ethernet port facing the offending traffic that appears to becoming from the AF radio at this time then re-enable the Ethernet port it would do the same thing as unplugging the cable.
Also for the new-comers to the thread disabling FC on the AF radio or the switch port facing the AF radio prevents the event from happening buit this is a work around for those not wanting to participate in finding more clues. But in the end this is a kludge fix as in the wireless industry you want Flow Control for best performance.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
Also we released rc19 last night which has a KLUDGE which is supposed to identify 10K per second of Pause Frames on a port and auto disable Flow Control on the port and log it in the log file. This is not meant to be a fix but a safety vulvae until there is a real fix either from us or UBNT. But I "think" it will be them as it has been reported on other brand of switches including Cisco?
Would be nice if someone can test that?
I am not sure 10K is a low enough trigger but that is what Eric wanted so that is what it is.
Would be nice if someone can test that?
I am not sure 10K is a low enough trigger but that is what Eric wanted so that is what it is.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
adairw - Associate
- Posts: 465
- Joined: Wed Nov 05, 2014 11:47 pm
- Location: Amarillo, TX
- Has thanked: 98 times
- Been thanked: 132 times
Re: Dropping ports on new WS, what is wrong with my setup?
sirhc wrote:Lets be clear, THE SWITCH IS NOT LOCKING UP.
I disagree somewhat. In our case every time the "event" happens the switch is completely unresponsive until you remove the AF5X somehow.
If I can't access the switch by SSH or Web from ANY port without first unplugging a cable that's locked up.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
Sorry, I meant to say "(in)accessible". So yes, it remained accessible through the 600 Mbps flood. And I can confirm (forgot to mention):
The funny thing was that (again) the GUI showed no FC on port 17, although both sides were configured to use FC (see the picture). The AF may have taken that as permission to send FC frames and the Netonix ignored it and that is why the switches and the link remained accessible.
Next I turned off FC on the Netonix and the AF5X stopped sending FCs frames (after re-negotiation). So yes, there's a workaround.
I'm just not sure it is the *same* bug. It is certainly a bug on the AF that it sends FC frames at wire speed whenever it has to transmit a few packets and capacity is not enough for that. However, the 8 Mbps flow observed when the switch becomes *inaccessible* just smells differently and so you shouldn't take my report as proof that the AF is to take the blame.
Sorry, if I keep repeating it: The 8 Mbps (inaccessible switch) problem has been seen by myself between two Netonixes and nothing else and and at least one person here has seen it without AF.
The funny thing was that (again) the GUI showed no FC on port 17, although both sides were configured to use FC (see the picture). The AF may have taken that as permission to send FC frames and the Netonix ignored it and that is why the switches and the link remained accessible.
Next I turned off FC on the Netonix and the AF5X stopped sending FCs frames (after re-negotiation). So yes, there's a workaround.
I'm just not sure it is the *same* bug. It is certainly a bug on the AF that it sends FC frames at wire speed whenever it has to transmit a few packets and capacity is not enough for that. However, the 8 Mbps flow observed when the switch becomes *inaccessible* just smells differently and so you shouldn't take my report as proof that the AF is to take the blame.
Sorry, if I keep repeating it: The 8 Mbps (inaccessible switch) problem has been seen by myself between two Netonixes and nothing else and and at least one person here has seen it without AF.
--
Thomas Giger
Thomas Giger
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
adairw wrote:sirhc wrote:Lets be clear, THE SWITCH IS NOT LOCKING UP.
I disagree somewhat. In our case every time the "event" happens the switch is completely unresponsive until you remove the AF5X somehow.
If I can't access the switch by SSH or Web from ANY port without first unplugging a cable that's locked up.
If unplugging the AF radio allows the switch to return to normal that is not the switch being locked up. That is the switch being jammed full of Pause Frames THOUSANDS PER SECOND causing all the ports to be Paused into submission. This is an event that should never happen unless a piece of equipment is malfunctioning.
A lock up denotes the switch has to be re-booted which is not the case, remove the offending never ending stream of Pause frames and the switch allows traffic to flow again.
If this is as it appears at this time a never ending stream of Pause Frames the switch is behaving as it is supposed to. The fact that this situation should never happen in the real world the switch chip manufacturer never put in a safe guard to prevent it. And apparently other switches are acting the exact same way when faced with this event including Cisco.
However with rc19 we did put in a KLUDGE to prevent it until it is resolved, the only question is will 10K trigger point be low enough and does it actually work?
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: Dropping ports on new WS, what is wrong with my setup?
tma wrote:Sorry, if I keep repeating it: The 8 Mbps (inaccessible switch) problem has been seen by myself between two Netonixes and nothing else and and at least one person here has seen it without AF.
And we would love to investigate this "other" issue but apparently it is not easily repeatable?
I think the other guy (TheHox) though that reported an issue also involved a Netgear switch and upon removing the Netgear the problem went away?
viewtopic.php?f=17&t=1654&p=12667&hilit=Netgear#p12667
If someone can provide a LAP that exposes an issue we will jump right on it.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Dropping ports on new WS, what is wrong with my setup?
sirhc wrote:If unplugging the AF radio allows the switch to return to normal that is not the switch being locked up. That is the switch being jammed full of Pause Frames THOUSANDS PER SECOND causing all the ports to be Paused into submission. This is an event that should never happen unless a piece of equipment is malfunctioning.
I doubt that an 8 Mbps FC stream will cause the switch to become inaccessible when it has taken 600 Mbps and one point two million pause frames per second today, remaining accessible all the time and even delivering traffic to the downstream customers. When it becomes inaccessible, something else must be going on than the AF freaking out in the way described.
--
Thomas Giger
Thomas Giger
Who is online
Users browsing this forum: Google [Bot] and 21 guests