Dropping ports on new WS, what is wrong with my setup?

DOWNLOAD THE LATEST FIRMWARE HERE
User avatar
tma
Experienced Member
 
Posts: 122
Joined: Tue Mar 03, 2015 4:07 pm
Location: Oberursel, Germany
Has thanked: 15 times
Been thanked: 14 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 3:00 am

As said, we try to deploy a new AF5X and Netonix every day :-) Yesterday, I had an interesting experience which may help to narrow down the issue. The general setup is this:

Router(GbE) <--> Netonix1 <--LAG--> Netonix2 <--> AF5X ..... AF5X <--> Netonix3 <--> 100mbps-gear

This was on a leg of our network where we cannot provide a backup for topographic reasons. So while switching over from the PB5 link we had before and the AF5X coming up (with a terribly low signal for the first minute), traffic was certainly higher than available capacity. As seen later, Netonix2 had received a very high number of FC pause frames from the AF5X it is connected to. This seems somewhat "normal" because the AF5X had too much traffic and not enough capacity to do away with it.

At the same time (in this phase) I received these messages from Netonix1 and Netonix2:

LACP changed state to Active on port 24 (key 2) - from WLM-GS-S4 (10.x.x.21)
LACP changed state to Active on port 23 (key 2) - from WLM-GS-S4 (10.x.x.21)

Port 23 and 24 is where the LAG connects Netonix1 to Netonix2. Yes, there were no messages saying a LACP port was down. The log contains this with a comment from myself:

May 10 12:06:02 STP: set port 23 to discarding
May 10 12:06:02 STP: set port 24 to discarding
May 10 12:06:02 STP: set port 28 to learning # what port 28? is this a synonym for the LAG itself?
May 10 12:06:02 STP: set port 28 to forwarding
May 10 12:06:04 STP: set port 24 to learning
May 10 12:06:04 STP: set port 24 to forwarding
May 10 12:06:04 STP: set port 23 to learning
May 10 12:06:04 STP: set port 23 to forwarding
May 10 12:06:08 LACP: starting negotiation with partner 1C-BD-B9-DD-67-1A
May 10 12:06:08 LACP: LACP changed state to Active on port 24 (key 2)
May 10 12:06:08 LACP: LACP changed state to Active on port 23 (key 2)
May 10 12:06:09 STP: set port 27 to discarding
May 10 12:06:09 STP: set port 28 to discarding
May 10 12:06:09 STP: set port 23 to discarding
May 10 12:06:09 STP: set port 24 to discarding
May 10 12:06:09 STP: set port 27 to learning
May 10 12:06:09 STP: set port 27 to forwarding
May 10 12:06:10 STP: set port 27 to discarding
May 10 12:06:10 STP: set port 28 to learning
May 10 12:06:10 STP: set port 23 to learning
May 10 12:06:10 STP: set port 24 to learning
May 10 12:06:10 STP: set port 28 to forwarding
May 10 12:06:10 STP: set port 23 to forwarding
May 10 12:06:10 STP: set port 24 to forwarding
May 10 12:06:20 sSMTP[647]: Sent mail for wlm-gs-s4@xxx.xx (221 2.0.0 xxxxx.xxx.xx Service closing transmission channel) uid=0 username=xxx outbytes=459
May 10 12:06:20 sSMTP[651]: Sent mail for wlm-gs-s4@xxx.xx (221 2.0.0 xxxxx.xxx.xx Service closing transmission channel) uid=0 username=xxx outbytes=459

I should mention that nobody was at the site with Netonix1 and Netonix2 at that time (my guys were working at the Netonix3 side) - so the reason is certainly not that someone did something to the LAG cables. I assume the STP and LAG/LACP weirdness has been a side effect of the sudden burst of FC pause frames received from the AF5X on Netonix2 (WLM-GS-S4). Since we've aimed the link better and there's now (much) more capacity than traffic, the AF5X hasn't sent a single FC frame, though.

To some extent, this seems different than the FC frame flood reported here - mostly because it stopped by itself - but that may be because LACP acted up and helped as a kind of circuit breaker in my case, preventing further side effects that ultimately get the switch-AF pair into a vicious cycle. OTOH, it supports the theory that AF5/AF24 tend to send many pause frames - at least when load approaches capacity. This - load near capacity - may be a frequent situation for some links others have reported on. Or it could happen on AF24 links during rain fade, while capacity is reduced. Or it could happen while the AF switches modulation rates and there's some packets accumulating in the buffer.

BTW, UBNT-Chuck said that an AF would not deliver pause frames through the wireless link to the remote side.
--
Thomas Giger

User avatar
tma
Experienced Member
 
Posts: 122
Joined: Tue Mar 03, 2015 4:07 pm
Location: Oberursel, Germany
Has thanked: 15 times
Been thanked: 14 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 9:26 am

Another day, next AF5X and Netonix deployment - more weirdness that may be related:

At first the AF5X link was badly mis-aimed and it reported a distance of 0 meters too - never seen that before:

002.PNG


Having more traffic than capacity and because I had some time to spare today, I looked at the switch ports that the AF5X was connected to. This is the far end:

003.PNG


Note that the AF5X on port had been sending a whopping 600 Mbps into the switch. And note the packet rate of 1.25 Mpps! Of course, there was no real traffic as the capacity was only ~ 16 Mbps. I was able to connect through the link into the far end switch, though, and customers were getting 16 Mbps just fine.

004.PNG


This one has been taken on the distribution tower (near end). Again, the AF5X on port 17 has been sending up to 550 Mbps, but fluctuating. The packet rate is up to 1 Mpps again.

005.PNG


Port statistics show that most of this traffic sent into the switch must have been pause frames.

This confirms yesterdays observation: An AF(5X) will send flow control pause frames to the switch if there's more traffic than capacity. I wouldn't have thought it could be that much, though. In fact, there must be something severely wrong, even if we consider one FC packet for each data packet received: for ~ 6 Mbps of traffic, we should get less than 6 Mbps of FC frames. Or put it that way: It seems the AF answers each data packet with not only one FC pause frame, but a staccato.

However, this subsides as soon as traffic is less than capacity. In particular, I haven't seen the AF firing 8 Mbps constantly for no reason. So while this confirms that the AF has a bug with FC (which I will post on the UBNT forum too), the switches didn't lock up for a second. In a way, that's nice but it doesn't help to determine what makes the switch lock up when it does.
--
Thomas Giger

User avatar
adairw
Associate
Associate
 
Posts: 465
Joined: Wed Nov 05, 2014 11:47 pm
Location: Amarillo, TX
Has thanked: 98 times
Been thanked: 132 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 9:40 am

Good reporting Thomas.

User avatar
sirhc
Employee
Employee
 
Posts: 7416
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1608 times
Been thanked: 1325 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 9:44 am

Lets be clear, THE SWITCH IS NOT LOCKING UP.

When the event occurs if you unplug the AF radio the switch returns to normal.

Then if you plug the AF back in it works again until the next random event.

People need to stop calling it a lock up, it implies the "switch" NEEDS a reboot and in fact this is not the case. A reboot appears to fix it but all is needed is the AF radio needs unplugged and then plugged right back in which resets the Ethernet communications.

My guess is if someone was consoled in to the switch they could simply disable the Ethernet port facing the offending traffic that appears to becoming from the AF radio at this time then re-enable the Ethernet port it would do the same thing as unplugging the cable.

Also for the new-comers to the thread disabling FC on the AF radio or the switch port facing the AF radio prevents the event from happening buit this is a work around for those not wanting to participate in finding more clues. But in the end this is a kludge fix as in the wireless industry you want Flow Control for best performance.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
sirhc
Employee
Employee
 
Posts: 7416
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1608 times
Been thanked: 1325 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 9:49 am

Also we released rc19 last night which has a KLUDGE which is supposed to identify 10K per second of Pause Frames on a port and auto disable Flow Control on the port and log it in the log file. This is not meant to be a fix but a safety vulvae until there is a real fix either from us or UBNT. But I "think" it will be them as it has been reported on other brand of switches including Cisco?

Would be nice if someone can test that?

I am not sure 10K is a low enough trigger but that is what Eric wanted so that is what it is. :tounge:
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
adairw
Associate
Associate
 
Posts: 465
Joined: Wed Nov 05, 2014 11:47 pm
Location: Amarillo, TX
Has thanked: 98 times
Been thanked: 132 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 9:52 am

sirhc wrote:Lets be clear, THE SWITCH IS NOT LOCKING UP.


I disagree somewhat. In our case every time the "event" happens the switch is completely unresponsive until you remove the AF5X somehow.
If I can't access the switch by SSH or Web from ANY port without first unplugging a cable that's locked up.

User avatar
tma
Experienced Member
 
Posts: 122
Joined: Tue Mar 03, 2015 4:07 pm
Location: Oberursel, Germany
Has thanked: 15 times
Been thanked: 14 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 10:07 am

Sorry, I meant to say "(in)accessible". So yes, it remained accessible through the 600 Mbps flood. And I can confirm (forgot to mention):

The funny thing was that (again) the GUI showed no FC on port 17, although both sides were configured to use FC (see the picture). The AF may have taken that as permission to send FC frames and the Netonix ignored it and that is why the switches and the link remained accessible.

Next I turned off FC on the Netonix and the AF5X stopped sending FCs frames (after re-negotiation). So yes, there's a workaround.

I'm just not sure it is the *same* bug. It is certainly a bug on the AF that it sends FC frames at wire speed whenever it has to transmit a few packets and capacity is not enough for that. However, the 8 Mbps flow observed when the switch becomes *inaccessible* just smells differently and so you shouldn't take my report as proof that the AF is to take the blame.

Sorry, if I keep repeating it: The 8 Mbps (inaccessible switch) problem has been seen by myself between two Netonixes and nothing else and and at least one person here has seen it without AF.
--
Thomas Giger

User avatar
sirhc
Employee
Employee
 
Posts: 7416
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1608 times
Been thanked: 1325 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 10:18 am

adairw wrote:
sirhc wrote:Lets be clear, THE SWITCH IS NOT LOCKING UP.


I disagree somewhat. In our case every time the "event" happens the switch is completely unresponsive until you remove the AF5X somehow.

If I can't access the switch by SSH or Web from ANY port without first unplugging a cable that's locked up.


If unplugging the AF radio allows the switch to return to normal that is not the switch being locked up. That is the switch being jammed full of Pause Frames THOUSANDS PER SECOND causing all the ports to be Paused into submission. This is an event that should never happen unless a piece of equipment is malfunctioning.

A lock up denotes the switch has to be re-booted which is not the case, remove the offending never ending stream of Pause frames and the switch allows traffic to flow again.

If this is as it appears at this time a never ending stream of Pause Frames the switch is behaving as it is supposed to. The fact that this situation should never happen in the real world the switch chip manufacturer never put in a safe guard to prevent it. And apparently other switches are acting the exact same way when faced with this event including Cisco.

However with rc19 we did put in a KLUDGE to prevent it until it is resolved, the only question is will 10K trigger point be low enough and does it actually work?
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
sirhc
Employee
Employee
 
Posts: 7416
Joined: Tue Apr 08, 2014 3:48 pm
Location: Lancaster, PA
Has thanked: 1608 times
Been thanked: 1325 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 10:25 am

tma wrote:Sorry, if I keep repeating it: The 8 Mbps (inaccessible switch) problem has been seen by myself between two Netonixes and nothing else and and at least one person here has seen it without AF.


And we would love to investigate this "other" issue but apparently it is not easily repeatable?

I think the other guy (TheHox) though that reported an issue also involved a Netgear switch and upon removing the Netgear the problem went away?

viewtopic.php?f=17&t=1654&p=12667&hilit=Netgear#p12667

If someone can provide a LAP that exposes an issue we will jump right on it.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.

User avatar
tma
Experienced Member
 
Posts: 122
Joined: Tue Mar 03, 2015 4:07 pm
Location: Oberursel, Germany
Has thanked: 15 times
Been thanked: 14 times

Re: Dropping ports on new WS, what is wrong with my setup?

Wed May 11, 2016 10:25 am

sirhc wrote:If unplugging the AF radio allows the switch to return to normal that is not the switch being locked up. That is the switch being jammed full of Pause Frames THOUSANDS PER SECOND causing all the ports to be Paused into submission. This is an event that should never happen unless a piece of equipment is malfunctioning.


I doubt that an 8 Mbps FC stream will cause the switch to become inaccessible when it has taken 600 Mbps and one point two million pause frames per second today, remaining accessible all the time and even delivering traffic to the downstream customers. When it becomes inaccessible, something else must be going on than the AF freaking out in the way described.
--
Thomas Giger

PreviousNext
Return to Hardware and software issues

Who is online

Users browsing this forum: No registered users and 17 guests