Flow Control / Packet Locks overview
Posted: Sat Jan 13, 2018 5:51 am
I wanted to pull this out of the 1.4.9 thread since my issue likely has little to do with the version I'm running.
I just wanted to go over what I'm understanding about how the flow control/flood problem with ubiquiti gear manifests and why to make sure i've got this right and then just go over the config I was seeing the issue with.
So:
- Wifi is not like wires, link speed changes, packets drop and the connection is otherwise unstable in many real-world situations
- Some favor using flow control, some don't and there are arguments for and against depending on how your network is setup; point being that variable link speed means that if you use FC, it WILL fire often
- Ubiquti's implementation tends to have bugs, and these bugs can cause a UBNT to send a flood of FC "pause" packets to the connected switch
- If that connected switch has FC enabled, the flood of pause packets will essentially shut the switch port down
- If that port is how you reach your switch for management, you'll lose management access and remote logging
- If that connected device has a shared buffer for all ports, it's possible to "lock up" all ports on the switch
- If the Ubiquiti device stops sending a flood of pause packets, the switch will recover when the buffers empty
General Questions:
- Can anyone describe in more detail how the Ubiquiti bug manifests and why the hell it is sending up to millions of PPS in pause frames? (saw that number mentioned in the forums). Is that not obviously excessive and something that they could cap as being obviously out of bounds?
- Is there a current list of known devices/firmware versions of UBNT gear that has this issue?
- In the Netonix switches there is a global setting to enable pause frame storm control. With this turned on (it appears to default on), why can this pause frame UBNT bug still lock the ports up?
- Is it true that just one device sending a flood can lock all ports up by filling the shared buffer and if so, do any of the larger switches have more than one buffer?
- With the Netonix switches being used almost exclusively by WISPs, why not default Flow Control to off and allow the more adventurous/knowledgable to enable it if they want to risk it?
- What is the current case where enabling FC with Ubiquiti gear involved actually works?
And finally, a question on my own issue... I was simply not seeing the switch recover on its own.
At the site where I experience the "lockup", I have a UBNT powerbeam being fed from a nearby sector with three VLANs bridged across from a cisco L3 switch connected to the sector (UBNT ac), this is the site backhaul. I then have two customers in the building that terminate on the switch and both are coming in on their own tagged VLAN, which is then untagged at each customer's premise. I also have a small sector at this site (UBNT ac) which in turn has 2 customers connected via two more PowerBeams. Flow control was at default (on) on all ports. When I would lose contact with the switch, even if I turned down the ethernet port on the PowerBeam feeding the switch, and turned off flow control on the Powerbeam and let it sit there for a bit, not even a single ping from the switch on re-enabling it. Power cycle was the only option we had for remote access, and that would "unlock" the switch ports for minutes or hours. Replacing the switch with a new one did not fix anything. About 16 hours ago, I disabled flow control on all ports on the switch and so far have not seen it go unreachable yet.
Three recent changes to this site:
- Upgrade to 1.4.9. (downgraded to 1.4.8 today, just in case)
- Added a Metrolinq, which was running the first time this happened (and the PowerBeam was disabled as I did not want screw with RSTP) but was subequently disabled for troubleshooting
- Upgraded all UBNT to 8.4.3, from 8.3.mumble
All of those happened at least a week before the problem surfaced.
Anyone want to take a stab at why the problem just started this week, well after any of the above changes? Or why other sites that have a very similar config with identical equipment and firmware are not seeing this problem?
I just wanted to go over what I'm understanding about how the flow control/flood problem with ubiquiti gear manifests and why to make sure i've got this right and then just go over the config I was seeing the issue with.
So:
- Wifi is not like wires, link speed changes, packets drop and the connection is otherwise unstable in many real-world situations
- Some favor using flow control, some don't and there are arguments for and against depending on how your network is setup; point being that variable link speed means that if you use FC, it WILL fire often
- Ubiquti's implementation tends to have bugs, and these bugs can cause a UBNT to send a flood of FC "pause" packets to the connected switch
- If that connected switch has FC enabled, the flood of pause packets will essentially shut the switch port down
- If that port is how you reach your switch for management, you'll lose management access and remote logging
- If that connected device has a shared buffer for all ports, it's possible to "lock up" all ports on the switch
- If the Ubiquiti device stops sending a flood of pause packets, the switch will recover when the buffers empty
General Questions:
- Can anyone describe in more detail how the Ubiquiti bug manifests and why the hell it is sending up to millions of PPS in pause frames? (saw that number mentioned in the forums). Is that not obviously excessive and something that they could cap as being obviously out of bounds?
- Is there a current list of known devices/firmware versions of UBNT gear that has this issue?
- In the Netonix switches there is a global setting to enable pause frame storm control. With this turned on (it appears to default on), why can this pause frame UBNT bug still lock the ports up?
- Is it true that just one device sending a flood can lock all ports up by filling the shared buffer and if so, do any of the larger switches have more than one buffer?
- With the Netonix switches being used almost exclusively by WISPs, why not default Flow Control to off and allow the more adventurous/knowledgable to enable it if they want to risk it?
- What is the current case where enabling FC with Ubiquiti gear involved actually works?
And finally, a question on my own issue... I was simply not seeing the switch recover on its own.
At the site where I experience the "lockup", I have a UBNT powerbeam being fed from a nearby sector with three VLANs bridged across from a cisco L3 switch connected to the sector (UBNT ac), this is the site backhaul. I then have two customers in the building that terminate on the switch and both are coming in on their own tagged VLAN, which is then untagged at each customer's premise. I also have a small sector at this site (UBNT ac) which in turn has 2 customers connected via two more PowerBeams. Flow control was at default (on) on all ports. When I would lose contact with the switch, even if I turned down the ethernet port on the PowerBeam feeding the switch, and turned off flow control on the Powerbeam and let it sit there for a bit, not even a single ping from the switch on re-enabling it. Power cycle was the only option we had for remote access, and that would "unlock" the switch ports for minutes or hours. Replacing the switch with a new one did not fix anything. About 16 hours ago, I disabled flow control on all ports on the switch and so far have not seen it go unreachable yet.
Three recent changes to this site:
- Upgrade to 1.4.9. (downgraded to 1.4.8 today, just in case)
- Added a Metrolinq, which was running the first time this happened (and the PowerBeam was disabled as I did not want screw with RSTP) but was subequently disabled for troubleshooting
- Upgraded all UBNT to 8.4.3, from 8.3.mumble
All of those happened at least a week before the problem surfaced.
Anyone want to take a stab at why the problem just started this week, well after any of the above changes? Or why other sites that have a very similar config with identical equipment and firmware are not seeing this problem?