Twice now I've had two 24 port AC switches stop passing packets on LAGs.
I'm using the latest release firmware, and using LAGs (with VLANs) to two Mikrotik CCRs each, version 6.35.4 at one site and 6.36 at another. There are a few backhauls also attached to the switch, which is essentially acting as a mid-span.
Initially, with STP on the LAGs, all was good. Ran for several months. Twice now recently the switch seems to just stop passing packets — the backhauls become inaccessible, OSPF adjacencies go down, etc.
I was able to log in out of band and bounce the LAGs from the Mikrotik side. There were NO spanning tree or LAG alerts in the Netonix logs. The only other time I've seen something similar was one time where a LAG failed to come up on a DC 12-port, and I had to actually reboot the Netonix.
There were no pause frame floods, no AirFiber in the mix, just Mimosa backhauls.
For good measure, I disabled STP on the LAG, as it shouldn't be needed, and I also disabled loop prevention. I left pause frame floods on.
Has anyone else experienced incompatibilities on the Mikrotik side with LAGs? Since it happened simultaneously with multiple Mikrotik routers, I was assuming it was the common point of failure, the Netonix, exhibiting this behavior.
LAGs and Mikrotik issue?
- jermudgeon
- Member
- Posts: 14
- Joined: Sat Nov 14, 2015 5:08 pm
- Has thanked: 0 time
- Been thanked: 0 time
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: LAGs and Mikrotik issue?
I have both Static and LACP LAGs running at my WISP and have not experienced this behavior but my routers are Cisco 2951 series.
Or it could be a Mikrotik issue, is there a firmware upgrade for the router?
The fact that you bounce the ports on the router side and not touch the switch to bring things backup could go either way but might lean towards the router?
jermudgeon wrote: Has anyone else experienced incompatibilities on the Mikrotik side with LAGs? Since it happened simultaneously with multiple Mikrotik routers, I was assuming it was the common point of failure, the Netonix, exhibiting this behavior.
Or it could be a Mikrotik issue, is there a firmware upgrade for the router?
The fact that you bounce the ports on the router side and not touch the switch to bring things backup could go either way but might lean towards the router?
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
- jermudgeon
- Member
- Posts: 14
- Joined: Sat Nov 14, 2015 5:08 pm
- Has thanked: 0 time
- Been thanked: 0 time
Re: LAGs and Mikrotik issue?
OK, it's happened again (always happens at prime time!), and I have tried a few more things.
Mikrotik firmware is at 6.35.4 and 6.36, for a total of four CCRs connected to two 24-400-As, each with two 2-port LAGs, for a total of 8 LAGs (16 ports). This is helpful for establishing baselines.
—It's not loop protection, SPT or pause frame related, AFAICT
—Bouncing the entire LAG (from the router side) restores connectivity, with the following caveat:
—On 1.4.0, there's definitely some sort of odd interaction or incompatibility where the Netonix gets "stuck" in a state where disabling half the LAG — from either the switch or router side — disables all traffic on the LAG.
—I've seen a similar issue twice in a single router/single switch setup, once on a RB3011 where a LAG refused to cooperate until I actually rebooted the switch — in this instance a smaller Netonix DC switch. Rebooting the router had had no effect.
—I updated one switch to 1.4.3rc7 (since we already had a service impacting outage). Time will tell whether this helps.
—Still no log entries on either side that are LAG-related when traffic stops passing. (From the router side, Rx counters go to 0 but Tx continues.)
—I had intended to continue to test 1.4.0 without rebooting by disabling half of two lags (one router set) while leaving the other router lags untouched. This didn't work (see above). When I do reboot it, it will be to update it to rc7 or later, so it won't be conclusive re: 1.4.0
The owner is about ready to rip out the Netonixes entirely. It's a showstopper of a bug in these cases where the Netonix is a single point of failure.
To verify my LAG settings:
Netonix side: Dst MAC/Src MAC/IP = YES, Port = No, STP on LAG = no
Mikrotik side: Mode = 802.3ad, Link monitoring = mii, Transmit Hash Policy = layer 2 and 3, LACP rate = 30s, MII interval = 100 ms
Mikrotik firmware is at 6.35.4 and 6.36, for a total of four CCRs connected to two 24-400-As, each with two 2-port LAGs, for a total of 8 LAGs (16 ports). This is helpful for establishing baselines.
—It's not loop protection, SPT or pause frame related, AFAICT
—Bouncing the entire LAG (from the router side) restores connectivity, with the following caveat:
—On 1.4.0, there's definitely some sort of odd interaction or incompatibility where the Netonix gets "stuck" in a state where disabling half the LAG — from either the switch or router side — disables all traffic on the LAG.
—I've seen a similar issue twice in a single router/single switch setup, once on a RB3011 where a LAG refused to cooperate until I actually rebooted the switch — in this instance a smaller Netonix DC switch. Rebooting the router had had no effect.
—I updated one switch to 1.4.3rc7 (since we already had a service impacting outage). Time will tell whether this helps.
—Still no log entries on either side that are LAG-related when traffic stops passing. (From the router side, Rx counters go to 0 but Tx continues.)
—I had intended to continue to test 1.4.0 without rebooting by disabling half of two lags (one router set) while leaving the other router lags untouched. This didn't work (see above). When I do reboot it, it will be to update it to rc7 or later, so it won't be conclusive re: 1.4.0
The owner is about ready to rip out the Netonixes entirely. It's a showstopper of a bug in these cases where the Netonix is a single point of failure.
To verify my LAG settings:
Netonix side: Dst MAC/Src MAC/IP = YES, Port = No, STP on LAG = no
Mikrotik side: Mode = 802.3ad, Link monitoring = mii, Transmit Hash Policy = layer 2 and 3, LACP rate = 30s, MII interval = 100 ms
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: LAGs and Mikrotik issue?
jermudgeon wrote:—I updated one switch to 1.4.3rc7 (since we already had a service impacting outage). Time will tell whether this helps.
—Still no log entries on either side that are LAG-related when traffic stops passing. (From the router side, Rx counters go to 0 but Tx continues.)
—I had intended to continue to test 1.4.0 without rebooting by disabling half of two lags (one router set) while leaving the other router lags untouched. This didn't work (see above). When I do reboot it, it will be to update it to rc7 or later, so it won't be conclusive re: 1.4.0
The owner is about ready to rip out the Netonixes entirely. It's a showstopper of a bug in these cases where the Netonix is a single point of failure.
1) If you encounter a potential issue such as this you should "always" upgrade to the latest firmware and in this case is v1.4.3rc7
v1.4.0 is pretty dang old firmware with TONS of fixes completed since then. v1.4.0 was released May 20 of this year and if you read the release notes which I assume you have not because it has BIG letters that says this firmware was pulled due to major bugs.
v1.4.0 - PULLED BECAUSE OF 2 BUGS - Released May 20th, 2016
viewtopic.php?f=17&t=240&start=70#p12912
2) You are assuming this is a Netonix issue yet all you have to do is bounce the port on the Router side and the LAG comes back up which would more indicate an issue with the router in my opinion. And since I personally run LAGs all over my WISP albeit with Cisco Routers I have ZERO issues with LAGs.
Now that you are on the latest version of our firmware lets see what happens.
If it happens again you should have a game plan in place to determine what is going on such as be ready to go on site and plug into switch and access the UI and or CLI and see what the switch is telling you.
jermudgeon wrote:—It's not loop protection, SPT or pause frame related, AFAICT
—Bouncing the entire LAG (from the router side) restores connectivity
Not sure how you can rule ANY of the above out because all of those issues would be cleared from a port bounce.
Other things I would try is the following:
1) Either disable Flow Control on the ports facing the router or make sure Pause Frame Storm Protection which is a new feature is enable under the Device/Configuration Tab under the Storm Protection section.
2) I strongly suggest when using LAGs to enable RSTP so make sure it is enabled and then disable Loop Protection
3) Disable all features you do not NEED such as:
Discovery Protocols
Discovery Tabs
SMTP alerts
Look we are always willing to help people and go out of our way to help but we can not do this unless people follow our protocols such as keeping firmware up to date and upgrading to latest firmware in development at the time as we can NOT fix any issues found in OLD firmware versions as they are closed threads.
People should also use good diagnostics protocols such as those described above.
But if your only solution is hey I am running a 4 month old firmware that was pulled from active service because of major bugs and I have an issue that may or may not be your fault as there are 2 manufacturers involved so give us a magic command to fix it or we rip them all out....well not much I can do to help anyway then under those circumstances.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
- jermudgeon
- Member
- Posts: 14
- Joined: Sat Nov 14, 2015 5:08 pm
- Has thanked: 0 time
- Been thanked: 0 time
Re: LAGs and Mikrotik issue?
sirhc wrote:jermudgeon wrote:—I updated one switch to 1.4.3rc7 (since we already had a service impacting outage). Time will tell whether this helps.
—Still no log entries on either side that are LAG-related when traffic stops passing. (From the router side, Rx counters go to 0 but Tx continues.)
—I had intended to continue to test 1.4.0 without rebooting by disabling half of two lags (one router set) while leaving the other router lags untouched. This didn't work (see above). When I do reboot it, it will be to update it to rc7 or later, so it won't be conclusive re: 1.4.0
The owner is about ready to rip out the Netonixes entirely. It's a showstopper of a bug in these cases where the Netonix is a single point of failure.
1) If you encounter a potential issue such as this you should "always" upgrade to the latest firmware and in this case is v1.4.3rc7
Wilco.
sirhc wrote:2) You are assuming this is a Netonix issue yet all you have to do is bounce the port on the Router side and the LAG comes back up which would more indicate an issue with the router in my opinion. And since I personally run LAGs all over my WISP albeit with Cisco Routers I have ZERO issues with LAGs.
I'm actually not assuming it is a Netonix issue. We have three additional very similar setups, the difference being the 12 port DC model rather than the 24 port AC. All CCRs as well, and none of them have exhibited the issue. Two RB3011 setups, WS-8-150-DC with that one reboot-fixed issue I mentioned.
Like you I run LAGs elsewhere with heterogenous hardware and have no issues.
It's a process of elimination, and I'm just trying to eliminate variables.
If it happens again you should have a game plan in place to determine what is going on such as be ready to go on site and plug into switch and access the UI and or CLI and see what the switch is telling you.
So far, we have not been locked out of the switches. Are you saying that the CLI or UI will report more troubleshooting info than is in syslog?
jermudgeon wrote:—It's not loop protection, SPT or pause frame related, AFAICT
—Bouncing the entire LAG (from the router side) restores connectivity
Not sure how you can rule ANY of the above out because all of those issues would be cleared from a port bounce.
It's not loop protection, STP or pause frame related because
a) STP was already disabled, in any flavor
b) I'd had loop protection on; disabled it after the August 19 event, and so it was off for the most recent event
c) I'd had flow control on; disabled after the August 19 event, and so it was off for the most recent event
You're right that in the moment, those issues would be cleared from a port bounce. I was trying to eliminate them as a factor in the problem's recurrence, should it recur. I did not guess, but intentionally toggled them and waited for the problem to recur. It did. They are now ruled out.
Other things I would try is the following:
1) Either disable Flow Control on the ports facing the router or make sure Pause Frame Storm Protection which is a new feature is enable under the Device/Configuration Tab under the Storm Protection section.
FC, see above; I'll try pause frame storm protection in the newer firmware. Thanks.
2) I strongly suggest when using LAGs to enable RSTP so make sure it is enabled and then disable Loop Protection
I will test with RSTP back on.
3) Disable all features you do not NEED such as:
Discovery Protocols
Discovery Tabs
SMTP alerts
Look we are always willing to help people and go out of our way to help but we can not do this unless people follow our protocols such as keeping firmware up to date and upgrading to latest firmware in development at the time as we can NOT fix any issues found in OLD firmware versions as they are closed threads.
People should also use good diagnostics protocols such as those described above.
But if your only solution is hey I am running a 4 month old firmware that was pulled from active service because of major bugs and I have an issue that may or may not be your fault as there are 2 manufacturers involved so give us a magic command to fix it or we rip them all out....well not much I can do to help anyway then under those circumstances.
I hear you, Chris; I'm not personally advocating ripping them out! I appreciate the troubleshooting protocols you have described. I try to be methodical and systematic about diagnosis as well. You've given some new ideas to try, and I thank you.
-
sirhc - Employee
- Posts: 7416
- Joined: Tue Apr 08, 2014 3:48 pm
- Location: Lancaster, PA
- Has thanked: 1608 times
- Been thanked: 1325 times
Re: LAGs and Mikrotik issue?
NP, that is what I am here for.....a guy that really cares about his fellow WISPs but comes across sounding like a condescending dick head.....it's a God given talent I think!
Really curious if v1.4.3rc7 helps.
Really curious if v1.4.3rc7 helps.
Support is handled on the Forums not in Emails and PMs.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
Before you ask a question use the Search function to see it has been answered before.
To do an Advanced Search click the magnifying glass in the Search Box.
To upload pictures click the Upload attachment link below the BLUE SUBMIT BUTTON.
- jermudgeon
- Member
- Posts: 14
- Joined: Sat Nov 14, 2015 5:08 pm
- Has thanked: 0 time
- Been thanked: 0 time
Re: LAGs and Mikrotik issue?
sirhc wrote:NP, that is what I am here for.....a guy that really cares about his fellow WISPs but comes across sounding like a condescending dick head.....it's a God given talent I think!
Really curious if v1.4.3rc7 helps.
I'm kicking myself a little for the misunderstanding — I wrote 1.4.0 and I meant 1.4.2. That's what I get from trying to trust my memory, which is like political platforms… full of holes.
It literally just happened again today to the 1.4.2 router/switch pair, so I'm updating it to 1.4.3rc7 right now.
7 posts
Page 1 of 1
Who is online
Users browsing this forum: Google [Bot] and 66 guests