Spontaneous Simultaneous Reboots
-
RebusCom - Experienced Member
- Posts: 111
- Joined: Sun Nov 30, 2014 9:42 pm
- Location: Washington
- Has thanked: 13 times
- Been thanked: 11 times
Spontaneous Simultaneous Reboots
We today had a bizarre incident on our network. We have six AC switches on a particular subnet dispersed over a large geographical area and for unknown reason they all rebooted at the same time, cycling POE and thereby rebooting all the AP and bridge radios as well. No power issues -- we would have separate indication of that being it's monitored, plus the routers didn't reboot. Different subnet switches unaffected. I do have Netonix Manager running on my computer only, and nobody was doing any network maintenance (it's Sunday, after all). Things were just humming along and bam, all h*ll broke loose. The manager does not have any action in the log, just lots of messages of remote switches going offline/online. Never seen anything like it. Any ideas what could have triggered a synchronized reboot? Firmware is v1.5.8 on all switches.
-
RebusCom - Experienced Member
- Posts: 111
- Joined: Sun Nov 30, 2014 9:42 pm
- Location: Washington
- Has thanked: 13 times
- Been thanked: 11 times
Re: Spontaneous Simultaneous Reboots
We today were again bit by the RSTP reboot bug. It seems to be triggered simply by an RSTP state change where a bridge link struggles from environmental conditions (which 60GHz links are prone to) and it ends up causing Netonix switches to perform a cold POE-off reboot. Depending on where it initiates within the RSTP hierarchy, it can cascade to multiple switches on the network that are running RSTP. We had a 60GHz link struggle in a passing heavy rain shower and fail over to the 5GHz backup, briefly, then everything rebooted due to the Netonix-initiated reboot. We've had failovers before without incident so it's unknown why they sometimes occur and sometimes don't.
Edit: In the prior occurrence it turns out that one of the switches part of the simultaneous reboot was not running RSTP, but all the switches that rebooted were on the same subnet.
Edit: In the prior occurrence it turns out that one of the switches part of the simultaneous reboot was not running RSTP, but all the switches that rebooted were on the same subnet.
-
RebusCom - Experienced Member
- Posts: 111
- Joined: Sun Nov 30, 2014 9:42 pm
- Location: Washington
- Has thanked: 13 times
- Been thanked: 11 times
Re: Spontaneous Simultaneous Reboots
We this evening had a third instance of spontaneous POE-off reboot, this time on a different subnet. The switches involved in the prior instances did not reboot this time. The firmware is the same 1.5.8 but model different, WS-12-250B.
It may be worth noting that one switch in each subnet has LAG enabled.
It may be worth noting that one switch in each subnet has LAG enabled.
-
tma - Experienced Member
- Posts: 122
- Joined: Tue Mar 03, 2015 4:07 pm
- Location: Oberursel, Germany
- Has thanked: 15 times
- Been thanked: 14 times
Re: Spontaneous Simultaneous Reboots
We are running 120+ Netonix Switches in our network. Recently, because we wanted to have Zabbix monitor all of them, we decided to upgrade from older versions of the firmware to 1.5.8. Many older switches were on 1.3.9, some 1.4.2 and 1.4.7, which lack some SNMP OIDs we are interested in. The result is rather frustrating because 1.5.8 turns out to be significantly more unstable when it comes to RSTP and LAGs and making simple configuration changes. Some details:
For purposes of redundancy, our bigger base stations use 4 switches in a S1=S2=S3=S4=S1 ring topology, where each = is a LAG on 2 cables. Obviously, STP is needed to avoid a loop in this setup, but STP never worked reliably (causing broadcast storms, sometimes seemingly out of the blue), which is why we decided to break the ring manually between S3 and S4 by configuring its LAGged ports down, one from each side. We've not had a BC storm since (of course).
Also, on firmware 1.3.9 and including most (if not all) 1.4.x versions, when a config change was made, it was accepted reliably. On 1.5.8 we find that even a config change that does not involve the LAGged + STPed ports - like a simple port label change - would often take like 20 seconds during which most traffic is blocked or the configuration change times out and is reverted (after 60 seconds, which is our revert timer setting). When this is happening, the log shows that the monitor script has restarted vtss_appl like 5 to 20 times before it finally settled with the new (or reverted) configuration.
Whatever the firmware is, this is not happening on our smaller base stations which got only one Netonix switch and so there's no need for STP and LAGs to begin with.
That is to say: Through all firmware versions, there has been and still is an issue when it comes to the CPU handling STP and LACP (for LAGs) PDUs. Vtss_appl will usually behave well during normal operation, but when something changes - either by making a configuration change or something that causes a port to go down or up - there's a high risk that the CPU (i.e. the vtss_appl process) misses STP and/or LACP control packets and comes to the conclusion that a LAG is broken or an STP topology change happened. Not sure whether vtss_appl really crashes or is only not responding to the monitor script, but until it is restarted and has re-established STP and LAGs, you may experience a BC storm or similar effects which may propagate to the next station and its switches - unless a router sets up a "subnet shield" between them.
We have never had a switch POE-reboot unexpectedly, but that may be because we do not rely on STP to prevent loops. We will also reconfigure all LACP LAGs to static LAGs in order to see if that helps prevent vtss_appl crashes when making config changes. In your case I would recommend you
- avoid STP: you may have it running, but make sure there's no possibility for a loop by manually configuring one interface down within the ring. If that ring is made from LAGs, you can configure it in a way that you can close the gap from either side if needed. But of course I understand that there are topologies that simply require STP ...
- avoid LACP: static LAGs do not require the CPU to handle LACP PDUs.
- set broadcast/multicast storm limits
- disable loop protection as this is just another burden on the CPU, i.e. vtss_appl probably
For purposes of redundancy, our bigger base stations use 4 switches in a S1=S2=S3=S4=S1 ring topology, where each = is a LAG on 2 cables. Obviously, STP is needed to avoid a loop in this setup, but STP never worked reliably (causing broadcast storms, sometimes seemingly out of the blue), which is why we decided to break the ring manually between S3 and S4 by configuring its LAGged ports down, one from each side. We've not had a BC storm since (of course).
Also, on firmware 1.3.9 and including most (if not all) 1.4.x versions, when a config change was made, it was accepted reliably. On 1.5.8 we find that even a config change that does not involve the LAGged + STPed ports - like a simple port label change - would often take like 20 seconds during which most traffic is blocked or the configuration change times out and is reverted (after 60 seconds, which is our revert timer setting). When this is happening, the log shows that the monitor script has restarted vtss_appl like 5 to 20 times before it finally settled with the new (or reverted) configuration.
Whatever the firmware is, this is not happening on our smaller base stations which got only one Netonix switch and so there's no need for STP and LAGs to begin with.
That is to say: Through all firmware versions, there has been and still is an issue when it comes to the CPU handling STP and LACP (for LAGs) PDUs. Vtss_appl will usually behave well during normal operation, but when something changes - either by making a configuration change or something that causes a port to go down or up - there's a high risk that the CPU (i.e. the vtss_appl process) misses STP and/or LACP control packets and comes to the conclusion that a LAG is broken or an STP topology change happened. Not sure whether vtss_appl really crashes or is only not responding to the monitor script, but until it is restarted and has re-established STP and LAGs, you may experience a BC storm or similar effects which may propagate to the next station and its switches - unless a router sets up a "subnet shield" between them.
We have never had a switch POE-reboot unexpectedly, but that may be because we do not rely on STP to prevent loops. We will also reconfigure all LACP LAGs to static LAGs in order to see if that helps prevent vtss_appl crashes when making config changes. In your case I would recommend you
- avoid STP: you may have it running, but make sure there's no possibility for a loop by manually configuring one interface down within the ring. If that ring is made from LAGs, you can configure it in a way that you can close the gap from either side if needed. But of course I understand that there are topologies that simply require STP ...
- avoid LACP: static LAGs do not require the CPU to handle LACP PDUs.
- set broadcast/multicast storm limits
- disable loop protection as this is just another burden on the CPU, i.e. vtss_appl probably
--
Thomas Giger
Thomas Giger
-
RebusCom - Experienced Member
- Posts: 111
- Joined: Sun Nov 30, 2014 9:42 pm
- Location: Washington
- Has thanked: 13 times
- Been thanked: 11 times
Re: Spontaneous Simultaneous Reboots
Thomas,
We employ both LAG (though from switch to router, not switch to switch) and RSTP, and have been experiencing all the things you mention in addition to the widespread reboots.
We employ both LAG (though from switch to router, not switch to switch) and RSTP, and have been experiencing all the things you mention in addition to the widespread reboots.
Re: Spontaneous Simultaneous Reboots
Running our Netonix switches on firmware v1.5.12 or v1.5.14 we have occasionally monitored similar issues on RSTP topology changes due to path outage.
Storm control is set for Broadcast, Multicast, and Unicast (i.e. 2/4/8k); Pause Frame is enabled; Loop Protection is disabled on all Netonix switches.
Doing some debugging, using RSTP on Netonix may cause device cold-reboot(s) if asymmetric path cost configuration causes an RSTP path enable/disable/enable/disable loop.
@RebusCom: Have you configured individual Path Cost value(s) to prefer 60GHz links over slow(er) backup links?
As the Switch has to clear the MAC Table on every Topology Change (aging), this event may also be the cause for the monitored cold-boot issues if multiple RSTP events occur within seconds.
I wish it would be possible to log and debug all STP-related TC events: viewtopic.php?f=17&t=7223
Storm control is set for Broadcast, Multicast, and Unicast (i.e. 2/4/8k); Pause Frame is enabled; Loop Protection is disabled on all Netonix switches.
Doing some debugging, using RSTP on Netonix may cause device cold-reboot(s) if asymmetric path cost configuration causes an RSTP path enable/disable/enable/disable loop.
@RebusCom: Have you configured individual Path Cost value(s) to prefer 60GHz links over slow(er) backup links?
As the Switch has to clear the MAC Table on every Topology Change (aging), this event may also be the cause for the monitored cold-boot issues if multiple RSTP events occur within seconds.
I wish it would be possible to log and debug all STP-related TC events: viewtopic.php?f=17&t=7223
6 posts
Page 1 of 1
Who is online
Users browsing this forum: No registered users and 44 guests