Power blip causes frozen web / ssh / snmp
Posted: Sat Sep 22, 2018 8:05 am
A couple of days ago we saw a suspected momentary power outage / brownout / spike across a big part of our network where around a third of our UBNT radios received fresh uptimes. We have 68 Netonix switches on the network and about 10 of them rebooted immediately. Another 30 went offline according to Netonix Manager and PRTG but actually they were still passing traffic albeit with a slow web interface and little or no SNMP.
Using SSH was hopeless but in the end I managed to reboot most of the switches using the web GUI and then they were fine. It took about 4 hours of watching browser tabs doing very little before I could eventually login and click the reboot button. Some switches showed CPU at 100% whilst I was struggling to access them, others failed to do full page loads and there were missing style elements etc.
I had to 'truck roll' to three switches which I couldn't access via web GUI due to constant timeouts and the last one I eventually rebooted remotely this morning after a further three hours of using a browser and having to retry the browser connection many times.
This is our model breakdown:
49 x WS-6-MINI
9 x WS-12-250-DC
4 x WS-8-150-DC
4 x WS-8-250-DC
1 x WS-12-250-AC
1 x WS-26-400-AC
Anyway I was just wondering if anything could be done to prevent this type of situation happening in future? I'm pretty sure the problem was power-related rather than network-related (packet storm?) due to the fresh uptimes. The switches are mostly running 1.4.9 but a few are on 1.5.0 and the problem affected them too. Maybe a future firmware release could make the CPU more tolerant to 'blips'? Might it be worth trying some experiments in the lab to test resilience against short power outages?
I'm not complaining, just sharing experience. And if it helps make the Netonix already-great products even better then that's a bonus!
Keep up the good work,
Thanks
Glenn
Using SSH was hopeless but in the end I managed to reboot most of the switches using the web GUI and then they were fine. It took about 4 hours of watching browser tabs doing very little before I could eventually login and click the reboot button. Some switches showed CPU at 100% whilst I was struggling to access them, others failed to do full page loads and there were missing style elements etc.
I had to 'truck roll' to three switches which I couldn't access via web GUI due to constant timeouts and the last one I eventually rebooted remotely this morning after a further three hours of using a browser and having to retry the browser connection many times.
This is our model breakdown:
49 x WS-6-MINI
9 x WS-12-250-DC
4 x WS-8-150-DC
4 x WS-8-250-DC
1 x WS-12-250-AC
1 x WS-26-400-AC
Anyway I was just wondering if anything could be done to prevent this type of situation happening in future? I'm pretty sure the problem was power-related rather than network-related (packet storm?) due to the fresh uptimes. The switches are mostly running 1.4.9 but a few are on 1.5.0 and the problem affected them too. Maybe a future firmware release could make the CPU more tolerant to 'blips'? Might it be worth trying some experiments in the lab to test resilience against short power outages?
I'm not complaining, just sharing experience. And if it helps make the Netonix already-great products even better then that's a bonus!
Keep up the good work,
Thanks
Glenn