WS-26-500-DC REBOOT ISSUE EXPLAINED
Posted: Sat Mar 24, 2018 1:21 pm
So if you have a WS-26-500-DC that is rebooting randomly this post may explain why and how to possibly fix the issue.
We had several hundred units that had the cable management done wrong.
Basically a person on the line decided it was a good idea for cable management to tie the fan, power, and I2C cables together.
Question: Why is this wrong you might ask?
Answer: Well there are a couple reasons listed below.
1) In zip tying these cables it has been shown that it put strain on the I2C connector and could cause a poor connection on one of the pins causing intermittent I2C errors.
2) Also in tying these cables together you are running the I2C in parallel with the fan cables which uses modified pulse to control the fan speed. This modified pulse causes noise on the fan wires so if they are too close in parallel to the I2C cable it can cause this noise to jump over to the I2C cable and cause intermittent I2C ERRORS.
These intermittent I2C errors were causing the Linux I2C service in charge of collecting telemetry data on the switchboard and power supply sensors to get backed up as it had a timeout where it would wait for a response from the sensor that was too long. When the I2C service got too backed up it starved another service called watchdog running on the Linux shell to not get enough CPU time to respond to the switch core watchdog requests. Basically if the switch core does not get a response from the Linux shell for 1 second it assumes the Linux shell is locked up and forces a cold reboot of the switch. Keep in mind that 1 second in computer time is almost an eternity.
The problem was not affecting all WS-26-500-DC so we were confused and looking in all the wrong areas for the cause of the reboot. At first we thought it was simply the CPU was over utilized as we all know the WS-26-500-DC had the highest CPU utilization due to the fact of it having to monitor the most sensors so we came out with v1.5.0rc1 which our programmers optimized the code reducing CPU utilization on all switches by as much as 40%. Sadly this was not the issue but was a good thing to do anyway as it gives up more room to add future creature features you guys are always asking for. Remember the CPU that runs the Linux shell is for the UI/CLI, stats collection, daemons, and so on. The CPU has no direct correlation as to the amount of data the switch is passing/forwarding as the switch core has its own CPU for packet forwarding.
Now version 1.5.0rc1 did add a new feature that helped narrow this down. It now reports in the switch log if the reboot was caused by the watchdog. We were able to confirm by people posting their logs that indeed this is what was happen but still did not know why.
What happened next was pure luck but it was good luck so we will take it. We had people basically staring at the Device/Status TABs on several switches we were running tests on and we happened to notice that every so often we would get an intermittent ERROR on the Board and CPU Temp as shown in the picture below.
CLICK IMAGE BELOW TO VIEW FULL SIZE
So we decided to open the chassis and check the I2C cable and that is when we noticed the cables were tied and this should not have been that way.
CLICK IMAGE BELOW TO VIEW FULL SIZE
So we carefully clipped the ties off and did cable management as we had intended without these ties as shown below.
CLICK IMAGE BELOW TO VIEW FULL SIZE
After making sure the connectors were all seated properly and the cables were ran properly we put the chassis back together and the intermittent I2C ERRORS went away.
But even with the ERRORS this should not have caused a watchdog reboot so our programmers and engineers went to work to find out why. Turns out a simple timeout was set too high which we reduced preventing the I2C service from getting backed up and starving the watchdog responder service from not getting enough CPU time to respond to the switch core to prevent a cold reboot. This software change was released in v1.5.0rc2
Now even if you have an intermittent I2C ERROR showing on the Device/Status TAB v1.5.0rc2 "should" prevent any reboots but I still advise you to check for the intermittent ERROR by simply sitting and watching the Device/Status TAB for up to 10 minutes after upgrading to v1.5.0rc2 and if you see this ERROR on the CPU or Board Temp I would schedule a time to fix it as described above in the picture. You are also welcome to open an RMA and send it to us to do but we are giving you permission to cut the warranty label if you see this error and it will not void your warranty on any switch with a manufactured date prior to this post as we have since corrected this and we went and opened every single switch in the warehouse and fixed any that were done wrong so this will not happen moving forward.
Also not all switches with their cables tied will cause this issue, it is random based on how the cables were tied and what position they were in when tied and also if by tieing the cables it put strain on the I2C connectors causing a poor connection.
So to recap on what you should do if you have a WS-26-500-DC:
Upgrade firmware as soon as possible to v1.5.0rc2 or newer
Check for intermittent I2C ERROR on Device/Status TAB for CPU and or Board temp.
If you see the ERROR then either cut the warranty label, open the chassis and fix it or get an RMA # for us to fix.
We are still fine tuning the firmware on this issue but we feel that v1.5.0rc2 should prevent any reboots from this issue, it lowers the CPU utilization on all models by as much as 40% which is a good thing.
You can download v1.5.0rc2 HERE.
Please make a post in this thread if you find the telemetry ERROR and if fixing the cable management clears it and if you were having reboots that doing the firmware upgrade and cable fix your issue has gone away.
Sorry for any problems this may have caused you.
We had several hundred units that had the cable management done wrong.
Basically a person on the line decided it was a good idea for cable management to tie the fan, power, and I2C cables together.
Question: Why is this wrong you might ask?
Answer: Well there are a couple reasons listed below.
1) In zip tying these cables it has been shown that it put strain on the I2C connector and could cause a poor connection on one of the pins causing intermittent I2C errors.
2) Also in tying these cables together you are running the I2C in parallel with the fan cables which uses modified pulse to control the fan speed. This modified pulse causes noise on the fan wires so if they are too close in parallel to the I2C cable it can cause this noise to jump over to the I2C cable and cause intermittent I2C ERRORS.
These intermittent I2C errors were causing the Linux I2C service in charge of collecting telemetry data on the switchboard and power supply sensors to get backed up as it had a timeout where it would wait for a response from the sensor that was too long. When the I2C service got too backed up it starved another service called watchdog running on the Linux shell to not get enough CPU time to respond to the switch core watchdog requests. Basically if the switch core does not get a response from the Linux shell for 1 second it assumes the Linux shell is locked up and forces a cold reboot of the switch. Keep in mind that 1 second in computer time is almost an eternity.
The problem was not affecting all WS-26-500-DC so we were confused and looking in all the wrong areas for the cause of the reboot. At first we thought it was simply the CPU was over utilized as we all know the WS-26-500-DC had the highest CPU utilization due to the fact of it having to monitor the most sensors so we came out with v1.5.0rc1 which our programmers optimized the code reducing CPU utilization on all switches by as much as 40%. Sadly this was not the issue but was a good thing to do anyway as it gives up more room to add future creature features you guys are always asking for. Remember the CPU that runs the Linux shell is for the UI/CLI, stats collection, daemons, and so on. The CPU has no direct correlation as to the amount of data the switch is passing/forwarding as the switch core has its own CPU for packet forwarding.
Now version 1.5.0rc1 did add a new feature that helped narrow this down. It now reports in the switch log if the reboot was caused by the watchdog. We were able to confirm by people posting their logs that indeed this is what was happen but still did not know why.
Intellipop's Log wrote: Dec 31 19:00:06 netonix: 1.5.0rc1-201803191145 on WS-26-500-DC
Dec 31 19:00:11 system: Setting MAC address from flash configuration: EC:13:B2:06:09:3E
Dec 31 19:00:14 admin: adding lan (eth0) to firewall zone lan
Dec 31 19:00:15 admin: Unable to query power supply
Dec 31 19:00:27 STP: MSTI0: New root on port 2, root path cost is 20000, root bridge id is 32768.64-D1-54-D5-11-AB
Dec 31 19:00:47 UI: i2c error setting 0x47 12 110
Dec 31 19:01:08 UI: i2c error setting 0x47 14 122
Dec 31 19:01:12 dropbear[931]: Running in background
Dec 31 19:01:15 switch[974]: Detected cold (watchdog) boot
What happened next was pure luck but it was good luck so we will take it. We had people basically staring at the Device/Status TABs on several switches we were running tests on and we happened to notice that every so often we would get an intermittent ERROR on the Board and CPU Temp as shown in the picture below.
CLICK IMAGE BELOW TO VIEW FULL SIZE
So we decided to open the chassis and check the I2C cable and that is when we noticed the cables were tied and this should not have been that way.
CLICK IMAGE BELOW TO VIEW FULL SIZE
So we carefully clipped the ties off and did cable management as we had intended without these ties as shown below.
CLICK IMAGE BELOW TO VIEW FULL SIZE
After making sure the connectors were all seated properly and the cables were ran properly we put the chassis back together and the intermittent I2C ERRORS went away.
But even with the ERRORS this should not have caused a watchdog reboot so our programmers and engineers went to work to find out why. Turns out a simple timeout was set too high which we reduced preventing the I2C service from getting backed up and starving the watchdog responder service from not getting enough CPU time to respond to the switch core to prevent a cold reboot. This software change was released in v1.5.0rc2
Now even if you have an intermittent I2C ERROR showing on the Device/Status TAB v1.5.0rc2 "should" prevent any reboots but I still advise you to check for the intermittent ERROR by simply sitting and watching the Device/Status TAB for up to 10 minutes after upgrading to v1.5.0rc2 and if you see this ERROR on the CPU or Board Temp I would schedule a time to fix it as described above in the picture. You are also welcome to open an RMA and send it to us to do but we are giving you permission to cut the warranty label if you see this error and it will not void your warranty on any switch with a manufactured date prior to this post as we have since corrected this and we went and opened every single switch in the warehouse and fixed any that were done wrong so this will not happen moving forward.
Also not all switches with their cables tied will cause this issue, it is random based on how the cables were tied and what position they were in when tied and also if by tieing the cables it put strain on the I2C connectors causing a poor connection.
So to recap on what you should do if you have a WS-26-500-DC:
Upgrade firmware as soon as possible to v1.5.0rc2 or newer
Check for intermittent I2C ERROR on Device/Status TAB for CPU and or Board temp.
If you see the ERROR then either cut the warranty label, open the chassis and fix it or get an RMA # for us to fix.
We are still fine tuning the firmware on this issue but we feel that v1.5.0rc2 should prevent any reboots from this issue, it lowers the CPU utilization on all models by as much as 40% which is a good thing.
You can download v1.5.0rc2 HERE.
Please make a post in this thread if you find the telemetry ERROR and if fixing the cable management clears it and if you were having reboots that doing the firmware upgrade and cable fix your issue has gone away.
Sorry for any problems this may have caused you.