Potential bugs in WS3 firmware causing strange behaviour

DOWNLOAD THE LATEST FIRMWARE HERE
Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 10:34 am

Hello,

I've been testing on the new WS3-14-600-AC units over the past few days and some of the behaviour I've been seeing indicates either a hardware defect or a firmware bug.

Important Info:
Model: WS3-14-600-AC
Firmware Version: 2.0.6rc4
Note: This unit has been previously RMA'd and the following is after the unit returned from RMA

Bench Test Setup
- Switch is directly connected to AC power via it's power cord
- Front ground screw is grounded to earth ground.

Testing So Far
I started validating this switch by performing a bench test as per instructions on these forums found here: LINK INSTRUCTIONS
I noted any deviations from the standard procedure as well as any abnormalities in the results (e.g. results not matching expected result)

Bench Test:
Deviations from standard procedure:
- Used a 10' patch cable in place of a 12' patch cable
- Switch was only ran for 20 hours instead of 24 hours

Test Results:
Note that as per official instructions the switch was factory defaulted and updated to latest firmware before performing the test.

Cable Diagnostic Results
- Port 1, 3, 4, 5, 6, 7, 8, 9, 10, 11
Pair 1, Length 3M, short
Pair 2, Length 3M, short
Pair 3, Length 3M, short
Pair 4, Length 3M, short

- Port 2
Pair 1, Length 3M, short
Pair 2, Length 0M, open
Pair 3, Length 0M, open
Pair 4, Length 3M, short

- Port 12
Pair 1, Length 3M, short
Pair 2, Length 3M, short
Pair 3, Length 0M, open
Pair 4, Length 3M, short

PoE Testing:
Connected an AirFibre 5XHD and tested each port powering the AirFiber both with 24V and 48V PoE options. Radio successfully powered up for every test. Every port negotiates a 1G link.

Reboot Testing:
Switch was left running overnight with a single device connected. Upon return uptime read 20 hours which was correct.

End of Bench Test

In also performed additional testing as this switch had issues getting an IP before it was RMA'd.

I'd like to hear feedback on the abnormalities before delving into my additional testing (notably why every single port showed the wrong cable diagnostic), however the most relevant info is that the switch seems to have issues switching between DHCP and Static IP. As well rebooting the switch via RS232 and via the DEF button causes the switch to lockup and when the lockup clears it prints the following error message several times:

PHP Notice: fwrite(): send of 36 bytes failed with errno=32 Broken pipe in /www/util.php on line 37

This tells me that for certain types of reboot the PHP scripts are crashing and preventing the switch from responding with a command shell over RS232 (note that character echoback remains functional while the switch is locked). Line 37 of the util.php appears to be the function that is called to process commands inputted over RS232 (at a glance). Bugs in the RS232 command shell would also explain another issue I've been seeing where config changes input over RS232 don't appear to have an effect on the switch (e.g. manually assigning a static IP via RS232).

Please let me know if there is any more information I can provide to help aid in diagnosing this issue.

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 11:20 am

Update on this: While troubleshooting the switch I attempted to set the address manually over RS232 to a static IP. This did not work (switch reports an IP of 0.0.0.0) I then re enabled DHCP via RS232 and the switch flooded me with a wall of text for a few minutes before a process crashed and the switch rebooted. I've attached a text file containing the RS232 wall of text. Note that I can't post the entire wall as the same message repeated so many times and the console history only goes so far back.

Another Update: When this switch was being validated by another person they observed the switch retrieve two IP's over DHCP and both IP's were able to access the management console. Photo is attached. Note that the IP reported by the switch is different than the IP in the browser bar. Also note that the MAC Address has been intentionally removed from this image.
Attachments
WS3-Two-IP's.png
WS3 getting two IP's
WS3-RS232-Wall.txt
(127.42 KiB) Downloaded 594 times

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 12:12 pm

Please note that the switch being trouble shot here: viewtopic.php?f=17&t=6948&p=35128#p35128 is the same switch.

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 12:23 pm

And another quick update. When the switch isn't locked up it seems to be switching it's IP between 39.0.0.0 and 0.0.0.0 Those IP's are what show up in Discovery on the WS-12-25-AC that the WS3 is connected to.

User avatar
Stephen
Employee
Employee
 
Posts: 1034
Joined: Sun Dec 24, 2017 8:56 pm
Has thanked: 85 times
Been thanked: 182 times

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 3:53 pm

The Cable Diagnostic's result's are a known issue that is being worked on at present. The accuracy of the cable measurement may not be improvable though, that's still under review presently.

I had tested the DHCP/Static issue's brought up in viewtopic.php?f=17&t=6948 but I was not able to replicate any of the issue's expressed there. However, I put it back in the que to be reviewed.

Garnet wrote:Update on this: While troubleshooting the switch I attempted to set the address manually over RS232 to a static IP. This did not work (switch reports an IP of 0.0.0.0) I then re enabled DHCP via RS232 and the switch flooded me with a wall of text for a few minutes before a process crashed and the switch rebooted. I've attached a text file containing the RS232 wall of text. Note that I can't post the entire wall as the same message repeated so many times and the console history only goes so far back.

Another Update: When this switch was being validated by another person they observed the switch retrieve two IP's over DHCP and both IP's were able to access the management console. Photo is attached. Note that the IP reported by the switch is different than the IP in the browser bar. Also note that the MAC Address has been intentionally removed from this image.


I will be trying this out probably first to see if I can see these results. Definitely odd behavior.

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 4:08 pm

Hi Stephen,

We did find that the switch was only exhibiting some DHCP weirdness on one of two test nets however both nets had identical configuration so the trail ran cold there. However that DHCP weirdness is separate from the issues we are now seeing.

I would like to know if other WS3 units are seeing issues considering the beta status of the product or if this switch may have issues that it's original RMA may have missed.

Also another update: The same wall of text has now been triggered simply by disconnecting a patch cable that had an active link from the switch. The switch also continues to lockup on every bootup and it is appears to be the same PHP error.

If I may offer a area to look into I think that whatever software is supposed to run on top of the OpenWRT system is failing before it can take over. This would be consistent with some the strange behaviours I've observed such as the switch showing up with an IP address in discovery but only before the full OS boots.

Also a point of clarification when I mentioned that this is the same switch from viewtopic.php?f=17&t=6948 I do mean it is the exact physical unit from that thread.

User avatar
Stephen
Employee
Employee
 
Posts: 1034
Joined: Sun Dec 24, 2017 8:56 pm
Has thanked: 85 times
Been thanked: 182 times

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 6:39 pm

Garnet wrote:We did find that the switch was only exhibiting some DHCP weirdness on one of two test nets however both nets had identical configuration so the trail ran cold there. However that DHCP weirdness is separate from the issues we are now seeing.


Thanks for the clarification, could you be more descriptive on the test nets that did or did not cause the issue? It may help in troubleshooting.

Garnet wrote:I would like to know if other WS3 units are seeing issues considering the beta status of the product or if this switch may have issues that it's original RMA may have missed.


There have been other issue's reported, generally they are related to PoE and OCP issue's though - which is currently being worked on.
I haven't seen any other reports with these issue's specifically or this many at once though.
But the product is in beta so we may see more instance's later on.

Garnet wrote:Also another update: The same wall of text has now been triggered simply by disconnecting a patch cable that had an active link from the switch. The switch also continues to lockup on every bootup and it is appears to be the same PHP error.


Could you describe the lockup in more detail? When the switch first boots it does take a few minutes for the CLI to become available which is expected behavior.
Obviously however, the crashing occurring from disconnecting a patch cable is not expected behavior.

Garnet wrote:If I may offer a area to look into I think that whatever software is supposed to run on top of the OpenWRT system is failing before it can take over. This would be consistent with some the strange behaviours I've observed such as the switch showing up with an IP address in discovery but only before the full OS boots.


I will look into the booting process, however I should mention, the new WS3's are not using OpenWRT any longer. It is a proprietary OS developed by our switchcore provider for the WS3's.

I suspect that it's possible that during you're tests a config file may have been corrupted that might be resulting in some of the new issue's you're seeing like the PHP error's followed by the extreme sensitivity that results in the switch crashing. This might have occurred when using the CLI. If that's the case, it might indicate that there is an issue with the defaulting function, bringing the switch into a vulnerable state instead of a clean one like it's suppose too.
Which might explain some of the new behavior.

That's my current suspicion, and I will try and replicate the scenario and find a way out of it. If I do I'll post here about what to do.

User avatar
Stephen
Employee
Employee
 
Posts: 1034
Joined: Sun Dec 24, 2017 8:56 pm
Has thanked: 85 times
Been thanked: 182 times

Re: Potential bugs in WS3 firmware causing strange behaviour

Wed Jun 02, 2021 6:52 pm

Actually, if you can access it would you mind sending me a PM of a config backup for your WS3?

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Thu Jun 03, 2021 9:45 am

Stephen wrote:Thanks for the clarification, could you be more descriptive on the test nets that did or did not cause the issue? It may help in troubleshooting.


As I said the two nets were identical save for the IP's used (e.g. 192.168.1.x vs 192.168.0.x). And this was prior to any the issues we are having now. As of now the switch exhibits the same behaviour regardless of what net it is connected too.


Garnet wrote:I would like to know if other WS3 units are seeing issues considering the beta status of the product or if this switch may have issues that it's original RMA may have missed.


There have been other issue's reported, generally they are related to PoE and OCP issue's though - which is currently being worked on.
I haven't seen any other reports with these issue's specifically or this many at once though.
But the product is in beta so we may see more instance's later on.


Stephen wrote:Could you describe the lockup in more detail? When the switch first boots it does take a few minutes for the CLI to become available which is expected behaviour.
Obviously however, the crashing occurring from disconnecting a patch cable is not expected behavior.


On every boot now the switch locks up for approximate 10 minutes. I call this a lock up as unlike the normal boot process the switch consistently errors after the lockup clears. The error is always the previously mentioned PHP error repeated many times. After which the CLI presents as if nothing happened. This first occurred after I rebooted the switch via the RS232 interface.

Stephen wrote:I will look into the booting process, however I should mention, the new WS3's are not using OpenWRT any longer. It is a proprietary OS developed by our switchcore provider for the WS3's.


Good to know, the bootup process looked almost identical to previous switch's so I had assumed OpenWRT was still running.

Stephen wrote:I suspect that it's possible that during you're tests a config file may have been corrupted that might be resulting in some of the new issue's you're seeing like the PHP error's followed by the extreme sensitivity that results in the switch crashing. This might have occurred when using the CLI. If that's the case, it might indicate that there is an issue with the defaulting function, bringing the switch into a vulnerable state instead of a clean one like it's suppose too.
Which might explain some of the new behavior.

That's my current suspicion, and I will try and replicate the scenario and find a way out of it. If I do I'll post here about what to do.


I must be extraordinarily unlucky if my testing corrupted a config file. To help you determine the flow of events leading to these issues see the following timeline:

1. Switch arrives from RMA
2. Switch boots normally, all ports negotiate 1G, unable to access management console, connected directly to switch with
an adapter that has a static IP on the same net as switch.
3. Enabled DHCP, can now connect to switch from it's IP but only on a Windows machine

4. Performed Bench test (see previous posts)
5.. Noticed switch would not acquire an address via DHCP
5.1. Manually set address to a static IP via RS232. switch did not take IP and did not revert to it's default static
5.2. Rebooted switch from CLI
5.3. Switch sat locked up for several minutes however character echo back over RS232 was still functional
5.4. Switch finally booted, this was the first time the PHP error occurred.
5.5. Switch still did not acquire a DHCP address or revert to default static IP
5.6. Attempted reboot via DEF button (green circle), same issue as 5.3 to 5.4

We had also attempted a factory reset via the DEF button (red circle) to no avail.

With the timeline laid out I hope it's clear why I would have to be extraordinarily unlucky to corrupt something as the only configuration change I had made was assigning a static IP via the Netonix CLI and then rebooting when the address didn't take. You may want to look into your error handlers or check that you don't have any MUTEX violations as those are the only ways I see this happening unless there are significant bugs elsewhere.

Garnet
Member
 
Posts: 71
Joined: Wed Jun 02, 2021 9:29 am
Has thanked: 2 times
Been thanked: 1 time

Re: Potential bugs in WS3 firmware causing strange behaviour

Thu Jun 03, 2021 9:46 am

Stephen wrote:Actually, if you can access it would you mind sending me a PM of a config backup for your WS3?


That may be difficult but I will try.

Next
Return to Hardware and software issues

Who is online

Users browsing this forum: No registered users and 19 guests