webformix wrote:OK, we have upgraded a bunch of Netonix Wispswitches last night.
We have been seeing what was once a rather subtle bug on previous fw (1.4.9 and/or 1.5.0, hard to tell) and barely had enough data about it to confirm it was not just coincidences. But since upgrade, we have caught same bug rather not-subtly twice in a day on 1.5.1. Hardware bug has presented on includes a WS-24-400A and a WS-12-400-AC as of today's testing.
Bug presents as follows: Wispswitch configured with some vlans (never more than 4 as it happens in our case, and seen with as few as management vlan + one more). Layer 3 traffic to one or more IP addresses on a single vlan that is not the management vlan (so far we've only caught this on one or more access ports at a time, that is to say vlan is set to "U" for this port) stops flowing. Some layer 2 traffic must be passing because the mac address table on the wisp switch (and on the router behind it) will rebuild for the problem IPs after you flush them. Needless to say, flushing mac cache does not make bug go away.
Problem start for IPs reachable through a given port appears to coincide with when that port hiccups, flaps, or renegotiates a different speed, per our device and SNMP logs.
"Some" IPs on a given vlan become unreachable, but not usually all. Affected IPs may be in any subnet, and other IPs in same subnet/vlan combo may continue to work just fine. In one instance this was on an AP and we had not diagnosed the problem as originating with the Wispswitch yet, so we tried rebooting the AP. The first time, only one client was affected but when AP came back online all clients were affected, while AP management interface was still reachable. After second reboot, even AP management interface stopped being reachable.
Bear in mind "reachable" is as measured from router behind any of these Wispswitches, or as measured from Wispswitch's management IP on a different vlan and a different subnet from the affected targets, so its traffic would have to leave the switch to the router and loop back.. making it no closer to the asset than the router itself.
So, our next diagnostic step is often to try creating a new IP in the same subnet as one of the troubled IPs, and applying that to the Wispswitch as an extra IP for that specific vlan. The thought is "if Wispswitch can ping asset while things behind it in same subnet/vlan cannot, perhaps our vlan forwarding settings are incorrect on either the switch or the router".
But, the instant the IP address gets added to the vlan (and then save/apply) where bug is being presented, bug instantly disappears and all affected clients can pass traffic once more.
Bear in mind that other settings that we save/apply do not appear to affect the bug at all.
Once bug is gone, new IP can be cleared out from Wispswitch's vlan with no further harm.
Please let us know if there is any kind of support dump we can download from the hardware during the bug's presentation, or if you have any other suggestions for us?
Thank you.
- - Jesse Thompson
Webformix, Bend OR
Next time you see the bug. See if you can get the logs from the switch that the bug is presenting itself on.
Also, take a look at the Port Detail page on ports that seem relevant to the situation. Look for error's, collisions, drops, etc and also see if you can get a screenshot of a couple if you can find any correlations with this issue.
Is flow control enabled on any of these switches? That mechanism has been known to cause odd behavior - switch it off if applicable.
Is it possible you could get a packet capture of the traffic that make's it through when this issue presents itself? Getting more details on the specific traffic that makes it across might help isolate what's happening.
Since VLANs seem to be involved, maybe MTU should be checked?