Loop Protection confusing next site switches

DOWNLOAD THE LATEST FIRMWARE HERE
User avatar
tma
Experienced Member
 
Posts: 122
Joined: Tue Mar 03, 2015 4:07 pm
Location: Oberursel, Germany
Has thanked: 15 times
Been thanked: 14 times

Loop Protection confusing next site switches

Mon May 16, 2016 8:01 am

This is not meant as a bug report, more as a warning, like saying "Careful With That Axe Eugene", the axe being loop protection and its probe packets it sends around. Here's the story:

In April 2015, we've set up 2 Ceragon links between 3 sites, for the first time using Netonix switches (with firmware 1.3.7) on one of the sites. The other 2 sites use DLINK DGS-1210-24 switches. In the meantime we've made Ceragon fix a slow but serious memory leak (and I got to see whether it's fixed really yet). But we also got a nagging problem with one of the DLINK switches on site 1 that used to come up every 4 to 8 weeks out of the blue. This is the general setup:

site1 w/DLINK - Ceragon1 - site2 w/Netonix - Ceragon2 - site3 w/DLINK

Each site has a switch pair, i.e. 2 DLINKs or 2 Netonices, for redundancy purposes. Until yesterday, it was always DLINK1 at SITE1 that was becoming confused. While being confused, it would somehow intermittently "block" ARP replies through the Ceragon link (thereby breaking this leg of the network) and to/from some other devices attached to it, not the same devices every time, until DLINK1 is was cold started. Maybe it wasn't even actively blocking ARP replies but assigning MACs to wrong ports or similar, which is hard to analyze for a layer 3 guy like me. We swapped the switch, upgraded its firmware, to no avail.

Yesterdays event was different, though, in that DLINK2 at SITE3 was suddenly doing the same. And, at the same time, NETONIX2 at SITE2 was doing the same to some of its devices (the latter had been seen before, but was assumed to be a side effect of the Ceragon link being blocked). The different scenario made me question the assumption that this is a DLINK-only fault somehow, because DLINK2 at SITE3 was running fine since 2+ years, now considering the Netonix switches at SITE2 could somehow be a common denominator.

I blocked communications between SITE2 and SITE3 on Ceragon2 and the problem of ARP/MAC confusion immediately stopped on both sites. I turned it back on and it came back. After some searching through configuration details (and trying several changes) on both sides, all to no avail, I finally arrived at Netonix "loop protection" and turned that off - and the problem stopped immediately.

So please take this only as a suggestion to try without loop protection if you experience weird intermittent ARP/MAC problems that start out of the blue. Although I was able to stop the problem by turning the loop protection feature off on the Netonix switches, I'm not saying Netonix' implementation is at fault - it could very well be the DLINKs need to be fixed. I'm only telling to be careful with that loop protection axe, Eugene, especially as it seems to be a default-on setting.
--
Thomas Giger

Return to Hardware and software issues

Who is online

Users browsing this forum: Google [Bot] and 71 guests