Page 1 of 1

The 8/15 issue, take 2

Posted: Thu Jun 09, 2016 8:35 pm
by tma
This post is meant to focus on a similarity in 3 other threads which (as I see it) has not been paid attention to. These 3 threads are:

1) http://forum.netonix.com/viewtopic.php?f=17&t=1022#p7949

Admittedly, this thread was flawed at start because I had first seen the 8/15 issue occur between two Netonix switches connected by a LAG and so I thought the LAG was the cause. The link refers to a subsequent posting where I report on lab setup where the 8/15 stream occurred between two Netonix switches connected by a single cable, no Airfiber or other devices connected.

2) http://forum.netonix.com/viewtopic.php?f=17&t=1390

User swang2002 reports on an issue between "the 8 port AC model with a Mini 6 port". Admittedly, he says it is 15 Mbps (instead of 8) but this could be due to looking at total throughput or by reversing Mbps and Kpps. Other than that, his findings are quite similar to what I had seen and so I left a comment in his thread. The thread stopped then, but it would be interesting to hear more from swang2002 what he did to solve the issue long term.

3) http://forum.netonix.com/viewtopic.php?f=17&t=1654
and http://forum.netonix.com/viewtopic.php?f=17&t=1654&start=10#p12374

TheHox reported about problems which, in his first post, weren't specifically pointing to the 8/15 issue. Actually his post was about ports flapping. In his second posting, he mentioned the 8/15 thing as a second issue he had observed and he showed in a screenshot what the 8/15 issue is about: a stream of 8 Mbps with 15 Kpps of something. At that point Wistech and Adairw became aware that they had seen the 8/15 issue too, and an Airfiber was mentioned for the first time.

Later, thread #3 zeroed in on the AF as the source of FC pause frames, it was confirmed that turning off FC will prevent the 8/15 issue, and (therefore) FC storm protection was added to Netonix firmware.

What I feel bad about is that, while thread #3 was growing, an important observation got lost: The 8/15 issue can happen without any AF and it can happen between two Netonix switches (as reported in thread #1 and #2). If I remember well, TheHox emphasized once more that no AF was present in his setup, but that didn't help.

Ignoring this important fact has been a serious mistake - that's what I think and why I opened this new thread.

Furthermore, it was quite unfortunate that, at about the same time, I uncovered a bug on the AF5X that occurs if it is run in 1/4x SISO modulation: In that condition, the AF5X sends 1.25 million(!) pps of FC pause frames to a Netonix switch. This finding was then taken to be the ultimate proof that the AF is the cause behind the issues discussed in thread #3.

However, the AF5X bug is rather different: It is about 640 Mbps and 1.25 Mpps, not 8 Mbps and 15 Kpps. And it also tells us that a Netonix will survive this maximum rate of pause frames (as demonstrated in my video) without becoming inaccessible and still passing packets to the storm source. Which raises the question why an 8 Mbps 15 Kpps stream of something (assumed to be pause frames) will render it inaccessible and block traffic going through it when a 1.25 Mpps true FC storm will not.

There has been no answer to this question. Instead, as it appears to me, FC storm protection has become a self-fulfilling prophecy: It will trigger at 10 Kpps (over 5 seconds) and if it triggers for an AF24 in bad weather, the AF is blamed for abnormal behavior. But are we sure 10 Kpps of FC pause frames are abnormal in that situation?

In short: I'd like to see us step back and think through it again: We got evidence for the 8/15 issue without AF. Turning of FC will prevent that. A Netonix will survive a 1.25 Mpps FC storm so how can there be a problem at 10 or 15 Kpps? And: Are we sure that 10 Kpps of FC pause frames are abnormal behavior?

Re: The 8/15 issue, take 2

Posted: Fri Jun 10, 2016 2:45 pm
by Eric Stern
If anyone is still having problems they will have to do some troubleshooting. Since the problem can cascade through the network, you need to trace back through the network until you find the problem device. That is, a device is is sending a pause storm but is not receiving a pause storm. Once the original source is found we can investigate further to figure out the root cause.

Re: The 8/15 issue, take 2

Posted: Wed Feb 01, 2017 5:39 pm
by tma
Finally there's a new report to what seems to be the 8/15 issue again, over here.

Re: The 8/15 issue, take 2

Posted: Fri Jun 09, 2017 8:03 pm
by tma
So I was hoping this had been fixed somewhere between versions and it would not come up again, but it did. It happened on a site with two backhauls, one being a Ceragon IP10 (on port 17, labelled M1) and the other being an AF5X (on port 1, labelled B1) ... so, aha, the AF5X again one would think, but not so:

All switch ports are set to FC=off on this switch except for the Ceragon IP10 port - I did it this way because the customer is critical and I didn't want to risk any FC issues. Only port 17 has FC=both because Ceragon IP10s have very small buffers and start dropping packets at 310 Mbps w/o FC, but can do the full 360 Mbps with FC.

So on a day with nice weather, the Ceragon link stopped working but the backup thru the AF5X saved our ass. This time I was able to take screenshots to document this. First note the graphs for port 17 (in the background) and see the switch sends 8 Mbps / 15 Kpps towards the Ceragon. Screenshot 2 was taken a few seconds (as fast as I was able to do) after screenshot 1:

frbs1-0815-1.PNG
note the counters on the TX side for comparing with screenshot 2


frbs1-0815-2.PNG


Note that on the TX side, no Unicast or Broadcasts are sent - just Tx Pauses and Multicasts. The Ceragon didn't send a single byte towards the switch - it couldn't under this pressure of FC frames from the switch. When I set FC=off on port 17, everything went back to normal, and setting FC=both afterwards the storm would not start again.

The point is that the switch was causing this FC storm on its own. I want to emphasize again that all other ports have FC=off and they were working fine, or I would have lost access to the switch. Also, setting FC=off and FC=both on port 17 worked to restore normal operation for port 17 - I didn't do anything else because I kind of expected this to fix the problem.

The switch has firmware 1.4.7rc7. The FC storm breaker feature is enabled on the switch but it didn't do anything because it expects others to be the source of FC storms, not the switch itself.

Re: The 8/15 issue, take 2

Posted: Fri Jun 09, 2017 8:20 pm
by tma
Just saw that I started this thread exactly one year ago. I agree this is hard to debug when there are (fortunately) only 3 or 4 events on some 50 Netonix switches that we use.

However, Eric, could you maybe extend the "Pause Frame" storm breaker feature to also watch the TX direction?

This time I was lucky because a backup link existed. But we've got leaf sites without backup. There's one with a combination of a WS12 and WS6-mini which did it twice already. We installed a GSM controlled power switch right after it happened for the first time, and it saved us when it happend the second time. But frankly, this is the first time after nearly 17 years of being a WISP that I had to install and use a GSM power switch ...