Stephen wrote:
Hello RTGLW,
Couple questions to see if we can figure out what's going on.
For the switch's that had discovery disabled and are still growing in memory, if you reboot them - does the memory growth stop?
For the same set of switch's that continued increasing in memory usage after Discovery was disabled, is there anything else different you can tell us about their configuration? Even if it seems small, such as, where SFP ports plugged in, where service's or ports enabled/disabled that differ from the other switch's? etc.
There were memory leaks found in SNMP and in the Watchdog functionality that where patched in 1.5.16. However, since the switch has so many different possible configurations, differing config's could expose others that haven't otherwise been caught, or wouldn't show up normally from an average deployment.
That being said, since you mentioned you saw that vtss_appl had a large memory consumption in a few instance's. For these switch's, did they have any of the following attributes?
- SFP ports plugged in? - if so, do you see any I2C error's in the switch log on these units?
- LAGs or LACP configured? - Is there anything in the logs related to these services?
For the switch's that continued showing memory loss but the offending process was not vtss_appl. Can you tell us which process it was? ps aux is OK for this type of test, on a production unit, this is the best that can be done. But knowing which process it is will definitely help find the issue.
Just in case it's still SNMP that's causing problems. If you have a linux box laying around anywhere. You can try running this in a loop to try and expose the problem more blatantly.
1. Install netsnmp, on Ubuntu, you can use snap to install it: https://snapcraft.io/install/net-snmp/ubuntu
- if on another distribution, I leave the details to you.
2. In a bash terminal (after net-snmp is installed) run the following:
- Code: Select all
while true; do snmpwalk -v 2c -c public ; done
This will continuously query the entire snmp tree. I have tools that do similar things to try and expose issue's. You can launch this in a few terminals to try to increase the effect.
If SNMP is the cause, the memory loss should increase dramatically after running this script(s). If you notice this, please let us know here and any other details such as the switch configuration so we can try and replicate it. (I suppose this is obvious but I'll say it anyway, please don't do this on switch's in production)
Memory does go up over time during normal operation, but it should eventually level off and also clear itself, as there is caching that occurs in the kernel that is slightly different than an average one that is meant to keep the interface between frames going between the CPU and the switchcore moving smoothly. High traffic loads can trigger this, but as traffic fluctuates up and down, it should clean it up as it goes along. Depending on the load, that might be all it is.
On a similar vein, if enabled, try disabling pause frames to see if that helps.
Hey Stephen, appreciate all the information provided on this. To answer some of your immediate questions:
For the switch's that had discovery disabled and are still growing in memory, if you reboot them - does the memory growth stop?
To keep things mostly relevant to the newest FW release; I've rebooted our 1.5.17rc2 host and will monitor over the coming week(s) to see if memory growth stops, as discovery was only disabled after the FW upgrade & reboot. BUT, I can confirm that a 1.5.15rc3 host had to be rebooted after we disabled Discovery due to it's memory being so low, and that host has NOT had it's memory increase since with 43 days uptime.
For the same set of switch's that continued increasing in memory usage after Discovery was disabled, is there anything else different you can tell us about their configuration? Even if it seems small, such as, where SFP ports plugged in, where service's or ports enabled/disabled that differ from the other switch's? etc.
These switches all have the same identical configuration (we provision them with a script) including which ports are in-use other than some have unused ports disabled while others do not and their SNMP Server location strings are different. No SFP, LACP, or LAGs in use on any of these.
That being said, since you mentioned you saw that vtss_appl had a large memory consumption in a few instance's. For these switch's, did they have any of the following attributes? SFP ports plugged in? LAGs or LACP configured? Is there anything in the logs related to these services?
No SFP ports in use on those hosts, no LAGs or LACP configured either. The only events in their logs are DHCP lease renewals.
For the switch's that continued showing memory loss but the offending process was not vtss_appl. Can you tell us which process it was? ps aux is OK for this type of test, on a production unit, this is the best that can be done. But knowing which process it is will definitely help find the issue.
Units still showing memory loss don't seem to be critical enough yet to show any other process that looks like a blatant outlier. I'll continue to monitor to see if I observe abnormal growth on any to report back.
Just in case it's still SNMP that's causing problems. If you have a linux box laying around anywhere. You can try running this in a loop to try and expose the problem more blatantly.
I could set up our lab for this after we get some results on the 1.5.17rc2 host as mentioned further up. Though I'd want to note that with the fixes introduced in 1.5.15~16 for SNMP, the overwhelming majority of our hosts on that FW (which have not had Discovery enabled since their last reboot) have no longer shown SNMP related memory leak issues. (Another huge thanks for that one btw.)