The idea for this post came from an activity that we engaged in with one of our premier customers who was just entering their Q4 peak. We offered them a proactive health check before they head into peak season.
Think of this as the equivalent of – “check your oil, tire inflation and head lights before you head into the hills” for your NSX footprint.
Here is a quick checklist of Health Check items. This list is not meant to be comprehensive, but lists a few common sense techniques that can be addressed prior to a forecasted peak or in a scheduled interval.
- NSX manager: Check if NTP, DNS and Syslog is setup properly.
- NSX manager: Regularly download tech support bundle so that you have a known good state handy.
- NSX manager: Check CPU, memory and storage usage on NSX manager. Temporary peaks for CPU and Memory are acceptable.
- NSX manager: Restore NSX manager backup on a test NSX manager appliance to check integrity of the file. This need not be a weekly or even monthly activity, but should be done as needed by the business. (Eg, before an expected peak)
- Check system events under NSX manager on the vSphere plugin and address any critical alerts.
- Login to ESG’s and run the “show highavailability” command – look for sync status between active and standby instances and rx/tx errors. A few errors are acceptable but continuously increasing error count can point to a larger issue. (HA will be turned off if ECMP is enabled.)
- LogInsight: login to the appliance and look for critical errors. Also check to see if all ESXi hosts are reporting syslogs to LogInsight. Admins may have forgotten to add logging capability to newly added hosts.
- VRNI: if you also have vRealize Network Insight, check the critical errors on vRNI. Periodically review system alerts and ensure they have appropriate email notifications set up. Remember to use the custom alert feature, this is easy to set up and can make a big difference in avoiding a crisis.
- VRNI System Alerts: vRNI provides 110 system alerts (and growing) that are already set up and only need to be toggled to enable with appropriate option for notifications. These are of varying severity but provide a proactive way to monitor critical infrastructure.
If there is anything else that comes to mind- please leave a comment below and I will add it to this list.