I have identified an issue in Log Insight 2.5 where alerts passed via email or to vROPS contain the following text in the message:
“Notification event – The worker node sending this alert was unable to contact the standalone node. You may receive duplicate notifications for this alert.”
I also confirmed that DNS resolution and reverse lookup functions are working as expected. I was also able to reproduce this issue successfully in a lab environment, with DNS working correctly.
While VMware vRealize Operations Manager makes use of a Gemfire database and vRealize Hyperic makes use of vPostgress, VMware vRealize Log Insight makes use of Cassandra. You might wonder why knowing that even matters. Well, as I’ve seen again this week, the database engine that drives each of these products essentially dictates the design and deployment of their environments and their limitations.
This week, we had a situation where our newly deployed Log Insight cluster wasn’t performing. In fact it was so bad, that it took 20 – 30 minutes to simply log into the admin interface. Yet the CPU and Memory usage counters for each of the appliances weren’t even being tickled. It was a strange issue for sure, and by 5pm on Monday 31st of August, we were in the process of logging a P1 call with VMware support.
Following on from my previous blog post where I mentioned that we’ve discovered a bug in the Hyperic 5.8.4 client (on both Windows and Linux), I think it’s only fair that I share our findings. It’s a bug that we discovered whilst deploying a very large vRealize Suite (two maximum sized global clusters of vROPS, vRLI, Hyperic and vRA/vRO).
Whilst carrying out some testing in my lab surrounding the impact of replacing SSL certificates in Hyperic, I noticed that if for whatever reason authentication between the Hyperic agent and Hyperic server fails, the Hyperic agent increases CPU utilisation of the client machine it’s running on to between 85% and 100%. At first I thought that it’s an anomaly, but I was then able to reproduce the symptoms a further 3 times in proving to VMware GSS that the issue really does exist. A long story short