The Issue
Recently, we had a very strange issue with a Dell server, hanging virtual machines in VMWare. Unfortunatley, these issues arrived at the same time as some faulty Windows updates, and we were somewhat delayed by what we now know is this red herring. As another bump in the road, we were not experiencing freezes on all virtual machines, mainly RDS hosts, which of course affected end users more than say a DC freeze for a few minutes.
As we looked deeper into the issue, we started to analyse the VMWare logs, and noticed our freeze was occurring every 23 minutes. Some of these freezes were for a few seconds, some for a few minutes, and the ones that lasted a few minutes causing us to reboot the hosts.
The Fix
Where would we be without Google? Once we discovered that this was occurring in this timeframe, we found this forum post on the VMWare forums. And low and behold, it worked.
The affected idrac version appears to be 3.23.23.23 released around the end of November 2018. What seems to occur is a probe eery 23 minutes to the hard drives, which is turn causes the datastore to disconnect, and the delay writes to show in the Windows event logs.
In the post above, this user downgraded to a previous idrac version (as the post is a few years old now), however we upgraded to a late 2020 firmware version on our idrac and since, there have been no disconnects on our datastore.
Conslusion
We have since spoken to VMWare and Dell, and can confirm this is an extremely rare occurance. But we hope this helps anyone facing the same issue.
Add comment