On, Off, On, Off, On…
I see this more often than I’d like to, so count this as your PSA (public safety announcement) when using VM level backup software, SQL Server, and even worse when you add in AlwayOn (FCI or AG). You might be familiar with these symptoms which may include:
- Unresponsive VMs
- Lease Timeouts (Availability Groups)
- Healthcheck Failures (AG or FCI)
- Suspect Database Upon Restore of VM
- Application Timeouts
Know that you’re not alone in the endeavor to break the production environment running such an important set of database(s) that there is some sort of high availability solution(s) implemented.
If One is Good, Six is Better
There are some very popular backup vendors and software solutions out there which will take backups of the whole VM, it might even do some extra fancy things like replicate across data domains, compression, deduplication, and apparently causing you to involuntarily test your high availability implementation for SQL Server. Before we go any further, let’s agree that high availability for SQL Server databases is implemented because downtime should be kept to an absolute minimum, whether that may be from software issues like patching, to hardware issues such as bad memory, to operating system environment redundancy (OSE), through to some secondary effects such as readable secondary replicas to help with read only loads. While there are many potential problems with said implementations (such as having all of your VMs on the same host, storage, network, etc.) one of the items rarely, if ever, thought about is backups!
Let’s take a very popular solution, Veeam, which seems to be very good at what it was designed to do which is backups at various levels in a virtualized environment. If we take the default setup and point this to take a backup of our SQL Server VM, it will happily do so, and when it’s a stand alone SQL Server you may or may not notice any issues. However, when you put high availability into play… well… you’re asking for a highly available service. In contrary to what most people think the word “short” or “small” means, it has quite a wide variety of meanings and in the case of VM level backups and freezing (or pausing, stunning, whichever you prefer) the virtual machine most people wouldn’t see an issue with it… but SQL Server does.
When taking the backup, there are multiple ways to do so. The default way is to pause the VM which certain underlying functions are run for a point in time backup of the VM is created. Now this seems all well and fine, except there are already a few issues with this, namely the time is takes for this to happen which is dependent upon many infrastructure factors and the fact it’s not a quality backup for SQL Server. Let’s talk about the second one first, because in order to obtain a non-native quality backup of SQL Server (i.e. not using BACKUP DATABASE T-SQL) there is an API and infrastructure framework called VSS. SQL Server has a VSS Writer implementation which does required items inside the instance to have the databases in a consistent state for a backup (whereas just taking a VM backup or snapshot that doesn’t integrate with VSS does not) leaving you with a good backup (assuming no other issues).
Now let’s get to the main point, which is how long the VM stays paused or stunned – remember, this is a “small” or “short” amount of time, one might even say “trivial”. When it is kept this short to where it’s “trivial” as in less than a second then all is good and you most likely won’t notice it except in very high workloads… but we should be running with VSS integration and not VM level so it’s still incorrect, but hey. When this time is not short of trivial then GOOD things start to happen, most notably that high availability kicks in.
Wait, What? You Said Good?
Yes, that’s right. Generally with these issues everyone is upset that high availability for SQL Server kicked in, but in all honesty you should be happy about that, it means the solution worked. You asked it to be high available and when it noticed that there was an issue where it wasn’t processing items for so long it failed as it should! Normally this is where people scoff at me and ask me why downtime is a good thing, but the frame of refence is all wrong. There are really two distinct issues here, but most people only see one. The issue everyone sees and has meeting upon meeting about later is that SQL Server failed (either to another replica or node) and that’s not good (even though it is). The second is that they are taking VM level backups which are keeping the VM paused for a _long_ time.
The fact that the backup takes a long time is a problem, sure the fileserver you’re taking a backup isn’t complaining about anything and it never fails, but SQL Server HA will gladly show you where the infrastructure isn’t all that great – where the long pole in the tent is so to speak. In these cases it’s with an improper (or very slow) configuration, which to reiterate should be using VSS. In fact in Veeam (as one of many vendors) calls out in their documentation:
“When you back up or replicate a running VM, you need to quiesce or ‘freeze’ the VM to bring its file system and application data to a consistent state. To create consistent backups and replicas for VMs that do not support Microsoft VSS (for example, Linux VMs), enable VMware Tools quiescence in job settings. In this case, Veeam Backup & Replication uses the VMware Tools to freeze the file system and application data on a VM before backup or replication.”
Also:
“To create transactionally consistent backups or replicas of VMs that run Microsoft Active Directory, Microsoft SQL Server, Microsoft SharePoint, Microsoft Exchange or Oracle, you must enable application-aware processing in job settings. Application-aware processing is the Veeam technology based on Microsoft VSS. Microsoft VSS is responsible for quiescing applications on the VM and creating a consistent view of application data on the VM guest OS. Use of Microsoft VSS ensures that there are no unfinished database transactions or incomplete application files when Veeam Backup & Replication triggers the VM snapshot and starts copying VM data to the target.”
So don’t just take my random blog’s word for it. Also note that using VSS is not the default behavior so make sure to double check you’re configuring these types of settings properly in your backup software.
This means that when I’m telling you that your backup software (again Veeam is just a popular option but many exist out there) is actually causing you issues (in more than one way) then you may want to listen as it could have ramifications such as being able to actually get a good database restore back or not causing your instances and AGs to fail.
Pedantic Semantic
Your HA solution wouldn’t be very good if it didn’t.. uhh.. try to keep itself highly available like say when the server is paused. This means the knee-jerk and generally heated reaction of, “Set the timeout for SQL higher” and, “Why is SQL so terrible it’s failing over on a backup” along with, “Just max out the wait time” is horribly, horribly wrong. The root cause should probably be looked into (why are we pausing the VM in the first place and why is it taking so long). Like anything else there are tradeoffs… so you *may* end up setting the lease timeout higher so that a failure event doesn’t occur but that just means you’re going to wait longer before triggering a helpful action, which means more downtime. Additionally, having this set too low, especially in a non-performant environment, means you’re having too many false positives and the environment isn’t even stable.
When we talk about things like the lease timeout (currently the default is set for 15 seconds) doesn’t mean that the VM is paused for 15 seconds or longer, but it does mean that it was paused long enough that it failed at least 2 rounds of that health check. So, for example, the lease mechanism is going to renew in 500 milliseconds and the VM is paused at this moment and stays this way for 5 seconds, you’ve now missed the first renew check (500 milliseconds) and the second one 5 seconds later, resulting in a timeout. Obviously that didn’t take the whole 15 seconds, but the point is it doesn’t have to, it just needs to be long enough to cause a problem, and I’d definitely say 5 seconds is a problem to have a paused VM.
I hope this PSA was helpful, because while I love telling people that things such as this aren’t SQL Server’s fault, I’m also quite perturbed that I’m seeing this as often as I am, so let’s not have to update our resumes because not only do we not have good backups, but it’s also causing downtime.