Backup job (cron) never finished, now stuck

I have Vertical backing up a couple of VMs for a few weeks now and been pretty reliable. About 4 days ago, there was some hiccup and I guess the backup never finished.

But- there seems to be no timeout, and now the daily cronjob is failing, saying “BACKUP_BUSY Another backup job is in progress”

I found these instructions for forcefully killing the backup (seems a bit hazardous- will this corrupt anything?) :

But- I’m hoping there is a more safe/sane way to keep this from happening. Paid user here, thank you

I think it is most likely that the previous backup was aborted but left the plock file behind. You can find this plock file in the hidden .verticalbackup folder and simply remove it.

Even if you run pkill vertical to kill an ongoing backup it won’t corrupt anything.

It wasn’t the plock file in this case, there were 7 instances of vertical running, I had to manually kill them and then the backup was able to run again. Hopefully it was a random fluke. A timeout feature would be very nice to have - if backup is running for >24 hrs then fail, etc.

@gchen I’ve bought 2 more licenses and set up new hosts. Still having this problem quite often. It’s a real PITA. Any more thoughts about a --timeout option?

Did it happen to all three hosts? When it gets stuck again, run pgrep vertical to find the pid of the first Vertical Backup process, and then strace -fp pid to trace the system calls. The output of strace should indicate in which call it gets stuck.

It’s happened to 2 of the hosts so far. Both using B2 backend. I killed it already to get it moving again, but if it happens again I’ll run the ptrace. Thank you

@gchen I had a backup get stuck again (ran >13hrs) and I ran the ptrace commands. The output is large, is there a way I can send it to you via private message or email?

This 1.3.6 version should fix the bug: https://acrosync.com/esxi/vertical

You’ll need to download the binary on a desktop computer and scp to the ESXi host since wget there doesn’t accept the https protocol.

Thank you. I have v1.3.6 installed now on 3 hosts, fingers crossed. Will keep posted here on status.

@gchen Got my first successful cron backup from the troublesome host last night! So far so good. Thank you again :slight_smile:

@gchen I just PM’d you some more traces, had 2 hosts go zombie again even with fixed ver 1.3.6 – hope you can take a look, thank you

Can you try this 1.3.7 version: https://acrosync.com/esxi/vertical

I found a place where I missed last time: if there is an upload error near the end of a backup then the main thread would enter an uninterruptible loop. So hopefully this version should work better that the previous one.

Will definitely try it and let you know! :slightly_smiling_face:

Had some new errors with 1.3.7, I uploaded the console output @gchen
see
output-v137.txt

That looks like an out-of-memory error that killed the main thread while other threads kept running.

I’ll fix the error handling with an update, but you’ll still get the memory error. I think you can get around it by backing up one VM at a time. This may create too many emails, so you may consider putting them in a script, one VM per line, then send out the email at the end of the script using a mail utility.

I didn’t know VB backed up multiple VMs in parallel. I thought it processed them sequentially, one at a time (it appears that way from the log file at least).

How would I send mail at the end of the script? The mail utilities I am aware of (eg sendmail) don’t seem to exist on th ESXi. I see VCSA has such things but I am not running VCSA.

I found this page on how to send mail from ESXi using netcat - is that what you’re suggesting?

You can try to see if the netcat method works for you.

If not, I’ll add an option to the email command that will allow you to send out an email.

Sorry about the confusion – Vertical Backup backs up multiple VMs in sequence, not in parallel, when you specify them in one line. When you back up only one VM in one invocation, you always start afresh with each VM, thus hopefully minimizing the overall memory usage.

Do you think 10 threads is too many? That’s what I’m using currently. The host has 32 or 64GB of RAM.

You don’t need that many threads, since only a small percentage of chunks are actually uploaded. So 4 threads should be enough.

Sadly no matter how much memory the host has, ESXi limits each process to use less than 1GB.