Storage inconsistency

Hi,

I recently had to restore from backup and in connection with that noticed one of my VMs had a missing chunk in the latest backup. After restoring, I’ve been trying to make sure I have a good recent backup of everything, but now, I’ve run into a weird issue.

Basically, the most recent snapshot of a VM is missing a chunk. I try to remedy that by taking a fresh and complete backup and then deleting the snapshot with the missing chunk (to ensure it is not taken into account for future prunes so I only end up with a broken snapshot). Thing is, that doesn’t seem to work.

Here’s what I have done: take a new snapshot using backup, obtain snapshot 51. Run a check and see a chunk missing in snapshot 50. Manually prune snapshot 50 and do a new check. Now, the same chunk is missing in snapshot 51.

It seems to me that if snapshot 51 needs the chunk, making the backup would ensure its existence. I would also expect that if the chunk is missing for snapshot 50 it would be missing for snapshot 51 as well during the check.

Navigating to the chunk store (on B2) reveals the chunk is indeed absent after the prune; here’s the chunks/89 folder on my storage:

It looks the same after making a new snapshot, so it seems backup doesn’t realize the chunk is missing.

Here’s what I do, starting from the last bits of backup of snapshot 51:

2020-02-29 15:02:55.369760 INFO BACKUP_DONE Backup Mail@stronger at revision 51 has been successfully completed
2020-02-29 15:02:55.370106 INFO BACKUP_STATS Total 10492 chunks, 262119.78M bytes; 96 new, 2300.77M bytes, 942.07M uploaded
2020-02-29 15:02:55.370231 INFO BACKUP_TIME Total backup time: 00:29:47
2020-02-29 15:02:55.373132 INFO SNAPSHOT_REMOVE Removing all snapshots of Mail
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] ./vertical --log check -a | tee -a backup.log
2020-02-29 15:03:37.289489 INFO PROGRAM_VERSION Vertical Backup 1.3.8
2020-02-29 15:03:39.073115 INFO STORAGE_CREATE Storage set to b2://stronger
2020-02-29 15:04:30.195346 INFO CHECK_LISTING Listing all chunks
2020-02-29 15:05:27.161666 ERROR CHECK_MISSING Backup Mail@stronger revision 50: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] ./vertical prune -r 50  Mail
Vertical Backup 1.3.8
Storage set to b2://stronger
Collection 1 is not eligible for deletion
Collection 2 has been saved
Backup Mail@stronger revision 50 has been deleted
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] ./vertical --log check -a | tee -a backup.log
2020-02-29 15:12:32.044653 INFO PROGRAM_VERSION Vertical Backup 1.3.8
2020-02-29 15:12:33.808153 INFO STORAGE_CREATE Storage set to b2://stronger
2020-02-29 15:13:25.663491 INFO CHECK_LISTING Listing all chunks
2020-02-29 15:14:23.935862 ERROR CHECK_MISSING Backup Mail@stronger revision 51: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] 

Only this machine is using the storage; it runs backups weekly and does a check and prune afterwards (with options for exclusive access). For a long time, the check failed due to spaces in VM names, so I don’t know exactly when the issue started. I did manage to restore the penultimate backup, though, so perhaps between those two?

I also tried removing all snapshots with errors of the affected VM, using --exclusive to delete the latest one. Doing this, the same chunk is missing in snapshot 44. 44 is before the snapshot I successfully restored last week. Checking again, the same chunk is now missing from another VM. I have a feeling, it will be missing from most or all of my backups (a lot of them are based on customizations of the same base image).

I would, of course, like to eliminate the concrete problem, preferably without ressing all backups, but I am really more interested in eliminating whichever situation lead to this data inconsistency.

Regards,
Michael

You can create a new VM from the same base image (or just use a copy of the vmdk file) and run a backup of the new VM to see if that chunk can be recreated.

Vertical Backup doesn’t keep prune logs as Duplicacy, so it may be better to use Duplicacy to run the prune jobs, especially the web GUI. No license is needed for running check, prune, etc.

Thanks for the suggestion. Unfortunately, it did not work. I cloned one VM and did a backup of the new clone followed by a check. The piece is also missing from the new clone:

[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] ./vertical --log check Dummy | tee -a backup.log
2020-03-08 10:20:02.420127 INFO PROGRAM_VERSION Vertical Backup 1.3.8
2020-03-08 10:20:04.243237 INFO STORAGE_CREATE Storage set to b2://stronger
2020-03-08 10:20:07.675097 INFO CHECK_LISTING Listing all chunks
2020-03-08 10:20:55.043526 ERROR CHECK_MISSING Backup Dummy@stronger revision 1: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing

I ended up deleting all snapshots referencing the missing piece, so now at least check works, and I’ll try making fresh backups.

I think it might be good with an option to check to delete snapshots missing chunks. Vertical/Duplicacy refuse to make partial restores anyway, so they are useless, and I just spent 2 hours manually identifying and deleting broken snapshots.

It might be even better to have backup (maybe behind an option) validate that chunks actually exist instead of trusting previous snapshots. As it is, it would be possible to make backups regularly yet end up without a valid snapshot for a long period as they would all be missing the same chunk.

Even that didn’t resolve the issue; after manually cleaning snapshots check was fine. I then did a full backup followed by a check, and I’m back to getting errors about that same chunk:

2020-03-09 02:20:58.403277 INFO PROGRAM_VERSION Vertical Backup 1.3.8
2020-03-09 02:21:00.190118 INFO STORAGE_CREATE Storage set to b2://stronger
2020-03-09 02:24:44.027695 INFO CHECK_LISTING Listing all chunks
2020-03-09 02:26:44.346491 ERROR CHECK_MISSING Backup Monitoring@stronger revision 47: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing

Can you remove the snapshot file Dummy@stronger/1 from the storage and back up Dummy again but with the -vv option:

./vertical -vv --log backup Dummy | tee -a backup.log

Then check if the chunk 89eb… is in the log.

If this chunk is still missing, remove Dummy@stronger/1 and run another initial backup with the --no-chunk-listing option:

./vertical -vv --log backup Dummy --no-chunk-listing | tee -a backup.log

Oh, good point about the no-chunk-listing option! I tried that, and thi just keeps getting weirder. The below is done with the new 1.4.0 release.

First, I ran a backup with -vv. It yielded no mention of the 89eb chunk in the log, but after backing up, check failed for the backed up machine:

Chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 referenced by snapshot Dummy@stronger at revision 1 does not exist
Some chunks referenced by snapshot Dummy@stronger at revision 1 are missing

So, that’s as expected. I removed the snapshot, and tried again with no-chunk-listing. After this, check succeeded! But it still failed for other VMs referring to the same chunk:

All chunks referenced by snapshot Dummy@stronger at revision 1 exist
Chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 referenced by snapshot Web@stronger at revision 47 does not exist
Some chunks referenced by snapshot Web@stronger at revision 47 are missing

The same check succeeded for the Dummy machine. I guess the chunk wasn’t needed for the machine this time around? That’s very weird, since the machine is not powered on, and the backups were done within an hour of each other.

I therefore tried doing a backup of the Web VM using no-chunk-listing because maybe that would at least make a recent backup without issues or even at best upload the chunk. Alas, after this backup, now the same chunk is missing in the two latest snapshots:

Chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 referenced by snapshot Web@stronger at revision 47 does not exist
Some chunks referenced by snapshot Web@stronger at revision 47 are missing
Chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 referenced by snapshot Web@stronger at revision 48 does not exist
Some chunks referenced by snapshot Web@stronger at revision 48 are missing

There is still no mention of the chunk in the log:

[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] grep 89 backup.log | grep eb84
2020-03-10 15:43:30.308189 ERROR CHECK_MISSING Backup Dummy@stronger revision 1: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing
2020-03-10 16:45:36.858799 ERROR CHECK_MISSING Backup Web@stronger revision 47: chunk 89eb8415019cda9a27d5d95eeb06efcb7297ec4e404698223866f39a8e1f89c6 missing

(I do the grep in two pieces to also match chunk URLs); both mentions are from checks.

I decided maybe the upload was slightly non-deterministic, but very likely (since I’ve had it fail on that chunk fairly consistently), so I removed the snapshot and tried doing another backup of the Dummy VM. This time something interesting happened: the check now returns error on 3 entirely new chunks:

Chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b referenced by snapshot Dummy@stronger at revision 1 does not exist
Chunk 2befc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 referenced by snapshot Dummy@stronger at revision 1 does not exist
Chunk 961086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 referenced by snapshot Dummy@stronger at revision 1 does not exist
Some chunks referenced by snapshot Dummy@stronger at revision 1 are missing

I manually checked the 2fef chunk, and can indeed not see it in B2. I checked the log, and the missing pieces are mentioned in the log; it looks perfectly normal:

[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] grep d8723e78a96e20f backup.log
2020-03-10 16:07:16.438307 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/24/d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b returned status code 404
2020-03-10 16:07:17.736062 TRACE CHUNK_UPLOAD Uploaded chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b size 104881
2020-03-10 16:07:17.764992 DEBUG CHUNK_CACHE Saved chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b to the local cache
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] grep efc09bd32f28915e53 backup.log
2020-03-10 16:07:12.806573 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/2b/efc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 returned status code 404
2020-03-10 16:07:13.210015 TRACE CHUNK_UPLOAD Uploaded chunk 2befc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 size 1472
2020-03-10 16:07:13.335349 DEBUG CHUNK_CACHE Saved chunk 2befc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 to the local cache
[root@stronger:/vmfs/volumes/5d4298be-b92d7a2e-87d0-78acc0faa2a0/verticalbackup] grep 1086c1058d165a1 backup.log
2020-03-10 16:07:14.066779 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/96/1086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 returned status code 404
2020-03-10 16:07:15.765799 TRACE CHUNK_UPLOAD Uploaded chunk 961086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 size 702831
2020-03-10 16:07:15.820098 DEBUG CHUNK_CACHE Saved chunk 961086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 to the local cache

I looked at the log and these three chunks are the last parts uploaded before backup finishes:

2020-03-10 16:07:08.137757 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10478/10486)
2020-03-10 16:07:08.243092 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10483/10486)
2020-03-10 16:07:08.454590 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10484/10486)  
2020-03-10 16:07:09.437713 TRACE CHUNK_DEDUP Skipped duplicate chunk 42096f4749783683aa92e791f920d16325de4829ea197c93467bb446f1aafc21 in the storage (10485/10486)
2020-03-10 16:07:10.352848 INFO BACKUP_UPLOAD Uploaded file CentOS 7 Template-flat.vmdk 252.43MB/s 00:17:18                                                                                      
2020-03-10 16:07:10.603772 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/59/0ba3f01989ddf112ec6a1fefba7a795934146fa53ce8aff2dec963b8e15a63 returned status code 404
2020-03-10 16:07:12.142725 TRACE CHUNK_UPLOAD Uploaded chunk 590ba3f01989ddf112ec6a1fefba7a795934146fa53ce8aff2dec963b8e15a63 size 3819                                                          
2020-03-10 16:07:12.143151 INFO BACKUP_UPLOAD Uploaded file CentOS 7 Template.vmx                                                                                                                
2020-03-10 16:07:12.393295 TRACE CHUNK_DEDUP Skipped duplicate chunk 93f1159ad032dd8b059fa22d9494594ce77da21ec179e41d2a955307ad6b6521 in the storage                                             
2020-03-10 16:07:12.393617 INFO BACKUP_UPLOAD Uploaded file CentOS 7 Template.vmxf                                                                                                               
2020-03-10 16:07:12.642063 TRACE CHUNK_DEDUP Skipped duplicate chunk 5d059c1a461b8b6058366deb1118e69065a5660042603ce06efdf4a8b449c9a5 in the storage                                             
2020-03-10 16:07:12.642537 INFO BACKUP_UPLOAD Uploaded file CentOS 7 Template.nvram                                                                                                              
2020-03-10 16:07:12.806573 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/2b/efc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 returned status code 404
2020-03-10 16:07:13.210015 TRACE CHUNK_UPLOAD Uploaded chunk 2befc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 size 1472                                                          
2020-03-10 16:07:13.335349 DEBUG CHUNK_CACHE Saved chunk 2befc09bd32f28915e5376441781661c59821eb95407ebf53889dd28afc6c3c7 to the local cache                                                     
2020-03-10 16:07:14.066779 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/96/1086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 returned status code 404
2020-03-10 16:07:15.765799 TRACE CHUNK_UPLOAD Uploaded chunk 961086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 size 702831                                                        
2020-03-10 16:07:15.820098 DEBUG CHUNK_CACHE Saved chunk 961086c1058d165a1b21ab55e97c528858bb0af6492a5e9e603ec1fa741726d4 to the local cache                                                     
2020-03-10 16:07:16.438307 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/24/d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b returned status code 404
2020-03-10 16:07:17.736062 TRACE CHUNK_UPLOAD Uploaded chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b size 104881                                                        
2020-03-10 16:07:17.764992 DEBUG CHUNK_CACHE Saved chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b to the local cache                                                     
2020-03-10 16:07:19.542225 INFO BACKUP_DONE Backup Dummy@stronger at revision 1 has been successfully completed                                                                                  
2020-03-10 16:07:19.542521 INFO BACKUP_STATS Total 10493 chunks, 262144.78M bytes; 82 new, 1925.78M bytes, 924.75M uploaded                                                                      
2020-03-10 16:07:19.542681 INFO BACKUP_TIME Total backup time: 00:17:29                                                                                                                          
2020-03-10 16:07:19.554111 TRACE CACHE_CLEAN Cleaning local cache                                                                                                                                
2020-03-10 16:07:19.633480 DEBUG CACHE_LIST Listing chunks referenced in backup Naming@stronger at revision 51 in the local cache                                                                
2020-03-10 16:07:19.676861 DEBUG CACHE_LIST Listing chunks referenced in backup DB@stronger at revision 65 in the local cache                                                                    
2020-03-10 16:07:19.721371 DEBUG CACHE_LIST Listing chunks referenced in backup Web@stronger at revision 48 in the local cache                                                                   

I notice that the three failed chunks do not give a chunk number after the size, unlike other chunks earlier:

2020-03-10 16:06:50.639517 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10184/10486)                                 
2020-03-10 16:06:50.739613 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10185/10486)                                 
2020-03-10 16:06:50.755938 TRACE CHUNK_UPLOAD Uploaded chunk fe7c6f85ab0bc54af135230c073f6c21807e173c12d4d7276db5cd1e14f3972c size 26214400 (10005/10486)                                        
2020-03-10 16:06:50.767888 INFO UPLOAD_PROGRESS Uploading 249.92MB/s 00:00:29 97.1%                                                                                                              
2020-03-10 16:06:50.937367 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10186/10486)                                 
2020-03-10 16:06:51.053563 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10187/10486)                                 

I also notice that the sizes are different and the chunks are decleared done immediately (while there’s around 6 seconds between chunks surrounding 10005 and 10005 itself):

2020-03-10 16:06:31.891889 TRACE CHUNK_DEDUP Skipped duplicate chunk b49e888638cab1e9593f12129302fed612f368b25faa9be36b1c71cf9513bbdd in the storage (10003/10486)                               
2020-03-10 16:06:32.203188 TRACE CHUNK_DEDUP Skipped duplicate chunk 1ac667bf0253db09d14ae062f527a0f06db5d54ce63e48c5c0cd3c0272ff6a9a in the storage (10004/10486)                               
2020-03-10 16:06:32.659249 DEBUG B2_CALL URL request https://f001.backblazeb2.com/file/stronger/chunks/fe/7c6f85ab0bc54af135230c073f6c21807e173c12d4d7276db5cd1e14f3972c returned status code 404
2020-03-10 16:06:34.577721 TRACE CHUNK_UPLOAD Uploaded chunk 4007b4648395b5444c2e0ae48e33d8d2c2847807e09bf85c703bd890e6559590 size 26214400 (9982/10486)                                         
2020-03-10 16:06:34.588461 INFO UPLOAD_PROGRESS Uploading 249.39MB/s 00:00:48 95.4%                                                                                                              
2020-03-10 16:06:35.102507 TRACE CHUNK_DEDUP Skipped duplicate chunk fb3f6d3f3fd85df867ec3c87c0ee45f59c6a3b78dcb1d71cc122ebadd1893a1a in the storage (10006/10486)                               
2020-03-10 16:06:35.464666 TRACE CHUNK_DEDUP Skipped duplicate chunk e41cafdd950226ff92eb1745b9f16efacabf00cafc445d6728325d8766bf687f in the cache (10007/10486)                                 
2020-03-10 16:06:35.940118 TRACE CHUNK_DEDUP Skipped duplicate chunk 92ee23e40a5b6ddbb0f267731554db5b153fb6e5a227929789851822fe8e7d05 in the storage (10008/10486)                               

This last bit might give some hint of why the 89eb chunk went missing in the first place. I do backups with 4 threads, and maybe the 3 threads busy uploading get terminated too soon, so B2 decides to discard the blocks, yet they are still stored in the local cache?

I’ll try removing the local cache and do a backup of Web without multiple threads and with no-chunk-listing and see if that resolves anything.

Removing the local cache + running a single-threaded backup with no-list-chunks didn’t upload the missing piece, but did result in a snapshot that didn’t depend on it. I’m trying making single-threaded backups of all VMs with the cleaned local cache now to see if that at least leads to working snapshots.

This log message:

2020-03-10 16:07:17.736062 TRACE CHUNK_UPLOAD Uploaded chunk 24d8723e78a96e20fd6d50b283ebce79a10b487ad92e089bb249b587f4e7440b size 104881

means the previous API call to b2 has completed and b2 returned a 200 status code. So this chunk should have been already uploaded to b2. There may have been a delay that caused the chunk to be not seen in the b2 storage for a while. Can you check the chunks directory now to see if it is there (should be under the subdirectory 24 and with the filename d872...)?

The --no-chunk-listing is only to instruct an initial backup to look up every chunk to be uploaded, instead of listing all chunks in the storages first (which is usually more efficient). For a subsequent backup (revision > 1) this option has no effect.

89eb... may be a snapshot chunk rather than a file chunk. To recreate it, you need to run an initial backup for that particular VM. Since b2 doesn’t support renaming directory, you may need to create a new bucket, copy over the config file, and then back up the VM to this new bucket. Or you can backup all snapshot files under snapshots/vmname and start from revision 1.