QRecall

Cody Frisch

seeing as you clearly have a really good idea of what files are duplicated, and know how manage metadata, would it be possible to create utility/component to qrecall that deduplicates data and uses cloning in APFS?

also track the clone status of the files and backup the cloning metadata so that if we restore two files that are cloned - only one gets restored and the other cloned from it with metadata set correctly to match the original.

James Bucanek

That's a great idea ... which is why it's already in the works.

However, we can't "track the clone status of files", which is why I haven't done this already.

Basically what we want to do is implement the same logic for cloned file that we do for hard links. The ideal solution is to detect when a file is a cloned copy of another file being captured. And when those two files are restored, restore the first file and then clone it to restore the second file.

The problem is that (unlike hard links), the filesystem provides no information whatsoever about which files are clones of other files. And even if it did, cloned file can later be modified wherein it becomes a "partial" clone of that other file?with some blocks sharing storage with the original file and other blocks being independent of the original. So the two files could still be different and must be treated as unique files.

What we need to do is to build a "cloned file recognition" engine that can detect when a second file is a full or partial clone of another file. But to do that thoroughly requires a massive amount of memory and processing time, since we'd literally have to compare the data allocation of every file with every other file.

So the idea on the workbench right now is to (a) limit cloned file recognition to relatively large (multi-megabyte) files and (b) make it probabilistic, so it matches cloned files imperfectly, but is likely to recognize very large cloned files. This would, hopefully, be a reasonable trade off between capture speed and catching a few large cloned files. But I haven't proven it will work yet.