Gary K. Griffey wrote:Greetings James...
1) A new QRecall archive is created at site "A" that includes one or more of these virtual disks. Even with the best compression and highest shifted quanta options...this archive could easily reach 100 GB's in size.
Aside: Shifted quanta detection rarely helps with virtual machine files (which are essentially disk images) because disk images are organized into
blocks so data can only "shift" to another block boundry. Shifted quanta detection is looking for shifts of data at the byte level. I'm not saying cranking up shifted quanta detection won't make
any difference, but it will add a lot of overhead for very little gain. Now, back to our regularly scheduled program...
2) This archive is then copied to an external drive...that is physically relocated to site "B".
Now, the problem statement. When the archive at site "A" is subsequently updated with a recapture operation of the virtual disks...I need a way to "refresh" site B's copy of the archive...preferably via a network connection....just the delta data would be transmitted, of course...then the archive at site "B" would somehow be "patched", for lack of a better term, and thus be a mirror of site "A"'s archive.
I have used many diff/patch utilities in the past to mimic this functionality...but they were all geared toward single binary files...not a package file/database, as QRecall uses.
A package is just a collection of files. Synchronize all of the files in the archive's package, and you've sync'd the archive.
Gary, I do this using rsync. I have a couple of archives that I maintain off-site mirrors of. I do this by running rsync once a day/week to mirror the changes made to one archive with another. Since QRecall adds the blocks of data that changed, and rsync only transmits the block of data that have changed, the two are almost a perfect match. The end result is that rsync will transmit pretty must just the new data captured by QRecall and not much else.
To do this over a network requires (a) one system with rsync and a second system running an rsync server or ssh, (b) a fairly fast network connection, (c) a generous period of time in which neither system is updating its archive, and (d) more free disk space than the size of the archive.
I schedule rsync (via cron) to run at 3AM every morning. It uploads an archive of my important projects (30GB) to my server and then downloads the running backup of my server (175GB) to a local drive. This process takes a little over an hour each day and typically ends up transferring about 1GB-1.5GB of data.
One of the drawbacks to this scheme is in how rsync synchronizes files. rsync first makes a copy of a file, patches the copy with the changes, and finally replaces the original with the updated version. For small files this isn't any big deal, but for the main repository.data file (which is 99%) of your archive, this means the remote system will first duplicate the entire (100GB) data file. This requires a lot of time, I/O, and disk space, but is really the only downside to this method.
My tip to making this work efficiently is to minimize other changes to the active archive. Schedule your merge actions so they run only occasionally (weekly at most), and compact the archive only rarely. Merging creates lots of small changes through the archive, and compacting makes massive changes. The next rsync will be compelled to mirror these changes, which will require a lot of data to be transmitted.
I keep giving this problem a lot of thought, as there are more than a few individuals who want to do this. I have a "cascading" archive feature on the to-do list, which would accomplish exactly what rsync does (although a little more efficiently). But I still don't like the design restrictions, so I keep putting it off.