QRecall

Glenn Henshaw wrote:I recently upgraded the hard drive in my laptop. The backups are now failing with an odd error. The Mac is up to date (10.6.4) using QRecall Version 1.2.0(8) beta (1.2.0.8). The upgraded HD was cloned from the old one.

I opened the archive manually and tried to re-index it and the following logs were the result.

Thoughts?

I can't tell you much beyond the fact that QRecall was trying to write a file and the operating system returning an I/O error (ioErr, -36):


2010-09-08 22:00:40.075 -0400 Details File: /Volumes/MBP/thraxisp.quanta/hash.index

2010-09-08 22:00:40.075 -0400 #debug# API: FSWriteFork

2010-09-08 22:00:40.075 -0400 #debug# OSErr: -36

What would cause your brand new drive to start returning I/O errors is a bit puzzling. Although I'm not sure which drive is new, your internal startup drive or the external drive a /Volumes/MBP? If it's the internal drive that was upgraded, that this could cause your external drive to start returning I/O errors is equally bizarre.

You also have a mysterious -50 (invalid parameter) error recorded in the log, which also doesn't make much sense, although I've seen both of these errors caused by corrupted volume structures.

Gary,

You're getting a generic "permission denied" error on one of the archive's internal files:

2010-09-07 15:01:08.477 -0500 Details cannot open negative hash map

2010-09-07 15:01:08.477 -0500 #debug# IO exception

2010-09-07 15:01:08.477 -0500 #debug# API: open

2010-09-07 15:01:08.478 -0500 #debug# OSErr: 13

Error 13 is EACCES (Permission denied). It means the user that's running the compact action doesn't have permission to write to the internal negative hash map file inside the archive's package.

This doesn't make a lot of sense, assuming that QRecall can access all of the other files in the package. Have you been updating this archive using other user accounts or on systems that might have created files that belong to different users?

Anyway, there are two ways of fixing this. If the problem is just the negative map index file, then just delete it. The negative map is one of those index files that will automatically rebuild itself. In the Finder, right+click on the archive icon and choose Show Package Contents. Inside the archive package folder, find and trash the negative.index file.

The other solution is to fix the permissions problem. Again, open the package contents and then explore the ownership and permissions for each file. Ideally, they should all be owned by the user account used to update the archive, or at least have appropriate permissions so that all users that need to update the archive have read and write permissions for those files.

Gary K. Griffey wrote:I will continue testing...just to clarify...I am no longer restarting the laptop and using target disk mode for Capture per your advice...I am executing QRecall right from the laptop itself...so...I am hoping the file system events work as expected.

So do I! I'd be really disappointed if file system events stopped working.

Gary K. Griffey wrote:I am getting this error on one of my archives.

That's an odd error to get from a local drive. Send a diagnostic report and I'll look into the exact nature of the error and what it might mean.

Gary K. Griffey wrote:Any thoughts as to why?

Pure coincidence. The most innocuous things can cause file system events to record a change for a directory. Merely opening the folder that contains your VM file in the Finder could be enough to trigger a rescan of that item.

So...it would appear that file system changes, at least for certain files, are not "surviving" through a system restart...

They should. File system event records are stored in an invisible folder at to root level of each volume, and happy live on between system restarts.

But as I remember, you're still shutting down your laptop and connecting to another system using target disk mode to perform the QRecall capture. As soon as you mount a volume on another system, the entire file system events history for that volume becomes invalid. And it gets scrambled again when you move it back. So under these circumstances, all bets are off.

Gary K. Griffey wrote:I have basically ruled-out your supposition of a bad target disk drive/controller...

Excellent! Then this problem is actually interesting.

To get started, you'll need to upload a diagnostic report (Help > Send Report...) from the computer you're using to capture the VM file.

Could you also provide me with information about the volume that contains your archive? Specifically, it's size, format (HFS Extended, journaled, ...?), and how you connect to it (Firewire, USB, network, ...).

One experiment I tried this morning...

The fact that one capture failed and the next one succeeded when QRAuditFileSystemHistoryDays was set to 0 may, or may not, be relevant. QRAuditFileSystemHistoryDays only changes what folders QRecall scans for changes. It doesn't change any of the logic that QRecall uses to determine which items to capture, or how they are captured.

It may be relevant because this could be a QRecall bug, one related to the size of the file(s) being captured. Changing QRAuditFileSystemHistoryDays might have changed the folders that QRecall examined, which in turn could have changed which files it captured and how much data it added to the archive. I'd be more interested in a capture that failed, you delete the layer, and then run the same capture (with the exact same settings) again.

(Beside the issue, I would recommend that you leave QRAuditFileSystemHistoryDays set to something relatively low; certainly lower than the normal interval in which you recapture your laptop, since moving your laptop drive from one system to another other invalidates the file system events information.)

To further diagnose this problem, I'd be most interested in getting a snapshot of your archive's structure when it was OK and again after the capture corrupts it. To do that, make a dump of the archive after a successful verify:

- Open the archive in QRecall
- Choose Archive > Dump (not File > Dump...)
- In the dump options sheet, choose the following:
- - In the Data section, check all options EXCEPT Data Packages
- - In the Layers section, check all options
- - Make sure all options in the Fill map, Hash Table, Package Index, and Names Index sections are off
- Click Dump and save the file to a convenient location

The Dump command is a diagnostic tool that only exists in the beta version. It will lock up the QRecall application (you'll get the spinning Technicolor pizza of death cursor) until it finishes. It should take about the same amount of time as a verify, so be patient. When it's done, compress the dump file in the Finder and send it to me. If it's too big to e-mail, contact me and I'll provide you with alternative upload methods.

And, I'm going to ask again (just because I can't stop myself) that you use Disk Utility to verify the structure of the volume that contains the archive. I ask this simply because I've spent days in the past trying to diagnose problems that turned out to be a corrupted volume.

Continue taking normal captures. When the verify following the capture fails, create another dump (same options) and send me that along with another diagnostic report (Help > Send Report...). It would be awesome if you could capture the dump of the archive immediately before the failure, but I'll understand if you don't have the time to generate a dump file after every successful capture.

I'm also very interested in the size and structure of the files in your VM, but that information will be in the dump file. This should give me enough information to create a simulation of your situation here.

Gary,

It's really hard to tell what's going on without more information. Upload a diagnostic report (Help > Send Report...) and I'll see if there are any clues in the log.

One plausible explanation would be that the drive you're using to store the archive has at least one sector that's causing intermittent data loss and the drive's controller is either failing to detect this or spare the sector. The scenario would go something like this: A sector fails, resulting in a bad data record. The verify detects this. The repair erases the record and marks it as empty. The next capture/compact sees an empty record and writes a new record to the same location. The sector fails, corrupting the new record. The next verify detects this and...

One clue would be to see if the data errors all occur at, or near, the same file position. This would indicate a failure associated with a particular region of the hard drive's surface. One simple workaround would be to copy the archive (since it's only 40GB), rename the old one, and update all of your actions to use the new one. Now see if the problem reoccurs.

The fact that the error occurs in the same file isn't all that surprising. Even if the data corruption was completely random, VM files tend to be huge. I wouldn't be surprised if most of the data in your archive belongs to that one VM file. Throwing a dart, it would be hard not to hit it.

(Replying to my own post...)

James Bucanek wrote:If you restored your system using this layer, you wouldn't have a home folder.

It also occurred to me that this is one more way in which QRecall will be forced to recapture everything in your home folder.

When you capture your entire volume and exclude your home folder, it's just as if your home folder doesn't exist. The next time you capture your home folder, QRecall sees a brand new folder with thousands of items in it! It has no history or metadata to compare with, so it performs a first-time capture of every file.

Adrian Chapman wrote:Would it be possible to have a preference so that the time when this occurs could be set.

I'll add that to the wish list.

It could be very inconvenient if QRecall embarks on an action that could last many hours just as you need to shut down your machine.

That's what the Stop and Reschedule action menu item (in the monitor window) was designed for. If it's time to shutdown, stop and reschedule the action to run in a few minutes, or at least enough time to shutdown. When the system reboots, the action will pick right back up where it left off.

My present arrangement is to backup my entire system EXCEPT my home directory once per day, and to backup my home directory at intervals of 2 hours.

I would recommend that you NOT exclude your home folder. The Exclude items feature is designed to exclude (thus the name) items from ever being captured. QRecall treats excluded items as if they did not exist. When you capture your entire volume and exclude your home folder, you create a layer in the archive where your home folder does not exist. If you restored your system using this layer, you wouldn't have a home folder.

Set one action to capture your entire volume, and a second action to periodically capture your home folder during the day. Now you can recall from any layer and you'll get your system files up to the last day, and your personal documents up to the last hour.

I have plans for a new "Ignore" feature that will do what you're trying to use the Exclude filter for, but it hasn't been implemented yet.

This may well be what has been causing me problems because sometimes I have cancelled a backup. Would it avoid the full rescan if the incomplete layer is deleted? prior to the next scheduled backup?

If that's what's causing the rescan, then yes.

This thought had passed through my mind, and the problem only seems to have occurred on the Mac Pro since I started playing around with Path Finder but I can't believe that is tinkering with the file system in such a way that so many files need to be checked.

Hard to say. Since Path Finder (ironically, one of my first Apple ][ applications was a program called Pathfinder for ProDOS) is a file browser, it might (like the Finder) store information in invisible files (like .DS_Store) or use extended attributes. Any changes like this would cause QRecall to recapture items.

I love this application and I am sure my copy of QRecall and I can come to some sort of mutually acceptable way of working

Greg,

QRecall 1.2 requires OS X version 10.5 or later.

Supporting 10.4 has become increasingly difficult and my installed base of 10.4 systems is currently less than 7%. In all likelihood, future versions of QRecall will require 10.5 or later.

Archives created with 1.1.4 can be used by all later versions of QRecall. QRecall 1.1.4 can still use archives created/updated by 1.2, within the limits outlined in the release notes. Most notably, 1.1.4 can not open or recall from archives that are significantly larger than 2TB.

Adrian,

All good questions. My short answer is that I don't know why the operating system is telling you that hundreds of folders have changed (beyond the obvious reason that hundreds of folders did, in fact, change). QRecall only recaptures items under certain circumstances (which I explain later), but it should not try to recapture an item that hasn't been touched in anyway.

Let me see if I can accurately explain exactly what QRecall does and why.

Recapturing items can be largely divided into three phases: Folder changes, metadata, and file data.

In the first phase, QRecall requests the list of potentially modified folders from the FSEvents service. As you correctly observed, FSEvents only produces a list of folders that might contain changes. It's QRecall's job to examine the contents of those folders to find out what, if anything, has actually changed.

While that sounds simple enough, these things never are and there are a few caveats that you want to be aware of:

FSEvents is not infallible and QRecall only trust it for a limited amount of time. The default is to trust FSEvent change history for 13.9 days, which will cause QRecall to perform an exhaustive search of your entire folder tree about every two weeks. This setting can be changed using the QRAuditFileSystemHistoryDays advanced setting. So the fact the QRecall occationally runs out and rescans everything isn't, by itself, surprising.

The history of FSEvent information is kept on a per-captured-item basis, and can't always be applied to future captures. This means that QRecall might scan more folders than you expect if you have mixed or overlapping capture actions defined. For example, if you have one capture action that captures your entire home folder and another that captures just your Documents, capturing your Documents folder doesn't help the history that's saved for your home folder. The next time you capture your home folder, it will rescan all of the historic changes in your Documents folder too because it can only use the history information that entirely encompasses the item that it's capturing. I admit that this is confusing, but it has to work that way or QRecall might miss changes.

QRecall will ignore the FSEvent information and perform an exhaustive scan of your folders if the previous layer was incomplete, or contains any information from layers that are incomplete or marked as damaged by a repair. Items marked as "-Damaged-" are always recaptured in their entirety.

So, now that QRecall has its marching orders, it's time to examine the individual items in each folder. This it the metadata phase. QRecall starts by comparing the metadata for each item with what's in the archive. "Metadata," for those new to this, is "information about information." In this case, the metadata of an item is things like it's name, created date, last modified date, ownership, permissions, extended attributes, launch services attributes, custom icon, and so on. It's basically everything that the operating system knows about the item except its actual contents.

If all of the metadata for an item is identical to what's in the archive, the item is skipped. If any changes are noted, the metadata for that item is recaptured. This doesn't mean the entire file is recaptured (that's later), just that the metadata about that file is recaptured so that QRecall always has the latest information about that file.

The reading, testing, and even recapturing of metadata is pretty fast and most folders only require a fraction of a second to determine what items in that folder need to be recaptured.

If QRecall finds any changes in a file's metadata that might indicate its contents could have been modified (creation date, last modified date, attribute modified date, name, extended attributes, number of data forks, length of data forks), it proceeds to recapture the data for that item. This consists of reading every byte of the file and comparing that to what's already in the archive. This is the data phase of the capture, and the one that takes the most time.

If you believe that QRecall is recapturing file items that it should be breezing past, you can find out why with a little sleuthing.

The advanced setting QRLogCaptureDecisions (see the Advanced QRecall Settings post) will log the reason that QRecall decided to capture each item. Note that it only logs the first reason; there could be more than one. This will tell you something about what it is about the item that triggered QRecall's decision to (re)capture the item. Warning: This setting logs a ridiculous amount of information to the log file, so don't leave the setting on once you've found the information that you're looking for.

If you find that all of these files have really been modified, then I would go hunting for some background or system process that is surreptitiously rummaging around your file system in the background.

Gary K. Griffey wrote:I am getting a QRecall application crash in beta versions 1.2v6 and now in the just released 1.2v8. It occurs when the Re-Index operation is performed.

Thanks, Gary.

This is a known bug. It occurs with the Repair command too. And sometimes the first time you choose either of these commands nothing happens, but the second time it works. There's also a related bug that can cause the QRecall application to crash when closing an archive window.

These are caused by a change in OS X that happened around version 10.6.2 which changed the order of events that occur when opening and closing a window. While arguably an improvement, it trips QRecall up.

I've been ignoring this issue because the QReacall user interface is being largely rewritten as I write this. The new code should replace the buggy code and (hopefully) won't have any of the same problems.

Greg Morin wrote:Does this mean it is possible for two different computers to write to the same archive (say on a shared mount point on a server) simultaneously?

No. Only one action (capture, merge, compact, delete) can modify the contents of an archive at a time.

Any number of processes can read an archive simultaneously. So it's possible to browse, recall from, and verify an archive all at once.

If not, what would happen if User A tries to open the archive to write to it while User B is in the middle of an archiving process using that archive?

User A will automatically wait until user B is finished. If you're opening the archive to view its contents, QRecall will present a dialog that will politely wait until the archive is available, or until you decide to give up and close the window. Similarly, if have an archive window open and start a (local) action that wants to modify it, QRecall will automatically close the window so the action can proceed.

Gary K. Griffey wrote:So...just to summarize...(I promise )...previously verified layers in the archive were indeed corrupted...but not via the last Capture operation...through some other, as of yet, unrecognized "event"...that is my final "takeaway" from your comments...correct?

Correct. And should have let you write the reply; that was so much more succinct.

If so..this makes me feel completely confident in your product once again going forward...expect several $40 licenses fees from my clients in the near future...

Always good news.

(Yikes, this really should have been it's own thread.)

Gary K. Griffey wrote:Thanks for the info...possibly, you could assist me in better understanding the available Repair options...it seemed to me that after executing the Repair operation...previously verified layers in the archive were indeed compromised by this subsequent re-Capture operation...which is very much contrary to what you have stated...possibly (very likely) a user error on my part when running the Repair.

None of the above; the previous verified data simply stopped verifying. A capture action only adds to an archive, it doesn't rewrite portions of the archive that have already been captured. If a previously captured and verified data record later fails its verification, then it's because some external cause (magnetic media failure, a glitch in the drive controller, OS bug, whatever) either lost, changed, or overwrote that data. The last thing to happen to an archive isn't necessary the cause.

The various repair options (which I'll explain in a moment) aren't to blame, nor can they pull valid data out of thin air.

As you recall...I ran the Repair operation against the archive after the verify failed...

That's absolutely the correct thing to do.

I ran it using only the first default option..."Use auto-repair information"...you further stated...from examining the log files that I attached...that the Repair was successful. However, after the Repair operation completed...I opened the archive...and saw 3 of the 4 layers now showing "Damaged Files"...which turned-out to be the large virtual machine package...(the most important reason for running the backup in the first place, by the way).

This is also correct. The archive is now "repaired" in that it is valid and internally consistent. However, data was damaged. Lost data can't be magically reinvented. QRecall examined the records and concluded that the data block beloning to two versions of your VM file were lost. So the two file records that contained the damaged data blocks were expunged from the archive, and the folders that contained those files was marked as "-Damaged-" indicating where the data loss occurred.

Thus it appeared to me that the VM package in layer 2 and 1...taken 3 and 4 weeks ago respectively...had been compromised...and thus my assumption that previously verified backup layers had now been corrupted.

That's correct. The data you captured a few weeks ago became damaged at some point. The verify alerted you to the issue, and the repair recovered the files and folders that weren't damaged.

Looking at the log, only a single data record was corrupted in your archive. Most likely, this impacted a single data block belonging to different versions of the file in the two earliest layers. That data didn't belong to later versions of the file, so the subsequent layers were unscathed.

Should I have run the Repair with different/additional options selected?

Not really, unless you're desperate.

The repair options are:

Copy recovered content to new archive: Use this only if you do not want to touch the damaged archive in any way. It's useful for testing repair settings (since it doesn't change the original archive) or repairing an archive on read-only media.

Recover lost files: If directory records in the archive are destroyed, it may leave "orphaned" file records. That is, file records that aren't contained in any folder. This option scrapes the archive and assembles all orphaned files together in a group of special "recovered" folders. Note that this may resurrect files previously deleted by a merge action.

Recover incomplete files: If a data block belonging to a file is lost, the file is deleted. With this option turned on, the file is kept, marked as "damaged" and all of the remaining (valid) data block are assembled into a file. The file is still incomplete—the lost data is still lost—but the remaining data is recovered. The file probably isn't usable, but might contain usable data.

This last option is probably the one you're most interested in, but it still can't recover the entire file. It can only recover the portion of the file data that wasn't lost.

Would this have preserved my 1st and 2nd layers?

No. Lost data is still lost data. You can't get it back once it's gone. Using the last option, you can get everything else in the file that wasn't damaged back, but that's usually of limited value.

Gary K. Griffey wrote:By the way...I ran Disk Utility on the target laptop's hard drive...the one that experienced the corrupted archive...and there were no issues with the drive at all. I also checked disk permissions...no issues. So...I guess at this point..there is really no known explanation for what occurred with this particular incremental Capture that evidently corrupted the entire historic archive.

It's often hard to isolate the cause of a single, or even a series, of random data failures without more information. The basic problem is that "stuff happens." Data storage and transfer is not nearly as perfect and repeatable as most people assume. Data in magnetic media—despite all of the clever tricks employed by modern drives to avoid it—gets lost from time to time. Data flying through USB, Firewire, SATA and over WiFi doesn't always arrive the way it was sent. And consumer-grade dynamic RAM is susceptible to the occasional bit-flip and corruption by cosmic rays.

The "problem" with QRecall is that it is almost alone as a backup solution in that it attaches 64-bit checksum to every record of data it creates, and verifies the integrity of that data every time it reads it. When QRecall reports damaged data, it inevitably results in the initial impression that QRecall is failing or is somehow failing to protect your data, when in fact it's most likely the media/controller/interface/RAM/CPU/OS that's damaging the data. What is (or should be) frightening is that so many other so-called "reliable" backup solutions make no attempt whatsoever to protect against, detect, or report any data loss. Solutions like Time Machine simply copy files and hope for the best. They couldn't tell you if your files were successfully and accurately copied if you wanted them to.

I guess that is the issue that I see moving forward using QRecall for backups with my clients...it would appear that each and every incremental Capture basically puts the entire historic archive at risk of total loss...

Not true at all. QRecall is specifically designed to protect against partial loss of data in an archive. Damaging part of an archive in no way impacts the integrity, or recoverability, of the rest of the archive.

this is unlike many other imaging products that I have used in the past where incremental backups are always written to a new physical file...

QRecall doesn't write files, it writes data records. Each data record is small, self contained, and independently verifiable. When the data in an archive is damaged, the repair process reads and verifies every record. It then reassembles the valid records into a usable archive again.

It doesn't matter if a million records were written to a single file or a million individual files. (Writing to a single file is more efficient and ultimately safer, which is why QRecall does it that way.) Either each individual record is valid or it isn't. QRecall does not relay on the file system's directory structure to organize its information.

and thereby do not jeopardize the integrity of the historic backups currently in existence.

The integrity of the archive's history is never in jeopardy. QRecall's layer records are specifically designed to resist corruption through partial data loss. It employs a method of "positive deltas" that re-record critical directory structure information in every layer. So the loss of data in an older layer won't impact the structure or integrity of subsequent layers.

I realize, of course, that I could simply make a copy of an existing archive each and every time before a subsequent Capture is performed...

No need, QRecall's already doing that behind the scenes. Most actions begin by duplicating key index files within the archive (if you peek inside an archive package during a capture or merge you'll often see temporary "_scribble" files appear). New data is appended to the primary data file. If anything goes wrong (power loss, OS crash, ...) the partially modified files are summarily discarded and the primary data file is truncated at the point before the action began. The result is an instant rewind to a valid archive state. You'll see "auto-repair" in the log when this happens.

Don't get me wrong...I think your product has some great features...and I do appreciate your thoughts and guidance.

I hope some of this technical information will help explain the extraordinary efforts that QRecall takes to preserve your data, and detect when that data has been lost or damaged.