QRecall

Bruce Giles wrote:This is on our G5 server, running OS X Server 10.4.10 and the latest release of QRecall. Every 10 seconds, the following two messages are added to the *console* log:

I forgot to add that our scheduled backups are occurring right on schedule, so unless this just started since the last scheduled backup at 2:00 this morning, the schedules aren't affected.

-- Bruce

This is on our G5 server, running OS X Server 10.4.10 and the latest release of QRecall. Every 10 seconds, the following two messages are added to the *console* log:

2007-11-16 11:39:16.511 QRecallScheduler[21833] CFLog (0): CFMessagePort: bootstrap_register(): failed 1100 (0x44c), port = 0x2703, name = 'Processes-0.2752513'
See /usr/include/servers/bootstrap_defs.h for the error codes.
2007-11-16 11:39:16.513 QRecallScheduler[21833] CFLog (99): CFMessagePortCreateLocal(): failed to name Mach port (Processes-0.2752513)
CFMessagePortCreateLocal failed (name = Processes-0.2752513 error = 0)

This happens starting at login, without even launching QRecall. There are no corresponding messages showing up in the QRecall log. In fact, there's nothing at all unusual in the QRecall log. I just noticed this today, but I don't know how long it's been going on, since the console log appears to be cleard at login.

Any idea what's going on?

-- Bruce

Charles Watts-Jones wrote:I'm afraid that the cure hasn't worked for me. The attachment shows the log after installing v49, then logging out/in and finally a restart. The good news is that the Activity Monitor continues to show that QRecall will back-up as scheduled. I have sent this month's log file to James separately.

v49 did fix my problem. Now I can log out and scheduled tasks are once again running. But I've got a new problem. Actually, I'm not sure how much of a problem it is. I wouldn't have even known about it had I not looked at the log.

Now, every time I launch QRecall, I get three new lines in the log file:

Problem connecting to scheduler
NSDistantObject access attempted from another thread
NSObjectInaccessibleException exception

But other than that, everything seems to be working. I'll send the log file to James.

-- Bruce

James Bucanek wrote:I've found out what the problem is.

A recent change to OS X now kills (SIGKILL -- essentially a force quit) any process belonging to the user when they log out. Worse, it kills any background process that tries to start up again.

This change makes it impossible for the scheduler to run while logged out.

I hope to have a solution soon.

I wonder if this explains the problem I've been having on my sever. For several weeks now, QRecall hasn't run at its scheduled time, which is 2:00 AM. Unfortunately, I've been too busy with other things to get too concerned about it, and whenever I do log in, it does the backup within a few minutes after that.

The log is full of entries like this, which repeat every 30 minutes:

2007-10-30 01:29:01.111 -0400 #debug# -[QRUserBSDSocketPath connectPort] /var/tmp/QRecall.501/QRecallScheduler [3.732979.4047.1]
2007-10-30 01:29:02.611 -0400 Failure Communications socket with existing scheduler is broken [3.732979.4047.2]
2007-10-30 01:29:02.612 -0400 Details [NSPortCoder sendBeforeTime:sendReplyPort:] timed out [3.732979.4047.2.1]
2007-10-30 01:29:02.612 -0400 Details NSPortTimeoutException exception [3.732979.4047.2.1.1]
2007-10-30 01:29:02.614 -0400 #debug# -[QRUserBSDSocketPath deleteSocket] /var/tmp/QRecall.501/QRecallScheduler [3.732979.4047.3]
2007-10-30 01:29:02.614 -0400 #debug# -[QRUserBSDSocketPath createPort] /var/tmp/QRecall.501/QRecallScheduler [3.732979.4047.4]

-- Bruce

James Bucanek wrote:You might want to check your NetInfo database. I suspect that a partial user account with a home folder path of "99" still exists.

I don't see anything like that. There's a user called "unknown" with a GID of 99, and also a group called "unknown" with a GID of 99, but those appear to be normal, since they're in the NetInfo database of every Mac I've looked at.

Anyway, as you noted, it's apparently not a QRecall problem. Thanks!

I'm gettings some strange error messages in the log. The nightly scheduled backup starts off like this:

2007-08-01 02:00:01.013 -0400 ------- Capture to Server HD1 [2.732889.25404]
2007-08-01 02:00:01.013 -0400 Details archive: /Volumes/Backup1/Backups/Server HD1.quanta [2.732889.25404.4]
2007-08-01 02:00:01.013 -0400 #debug# executing [2.732889.25404.5]
2007-08-01 02:00:09.671 -0400 #debug# reconstructing negative hash map from 321473 hash entries [2.732889.25404.6]
2007-08-01 02:00:11.054 -0400 #debug# negative hash rebuilt (536870912 bits, 321386 set) [2.732889.25404.7]
2007-08-01 02:00:12.804 -0400 #debug# -[HashFile anticipateHashGrowth:] total free=191981551616, practicalGrowth=7030574, maxGrowth=8067134, hashEntries=321473, hashSize=4194304 [2.732889.25404.8]
2007-08-01 02:00:12.840 -0400 Capture Macintosh HD [2.732889.25404.9.1]
2007-08-01 02:00:12.840 -0400 Details Macintosh HD [2.732889.25404.9.1.1]
2007-08-01 02:00:13.075 -0400 Curious Filter cannot make reference to item [2.732889.25404.10]
2007-08-01 02:00:13.075 -0400 Details path: /Users/diradmin [2.732889.25404.10.1]
2007-08-01 02:00:13.075 -0400 Details file reference creation failed [2.732889.25404.10.2]
2007-08-01 02:00:13.075 -0400 Details IO exception [2.732889.25404.10.2.1]
2007-08-01 02:00:13.076 -0400 #debug# API: FSPathMakeRef [2.732889.25404.10.2.2]
2007-08-01 02:00:13.076 -0400 #debug# OSErr: -43 [2.732889.25404.10.2.3]
2007-08-01 02:00:13.092 -0400 Curious Filter cannot make reference to item [2.732889.25404.11]
2007-08-01 02:00:13.092 -0400 Details path: 99 [2.732889.25404.11.1]
2007-08-01 02:00:13.092 -0400 Details file reference creation failed [2.732889.25404.11.2]
2007-08-01 02:00:13.092 -0400 Details IO exception [2.732889.25404.11.2.1]
2007-08-01 02:00:13.092 -0400 #debug# API: FSPathMakeRef [2.732889.25404.11.2.2]
2007-08-01 02:00:13.092 -0400 #debug# OSErr: -43 [2.732889.25404.11.2.3]
2007-08-01 02:00:13.093 -0400 Curious Filter cannot make reference to item [2.732889.25404.12]
2007-08-01 02:00:13.093 -0400 Details path: 99 [2.732889.25404.12.1]
2007-08-01 02:00:13.093 -0400 Details file reference creation failed [2.732889.25404.12.2]
2007-08-01 02:00:13.093 -0400 Details IO exception [2.732889.25404.12.2.1]
2007-08-01 02:00:13.094 -0400 #debug# API: FSPathMakeRef [2.732889.25404.12.2.2]
2007-08-01 02:00:13.094 -0400 #debug# OSErr: -43 [2.732889.25404.12.2.3]
2007-08-01 02:00:13.094 -0400 Curious Filter cannot make reference to item [2.732889.25404.13]
2007-08-01 02:00:13.094 -0400 Details path: 99 [2.732889.25404.13.1]
2007-08-01 02:00:13.094 -0400 Details file reference creation failed [2.732889.25404.13.2]
2007-08-01 02:00:13.094 -0400 Details IO exception [2.732889.25404.13.2.1]

And then similar errors continue for a long time. The last one in the batch looks like this:

2007-08-01 02:00:13.821 -0400 Curious Filter cannot make reference to item [2.732889.25404.101]
2007-08-01 02:00:13.821 -0400 Details path: 99 [2.732889.25404.101.1]
2007-08-01 02:00:13.821 -0400 Details file reference creation failed [2.732889.25404.101.2]
2007-08-01 02:00:13.821 -0400 Details IO exception [2.732889.25404.101.2.1]
2007-08-01 02:00:13.822 -0400 #debug# API: FSPathMakeRef [2.732889.25404.101.2.2]
2007-08-01 02:00:13.822 -0400 #debug# OSErr: -43 [2.732889.25404.101.2.3]

I'm really not sure what the "path: 99" stuff is about, but I note that when we were initially setting up accounts on the server, we got something wrong on one account with respect to the home folder or profile folder setting. When the user tried to log in, he ended up in a folder he couldn't write to. After he logged out, I found a folder called "99" at the top level of the server. The only thing inside it was a Library folder, which contained a "Preferences" folder, and there were a few pref files inside that. Since it was obviously an error, we deleted the entire "99" folder. Fixed the user's account info, and have had no further problems with that, but that may have been when these QRecall errors started.

Either I'm misunderstanding the purpose of the "Ignore if no archive" condition, or else it's broken. When I add that condition to an action, the action stops working, even though the archive drive is mounted and the archive is available. It doesn't matter whether the archive is open or not. I checked the log file to see what was going on, but there's no indication (even at the max detail setting) that the action even attempted to run. It doesn't run either manually or scheduled.

On the other hand, if I change the condition to "Hold while no archive", it performs exactly as I expected.

What I would like to do is have rotating backups between two or more external hard drives. One would be mounted at all times, and the others would be offsite. I'd have a set of scripts which were duplicates except that each one referred to an archive on a different backup drive. Each would be scheduled every day, with an "Ignore if no archive" condition. I'm assuming this should ensure that an archive is performed every day, regardless of which hard drive happened to be mounted.

If it works, it beats the heck out of Retrospect. In Retro, I can define two scripts with alternating weekly schedules, but if no one is there to swap drives on time, then the backup just hangs. Worse, one of the dialog actions is to mark the drive as lost, which is apparently easy to do, because our office manager has done it several times when holidays caused an exception to our normal schedule. Once Retro thinks a drive is lost, it's very difficult to get it to save any more backups to it.

James Bucanek wrote:

Bruce Giles wrote:I was playng around a little with the Capture Assistant today. In the time limit section, I selected "Keep 90 days".

I notice the dialog has a footnote, with an asterisk, but I don't see where the matching asterisk in the main text is. Anyway, that's a trivial point.

The asterisks in the button titles were lost. I just fixed that.

That should serve to make the footnote more clear, since I'll know exactly what it's referring to.

That's exactly what it's saying. The suggested schedule changes based on a guesstimate of how much data the archive can capture. The assistant calculates a ratio of the free space on the archive volume (after a hypothetical first capture) to the total size of the items you selected to capture. If the estimated free space on the archive volume is more than 150% the size of the items to capture, it suggests the 90 days and 30 day+3 years schedules. This is based on the assumption that approximately 4% of files will change each day and approximately 40% of the data in those files does not actually change. If the ratio is smaller, the assistant will suggest 60, 30, or 14 day schedules instead.

That's very cool. Apparently I just didn't test over a wide enough range of items to see the variable parts.

Which is exactly why the schedules are a little non-intuative. If you add additional actions to, say, capture your Documents folder every hour or use the Desktop Capture command to take a snapshot of a project folder, the archive will have many intermediate layers during the day. Keeping 3 (or 7) days keeps all of those "micro-backups" for a few days, allowing you to recover the project as it was at 9AM the day before yesterday. After a few days have elapsed, these micro-backups will be merged so that only the last daily version of each file is retained.

OK, that makes sense. The assistant does do some things differently than I would have expected. Not wrong (as far as I can tell), just different. For instance, having two compact actions with different conditions for the same archive. I hadn't thought of doing that until I saw the assistant do it.

I'm not sure if this is a bug, because of your disclaimer. (read on...)

I was playng around a little with the Capture Assistant today. In the time limit section, I selected "Keep 90 days".

I notice the dialog has a footnote, with an asterisk, but I don't see where the matching asterisk in the main text is. Anyway, that's a trivial point.

I'm not sure what the footnote is trying to say. The first sentence sounds as though the "Keep 90 days" and "Keep 30 days, then less frequently for two years" options might change to other options, depending on what I'm trying to capture, but if that's the case, I haven't seen it happen yet.

The second sentence seems to say that I might not actually get 90 days, or two years, worth of captures if my hard drive fills up before that. If that's what you wanted to say, I think the first sentence confused the issue.

When the assistant finished, I looked at the results. The rolling merge says:

Keep most recent 3 days
Followed by 57 day layers

I'm pretty sure that's 60 days, not 90.

Also, in this particular case, since the capture is done daily, could the same result be achieved by either:

Keep most recent 90 days

or

Keep most recent 1 days
Followed by 89 day layers?

(I realize they're not the same if you change the capture frequency.)

OK, here's something I don't understand. As I mentioned in another message, I did a complete restore of my server yesterday. My next backup was this morning. I was surprised to notice that QRecall clamed to have captured *everything* this time, even though the vast majority of the files had not changed since the previous restore.

To explain further, here's the size and the item count for the backup layers in this archive:

1. 4.25 GB -- 96,437 items (Initial backup)
2. 189.7 MB -- 2,921 items
3. 83.5 MB -- 964 items
4. 71.8 MB -- 624 items
5. 818.3 MB -- 4,981 items (last backup before restore)
(full restore occurred here)
6. 4.37 GB -- 97,308 items (first backup after restore)
7. 258.6 MB -- 1,958 items

Layer number 6 is the one that surprised me. I did notice, however, that the monitor window said "99% duplicate" as it ran.

I just checked the size of the archive file, and it's 4.24 GB. So I'm relieved to see that whatever QRecall did, it didn't actually capture every item in its entirety, or the archive size would have been about 8.5 GB.

By the way, my archive settings were shifted quanta to the max (far right on the slider) and all compression off (far left on the sliders).

James, Here's a story you'll like.

I was working on our OS X Server last night. I did a full backup with QRecall, then installed the 10.4.10 update, then made a number of configuration changes, and the last thing I did before I left was another QRecall backup. Since I'm not sure where OS X Server stores all hundreds (thousands?) of configuration files, I didn't want to try to guess what items to back up, so I always have QRecall capture the entire boot volume. The server doesn't have much on it yet, so it goes pretty quickly.

Today, I spent the entire day playing with the server. As you know, I'm new to OS X Server, and Apple's docs just aren't that good. Neither are any of the books I've read. Maybe they're OK if you're a server expert, but I don't speak that language yet. So sometimes, when you're trying to figure out how to do something on the server, you just have to experiment.

So I'm setting up shared folders, adding a few users and groups, configuring DHCP, DNS, Open Directory, etc. Every time I found something useful, especially a configuration change, I carefully documented it via an entry in my weblog I started on the server.

Sometime this afternoon, things started to go south. First, shared folders which used to work stopped working. Then the help viewer application started acting up. (Links did nothing when I clicked on them, which meant I couldn't get past the first page of help.) I rebooted the server -- didn't help. I knew the help problem could be indicative of a cache problem, so I employed a cache cleaner utility, and rebooted. Sure enough, it fixed the help problem. But now I had a bigger problem. Suddenly, I wasn't authorized to make ANY changes in Workgroup Manager. I could authenticate against the directory, and got no error messages, but everything was grayed out.

Well, if I can't make changes in Workgroup Manager, the server is no better than an oversized boat anchor. So I fired up QRecall and started restoring things that I thought might have gotten hosed. Got nowhere with that either, so obviously I wasn't restoring the right things.

After 30 minutes of going nowhere, I decided it was time to get serious. I saved out a copy of my weblog page which documented the useful work I had done today. Then I rebooted the server off my external bootable backup drive. I ran QRecall and did a complete restore of the server's boot drive from last night's last backup. It took less than 30 minutes, and restored completely with no errors. I rebooted back to the newly restored drive, and everything works again! Most importantly, Workgroup Manager is speaking to me again. Sure, I lost the work I did today, but it didn't take long to restore that with the documentation from my weblog. (And we haven't rolled out the server to the users yet, so I didn't have to worry about users losing any of their files.)

I still don't know what I did wrong to cause the problem, but it's safe to say that if I hadn't had a good backup, I'd have spent hours, or even days, trying to fix the problem, and even then I might not have succeeded.

-- Bruce

Bruce Giles wrote:Maybe you could show the first part of it in a column in the layers section of an archive window. Move your mouse over the comment and maybe a little tip window shows the full comment.

Or another possibility -- something like the status bar at the bottom of a Safari window. Maybe it should be just about the window split instead. When you hover the mouse over a layer, causing it to highlight, if there's a comment, it shows up in the status bar.

-- Bruce

I backed up our server last night, then the 10.4.10 update came out today. Since I had a good backup, I decided to go ahead and install the update this evening. If the update has problems, I can restore back to last night again.

But suppose I recapture every night, and I discover late next week that I need to restore all the way back to 10.4.9 for some reason. I'm probably not going to remember which day I installed the 10.4.10 update. This is why I need layer comments. I'd like some way to tag a layer with a comment, either as I do the capture, or maybe later, in the case of a timed action capture.

I don't know if you're familiar witih Microsoft Windows' System Restore checkpoints, but they have a feature like this, and I've always found it useful.

I'm not sure how the user interface ought to work. For sure, the comments should be optional -- you don't have to have one for every layer. There should probably be an option to add some set comment as part of an action. A comment wouldn't necessarilly have to be long, but at least 40 characters would be nice. Maybe you could show the first part of it in a column in the layers section of an archive window. Move your mouse over the comment and maybe a little tip window shows the full comment. What do you think -- possible?

-- Bruce

James Bucanek wrote:

Bruce Giles wrote:Is the "Restore To" command documented in the help? I couldn't find it, but, but maybe I wasn't searching the right way.

QRecall Help > Capture Basics > Restore Items. After the basic restore instructions, there a note on how to restore a volume.

Actually, the title of the page is "Recall Items", not "Restore Items". And that was part of my problem. I knew the command I wanted was "Restore To", so when I saw this page, the title caused me to think it was about the Recall command only, and I therefore didn't really look at it. Had I not known what the command name actually was, I would probably have read the "Recall Items" page more closely, and found the answer.

How were you searching? If I search the help for "restore volume" or "recall volume" the Recall Items page is the first hit.

First I tried "Restore To", including the double quotes. (Old Google habits die hard.) When that returned no hits, I tried it again without the quotes.

I finally noticed the "Show Owners & Volumes" command in the View menu.

Oh, that's embarrassing. The help really should mention that you must open the owners & volumes drawer in order to select an entire volume.

Yeah, that step 1 "Select a captured volume" was the hard part, if you don't know about the Owners & Volumes drawer.

Is the "Restore To" command documented in the help? I couldn't find it, but, but maybe I wasn't searching the right way.

Anyway, I had created an external bootable backup drive for our server. No backups on it yet, just the OS and QRecall. I wanted to clone it (using QRecall) to our second external backup drive. I successfully archived the first backup drive. Then it took me a good 20 minutes to figure out how to get the "Restore To" command to work. I kept selecting everything in the archive window, but when I'd hold down the Option key and click on the Archive menu, the Restore To command was there, but grayed out.

I finally noticed the "Show Owners & Volumes" command in the View menu. I had never used it before, but I tried it. Sure enough, the little drawer popped out, and then I was able to click on the volume name. Once I did that, then the Restore To command was enabled. I restored to the second backup drive, set it as the startup disk, and rebooted, and it came right up.

So, everything worked perfectly once I figured it out, but it was not obvious (at least not to me) how to get Restore To to work.

-- Bruce