QRecallDownloadIdentity KeysForumsSupport
  [Search] Search   [Recent Topics] Recent Topics   [Hottest Topics] Hottest Topics   [Groups] Back to home page 
backup failures over network  XML
Forum Index » Beta Version
Author Message
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

I'm still having intermittent, but fairly frequent, problems when backing up my MBP over wifi to a backup drive mounted on my iMac. It seems to be a network problem of some kind, but my network connection otherwise seems strong and the problem didn't really start until I installed Qrecall 2.0. Can you tell anything about what's causing it from my logs, and do you have any suggestions about how I might control it?

I've sent a report showing recent problems. The repairs and subsequent good backups after the failures were done with the backup drive mounted directly on the MBP.

Ralph
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1548
Online

Ralph,

Yikes, that's a lot of errors.

Most of the errors do appear to be related to network communications. Most of the captures/verifies/repairs that fail predomenatnly report POSIX error 60 or 6.

Error 60 is an "operation timed out" error, usually associated with a network communications socket or device channel. Error 6 is a "no such device" error; in this context it usually means the volume/drive being addressed is no longer connected.

A lot of your failures follow the pattern of getting "timed out" errors, later followed by "no such device" errors. I suspect you're having network communications or remote storage device problems that initial stop responding to requests, and later appear to go off line. You can see this in events such like that starting on May-12 where the first capture starts but dies with an error 60. Subsequent actions then fail because they can't access any archive files (error 6).

I also see that your archive's volume tends to get mounted and unmounted a lot. At first I thought this might be an indication of a problem, but the timing wasn't quite right. Instead, it seems to be by design; you apparently mount the volume and then manually start a capture action. (Just FYI: if the volume can be automatically mounted, QRecall will mount, and unmount, the volume for you.)

I did find one really suspicious sequence of events that I think lead to all of the problems on May-9. The volume was mounted at 12:23 and the capture action was manually run a few seconds later. The capture action ran until 13:17, at which time it encounted network timeout errors (60) that prevented it from finishing. But before that, it appears that the system was put to sleep:



(Please note that there's another possible problem here. Starting somewhere around OS X 10.10, the kernel is letting background processes run in the so-called "power nap" mode, where the system is mostly asleep but some background processes are still running. Unfortunately, QRecall seems to be one of the processes that it's allowing to run, but not enough of the rest of the system is awake to function correctly and errors ensue.)

It's been my experience that network sockets don't like to be put to sleep, and can take quite awhile to recover when the system wakes up again, which is probably the source of that particular failure.

Without anything definitive to go on, I'd recommend trying to isolate the pieces one at a time and see if you can find some improvement.

First, is it possible to eliminate the network and server for a trial and connect the archive drive directly to the system, just to make sure it's not the drive or something else?

Then try to replace pieces one at a time to see if that makes any difference. Try a different network connection. If you're using WiFi, try to hook up a hard-wired ethernet cable. If you're using ethernet, see if you can use IP-over-Firewire or something. Can you move the archive drive to a different server?

I know I'm not being terrible helpful, but I hope it gives you some ideas to try.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

This is a update on the intermittent "problem closing file" issue I've had with qrecall v2. It only happens when I back up my MBP over wifi to a backup drive mounted on an iMac, and it has occurred with two different backup drives so seemed to be a network problem.

I eliminated the wifi connection by shutting off wifi and connecting the two computers via firewire, which took care of the problem but makes backing up more cumbersome. Then a couple of weeks ago it occurred to me that when I connected the computers that way I was also waking up the iMac, which a wifi backup didn't always seem to do, so I decided to try wifi backups while keeping the iMac awake with the Caffeine utility. That seemed to work as well, with both backup drives, and while still inconvenient, is less trouble than moving the MBP into the same room with the iMac and physically changing the network connection.

Today I decided to try wifi again and again got the error, so I followed up immediately with a backup over firewire. I then noticed that the earlier wifi backup had apparently completed, and added a layer to the archive even as it was complaining about a "problem closing file." (I've also noticed this occasionally in the past.)

I've been using qrecall since 2007 and this problem only cropped up with v2, so I'm guessing that something in v2 changed what happens with the wifi connection when the target computer is asleep.

Ralph
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1548
Online

Ralph Strauch wrote:Today I decided to try wifi again and again got the error, so I followed up immediately with a backup over firewire. I then noticed that the earlier wifi backup had apparently completed, and added a layer to the archive even as it was complaining about a "problem closing file." (I've also noticed this occasionally in the past.)

There are two places the "problem closing file" can occur. First, if there's a permanent failure of some kind (say the network or drive gets disconnected) and QRecall can't close its open files while trying to clean up and terminate.

The other, which is what happened here, can occur when everything goes the way it should, but just as QRecall is finishing up and closing the completed files, the OS complains that something went wrong. This is what appears to have happened in this case. All of the data was successfully written, but when QRecall went to close the very last file, the network hung for 8 minutes and then reported that the file couldn't be closed (POSIX error 60, "timeout"). But in reality, all of the data was probably written, which is why the archive was intact and you had a complete layer. I don't have any good theories as to why this might happen. It's possible that the server simply had a lot of unwritten data to write, and the network operation timed out before the server had finished writing all of its buffered data. But that would require either tens of GB of unwritten data on a really, really, slow drive, or else the server was frightfully busy doing other things at the same time.

I've been using qrecall since 2007 and this problem only cropped up with v2, so I'm guessing that something in v2 changed what happens with the wifi connection when the target computer is asleep.

This is much more likely due to the change in filesystem API. QRecall 1.x uses the legacy Carbon API while QRecall 2.x uses the BSD (UNIX) API. Each API has its own idiosyncrasies and error handling, so there are bound to be some behavioral differences between the two.

- QRecall Development -
[Email]
 
Forum Index » Beta Version
Go to:   
Powered by JForum 2.1.8 © JForum Team