PS. We might be interested in getting a better estimate of how long a quotacheck would take. From an old thread on the mailing list, we see this suggestion: xfstests:src/bstat We're a bit worried about running this on the live system, because we're worried it will impact its performance substantially. Is that an unfounded worry? I presume it's a read-only operation, so it would be safe to kill it if we see performance degradation? rgds, Harry + the team. On 05/03/15 17:05, Harry wrote: > Thanks for the reply Eric. > > One of our problems is that we're limited in terms of what > manipulations we can apply to the live system, and so instead we've > been running our experiments against the backup system, and you're > quite right that DRBD may be introducing some weirdness of its own, so > those experiments may not be safe to draw conclusions from. > > Here's what we know about the live system > -> it had an outage, equivalent to having its power cable yanked, or > doing an 'echo b > /proc/sysrq-trigger' > -> when it came back, it decided to mount the drive without quotas. > -> we saw a message in syslog saying " Failed to initialize disk quotas" > -> last time we had to run a quotacheck (several months ago) it took > about 2 hours. > > We can repro the quotacheck issue on our test clusters, as follows: > -> kick off a job that writes to the disk > -> hard reboot with "echo b > /proc/sysrq-trigger" > -> on next boot, see "Failed to initialize disk quotas" message, xfs > mounts without quotas > -> soft reboot with "reboot" > -> on next boot, see "Quotacheck needed: Please wait." message. > -> Quotacheck completes some time later. > > So our best-case scenario is that, next time we reboot, we'll have an > outage of about 2 hours. And our paranoid worst-case scenario, > induced by our experiments with our drbd backup drives, are that the > disk will actually turn out not to be mountable at all. > > is that "quotacheck always required after hard reboot" behaviour that > we're observing something you expected? you seemed to be saying that > the fact that quota are journaled should mean it's not needed? > > HP > > On 05/03/15 15:53, Eric Sandeen wrote: >> On 3/5/15 7:15 AM, Harry wrote: >>> Update -- so far, we've not managed to gain any confidence that we'll >>> ever be able to re-mount that disk. The general consensus seems to be >>> to fish all the data off the disk using rsync, and then move off XFS >>> to ext4. >>> >>> Not a very helpful message for y'all to hear, I know. But if it's any >>> help in prioritising your future work, i think the dealbreaker for us >>> was the inescapable quotacheck on mount, which means that any time a >>> fileserver goes down unexpectedly, we have an unavoidable, >>> indeterminate-but-long period of downtime... >>> >>> hp >> What you decide to use is up to you of course, and causes us no >> heartbreak. :) But I think you fundamentally misunderstand the situation; >> an unexpected fileserver failure should not result in a lengthy quotacheck >> on xfs, because xfs quota is journaled, and will simply be replayed along with >> the rest of the log. >> >> I honestly don't know what has led you to the conclusion that remounting >> the filesystem will lead to any quotacheck at all, let alone a lengthy one. >> >>> * We're even a bit worried the disk might be in a broken state, such >>> that the quotacheck won't actually complete successfully at all. >> If your disk is broken, that's not a filesystem issue. It seems possible >> that whatever drbd manipulation you're doing is causing an issue, but because >> you haven't really explained it in detail, I don't know. >> >>> We take DRBD offline, so it's no longer writing, then we take >>> snapshots of the drives, then remount those elsewhere so we can >>> experiment without disturbing the live system. >> Did you quiesce the filesystem first with i.e. xfs_freeze? >> >> So far this thread has been long on prose and speculation, and short >> on actual analysis, log messages, etc. Feel free to use ext4 or whatever >> suits you, but given that nothing in this thread has implicated misbehavior >> by xfs, I don't think that switching filesystems will solve the perceived >> problem. >> >> -Eric > > Rgds, > Harry + the PythonAnywhere team. > > -- > Harry Percival > Developer > harry@pythonanywhere.com > > PythonAnywhere - a fully browser-based Python development and hosting environment > > > PythonAnywhere LLP > 17a Clerkenwell Road, London EC1M 5RD, UK > VAT No.: GB 893 5643 79 > Registered in England and Wales as company number OC378414. > Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK Rgds, Harry + the PythonAnywhere team. -- Harry Percival Developer harry@pythonanywhere.com PythonAnywhere - a fully browser-based Python development and hosting environment PythonAnywhere LLP 17a Clerkenwell Road, London EC1M 5RD, UK VAT No.: GB 893 5643 79 Registered in England and Wales as company number OC378414. Registered address: 28 Ely Place, 3rd Floor, London EC1N 6TD, UK