From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
Date: Mon, 22 May 2017 09:19:34 +0000 (UTC) [thread overview]
Message-ID: <pan$df61e$e305754a$711742a$a2be8e5d@cox.net> (raw)
In-Reply-To: 20170522013553.hspdrwpmxe5kyoas@merlins.org
Marc MERLIN posted on Sun, 21 May 2017 18:35:53 -0700 as excerpted:
> On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote:
>> On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
>> > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 enabling repair
>> > mode Checking filesystem on /dev/mapper/dshelf1 UUID:
>> > 36f5079e-ca6c-4855-8639-ccb82695c18d checking extents
>> >
>> > This causes a bunch of these:
>> > btrfs-transacti: page allocation stalls for 23508ms, order:0,
>> > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
>> > btrfs-transacti cpuset=/ mems_allowed=0
>> >
>> > What's the recommended way out of this and which code is at fault? I
>> > can't tell if btrfs is doing memory allocations wrong, or if it's
>> > just being undermined by the block layer dying underneath.
>>
>> I went back to 4.8.10, and similar problem.
>> It looks like btrfs check exercises the kernel and causes everything to
>> come down to a halt :(
btrfs check is userspace, not kernelspace. The btrfs-transacti threads
are indeed kernelspace, but the problem would appear to be either IO or
memory starvation triggered by the userspace check hogging all available
resources, not leaving enough for normal system, including kernel,
processes.
Check is /known/ to be memory intensive, with multi-TB filesystems often
requiring tens of GiB of memory, and qgroups and snapshots are both known
to dramatically intensify the scaling issues. (btrfs balance, by
contrast, has the same scaling issues, but is kernelspace.)
That's one reason why (not all of these may apply to your case) ...
* Keeping the number of snapshots as low as possible is strongly
recommended by pretty much everyone here, definitely under 300 per
subvolume and if possible, to double-digits per subvolume.
* I personally recommend disabling qgroups, unless you're actively
working with the devs on improving them. In addition to the scaling
issues, quotas simply aren't reliable enough on btrfs yet to rely on them
if the use-case requires them (in which case using a mature filesystem
where they're proven to work is recommended), and if it doesn't, there's
simply too many remaining issues for the qgroups option to be worth it.
* I personally recommend keeping overall filesystem size to something one
can reasonably manage. Most people's use-cases aren't going to allow for
an fsck taking days and tens of GiB, but /will/ allow for multi-TB
filesystems to be split out into multiple independent filesystems of
perhaps a TB or two each, tops, if that's the alternative to multiple-day
fscks taking tens of GiB. (Some use-cases are of course exceptions.)
* The low-memory-mode btrfs check is being developed, tho unfortunately
it doesn't yet do repairs. (Another reason is that it's an alternate
implementation that provides a very useful second opinion and the ability
to cross-check one implementation against the other in hard problem
cases.)
(The two "I personally recommend" points above aren't recommendations
shared by everyone on the list, but obviously I've found them very useful
here. =:^)
>> Sadly, I tried a scrub on the same device, and it stalled after 6TB.
>> The scrub process went zombie and the scrub never succeeded, nor could
>> it be stopped.
Quite apart from the "... after 6TB" bit setting off my own "it's too big
to reasonably manage" alarm, the filesystem obviously is bugged, and
scrub as well, since it shouldn't just go zombie regardless of the
problem -- it should fail much more gracefully.
Meanwhile, FWIW, unlike check, scrub /is/ kernelspace.
> So, putting the btrfs scrub that stalled issue, I didn't quite realize
> that btrs check memory issues actually caused the kernel to eat all the
> memory until everything crashed/deadlocked/stalled.
> Is that actually working as intended?
> Why doesn't it fail and stop instead of taking my entire server down?
> Clearly there must be a rule against a kernel subsystem taking all the
> memory from everything until everything crashes/deadlocks, right?
As explained, check is userspace, but as you found, it can still
interfere with kernelspace, including unrelated btrfs-transaction
threads. When the system's out of memory, it's out of memory.
Tho there is ongoing work into better predicting memory allocation needs
for btrfs kernel threads and reserving memory space accordingly, so this
sort of thing doesn't happen any more.
Of course it could also be some sort of (not necessarily directly btrfs)
lockdep issue, and there's ongoing kernel-wide and btrfs work there as
well.
> So for now, I'm doing a lowmem check, but it's not going to be very
> helpful since it cannot repair anything if it finds a problem.
>
> At least my machine isn't crashing anymore, I suppose that's still an
> improvement.
> gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1 We'll see how
> many days it takes.
Agreed. Lowmem mode looks like about your only option, beyond simply
blowing it away, at this point. Too bad it doesn't do repair yet, but
with a bit of luck it should at least give you and the devs some idea
what's wrong, information that can in turn be used to fix both scrub and
normal check mode, as well as low-mem repair mode, once it's available.
Of course your "days" comment is triggering my "it's too big to maintain"
reflex again, but obviously it's something you've found to be tolerable
or possibly required in your use-case, so who am I to second-guess...
maybe you have /files/ of multi-TB size, which of course kills the split
the filesystem down to under that, idea. <shrug>
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2017-05-22 9:19 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-21 21:47 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls Marc MERLIN
2017-05-21 23:45 ` Marc MERLIN
2017-05-22 1:35 ` Marc MERLIN
2017-05-22 9:19 ` Duncan [this message]
2017-05-23 17:15 ` Marc MERLIN
2017-05-22 16:31 ` Marc MERLIN
2017-05-22 23:26 ` Chris Murphy
2017-05-22 23:57 ` Marc MERLIN
2017-05-23 2:07 ` Chris Murphy
2017-05-23 11:21 ` Austin S. Hemmelgarn
2017-05-23 16:49 ` Marc MERLIN
2017-05-23 18:32 ` Kai Krakow
2017-05-24 11:57 ` Austin S. Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$df61e$e305754a$711742a$a2be8e5d@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).