From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Date: Mon, 12 Sep 2022 14:18:52 -0400 [thread overview]
Message-ID: <Yx94DA75iNdsZUGX@localhost.localdomain> (raw)
In-Reply-To: <20220912095907.GA269395@falcondesktop>
On Mon, Sep 12, 2022 at 10:59:07AM +0100, Filipe Manana wrote:
> On Fri, Sep 09, 2022 at 09:35:01AM -0400, Josef Bacik wrote:
> > Inside of FB, as well as some user reports, we've had a consistent
> > problem of occasional ENOSPC transaction aborts. Inside FB we were
> > seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
> > low occurrence rate given the size of our fleet, but it's not nothing.
> >
> > There are two causes of this particular problem.
> >
> > First is delayed allocation. The reservation system for delalloc
> > assumes that contiguous dirty ranges will result in 1 file extent item.
> > However if there is memory pressure that results in fragmented writeout,
> > or there is fragmentation in the block groups, this won't necessarily be
> > true. Consider the case where we do a single 256MiB write to a file and
> > then close it. We will have 1 reservation for the inode update, the
> > reservations for the checksum updates, and 1 reservation for the file
> > extent item. At some point later we decide to write this entire range
> > out, but we're so fragmented that we break this into 100 different file
> > extents. Since we've already closed the file and are no longer writing
> > to it there's nothing to trigger a refill of the delalloc block rsv to
> > satisfy the 99 new file extent reservations we need. At this point we
> > exhaust our delalloc reservation, and we begin to steal from the global
> > reserve. If you have enough of these cases going in parallel you can
> > easily exhaust the global reserve, get an ENOSPC at
> > btrfs_alloc_tree_block() time, and then abort the transaction.
>
> There's also another problem I pointed out in the past regarding delalloc
> reservations. The thing is that we assume for each ordered extent, the new
> file extent item will require changing only 1 leaf in the subvolume tree.
>
> If our new extent has a size of 128M and currently for that range we
> have 32768 extents each with a size of 4K, then we need to touch 157
> leaves in order to drop those file extent items before inserting the new
> one at ordered extent completion. And our reservation that happened at
> buffered write time only accounted for 1 leaf/path for file extent items
> (plus 1 for the inode item). This is assuming the default leaf size of 16K,
> where we can have at most 208 file extent items per leaf.
>
Yeah I should have been less absolute in my descriptions, as you point out
there's a few other situations. This one is another good one, and is even more
annoying because there isn't even the "oh we're going to allocate another file
extent at some point in the future" buffer, it happens right as you're
completing that ordered extent.
>
> That's another elephant in the room. We assume that if a task reserves
> space, it will be able to allocate that space later.
>
> There are several things that can happen which will result in not being
> able to allocate space we reserved before:
>
> 1) Discard/fitrim - it removes a free space entry, does the discard, and
> after that it adds back the free space entry. If all the available space
> is in such an entry being discarded, the task fails to allocate space;
>
Yup I've worried this was happening before, however when I tracked down the
actual problem this wasn't actually the issue we were seeing, so I set aside
fixing it. For this I want to have another pass through the allocator that
waits for all discards to finish and doesn't allow new ones to start so we know
we can actually make our allocation.
> 2) fsync - it joins a transaction, doesn't reserve space and starts allocating
> space for tree blocks, without ever reserving space (because we want it
> to be fast and for most cases we don't know in advance, or can estimate,
> how many tree blocks we will need to allocate). So it can steal space that
> was reserved by some other task;
I'm far less worried about this than I used to be. Before we had all these
hueristics about when we would commit the tranaction for ENOSPC flushing. Now
we unconditionally commit it and thus free all this space. We may need to add
another "wait for the tree log freeing code to finish running" wait, but this
should be straightforward to address.
>
> 3) Scrub - scrub temporarily turns a block group into RO mode - if all the
> available space was in that block group, than when the task tries to
> allocate it will fail because the block group is now RO;
>
Another thing we could add to the ENOSPC flushing thing, probably as a last
ditch "stop scrub, we're really in trouble here". It could even go in the
allocator loop like discard where we definitely have the space, it's just
temporarily tied up.
> 4) During space reservation we only check if free space counters. There
> may be block groups with plenty of free space but their profile is not
> compatible, so when trying to allocate an extent we are forced to allocate
> a new block group with the necessary profile, which will fail if there's
> not enough unallocated space.
> This mostly affects degraded mode only (hopefully).
Yeah this really shouldn't happen, everything should be the same profile. Which
is to say all discrete types should be on the same profile, if we had some on
dupe metadata but some on single metadata mixed into the same space_info we're
going to have a bad time. We could probalby address this by making the
space_info's completely tied to their profile and their data/metadata type, but
that might be hairy.
Degraded is special, but I'm happy slapping a "try not to use degraded mode
heavily in production" warning label on it. Thanks,
Josef
next prev parent reply other threads:[~2022-09-12 18:19 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-09 13:35 [PATCH] btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY Josef Bacik
2022-09-12 9:59 ` Filipe Manana
2022-09-12 10:26 ` Qu Wenruo
2022-09-12 10:42 ` Filipe Manana
2022-09-12 11:02 ` Qu Wenruo
2022-09-12 11:21 ` Filipe Manana
2022-09-12 11:31 ` Qu Wenruo
2022-09-12 11:52 ` Qu Wenruo
2022-09-12 18:18 ` Josef Bacik [this message]
2022-09-12 10:40 ` Neal Gompa
2022-10-10 16:59 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yx94DA75iNdsZUGX@localhost.localdomain \
--to=josef@toxicpanda.com \
--cc=fdmanana@kernel.org \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox