From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: try to search for data csums in commit root
Date: Mon, 7 Oct 2024 15:30:45 -0700 [thread overview]
Message-ID: <20241007223045.GA388113@zen.localdomain> (raw)
In-Reply-To: <6a34b96c-0b00-46c8-a0e8-69f0028173e4@suse.com>
On Sat, Oct 05, 2024 at 04:30:29PM +0930, Qu Wenruo wrote:
>
>
> 在 2024/10/5 08:53, Boris Burkov 写道:
> > If you run a workload like:
> > - a cgroup that does tons of data reading, with a harsh memory limit
> > - a second cgroup that tries to write new files
> > e.g.:
> > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> >
> > then what quickly occurs is:
> > - a high degree of contention on the csum root node eb rwsem
> > - memory starved cgroup doing tons of reclaim on CPU.
> > - many reader threads in the memory starved cgroup "holding" the sem
> > as readers, but not scheduling promptly. i.e., task __state == 0, but
> > not running on a cpu.
> > - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> >
> > This results in VERY long transactions. On my test system, that script
> > produces 20-30s long transaction commits. This then results in
> > seriously degraded performance for any cgroup using the filesystem (the
> > victim cgroup in the script).
> >
> > This reproducer is a bit silly, as the villanous cgroup is using almost
> > all of its memory.max for kernel memory (specifically pagetables) but it
> > sort of doesn't matter, as I am most interested in the btrfs locking
> > behavior. It isn't an academic problem, as we see this exact problem in
> > production at Meta with one cgroup over memory.max ruining btrfs
> > performance for the whole system.
> >
> > The underlying scheduling "problem" with global rwsems is sort of thorny
> > and apparently well known. e.g.
> > https://lpc.events/event/18/contributions/1883/
> >
> > As a result, our main lever in the short term is just trying to reduce
> > contention on our various rwsems. In the case of the csum tree, we can
> > either redesign btree locking (hard...) or try to use the commit root
> > when we can. Luckily, it seems likely that many reads are for old extents
> > written many transactions ago, and that for those we *can* in fact
> > search the commit root!
>
> The idea looks good to me.
>
> The extent_map::generation is updated to the larger one during merge, so if
> we got a em whose generation is smaller than the current generation it's
> definitely older.
>
> And since data extents in commit root won't be overwritten until the current
> transaction committed, so it should also be fine.
>
>
> But my concern is, the path->need_commit_sem is only blocking transaction
> from happening when the path is holding something.
> And inside search_csum_tree() we can release the path halfway, would that
> cause 2 transaction to be committed during that release window?
>
> Shouldn't we hold the semaphore manually inside btrfs_lookup_bio_sums()
> other than relying on the btrfs_path::need_commit_sem?
Yes, I think you are right. Good catch! I will test that version and
re-send, assuming it still works well.
>
> Thanks,
> Qu
> >
> > This change detects when we are trying to read an old extent (according
> > to extent map generation) and then wires that through bio_ctrl to the
> > btrfs_bio, which unfortunately isn't allocated yet when we have this
> > information. When we go to lookup the csums in lookup_bio_sums we can
> > check this condition on the btrfs_bio and do the commit root lookup
> > accordingly.
> >
> > With the fix, on that same test case, commit latencies no longer exceed
> > ~400ms on my system.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > fs/btrfs/bio.h | 1 +
> > fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
> > fs/btrfs/file-item.c | 7 +++++++
> > 3 files changed, 29 insertions(+)
> >
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index e48612340745..159f6a4425a6 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -48,6 +48,7 @@ struct btrfs_bio {
> > u8 *csum;
> > u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> > struct bvec_iter saved_iter;
> > + bool commit_root_csum;
> > };
> > /*
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index cb0a39370d30..8544fe2302ff 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
> > * This is to avoid touching ranges covered by compression/inline.
> > */
> > unsigned long submit_bitmap;
> > + /*
> > + * If this is a data read bio, we set this to true if it is safe to
> > + * search for csums in the commit root. Otherwise, it is set to false.
> > + *
> > + * This is an optimization to reduce the contention on the csum tree
> > + * root rwsem. Due to how rwsem is implemented, there is a possible
> > + * priority inversion where the readers holding the lock don't get
> > + * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> > + * then blocks btrfs transactions. The only real help is to try to
> > + * reduce the contention on the lock as much as we can.
> > + *
> > + * Do this by detecting when a data read is reading data from an old
> > + * transaction so it's safe to look in the commit root.
> > + */
> > + bool commit_root_csum;
> > };
> > static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
> > @@ -770,6 +785,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
> > alloc_new_bio(inode, bio_ctrl, disk_bytenr,
> > folio_pos(folio) + pg_offset);
> > }
> > + if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
> > + bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
> > +
> > /* Cap to the current ordered extent boundary if there is one. */
> > if (len > bio_ctrl->len_to_oe_boundary) {
> > @@ -1048,6 +1066,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > if (prev_em_start)
> > *prev_em_start = em->start;
> > + if (em->generation < btrfs_get_fs_generation(fs_info))
> > + bio_ctrl->commit_root_csum = true;
> > +
> > free_extent_map(em);
> > em = NULL;
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 886749b39672..2433b169a4e6 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -401,6 +401,13 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
> > path->skip_locking = 1;
> > }
> > + /* See the comment on btrfs_bio_ctrl->commit_root_csum. */
> > + if (bbio->commit_root_csum) {
> > + path->search_commit_root = 1;
> > + path->skip_locking = 1;
> > + path->need_commit_sem = 1;
> > + }
> > +
> > while (bio_offset < orig_len) {
> > int count;
> > u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
>
next prev parent reply other threads:[~2024-10-07 22:30 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-04 23:23 [PATCH] btrfs: try to search for data csums in commit root Boris Burkov
2024-10-05 7:00 ` Qu Wenruo
2024-10-07 22:30 ` Boris Burkov [this message]
2024-10-07 23:43 ` Qu Wenruo
2024-10-08 6:00 ` Boris Burkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241007223045.GA388113@zen.localdomain \
--to=boris@bur.io \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox