From: Josef Bacik <josef@toxicpanda.com>
To: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 3/3] btrfs: do not hold the extent lock for entire read
Date: Mon, 26 Aug 2024 10:49:47 -0400 [thread overview]
Message-ID: <20240826144947.GF2393039@perftesting> (raw)
In-Reply-To: <tu2lxd7s3obnk6my6jcypnid44ndz2hpdmarxyrrowvkrpftqb@7dluyzmfc2is>
On Sat, Aug 24, 2024 at 12:03:47PM -0400, Goldwyn Rodrigues wrote:
> On 12:02 21/08, Josef Bacik wrote:
> > Historically we've held the extent lock throughout the entire read.
> > There's been a few reasons for this, but it's mostly just caused us
> > problems. For example, this prevents us from allowing page faults
> > during direct io reads, because we could deadlock. This has forced us
> > to only allow 4k reads at a time for io_uring NOWAIT requests because we
> > have no idea if we'll be forced to page fault and thus have to do a
> > whole lot of work.
> >
> > On the buffered side we are protected by the page lock, as long as we're
> > reading things like buffered writes, punch hole, and even direct IO to a
> > certain degree will get hung up on the page lock while the page is in
> > flight.
> >
> > On the direct side we have the dio extent lock, which acts much like the
> > way the extent lock worked previously to this patch, however just for
> > direct reads. This protects direct reads from concurrent direct writes,
> > while we're protected from buffered writes via the inode lock.
> >
> > Now that we're protected in all cases, narrow the extent lock to the
> > part where we're getting the extent map to submit the reads, no longer
> > holding the extent lock for the entire read operation.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/btrfs/compression.c | 2 +-
> > fs/btrfs/direct-io.c | 51 ++++++++++++------------
> > fs/btrfs/extent_io.c | 88 +++---------------------------------------
> > 3 files changed, 31 insertions(+), 110 deletions(-)
> >
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index 832ab8984c41..511f81f312af 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -521,6 +521,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> > }
> > add_size = min(em->start + em->len, page_end + 1) - cur;
> > free_extent_map(em);
> > + unlock_extent(tree, cur, page_end, NULL);
> >
> > if (folio->index == end_index) {
> > size_t zero_offset = offset_in_folio(folio, isize);
> > @@ -534,7 +535,6 @@ static noinline int add_ra_bio_pages(struct inode *inode,
> >
> > if (!bio_add_folio(orig_bio, folio, add_size,
> > offset_in_folio(folio, cur))) {
> > - unlock_extent(tree, cur, page_end, NULL);
> > folio_unlock(folio);
> > folio_put(folio);
> > break;
> > diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> > index 576f469cacee..66f1ce5fcfd2 100644
> > --- a/fs/btrfs/direct-io.c
> > +++ b/fs/btrfs/direct-io.c
> > @@ -366,7 +366,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > int ret = 0;
> > u64 len = length;
> > const u64 data_alloc_len = length;
> > - bool unlock_extents = false;
> > + u32 unlock_bits = EXTENT_LOCKED;
> >
> > /*
> > * We could potentially fault if we have a buffer > PAGE_SIZE, and if
> > @@ -527,7 +527,6 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > start, &len, flags);
> > if (ret < 0)
> > goto unlock_err;
> > - unlock_extents = true;
> > /* Recalc len in case the new em is smaller than requested */
> > len = min(len, em->len - (start - em->start));
> > if (dio_data->data_space_reserved) {
> > @@ -548,23 +547,8 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > release_offset,
> > release_len);
> > }
> > - } else {
> > - /*
> > - * We need to unlock only the end area that we aren't using.
> > - * The rest is going to be unlocked by the endio routine.
> > - */
> > - lockstart = start + len;
> > - if (lockstart < lockend)
> > - unlock_extents = true;
> > }
> >
> > - if (unlock_extents)
> > - clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> > - EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> > - &cached_state);
> > - else
> > - free_extent_state(cached_state);
> > -
> > /*
> > * Translate extent map information to iomap.
> > * We trim the extents (and move the addr) even though iomap code does
> > @@ -583,6 +567,23 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
> > iomap->length = len;
> > free_extent_map(em);
> >
> > + /*
> > + * Reads will hold the EXTENT_DIO_LOCKED bit until the io is completed,
> > + * writes only hold it for this part. We hold the extent lock until
> > + * we're completely done with the extent map to make sure it remains
> > + * valid.
> > + */
> > + if (write)
> > + unlock_bits |= EXTENT_DIO_LOCKED;
> > +
> > + clear_extent_bit(&BTRFS_I(inode)->io_tree, lockstart, lockend,
> > + unlock_bits, &cached_state);
> > +
> > + /* We didn't use everything, unlock the dio extent for the remainder. */
> > + if (!write && (start + len) < lockend)
> > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, start + len,
> > + lockend, NULL);
> > +
> > return 0;
> >
> > unlock_err:
> > @@ -615,9 +616,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> >
> > if (!write && (iomap->type == IOMAP_HOLE)) {
> > /* If reading from a hole, unlock and return */
> > - clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> > - pos + length - 1,
> > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> > + pos + length - 1, NULL);
> > return 0;
> > }
> >
> > @@ -628,10 +628,8 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> > btrfs_finish_ordered_extent(dio_data->ordered, NULL,
> > pos, length, false);
> > else
> > - clear_extent_bit(&BTRFS_I(inode)->io_tree, pos,
> > - pos + length - 1,
> > - EXTENT_LOCKED | EXTENT_DIO_LOCKED,
> > - NULL);
> > + unlock_dio_extent(&BTRFS_I(inode)->io_tree, pos,
> > + pos + length - 1, NULL);
> > ret = -ENOTBLK;
> > }
> > if (write) {
> > @@ -663,9 +661,8 @@ static void btrfs_dio_end_io(struct btrfs_bio *bbio)
> > dip->file_offset, dip->bytes,
> > !bio->bi_status);
> > } else {
> > - clear_extent_bit(&inode->io_tree, dip->file_offset,
> > - dip->file_offset + dip->bytes - 1,
> > - EXTENT_LOCKED | EXTENT_DIO_LOCKED, NULL);
> > + unlock_dio_extent(&inode->io_tree, dip->file_offset,
> > + dip->file_offset + dip->bytes - 1, NULL);
> > }
> >
> > bbio->bio.bi_private = bbio->private;
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 822e2bf8bc99..3f3baec23717 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -480,75 +480,6 @@ static void end_bbio_data_write(struct btrfs_bio *bbio)
> > bio_put(bio);
> > }
> >
> > -/*
> > - * Record previously processed extent range
> > - *
> > - * For endio_readpage_release_extent() to handle a full extent range, reducing
> > - * the extent io operations.
> > - */
> > -struct processed_extent {
> > - struct btrfs_inode *inode;
> > - /* Start of the range in @inode */
> > - u64 start;
> > - /* End of the range in @inode */
> > - u64 end;
> > - bool uptodate;
> > -};
> > -
> > -/*
> > - * Try to release processed extent range
> > - *
> > - * May not release the extent range right now if the current range is
> > - * contiguous to processed extent.
> > - *
> > - * Will release processed extent when any of @inode, @uptodate, the range is
> > - * no longer contiguous to the processed range.
> > - *
> > - * Passing @inode == NULL will force processed extent to be released.
> > - */
> > -static void endio_readpage_release_extent(struct processed_extent *processed,
> > - struct btrfs_inode *inode, u64 start, u64 end,
> > - bool uptodate)
> > -{
> > - struct extent_state *cached = NULL;
> > - struct extent_io_tree *tree;
> > -
> > - /* The first extent, initialize @processed */
> > - if (!processed->inode)
> > - goto update;
> > -
> > - /*
> > - * Contiguous to processed extent, just uptodate the end.
> > - *
> > - * Several things to notice:
> > - *
> > - * - bio can be merged as long as on-disk bytenr is contiguous
> > - * This means we can have page belonging to other inodes, thus need to
> > - * check if the inode still matches.
> > - * - bvec can contain range beyond current page for multi-page bvec
> > - * Thus we need to do processed->end + 1 >= start check
> > - */
> > - if (processed->inode == inode && processed->uptodate == uptodate &&
> > - processed->end + 1 >= start && end >= processed->end) {
> > - processed->end = end;
> > - return;
> > - }
> > -
> > - tree = &processed->inode->io_tree;
> > - /*
> > - * Now we don't have range contiguous to the processed range, release
> > - * the processed range now.
> > - */
> > - unlock_extent(tree, processed->start, processed->end, &cached);
> > -
> > -update:
> > - /* Update processed to current range */
> > - processed->inode = inode;
> > - processed->start = start;
> > - processed->end = end;
> > - processed->uptodate = uptodate;
> > -}
> > -
> > static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
> > {
> > ASSERT(folio_test_locked(folio));
> > @@ -575,7 +506,6 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
> > {
> > struct btrfs_fs_info *fs_info = bbio->fs_info;
> > struct bio *bio = &bbio->bio;
> > - struct processed_extent processed = { 0 };
> > struct folio_iter fi;
> > const u32 sectorsize = fs_info->sectorsize;
> >
> > @@ -640,11 +570,7 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
> >
> > /* Update page status and unlock. */
> > end_folio_read(folio, uptodate, start, len);
> > - endio_readpage_release_extent(&processed, BTRFS_I(inode),
> > - start, end, uptodate);
> > }
> > - /* Release the last extent */
> > - endio_readpage_release_extent(&processed, NULL, 0, 0, false);
> > bio_put(bio);
> > }
> >
> > @@ -1019,11 +945,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > size_t pg_offset = 0;
> > size_t iosize;
> > size_t blocksize = fs_info->sectorsize;
> > - struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >
> > ret = set_folio_extent_mapped(folio);
> > if (ret < 0) {
> > - unlock_extent(tree, start, end, NULL);
> > folio_unlock(folio);
> > return ret;
> > }
> > @@ -1047,14 +971,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > if (cur >= last_byte) {
> > iosize = folio_size(folio) - pg_offset;
> > folio_zero_range(folio, pg_offset, iosize);
> > - unlock_extent(tree, cur, cur + iosize - 1, NULL);
> > end_folio_read(folio, true, cur, iosize);
> > break;
> > }
> > em = __get_extent_map(inode, folio, cur, end - cur + 1,
> > em_cached);
> > if (IS_ERR(em)) {
> > - unlock_extent(tree, cur, end, NULL);
> > end_folio_read(folio, false, cur, end + 1 - cur);
> > return PTR_ERR(em);
> > }
> > @@ -1123,7 +1045,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > if (block_start == EXTENT_MAP_HOLE) {
> > folio_zero_range(folio, pg_offset, iosize);
> >
> > - unlock_extent(tree, cur, cur + iosize - 1, NULL);
> > end_folio_read(folio, true, cur, iosize);
> > cur = cur + iosize;
> > pg_offset += iosize;
> > @@ -1131,7 +1052,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > }
> > /* the get_extent function already copied into the folio */
> > if (block_start == EXTENT_MAP_INLINE) {
> > - unlock_extent(tree, cur, cur + iosize - 1, NULL);
> > end_folio_read(folio, true, cur, iosize);
> > cur = cur + iosize;
> > pg_offset += iosize;
> > @@ -1161,11 +1081,13 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
> > u64 end = start + folio_size(folio) - 1;
> > struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
> > struct extent_map *em_cached = NULL;
> > + struct extent_state *cached = NULL;
> > int ret;
> >
> > - btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached);
> >
> > ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
> > + unlock_extent(&inode->io_tree, start, end, &cached);
> > free_extent_map(em_cached);
> >
> > /*
> > @@ -2350,13 +2272,15 @@ void btrfs_readahead(struct readahead_control *rac)
> > u64 start = readahead_pos(rac);
> > u64 end = start + readahead_length(rac) - 1;
> > struct extent_map *em_cached = NULL;
> > + struct extent_state *cached = NULL;
> > u64 prev_em_start = (u64)-1;
> >
> > - btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
> > + btrfs_lock_and_flush_ordered_range(inode, start, end, &cached);;
> >
> > while ((folio = readahead_folio(rac)) != NULL)
> > btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> >
> > + unlock_extent(&inode->io_tree, start, end, &cached);
> > if (em_cached)
> > free_extent_map(em_cached);
> > submit_one_bio(&bio_ctrl);
>
> I think you can combine the patch I have sent a couple of days back with
> this one. It will reduce the scope to extent locking-unlocking still
> further and limit it to fetching extent_map within
> __get_extent_map(). I have tested the following patch on top of this
> one, and it passes xfstests. If not, I can send it as a part of my next
> series.
Ok I'll integrate this and re-send. Thanks!
Josef
prev parent reply other threads:[~2024-08-26 14:49 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-21 16:01 [PATCH 0/3] btrfs: no longer hold the extent lock for the entire read Josef Bacik
2024-08-21 16:01 ` [PATCH 1/3] btrfs: introduce EXTENT_DIO_LOCKED Josef Bacik
2024-08-21 16:02 ` [PATCH 2/3] btrfs: take the dio extent lock during O_DIRECT operations Josef Bacik
2024-08-21 16:02 ` [PATCH 3/3] btrfs: do not hold the extent lock for entire read Josef Bacik
2024-08-24 16:03 ` Goldwyn Rodrigues
2024-08-26 14:49 ` Josef Bacik [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240826144947.GF2393039@perftesting \
--to=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=rgoldwyn@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox