From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] xfs, iomap: limit individual ioend chain lengths in writeback
Date: Wed, 26 Jan 2022 09:20:42 -0800 [thread overview]
Message-ID: <20220126172042.GA13540@magnolia> (raw)
In-Reply-To: <20220120034733.221737-1-david@fromorbit.com>
On Thu, Jan 20, 2022 at 02:47:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Trond Myklebust reported soft lockups in XFS IO completion such as
> this:
>
> watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
> CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
> Workqueue: xfs-conv/md127 xfs_end_io [xfs]
> RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
> Call Trace:
> wake_up_page_bit+0x8a/0x110
> iomap_finish_ioend+0xd7/0x1c0
> iomap_finish_ioends+0x7f/0xb0
> xfs_end_ioend+0x6b/0x100 [xfs]
> xfs_end_io+0xb9/0xe0 [xfs]
> process_one_work+0x1a7/0x360
> worker_thread+0x1fa/0x390
> kthread+0x116/0x130
> ret_from_fork+0x35/0x40
>
> Ioends are processed as an atomic completion unit when all the
> chained bios in the ioend have completed their IO. Logically
> contiguous ioends can also be merged and completed as a single,
> larger unit. Both of these things can be problematic as both the
> bio chains per ioend and the size of the merged ioends processed as
> a single completion are both unbound.
>
> If we have a large sequential dirty region in the page cache,
> write_cache_pages() will keep feeding us sequential pages and we
> will keep mapping them into ioends and bios until we get a dirty
> page at a non-sequential file offset. These large sequential runs
> can will result in bio and ioend chaining to optimise the io
> patterns. The pages iunder writeback are pinned within these chains
> until the submission chaining is broken, allowing the entire chain
> to be completed. This can result in huge chains being processed
> in IO completion context.
>
> We get deep bio chaining if we have large contiguous physical
> extents. We will keep adding pages to the current bio until it is
> full, then we'll chain a new bio to keep adding pages for writeback.
> Hence we can build bio chains that map millions of pages and tens of
> gigabytes of RAM if the page cache contains big enough contiguous
> dirty file regions. This long bio chain pins those pages until the
> final bio in the chain completes and the ioend can iterate all the
> chained bios and complete them.
>
> OTOH, if we have a physically fragmented file, we end up submitting
> one ioend per physical fragment that each have a small bio or bio
> chain attached to them. We do not chain these at IO submission time,
> but instead we chain them at completion time based on file
> offset via iomap_ioend_try_merge(). Hence we can end up with unbound
> ioend chains being built via completion merging.
>
> XFS can then do COW remapping or unwritten extent conversion on that
> merged chain, which involves walking an extent fragment at a time
> and running a transaction to modify the physical extent information.
> IOWs, we merge all the discontiguous ioends together into a
> contiguous file range, only to then process them individually as
> discontiguous extents.
>
> This extent manipulation is computationally expensive and can run in
> a tight loop, so merging logically contiguous but physically
> discontigous ioends gains us nothing except for hiding the fact the
> fact we broke the ioends up into individual physical extents at
> submission and then need to loop over those individual physical
> extents at completion.
>
> Hence we need to have mechanisms to limit ioend sizes and
> to break up completion processing of large merged ioend chains:
>
> 1. bio chains per ioend need to be bound in length. Pure overwrites
> go straight to iomap_finish_ioend() in softirq context with the
> exact bio chain attached to the ioend by submission. Hence the only
> way to prevent long holdoffs here is to bound ioend submission
> sizes because we can't reschedule in softirq context.
>
> 2. iomap_finish_ioends() has to handle unbound merged ioend chains
> correctly. This relies on any one call to iomap_finish_ioend() being
> bound in runtime so that cond_resched() can be issued regularly as
> the long ioend chain is processed. i.e. this relies on mechanism #1
> to limit individual ioend sizes to work correctly.
>
> 3. filesystems have to loop over the merged ioends to process
> physical extent manipulations. This means they can loop internally,
> and so we break merging at physical extent boundaries so the
> filesystem can easily insert reschedule points between individual
> extent manipulations.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reported-and-tested-by: Trond Myklebust <trondmy@hammerspace.com>
Looks good, sorry I didn't get to this earlier...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++++++----
> fs/xfs/xfs_aops.c | 16 ++++++++++++-
> include/linux/iomap.h | 2 ++
> 3 files changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index c938bbad075e..6c51a75d0be6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -21,6 +21,8 @@
>
> #include "../internal.h"
>
> +#define IOEND_BATCH_SIZE 4096
> +
> /*
> * Structure allocated for each folio when block size < folio size
> * to track sub-folio uptodate status and I/O completions.
> @@ -1039,7 +1041,7 @@ static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
> * state, release holds on bios, and finally free up memory. Do not use the
> * ioend after this.
> */
> -static void
> +static u32
> iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> {
> struct inode *inode = ioend->io_inode;
> @@ -1048,6 +1050,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> u64 start = bio->bi_iter.bi_sector;
> loff_t offset = ioend->io_offset;
> bool quiet = bio_flagged(bio, BIO_QUIET);
> + u32 folio_count = 0;
>
> for (bio = &ioend->io_inline_bio; bio; bio = next) {
> struct folio_iter fi;
> @@ -1062,9 +1065,11 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> next = bio->bi_private;
>
> /* walk all folios in bio, ending page IO on them */
> - bio_for_each_folio_all(fi, bio)
> + bio_for_each_folio_all(fi, bio) {
> iomap_finish_folio_write(inode, fi.folio, fi.length,
> error);
> + folio_count++;
> + }
> bio_put(bio);
> }
> /* The ioend has been freed by bio_put() */
> @@ -1074,20 +1079,36 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
> "%s: writeback error on inode %lu, offset %lld, sector %llu",
> inode->i_sb->s_id, inode->i_ino, offset, start);
> }
> + return folio_count;
> }
>
> +/*
> + * Ioend completion routine for merged bios. This can only be called from task
> + * contexts as merged ioends can be of unbound length. Hence we have to break up
> + * the writeback completions into manageable chunks to avoid long scheduler
> + * holdoffs. We aim to keep scheduler holdoffs down below 10ms so that we get
> + * good batch processing throughput without creating adverse scheduler latency
> + * conditions.
> + */
> void
> iomap_finish_ioends(struct iomap_ioend *ioend, int error)
> {
> struct list_head tmp;
> + u32 completions;
> +
> + might_sleep();
>
> list_replace_init(&ioend->io_list, &tmp);
> - iomap_finish_ioend(ioend, error);
> + completions = iomap_finish_ioend(ioend, error);
>
> while (!list_empty(&tmp)) {
> + if (completions > IOEND_BATCH_SIZE * 8) {
> + cond_resched();
> + completions = 0;
> + }
> ioend = list_first_entry(&tmp, struct iomap_ioend, io_list);
> list_del_init(&ioend->io_list);
> - iomap_finish_ioend(ioend, error);
> + completions += iomap_finish_ioend(ioend, error);
> }
> }
> EXPORT_SYMBOL_GPL(iomap_finish_ioends);
> @@ -1108,6 +1129,18 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
> return false;
> if (ioend->io_offset + ioend->io_size != next->io_offset)
> return false;
> + /*
> + * Do not merge physically discontiguous ioends. The filesystem
> + * completion functions will have to iterate the physical
> + * discontiguities even if we merge the ioends at a logical level, so
> + * we don't gain anything by merging physical discontiguities here.
> + *
> + * We cannot use bio->bi_iter.bi_sector here as it is modified during
> + * submission so does not point to the start sector of the bio at
> + * completion.
> + */
> + if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector)
> + return false;
> return true;
> }
>
> @@ -1209,8 +1242,10 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
> ioend->io_flags = wpc->iomap.flags;
> ioend->io_inode = inode;
> ioend->io_size = 0;
> + ioend->io_folios = 0;
> ioend->io_offset = offset;
> ioend->io_bio = bio;
> + ioend->io_sector = sector;
> return ioend;
> }
>
> @@ -1251,6 +1286,13 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
> return false;
> if (sector != bio_end_sector(wpc->ioend->io_bio))
> return false;
> + /*
> + * Limit ioend bio chain lengths to minimise IO completion latency. This
> + * also prevents long tight loops ending page writeback on all the
> + * folios in the ioend.
> + */
> + if (wpc->ioend->io_folios >= IOEND_BATCH_SIZE)
> + return false;
> return true;
> }
>
> @@ -1335,6 +1377,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> &submit_list);
> count++;
> }
> + if (count)
> + wpc->ioend->io_folios++;
>
> WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
> WARN_ON_ONCE(!folio_test_locked(folio));
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2705f91bdd0d..9d6a67c7d227 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -136,7 +136,20 @@ xfs_end_ioend(
> memalloc_nofs_restore(nofs_flag);
> }
>
> -/* Finish all pending io completions. */
> +/*
> + * Finish all pending IO completions that require transactional modifications.
> + *
> + * We try to merge physical and logically contiguous ioends before completion to
> + * minimise the number of transactions we need to perform during IO completion.
> + * Both unwritten extent conversion and COW remapping need to iterate and modify
> + * one physical extent at a time, so we gain nothing by merging physically
> + * discontiguous extents here.
> + *
> + * The ioend chain length that we can be processing here is largely unbound in
> + * length and we may have to perform significant amounts of work on each ioend
> + * to complete it. Hence we have to be careful about holding the CPU for too
> + * long in this loop.
> + */
> void
> xfs_end_io(
> struct work_struct *work)
> @@ -157,6 +170,7 @@ xfs_end_io(
> list_del_init(&ioend->io_list);
> iomap_ioend_try_merge(ioend, &tmp);
> xfs_end_ioend(ioend);
> + cond_resched();
> }
> }
>
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index b55bd49e55f5..97a3a2edb585 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -263,9 +263,11 @@ struct iomap_ioend {
> struct list_head io_list; /* next ioend in chain */
> u16 io_type;
> u16 io_flags; /* IOMAP_F_* */
> + u32 io_folios; /* folios added to ioend */
> struct inode *io_inode; /* file being written to */
> size_t io_size; /* size of the extent */
> loff_t io_offset; /* offset in the file */
> + sector_t io_sector; /* start sector of ioend */
> struct bio *io_bio; /* bio being built */
> struct bio io_inline_bio; /* MUST BE LAST! */
> };
> --
> 2.33.0
>
prev parent reply other threads:[~2022-01-26 17:20 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-20 3:47 [PATCH] xfs, iomap: limit individual ioend chain lengths in writeback Dave Chinner
2022-01-26 17:20 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220126172042.GA13540@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).