From: Trond Myklebust <trondmy@hammerspace.com>
To: "bfoster@redhat.com" <bfoster@redhat.com>,
"david@fromorbit.com" <david@fromorbit.com>
Cc: "djwong@kernel.org" <djwong@kernel.org>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
"willy@infradead.org" <willy@infradead.org>,
"hch@infradead.org" <hch@infradead.org>,
"axboe@kernel.dk" <axboe@kernel.dk>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"trondmy@kernel.org" <trondmy@kernel.org>
Subject: Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()
Date: Tue, 11 Jan 2022 14:33:11 +0000 [thread overview]
Message-ID: <615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com> (raw)
In-Reply-To: <YdxwnaT0nYHgGQZR@bfoster>
On Mon, 2022-01-10 at 12:45 -0500, Brian Foster wrote:
> On Mon, Jan 10, 2022 at 07:18:47PM +1100, Dave Chinner wrote:
> > On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote:
> > > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote:
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 71a36ae120ee..39214577bc46 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend
> > > > *ioend, int error)
> > > > }
> > > > }
> > > >
> > > > +/*
> > > > + * Ioend completion routine for merged bios. This can only be
> > > > called from task
> > > > + * contexts as merged ioends can be of unbound length. Hence
> > > > we have to break up
> > > > + * the page writeback completion into manageable chunks to
> > > > avoid long scheduler
> > > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms
> > > > so that we get
> > > > + * good batch processing throughput without creating adverse
> > > > scheduler latency
> > > > + * conditions.
> > > > + */
> > > > void
> > > > iomap_finish_ioends(struct iomap_ioend *ioend, int error)
> > > > {
> > > > struct list_head tmp;
> > > > + int segments;
> > > > +
> > > > + might_sleep();
> > > >
> > > > list_replace_init(&ioend->io_list, &tmp);
> > > > + segments = ioend->io_segments;
> > > > iomap_finish_ioend(ioend, error);
> > > >
> > > > while (!list_empty(&tmp)) {
> > > > + if (segments > 32768) {
> > > > + cond_resched();
> > > > + segments = 0;
> > > > + }
> > >
> > > How is this intended to address the large bi_vec scenario? AFAICT
> > > bio_segments() doesn't account for multipage bvecs so the above
> > > logic
> > > can allow something like 34b (?) 4k pages before a yield.
> >
> > Right now the bvec segment iteration in iomap_finish_ioend() is
> > completely unaware of multipage bvecs - as per above
> > bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec,
> > regardless of whether they are stored in a multipage bvec or not.
> > Hence it always iterates the entire bio a single page at a time.
> >
> > IOWs, we don't use multi-page bvecs in iomap writeback, nor is it
> > aware of them at all. We're adding single pages to bios via
> > bio_add_page() which may merge them internally into multipage
> > bvecs.
> > However, all our iterators use single page interfaces, hence we
> > don't see the internal multi-page structure of the bio at all.
> > As such, bio_segments() should return the number of PAGE_SIZE pages
> > attached to the bio regardless of it's internal structure.
> >
>
> That is pretty much the point. The completion loop doesn't really
> care
> whether the amount of page processing work is due to a large bio
> chain,
> multipage bi_bvec(s), merged ioends, or some odd combination thereof.
> As
> you note, these conditions can manifest from various layers above or
> below iomap. I don't think iomap really needs to know or care about
> any
> of this. It just needs to yield when it has spent "too much" time
> processing pages.
>
> With regard to the iterators, my understanding was that
> bio_for_each_segment_all() walks the multipage bvecs but
> bio_for_each_segment() does not, but that could certainly be wrong as
> I
> find the iterators a bit confusing. Either way, the most recent test
> with the ioend granular filter implies that a single ioend can still
> become a soft lockup vector from non-atomic context.
>
> > That is what I see on a trace from a large single file submission,
> > comparing bio_segments() output from the page count on an ioend:
> >
> > kworker/u67:2-187 [017] 13530.753548: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size
> > 16777216
> > kworker/u67:2-187 [017] 13530.759706: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size
> > 16777216
> > kworker/u67:2-187 [017] 13530.766326: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size
> > 16777216
> > kworker/u67:2-187 [017] 13530.770689: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size
> > 16777216
> > kworker/u67:2-187 [017] 13530.774716: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size
> > 16777216
> > kworker/u67:2-187 [017] 13530.777157: iomap_writepages: 3.
> > bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size
> > 8388608
> >
> > Which shows we are building ioends with a single bio with a single
> > bvec, containing 4096 pages and 4096 bio segments. So, as expected,
> > bio_segments() matches the page count and we submit 4096 page
> > ioends
> > with a single bio attached to it.
> >
> > This is clearly a case where we are getting physically contiguous
> > page cache page allocation during write() syscalls, and the result
> > is a single contiguous bvec from bio_add_page() doing physical page
> > merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a
> > physically contiguous 4096 multipage bvec being dispatched. The
> > lower layers slice and dice these huge bios to what the hardware
> > can
> > handle...
> >
>
> I think we're in violent agreement here. That is the crux of
> multipage
> bvecs and what I've been trying to point out [1]. Ming (who I believe
> implemented it) pointed this out back when the problem was first
> reported. This is also why I asked Trond to test out the older patch
> series, because that was intended to cover this case.
>
> [1]
> https://lore.kernel.org/linux-xfs/20220104192321.GF31606@magnolia/T/#mc08ffe4b619c1b503b2c1342157bdaa9823167c1
>
> > What I'm not yet reproducing is whatever vector that Trond is
> > seeing
> > that is causing the multi-second hold-offs. I get page completion
> > processed at a rate of about a million pages per second per CPU,
> > but
> > I'm bandwidth limited to about 400,000 pages per second due to
> > mapping->i_pages lock contention (reclaim vs page cache
> > instantiation vs writeback completion). I'm not seeing merged ioend
> > batches of larger than about 40,000 pages being processed at once.
> > Hence I can't yet see where the millions of pages in a single ioend
> > completion that would be required to hold a CPU for tens of seconds
> > is coming from yet...
> >
>
> I was never able to reproduce the actual warning either (only
> construct
> the unexpectedly large page sequences through various means), so I'm
> equally as curious about that aspect of the problem. My only guess at
> the moment is that perhaps hardware is enough of a factor to increase
> the cost (i.e. slow cpu, cacheline misses, etc.)? I dunno..
>
So I did try patches 1 and 2 from this series over the weekend, but for
some reason the resulting kernel started hanging during I/O. I'm still
trying to figure out why.
> >
I think I'll try Dave's new patch to see if that behaves similarly (in
case this is something new in the stable kernel series) and report
back.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com
next prev parent reply other threads:[~2022-01-11 14:33 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-30 19:35 [PATCH] iomap: Address soft lockup in iomap_finish_ioend() trondmy
2021-12-30 21:24 ` Jens Axboe
2021-12-30 22:25 ` Trond Myklebust
2021-12-30 22:27 ` Jens Axboe
2021-12-30 22:55 ` Trond Myklebust
2021-12-31 1:42 ` Matthew Wilcox
2021-12-31 6:16 ` Trond Myklebust
2022-01-01 3:55 ` Dave Chinner
2022-01-01 17:39 ` Trond Myklebust
2022-01-03 22:03 ` Dave Chinner
2022-01-04 0:04 ` Trond Myklebust
2022-01-04 1:22 ` Dave Chinner
2022-01-04 3:01 ` Trond Myklebust
2022-01-04 7:08 ` hch
2022-01-04 18:08 ` Matthew Wilcox
2022-01-04 18:14 ` hch
2022-01-04 19:22 ` Darrick J. Wong
2022-01-04 21:52 ` Dave Chinner
2022-01-04 23:12 ` Darrick J. Wong
2022-01-05 2:10 ` Dave Chinner
2022-01-05 13:56 ` Brian Foster
2022-01-05 22:04 ` Dave Chinner
2022-01-06 16:44 ` Brian Foster
2022-01-10 8:18 ` Dave Chinner
2022-01-10 17:45 ` Brian Foster
2022-01-10 18:11 ` hch
2022-01-11 14:33 ` Trond Myklebust [this message]
2022-01-05 13:42 ` hch
2022-01-04 21:16 ` Dave Chinner
2022-01-05 13:43 ` hch
2022-01-05 22:34 ` Dave Chinner
2022-01-05 2:09 ` Trond Myklebust
2022-01-05 20:45 ` Trond Myklebust
2022-01-05 22:48 ` Dave Chinner
2022-01-05 23:29 ` Trond Myklebust
2022-01-06 0:01 ` Darrick J. Wong
2022-01-09 23:09 ` Dave Chinner
2022-01-06 18:36 ` Trond Myklebust
2022-01-06 18:38 ` Trond Myklebust
2022-01-06 20:07 ` Brian Foster
2022-01-07 3:08 ` Trond Myklebust
2022-01-07 15:15 ` Brian Foster
2022-01-09 23:34 ` Dave Chinner
2022-01-10 23:37 ` Dave Chinner
2022-01-11 0:08 ` Dave Chinner
2022-01-13 17:01 ` Trond Myklebust
2022-01-17 17:24 ` Trond Myklebust
2022-01-17 17:36 ` Darrick J. Wong
2022-01-04 13:36 ` Brian Foster
2022-01-04 19:23 ` Darrick J. Wong
2022-01-05 2:31 ` [iomap] f5934dda54: BUG:sleeping_function_called_from_invalid_context_at_fs/iomap/buffered-io.c kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com \
--to=trondmy@hammerspace.com \
--cc=axboe@kernel.dk \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=trondmy@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).