Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Trond Myklebust <trondmy@hammerspace.com>
To: "bfoster@redhat.com" <bfoster@redhat.com>,
	"david@fromorbit.com" <david@fromorbit.com>
Cc: "djwong@kernel.org" <djwong@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"trondmy@kernel.org" <trondmy@kernel.org>
Subject: Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()
Date: Tue, 11 Jan 2022 14:33:11 +0000	[thread overview]
Message-ID: <615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com> (raw)
In-Reply-To: <YdxwnaT0nYHgGQZR@bfoster>

On Mon, 2022-01-10 at 12:45 -0500, Brian Foster wrote:
> On Mon, Jan 10, 2022 at 07:18:47PM +1100, Dave Chinner wrote:
> > On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote:
> > > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote:
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 71a36ae120ee..39214577bc46 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend
> > > > *ioend, int error)
> > > >         }
> > > >  }
> > > >  
> > > > +/*
> > > > + * Ioend completion routine for merged bios. This can only be
> > > > called from task
> > > > + * contexts as merged ioends can be of unbound length. Hence
> > > > we have to break up
> > > > + * the page writeback completion into manageable chunks to
> > > > avoid long scheduler
> > > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms
> > > > so that we get
> > > > + * good batch processing throughput without creating adverse
> > > > scheduler latency
> > > > + * conditions.
> > > > + */
> > > >  void
> > > >  iomap_finish_ioends(struct iomap_ioend *ioend, int error)
> > > >  {
> > > >         struct list_head tmp;
> > > > +       int segments;
> > > > +
> > > > +       might_sleep();
> > > >  
> > > >         list_replace_init(&ioend->io_list, &tmp);
> > > > +       segments = ioend->io_segments;
> > > >         iomap_finish_ioend(ioend, error);
> > > >  
> > > >         while (!list_empty(&tmp)) {
> > > > +               if (segments > 32768) {
> > > > +                       cond_resched();
> > > > +                       segments = 0;
> > > > +               }
> > > 
> > > How is this intended to address the large bi_vec scenario? AFAICT
> > > bio_segments() doesn't account for multipage bvecs so the above
> > > logic
> > > can allow something like 34b (?) 4k pages before a yield.
> > 
> > Right now the bvec segment iteration in iomap_finish_ioend() is
> > completely unaware of multipage bvecs - as per above
> > bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec,
> > regardless of whether they are stored in a multipage bvec or not.
> > Hence it always iterates the entire bio a single page at a time.
> > 
> > IOWs, we don't use multi-page bvecs in iomap writeback, nor is it
> > aware of them at all. We're adding single pages to bios via
> > bio_add_page() which may merge them internally into multipage
> > bvecs.
> > However, all our iterators use single page interfaces, hence we
> > don't see the internal multi-page structure of the bio at all.
> > As such, bio_segments() should return the number of PAGE_SIZE pages
> > attached to the bio regardless of it's internal structure.
> > 
> 
> That is pretty much the point. The completion loop doesn't really
> care
> whether the amount of page processing work is due to a large bio
> chain,
> multipage bi_bvec(s), merged ioends, or some odd combination thereof.
> As
> you note, these conditions can manifest from various layers above or
> below iomap. I don't think iomap really needs to know or care about
> any
> of this. It just needs to yield when it has spent "too much" time
> processing pages.
> 
> With regard to the iterators, my understanding was that
> bio_for_each_segment_all() walks the multipage bvecs but
> bio_for_each_segment() does not, but that could certainly be wrong as
> I
> find the iterators a bit confusing. Either way, the most recent test
> with the ioend granular filter implies that a single ioend can still
> become a soft lockup vector from non-atomic context.
> 
> > That is what I see on a trace from a large single file submission,
> > comparing bio_segments() output from the page count on an ioend:
> > 
> >    kworker/u67:2-187   [017] 13530.753548: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.759706: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.766326: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.770689: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.774716: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.777157: iomap_writepages: 3.
> > bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size
> > 8388608
> > 
> > Which shows we are building ioends with a single bio with a single
> > bvec, containing 4096 pages and 4096 bio segments. So, as expected,
> > bio_segments() matches the page count and we submit 4096 page
> > ioends
> > with a single bio attached to it.
> > 
> > This is clearly a case where we are getting physically contiguous
> > page cache page allocation during write() syscalls, and the result
> > is a single contiguous bvec from bio_add_page() doing physical page
> > merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a
> > physically contiguous 4096 multipage bvec being dispatched. The
> > lower layers slice and dice these huge bios to what the hardware
> > can
> > handle...
> > 
> 
> I think we're in violent agreement here. That is the crux of
> multipage
> bvecs and what I've been trying to point out [1]. Ming (who I believe
> implemented it) pointed this out back when the problem was first
> reported. This is also why I asked Trond to test out the older patch
> series, because that was intended to cover this case.
> 
> [1]
> https://lore.kernel.org/linux-xfs/20220104192321.GF31606@magnolia/T/#mc08ffe4b619c1b503b2c1342157bdaa9823167c1
> 
> > What I'm not yet reproducing is whatever vector that Trond is
> > seeing
> > that is causing the multi-second hold-offs. I get page completion
> > processed at a rate of about a million pages per second per CPU,
> > but
> > I'm bandwidth limited to about 400,000 pages per second due to
> > mapping->i_pages lock contention (reclaim vs page cache
> > instantiation vs writeback completion). I'm not seeing merged ioend
> > batches of larger than about 40,000 pages being processed at once.
> > Hence I can't yet see where the millions of pages in a single ioend
> > completion that would be required to hold a CPU for tens of seconds
> > is coming from yet...
> > 
> 
> I was never able to reproduce the actual warning either (only
> construct
> the unexpectedly large page sequences through various means), so I'm
> equally as curious about that aspect of the problem. My only guess at
> the moment is that perhaps hardware is enough of a factor to increase
> the cost (i.e. slow cpu, cacheline misses, etc.)? I dunno..
> 

So I did try patches 1 and 2 from this series over the weekend, but for
some reason the resulting kernel started hanging during I/O. I'm still
trying to figure out why.

> > 
I think I'll try Dave's new patch to see if that behaves similarly (in
case this is something new in the stable kernel series) and report
back.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

next prev parent reply	other threads:[~2022-01-11 14:33 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 19:35 [PATCH] iomap: Address soft lockup in iomap_finish_ioend() trondmy
2021-12-30 21:24 ` Jens Axboe
2021-12-30 22:25   ` Trond Myklebust
2021-12-30 22:27     ` Jens Axboe
2021-12-30 22:55       ` Trond Myklebust
2021-12-31  1:42 ` Matthew Wilcox
2021-12-31  6:16   ` Trond Myklebust
2022-01-01  3:55     ` Dave Chinner
2022-01-01 17:39       ` Trond Myklebust
2022-01-03 22:03         ` Dave Chinner
2022-01-04  0:04           ` Trond Myklebust
2022-01-04  1:22             ` Dave Chinner
2022-01-04  3:01               ` Trond Myklebust
2022-01-04  7:08               ` hch
2022-01-04 18:08                 ` Matthew Wilcox
2022-01-04 18:14                   ` hch
2022-01-04 19:22                     ` Darrick J. Wong
2022-01-04 21:52                       ` Dave Chinner
2022-01-04 23:12                         ` Darrick J. Wong
2022-01-05  2:10                           ` Dave Chinner
2022-01-05 13:56                             ` Brian Foster
2022-01-05 22:04                               ` Dave Chinner
2022-01-06 16:44                                 ` Brian Foster
2022-01-10  8:18                                   ` Dave Chinner
2022-01-10 17:45                                     ` Brian Foster
2022-01-10 18:11                                       ` hch
2022-01-11 14:33                                       ` Trond Myklebust [this message]
2022-01-05 13:42                           ` hch
2022-01-04 21:16                 ` Dave Chinner
2022-01-05 13:43                   ` hch
2022-01-05 22:34                     ` Dave Chinner
2022-01-05  2:09               ` Trond Myklebust
2022-01-05 20:45                 ` Trond Myklebust
2022-01-05 22:48                   ` Dave Chinner
2022-01-05 23:29                     ` Trond Myklebust
2022-01-06  0:01                     ` Darrick J. Wong
2022-01-09 23:09                       ` Dave Chinner
2022-01-06 18:36                     ` Trond Myklebust
2022-01-06 18:38                       ` Trond Myklebust
2022-01-06 20:07                       ` Brian Foster
2022-01-07  3:08                         ` Trond Myklebust
2022-01-07 15:15                           ` Brian Foster
2022-01-09 23:34                       ` Dave Chinner
2022-01-10 23:37                       ` Dave Chinner
2022-01-11  0:08                         ` Dave Chinner
2022-01-13 17:01                         ` Trond Myklebust
2022-01-17 17:24                           ` Trond Myklebust
2022-01-17 17:36                             ` Darrick J. Wong
2022-01-04 13:36         ` Brian Foster
2022-01-04 19:23           ` Darrick J. Wong
2022-01-05  2:31 ` [iomap] f5934dda54: BUG:sleeping_function_called_from_invalid_context_at_fs/iomap/buffered-io.c kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=axboe@kernel.dk \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).