From: Brian Foster <bfoster@redhat.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
Dave Chinner <david@fromorbit.com>,
Ritesh Harjani <riteshh@linux.ibm.com>,
Anju T Sudhakar <anju@linux.vnet.ibm.com>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, willy@infradead.org,
minlei@redhat.com
Subject: Re: [PATCH] iomap: Fix the write_count in iomap_add_to_ioend().
Date: Wed, 16 Sep 2020 09:07:14 -0400 [thread overview]
Message-ID: <20200916130714.GA1681377@bfoster> (raw)
In-Reply-To: <20200916084510.GA30815@infradead.org>
On Wed, Sep 16, 2020 at 09:45:10AM +0100, Christoph Hellwig wrote:
> On Tue, Sep 15, 2020 at 05:12:42PM -0700, Darrick J. Wong wrote:
> > On Tue, Aug 25, 2020 at 10:49:17AM -0400, Brian Foster wrote:
> > > cc Ming
> > >
> > > On Tue, Aug 25, 2020 at 10:42:03AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 24, 2020 at 11:48:41AM -0400, Brian Foster wrote:
> > > > > On Mon, Aug 24, 2020 at 04:04:17PM +0100, Christoph Hellwig wrote:
> > > > > > On Mon, Aug 24, 2020 at 10:28:23AM -0400, Brian Foster wrote:
> > > > > > > Do I understand the current code (__bio_try_merge_page() ->
> > > > > > > page_is_mergeable()) correctly in that we're checking for physical page
> > > > > > > contiguity and not necessarily requiring a new bio_vec per physical
> > > > > > > page?
> > > > > >
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > >
> > > > > Ok. I also realize now that this occurs on a kernel without commit
> > > > > 07173c3ec276 ("block: enable multipage bvecs"). That is probably a
> > > > > contributing factor, but it's not clear to me whether it's feasible to
> > > > > backport whatever supporting infrastructure is required for that
> > > > > mechanism to work (I suspect not).
> > > > >
> > > > > > > With regard to Dave's earlier point around seeing excessively sized bio
> > > > > > > chains.. If I set up a large memory box with high dirty mem ratios and
> > > > > > > do contiguous buffered overwrites over a 32GB range followed by fsync, I
> > > > > > > can see upwards of 1GB per bio and thus chains on the order of 32+ bios
> > > > > > > for the entire write. If I play games with how the buffered overwrite is
> > > > > > > submitted (i.e., in reverse) however, then I can occasionally reproduce
> > > > > > > a ~32GB chain of ~32k bios, which I think is what leads to problems in
> > > > > > > I/O completion on some systems. Granted, I don't reproduce soft lockup
> > > > > > > issues on my system with that behavior, so perhaps there's more to that
> > > > > > > particular issue.
> > > > > > >
> > > > > > > Regardless, it seems reasonable to me to at least have a conservative
> > > > > > > limit on the length of an ioend bio chain. Would anybody object to
> > > > > > > iomap_ioend growing a chain counter and perhaps forcing into a new ioend
> > > > > > > if we chain something like more than 1k bios at once?
> > > > > >
> > > > > > So what exactly is the problem of processing a long chain in the
> > > > > > workqueue vs multiple small chains? Maybe we need a cond_resched()
> > > > > > here and there, but I don't see how we'd substantially change behavior.
> > > > > >
> > > > >
> > > > > The immediate problem is a watchdog lockup detection in bio completion:
> > > > >
> > > > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 25
> > > > >
> > > > > This effectively lands at the following segment of iomap_finish_ioend():
> > > > >
> > > > > ...
> > > > > /* walk each page on bio, ending page IO on them */
> > > > > bio_for_each_segment_all(bv, bio, iter_all)
> > > > > iomap_finish_page_writeback(inode, bv->bv_page, error);
> > > > >
> > > > > I suppose we could add a cond_resched(), but is that safe directly
> > > > > inside of a ->bi_end_io() handler? Another option could be to dump large
> > > > > chains into the completion workqueue, but we may still need to track the
> > > > > length to do that. Thoughts?
> > > >
> > > > We have ioend completion merging that will run the compeltion once
> > > > for all the pending ioend completions on that inode. IOWs, we do not
> > > > need to build huge chains at submission time to batch up completions
> > > > efficiently. However, huge bio chains at submission time do cause
> > > > issues with writeback fairness, pinning GBs of ram as unreclaimable
> > > > for seconds because they are queued for completion while we are
> > > > still submitting the bio chain and submission is being throttled by
> > > > the block layer writeback throttle, etc. Not to mention the latency
> > > > of stable pages in a situation like this - a mmap() write fault
> > > > could stall for many seconds waiting for a huge bio chain to finish
> > > > submission and run completion processing even when the IO for the
> > > > given page we faulted on was completed before the page fault
> > > > occurred...
> > > >
> > > > Hence I think we really do need to cap the length of the bio
> > > > chains here so that we start completing and ending page writeback on
> > > > large writeback ranges long before the writeback code finishes
> > > > submitting the range it was asked to write back.
> > > >
> > >
> > > Ming pointed out separately that limiting the bio chain itself might not
> > > be enough because with multipage bvecs, we can effectively capture the
> > > same number of pages in much fewer bios. Given that, what do you think
> > > about something like the patch below to limit ioend size? This
> > > effectively limits the number of pages per ioend regardless of whether
> > > in-core state results in a small chain of dense bios or a large chain of
> > > smaller bios, without requiring any new explicit page count tracking.
> > >
> > > Brian
> >
> > Dave was asking on IRC if I was going to pull this patch in. I'm unsure
> > of its status (other than it hasn't been sent as a proper [PATCH]) so I
> > wonder, is this necessary, and if so, can it be cleaned up and
> > submitted?
>
I was waiting on some feedback from a few different angles before
posting a proper patch..
> Maybe it is lost somewhere, but what is the point of this patch?
> What does the magic number try to represent?
>
Dave described the main purpose earlier in this thread [1]. The initial
motivation is that we've had downstream reports of soft lockup problems
in writeback bio completion down in the bio -> bvec loop of
iomap_finish_ioend() that has to finish writeback on each individual
page of insanely large bios and/or chains. We've also had an upstream
reports of a similar problem on linux-xfs [2].
The magic number itself was just pulled out of a hat. I picked it
because it seemed conservative enough to still allow large contiguous
bios (1GB w/ 4k pages) while hopefully preventing I/O completion
problems, but was hoping for some feedback on that bit if the general
approach was acceptable. I was also waiting for some feedback on either
of the two users who reported the problem but I don't think I've heard
back on that yet...
Brian
[1] https://lore.kernel.org/linux-fsdevel/20200821215358.GG7941@dread.disaster.area/
[2] https://lore.kernel.org/linux-xfs/alpine.LRH.2.02.2008311513150.7870@file01.intranet.prod.int.rdu2.redhat.com/
next prev parent reply other threads:[~2020-09-16 20:28 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-19 10:28 [PATCH] iomap: Fix the write_count in iomap_add_to_ioend() Anju T Sudhakar
2020-08-20 23:11 ` Dave Chinner
2020-08-21 4:45 ` Ritesh Harjani
2020-08-21 6:00 ` Christoph Hellwig
2020-08-21 9:09 ` Ritesh Harjani
2020-08-21 21:53 ` Dave Chinner
2020-08-22 13:13 ` Christoph Hellwig
2020-08-24 14:28 ` Brian Foster
2020-08-24 15:04 ` Christoph Hellwig
2020-08-24 15:48 ` Brian Foster
2020-08-25 0:42 ` Dave Chinner
2020-08-25 14:49 ` Brian Foster
2020-08-31 4:01 ` Ming Lei
2020-08-31 14:35 ` Brian Foster
2020-09-16 0:12 ` Darrick J. Wong
2020-09-16 8:45 ` Christoph Hellwig
2020-09-16 13:07 ` Brian Foster [this message]
2020-09-17 8:04 ` Christoph Hellwig
2020-09-17 10:42 ` Brian Foster
2020-09-17 14:48 ` Christoph Hellwig
2020-09-17 21:33 ` Darrick J. Wong
2020-09-17 23:13 ` Ming Lei
2020-08-21 6:01 ` Christoph Hellwig
2020-08-21 6:07 ` Christoph Hellwig
2020-08-21 8:53 ` Ritesh Harjani
2020-08-21 14:49 ` Jens Axboe
2020-08-21 13:31 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200916130714.GA1681377@bfoster \
--to=bfoster@redhat.com \
--cc=anju@linux.vnet.ibm.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=minlei@redhat.com \
--cc=riteshh@linux.ibm.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.