From: Minchan Kim <minchan@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm <linux-mm@kvack.org>, Josef Bacik <josef@toxicpanda.com>,
Johannes Weiner <hannes@cmpxchg.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: fix long time stall from mm_populate
Date: Wed, 12 Feb 2020 09:40:15 -0800 [thread overview]
Message-ID: <20200212174015.GB93795@google.com> (raw)
In-Reply-To: <20200212101804.GD25573@quack2.suse.cz>
Hello Jan,
On Wed, Feb 12, 2020 at 11:18:04AM +0100, Jan Kara wrote:
> On Tue 11-02-20 09:57:31, Minchan Kim wrote:
> > On Tue, Feb 11, 2020 at 09:28:03AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 11, 2020 at 08:34:04AM -0800, Minchan Kim wrote:
> > > > On Tue, Feb 11, 2020 at 04:23:23AM -0800, Matthew Wilcox wrote:
> > > > > On Mon, Feb 10, 2020 at 08:25:36PM -0800, Minchan Kim wrote:
> > > > > > On Mon, Feb 10, 2020 at 07:54:12PM -0800, Matthew Wilcox wrote:
> > > > > > > On Mon, Feb 10, 2020 at 07:50:04PM -0800, Minchan Kim wrote:
> > > > > > > > On Mon, Feb 10, 2020 at 05:10:21PM -0800, Matthew Wilcox wrote:
> > > > > > > > > On Mon, Feb 10, 2020 at 04:19:58PM -0800, Minchan Kim wrote:
> > > > > > > > > > filemap_fault
> > > > > > > > > > find a page form page(PG_uptodate|PG_readahead|PG_writeback)
> > > > > > > > >
> > > > > > > > > Uh ... That shouldn't be possible.
> > > > > > > >
> > > > > > > > Please see shrink_page_list. Vmscan uses PG_reclaim to accelerate
> > > > > > > > page reclaim when the writeback is done so the page will have both
> > > > > > > > flags at the same time and the PG reclaim could be regarded as
> > > > > > > > PG_readahead in fault conext.
> > > > > > >
> > > > > > > What part of fault context can make that mistake? The snippet I quoted
> > > > > > > below is from page_cache_async_readahead() where it will clearly not
> > > > > > > make that mistake. There's a lot of code here; please don't presume I
> > > > > > > know all the areas you're talking about.
> > > > > >
> > > > > > Sorry about being not clear. I am saying filemap_fault ->
> > > > > > do_async_mmap_readahead
> > > > > >
> > > > > > Let's assume the page is hit in page cache and vmf->flags is !FAULT_FLAG
> > > > > > TRIED so it calls do_async_mmap_readahead. Since the page has PG_reclaim
> > > > > > and PG_writeback by shrink_page_list, it goes to
> > > > > >
> > > > > > do_async_mmap_readahead
> > > > > > if (PageReadahead(page))
> > > > > > fpin = maybe_unlock_mmap_for_io();
> > > > > > page_cache_async_readahead
> > > > > > if (PageWriteback(page))
> > > > > > return;
> > > > > > ClearPageReadahead(page); <- doesn't reach here until the writeback is clear
> > > > > >
> > > > > > So, mm_populate will repeat the loop until the writeback is done.
> > > > > > It's my just theory but didn't comfirm it by the testing.
> > > > > > If I miss something clear, let me know it.
> > > > >
> > > > > Ah! Surely the right way to fix this is ...
> > > >
> > > > I'm not sure it's right fix. Actually, I wanted to remove PageWriteback check
> > > > in page_cache_async_readahead because I don't see corelation. Why couldn't we
> > > > do readahead if the marker page is PG_readahead|PG_writeback design PoV?
> > > > Only reason I can think of is it makes *a page* will be delayed for freeing
> > > > since we removed PG_reclaim bit, which would be over-optimization for me.
> > >
> > > You're confused. Because we have a shortage of bits in the page flags,
> > > we use the same bit for both PageReadahead and PageReclaim. That doesn't
> > > mean that a page marked as PageReclaim should be treated as PageReadahead.
> >
> > My point is why we couldn't do readahead if the marker page is under PG_writeback.
>
> Well, as far as I'm reading the code, this shouldn't usually happen -
> PageReadahead is set on a page that the preread into memory. Once it is
> used for the first time (either by page fault or normal read), readahead
> logic triggers starting further readahead and PageReadahead gets cleared.
>
> What could happen though is that the page gets written into (say just a few
> bytes). That would keep PageReadahead set although the page will become
> dirty and can later be written back. I don't find not triggering writeback in
> this case too serious though since it should be very rare.
Please see pageout in vmscan.c which set PG_reclaim righ before calling
writepage. Since PG_reclaim and PG_readahead shares the same bit of the
page flags, do_async_mmap_readahead will decipher the PG_reclaim as
PageReadahead so it releases mmap but page_cache_async_readahead just
returns by PageWriteback check. It will be repeated until the writeback
is done.
>
> So I'd be OK with the change Matthew suggested although I'd prefer if we
> had this magic "!PageWriteback && PageReadahead" test in some helper
> function (like page_should_trigger_readahead()?) with a good comment explaining
> the details.
How about this?
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..d07d602476df 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -363,8 +363,28 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
/* PG_readahead is only used for reads; PG_reclaim is only for writes */
PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
- TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+SETPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+CLEARPAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+
+/*
+ * Since PG_readahead is shared with PG_reclaim of the page flags,
+ * PageReadahead should double check whether it's readahead marker
+ * or PG_reclaim. It could be done by PageWriteback check because
+ * PG_reclaim is always with PG_writeback.
+ */
+static inline int PageReadahead(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+ return test_bit(PG_reclaim, &(page)->flags) && !PageWriteback(page);
+}
+
+static inline int TestClearPageReadahead(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageCompound(page), page);
+
+ return test_and_clear_bit(PG_reclaim, &page->flags) && !PageWriteback(page);
+}
#ifdef CONFIG_HIGHMEM
/*
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..85b15e5a1d7b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -553,12 +553,6 @@ page_cache_async_readahead(struct address_space *mapping,
if (!ra->ra_pages)
return;
- /*
- * Same bit is used for PG_readahead and PG_reclaim.
- */
- if (PageWriteback(page))
- return;
-
ClearPageReadahead(page);
/*
--
2.25.0.225.g125e21ebc7-goog
>
> > It was there for a long time and you were adding one more so I was curious what's
> > reasoning comes from. Let me find why PageWriteback check in
> > page_cache_async_readahead from the beginning.
> >
> > fe3cba17c4947, mm: share PG_readahead and PG_reclaim
> >
> > The reason comes from the description
> >
> > b) clear PG_readahead => implicit clear of PG_reclaim
> > one(and only one) page will not be reclaimed in time
> > it can be avoided by checking PageWriteback(page) in readahead first
> >
> > The goal was to avoid delay freeing of the page by clearing PG_reclaim.
> > I'm saying that I feel it's over optimization. IOW, it would be okay to
> > lose a page to be accelerated reclaim.
> >
> > >
> > > > Other concern is isn't it's racy? IOW, page was !PG_writeback at the check below
> > > > in your snippet but it was under PG_writeback in page_cache_async_readahead and
> > > > then the IO was done before refault reaching the code again. It could be repeated
> > > > *theoretically* even though it's very hard to happen in real practice.
> > > > Thus, I think it would be better to remove PageWriteback check from
> > > > page_cache_async_readahead if we really want to go the approach.
> > >
> > > PageReclaim is always cleared before PageWriteback. eg here:
> > >
> > > void end_page_writeback(struct page *page)
> > > ...
> > > if (PageReclaim(page)) {
> > > ClearPageReclaim(page);
> > > rotate_reclaimable_page(page);
> > > }
> > >
> > > if (!test_clear_page_writeback(page))
> > > BUG();
> > >
> > > so if PageWriteback is clear, PageReclaim must already be observable as clear.
> > >
> >
> > I'm saying live lock siutation below.
> > It would be hard to trigger since IO is very slow but isn't it possible
> > theoretically?
> >
> >
> > CPU 1 CPU 2
> > mm_populate
> > 1st trial
> > __get_user_pages
> > handle_mm_fault
> > filemap_fault
> > do_async_mmap_readahead
> > if (!PageWriteback(page) && PageReadahead(page)) {
> > fpin = maybe_unlock_mmap_for_io
> > page_cache_async_readahead
> > set_page_writeback here
> > if (PageWriteback(page))
> > return; <- hit
> >
> > writeback completed and reclaimed the page
> > ..
> > ondemand readahead allocates new page and mark it to PG_readahead
> > 2nd trial
> > __get_user_pages
> > handle_mm_fault
> > filemap_fault
> > do_async_mmap_readahead
> > if (!PageWriteback(page) && PageReadahead(page)) {
> > fpin = maybe_unlock_mmap_for_io
> > page_cache_async_readahead
> > set_page_writeback here
> > if (PageWriteback(page))
> > return; <- hit
> >
> > writeback completed and reclaimed the page
> > ..
> > ondemand readahead allocates new page and mark it to PG_readahead
> >
> > 3rd trial
> > ..
> >
> >
> > Let's consider ra_pages, too as I mentioned. Isn't it another hole to make
> > such live lock if other task suddenly reset it to zero?
> >
> > void page_cache_async_readahead(..)
> > {
> > /* no read-ahead */
> > if (!ra->ra_pages)
> > return;
>
> So this is definitively a bug which was already reported previously. I've
> just sent out a patch to fix this which has somehow fallen through the
> cracks.
>
> Now I agree that regardless of this fix and the fix Matthew has proposed,
> mm_populate() would benefit from being more robust like you suggested so
> I'll check that separately (but I'm no expert in that area).
True because I think we couldn't prevent live lock as I wrote above.
Thanks for the review, Jan!
next prev parent reply other threads:[~2020-02-12 17:40 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-11 0:19 [PATCH] mm: fix long time stall from mm_populate Minchan Kim
2020-02-11 1:10 ` Matthew Wilcox
2020-02-11 3:50 ` Minchan Kim
2020-02-11 3:54 ` Matthew Wilcox
2020-02-11 4:25 ` Minchan Kim
2020-02-11 12:23 ` Matthew Wilcox
2020-02-11 16:34 ` Minchan Kim
2020-02-11 17:28 ` Matthew Wilcox
2020-02-11 17:57 ` Minchan Kim
2020-02-12 10:18 ` Jan Kara
2020-02-12 17:40 ` Minchan Kim [this message]
2020-02-12 18:28 ` Matthew Wilcox
2020-02-12 19:53 ` Minchan Kim
2020-02-12 22:24 ` Andrew Morton
2020-02-12 23:12 ` Minchan Kim
2020-02-13 2:00 ` Andrew Morton
2020-02-13 17:24 ` Minchan Kim
2020-02-11 18:14 ` Yang Shi
2020-02-12 10:22 ` Jan Kara
2020-02-12 17:43 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200212174015.GB93795@google.com \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=josef@toxicpanda.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.