From: Johannes Weiner <hannes@cmpxchg.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: "Vlastimil Babka" <vbabka@suse.cz>,
"Huang Ying" <ying.huang@intel.com>,
"Tim Chen" <tim.c.chen@linux.intel.com>,
"Michal Hocko" <mhocko@kernel.org>,
"Greg Thelen" <gthelen@google.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Balbir Singh" <bsingharora@gmail.com>,
"Minchan Kim" <minchan@kernel.org>, "Shaohua Li" <shli@fb.com>,
"Jérôme Glisse" <jglisse@redhat.com>, "Jan Kara" <jack@suse.cz>,
"Nicholas Piggin" <npiggin@gmail.com>,
"Dan Williams" <dan.j.williams@intel.com>,
"Mel Gorman" <mgorman@suse.de>, "Hugh Dickins" <hughd@google.com>,
"Linux MM" <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm, mlock, vmscan: no more skipping pagevecs
Date: Tue, 21 Nov 2017 16:34:00 -0500 [thread overview]
Message-ID: <20171121213400.GA1503@cmpxchg.org> (raw)
In-Reply-To: <CALvZod5f9oC97OJBZuHypFn1nuyYKF77TpNSyPs31mAGFc=4Ag@mail.gmail.com>
On Tue, Nov 21, 2017 at 10:22:23AM -0800, Shakeel Butt wrote:
> On Tue, Nov 21, 2017 at 7:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Tue, Nov 21, 2017 at 01:39:57PM +0100, Vlastimil Babka wrote:
> >> On 11/04/2017 11:43 PM, Shakeel Butt wrote:
> >> > When a thread mlocks an address space backed by file, a new
> >> > page is allocated (assuming file page is not in memory), added
> >> > to the local pagevec (lru_add_pvec), I/O is triggered and the
> >> > thread then sleeps on the page. On I/O completion, the thread
> >> > can wake on a different CPU, the mlock syscall will then sets
> >> > the PageMlocked() bit of the page but will not be able to put
> >> > that page in unevictable LRU as the page is on the pagevec of
> >> > a different CPU. Even on drain, that page will go to evictable
> >> > LRU because the PageMlocked() bit is not checked on pagevec
> >> > drain.
> >> >
> >> > The page will eventually go to right LRU on reclaim but the
> >> > LRU stats will remain skewed for a long time.
> >> >
> >> > However, this issue does not happen for anon pages on swap
> >> > because unlike file pages, anon pages are not added to pagevec
> >> > until they have been fully swapped in. Also the fault handler
> >> > uses vm_flags to set the PageMlocked() bit of such anon pages
> >> > even before returning to mlock() syscall and mlocked pages will
> >> > skip pagevecs and directly be put into unevictable LRU. No such
> >> > luck for file pages.
> >> >
> >> > One way to resolve this issue, is to somehow plumb vm_flags from
> >> > filemap_fault() to add_to_page_cache_lru() which will then skip
> >> > the pagevec for pages of VM_LOCKED vma and directly put them to
> >> > unevictable LRU. However this patch took a different approach.
> >> >
> >> > All the pages, even unevictable, will be added to the pagevecs
> >> > and on the drain, the pages will be added on their LRUs correctly
> >> > by checking their evictability. This resolves the mlocked file
> >> > pages on pagevec of other CPUs issue because when those pagevecs
> >> > will be drained, the mlocked file pages will go to unevictable
> >> > LRU. Also this makes the race with munlock easier to resolve
> >> > because the pagevec drains happen in LRU lock.
> >> >
> >> > There is one (good) side effect though. Without this patch, the
> >> > pages allocated for System V shared memory segment are added to
> >> > evictable LRUs even after shmctl(SHM_LOCK) on that segment. This
> >> > patch will correctly put such pages to unevictable LRU.
> >> >
> >> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> >>
> >> I like the approach in general, as it seems to make the code simpler,
> >> and the diffstats support that. I found no bugs, but I can't say that
> >> with certainty that there aren't any, though. This code is rather
> >> tricky. But it should be enough for an ack, so.
> >>
> >> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >> A question below, though.
> >>
> >> ...
> >>
> >> > @@ -883,15 +855,41 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> >> > static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> >> > void *arg)
> >> > {
> >> > - int file = page_is_file_cache(page);
> >> > - int active = PageActive(page);
> >> > - enum lru_list lru = page_lru(page);
> >> > + enum lru_list lru;
> >> > + int was_unevictable = TestClearPageUnevictable(page);
> >> >
> >> > VM_BUG_ON_PAGE(PageLRU(page), page);
> >> >
> >> > SetPageLRU(page);
> >> > + /*
> >> > + * Page becomes evictable in two ways:
> >> > + * 1) Within LRU lock [munlock_vma_pages() and __munlock_pagevec()].
> >> > + * 2) Before acquiring LRU lock to put the page to correct LRU and then
> >> > + * a) do PageLRU check with lock [check_move_unevictable_pages]
> >> > + * b) do PageLRU check before lock [isolate_lru_page]
> >> > + *
> >> > + * (1) & (2a) are ok as LRU lock will serialize them. For (2b), if the
> >> > + * other thread does not observe our setting of PG_lru and fails
> >> > + * isolation, the following page_evictable() check will make us put
> >> > + * the page in correct LRU.
> >> > + */
> >> > + smp_mb();
> >>
> >> Could you elaborate on the purpose of smp_mb() here? Previously there
> >> was "The other side is TestClearPageMlocked() or shmem_lock()" in
> >> putback_lru_page(), which seems rather unclear to me (neither has an
> >> explicit barrier?).
> >
> > The TestClearPageMlocked() is an RMW operation with return value, and
> > thus an implicit full barrier (see Documentation/atomic_bitops.txt).
> >
> > The ordering is between putback and munlock:
> >
> > #0 #1
> > list_add(&page->lru,...) if (TestClearPageMlock())
> > SetPageLRU() __munlock_isolate_lru_page()
> > smp_mb()
> > if (page_evictable())
> > rescue
> >
> > The scenario that the barrier prevents from happening is:
> >
> > list_add(&page->lru,...)
> > if (page_evictable())
> > rescue
> > if (TestClearPageMlock())
> > __munlock_isolate_lru_page() // FAILS on !PageLRU
> > SetPageLRU()
> >
> > and now an evictable page is stranded on the unevictable LRU.
> >
> > The barrier guarantees that if #0 doesn't see the page evictable yet,
> > #1 WILL see the PageLRU and succeed in isolation and rescue.
> >
> > Shakeel, please don't drop that "the other side" comment. You mention
> > the places that make the page evictable - which is great, and please
> > keep that as well - but for barriers it's always good to know exactly
> > which operation guarantees the ordering on the other side. In fact, it
> > would be great if you could add comments to the TestClearPageMlocked()
> > sites that mention how they order against the smp_mb() in LRU putback.
>
> Johannes, I have a question. The example you presented is valid before
> this patch as '#0' was happening outside LRU lock. This patch moves
> '#0' inside LRU lock and '#1' was already in LRU lock therefore no
> issue for this particular scenario. However there is still a
> TestClearPageMlocked() in clear_page_mlock() which happens outside LRU
> lock and same issue which you have explained can happen even with this
> patch (but without smp_mb()).
>
> So, "the other side" for smp_mb() after this patch will only be the
> TestClearPageMlock() in clear_page_mlock() because all other
> TestClearPageMlocked() instances are serialized by LRU lock. Please
> let me know if I missed something.
You are right, I overlooked the lru lock in __munlock_pagevec(). It's
really only clear_page_mlock() that needs the ordering.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-11-21 21:34 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-04 22:43 [PATCH] mm, mlock, vmscan: no more skipping pagevecs Shakeel Butt
2017-11-16 1:45 ` Shakeel Butt
2017-11-21 12:39 ` Vlastimil Babka
2017-11-21 15:06 ` Johannes Weiner
2017-11-21 17:13 ` Shakeel Butt
2017-11-21 18:22 ` Shakeel Butt
2017-11-21 21:34 ` Johannes Weiner [this message]
2017-11-21 15:32 ` Johannes Weiner
2017-11-21 17:20 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171121213400.GA1503@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=bsingharora@gmail.com \
--cc=dan.j.williams@intel.com \
--cc=gthelen@google.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jglisse@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=npiggin@gmail.com \
--cc=shakeelb@google.com \
--cc=shli@fb.com \
--cc=tim.c.chen@linux.intel.com \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).