All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Andi Kleen <ak@linux.intel.com>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Dave Chinner <david@fromorbit.com>, linux-mm <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
Date: Mon, 17 Feb 2014 21:49:55 +0200	[thread overview]
Message-ID: <20140217194955.GA30908@node.dhcp.inet.fi> (raw)
In-Reply-To: <CA+55aFwz+36NOk=uanDvii7zn46-s1kpMT1Lt=C0hhhn9v6w-Q@mail.gmail.com>

On Mon, Feb 17, 2014 at 11:01:58AM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.
> 
> Sad.
> 
> I do wonder if we really need to lock the pages we fault in. We lock
> them in order to test for being up-to-date and still mapped. The
> up-to-date check we don't really need to worry about: that we can test
> without locking by just reading "page->flags" atomically and verifying
> that it's uptodate and not locked.
> 
> The other reason to lock the page is:
> 
>  - for anonymous pages we need the lock for rmap, so the VM generally
> always locks the page. But that's not an issue for file-backed pages:
> the "rmap" for a filebacked page is just the page mapcount and the
> cgroup statistics, and those don't need the page lock.
> 
>  - the whole truncation/unmapping thing
> 
> So the complex part is racing with truncate/unmapping the page. But
> since we hold the page table lock, I *think* what we should be able to
> do is:
> 
>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

Sounds reasonable to me. I'll take a closer look tomorrow.

But it could be safer to keep locking in place and reduce lookup cost by
exposing something like ->fault_iter_init() and ->fault_iter_next(). It
will still return one page a time, but it will keep radix-tree context
around for cheaper next-page lookup.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	Andi Kleen <ak@linux.intel.com>,
	Matthew Wilcox <matthew.r.wilcox@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Dave Chinner <david@fromorbit.com>, linux-mm <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache
Date: Mon, 17 Feb 2014 21:49:55 +0200	[thread overview]
Message-ID: <20140217194955.GA30908@node.dhcp.inet.fi> (raw)
In-Reply-To: <CA+55aFwz+36NOk=uanDvii7zn46-s1kpMT1Lt=C0hhhn9v6w-Q@mail.gmail.com>

On Mon, Feb 17, 2014 at 11:01:58AM -0800, Linus Torvalds wrote:
> On Mon, Feb 17, 2014 at 10:38 AM, Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > Now we have ->fault_nonblock() to ask filesystem for a page, if it's
> > reachable without blocking. We request one page a time. It's not terribly
> > efficient and I will probably re-think the interface once again to expose
> > iterator or something...
> 
> Hmm. Yeah, clearly this isn't working, since the real workloads all
> end up looking like
> 
> >        115,493,976      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       59.686645587 seconds time elapsed                                          ( +-  0.30% )
>  becomes
> >         47,428,068      minor-faults                                                  ( +-  0.00% ) [100.00%]
> >       60.241766430 seconds time elapsed                                          ( +-  0.85% )
> 
> and
> 
> >        268,039,365      minor-faults                                                 [100.00%]
> >      132.830612471 seconds time elapsed
> becomes
> >        193,550,437      minor-faults                                                 [100.00%]
> >      132.851823758 seconds time elapsed
> 
> and
> 
> >          4,967,540      minor-faults                                                  ( +-  0.06% ) [100.00%]
> >       27.215434226 seconds time elapsed                                          ( +-  0.18% )
> becomes
> >          2,285,563      minor-faults                                                  ( +-  0.26% ) [100.00%]
> >       27.292854546 seconds time elapsed                                          ( +-  0.29% )
> 
> ie it shows a clear reduction in faults, but the added costs clearly
> eat up any wins and it all becomes (just _slightly_) slower.
> 
> Sad.
> 
> I do wonder if we really need to lock the pages we fault in. We lock
> them in order to test for being up-to-date and still mapped. The
> up-to-date check we don't really need to worry about: that we can test
> without locking by just reading "page->flags" atomically and verifying
> that it's uptodate and not locked.
> 
> The other reason to lock the page is:
> 
>  - for anonymous pages we need the lock for rmap, so the VM generally
> always locks the page. But that's not an issue for file-backed pages:
> the "rmap" for a filebacked page is just the page mapcount and the
> cgroup statistics, and those don't need the page lock.
> 
>  - the whole truncation/unmapping thing
> 
> So the complex part is racing with truncate/unmapping the page. But
> since we hold the page table lock, I *think* what we should be able to
> do is:
> 
>  - increment the page _mapcount (iow, do "page_add_file_rmap()"
> early). This guarantees that any *subsequent* unmap activity on this
> page will walk the file mapping lists, and become serialized by the
> page table lock we hold.
> 
>  - mb_after_atomic_inc() (this is generally free)
> 
>  - test that the page is still unlocked and uptodate, and the page
> mapping still points to our page.
> 
>  - if that is true, we're all good, we can use the page, otherwise we
> decrement the mapcount (page_remove_rmap()) and skip the page.
> 
> Hmm? Doing something like this means that we would never lock the
> pages we prefault, and you can go back to your gang lookup rather than
> that "one page at a time". And the race case is basically never going
> to trigger.
> 
> Comments?

Sounds reasonable to me. I'll take a closer look tomorrow.

But it could be safer to keep locking in place and reduce lookup cost by
exposing something like ->fault_iter_init() and ->fault_iter_next(). It
will still return one page a time, but it will keep radix-tree context
around for cheaper next-page lookup.

-- 
 Kirill A. Shutemov

  reply	other threads:[~2014-02-17 19:49 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-17 18:38 [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
2014-02-17 18:38 ` Kirill A. Shutemov
2014-02-17 18:38 ` [PATCH 1/2] mm: introduce vm_ops->fault_nonblock() Kirill A. Shutemov
2014-02-17 18:38   ` Kirill A. Shutemov
2014-02-17 18:38 ` [PATCH 2/2] mm: implement ->fault_nonblock() for page cache Kirill A. Shutemov
2014-02-17 18:38   ` Kirill A. Shutemov
2014-02-17 19:01 ` [RFC, PATCHv2 0/2] mm: map few pages around fault address if they are in " Linus Torvalds
2014-02-17 19:01   ` Linus Torvalds
2014-02-17 19:49   ` Kirill A. Shutemov [this message]
2014-02-17 19:49     ` Kirill A. Shutemov
2014-02-17 20:24     ` Linus Torvalds
2014-02-17 20:24       ` Linus Torvalds
2014-02-18 13:28   ` Rik van Riel
2014-02-18 13:28     ` Rik van Riel
2014-02-18 14:15     ` Wilcox, Matthew R
2014-02-18 14:15       ` Wilcox, Matthew R
2014-02-18 18:02       ` Linus Torvalds
2014-02-18 18:02         ` Linus Torvalds
2014-02-18 18:53         ` Matthew Wilcox
2014-02-18 18:53           ` Matthew Wilcox
2014-02-18 19:07           ` Linus Torvalds
2014-02-18 19:07             ` Linus Torvalds
2014-02-18 14:23     ` Kirill A. Shutemov
2014-02-18 14:23       ` Kirill A. Shutemov
2014-02-18 17:51     ` Linus Torvalds
2014-02-18 17:51       ` Linus Torvalds
2014-02-18 17:59   ` Kirill A. Shutemov
2014-02-18 17:59     ` Kirill A. Shutemov
2014-02-18 18:07     ` Kirill A. Shutemov
2014-02-18 18:07       ` Kirill A. Shutemov
2014-02-18 18:07       ` Kirill A. Shutemov
2014-02-18 18:28       ` Linus Torvalds
2014-02-18 18:28         ` Linus Torvalds
2014-02-18 23:57         ` Kirill A. Shutemov
2014-02-18 23:57           ` Kirill A. Shutemov
2014-02-19  0:29           ` Linus Torvalds
2014-02-19  0:29             ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140217194955.GA30908@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.