From: Andrea Arcangeli <aarcange@redhat.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Linus Torvalds <torvalds@linux-foundation.org>,
Dave Jones <davej@redhat.com>,
Darren Hart <dvhart@linux.intel.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Mel Gorman <mgorman@suse.de>
Subject: Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit)
Date: Tue, 17 Dec 2013 19:06:03 +0100 [thread overview]
Message-ID: <20131217180603.GC5441@redhat.com> (raw)
In-Reply-To: <20131217165335.GA24799@redhat.com>
On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote:
> On 12/16, Oleg Nesterov wrote:
> >
> > On 12/16, Andrea Arcangeli wrote:
> > >
> > > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> > > >
> > > > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> > > >
> > > > For example, suppose that page_head was already freed and then re-allocated
> > > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > > > race with the non-atomic prep_compound_page()->__SetPageHead() ?
> > >
> > > Yes. We need to change to:
> > >
> > > if (order && (gfp_flags & __GFP_COMP))
> > > prep_compound_page(page, order);
> > > smp_wmb();
> > > /* as the compound_lock can be taken after it's refcounted */
> > > set_page_refcounted(page);
> > >
> > > __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> > > missing in a assembly instruction. So the race window is incredibly
> > > small, but it must be fixed indeed. This also puts set_page_refcounted
> > > as the last action of buffered_rmqueue so there shouldn't be any other
> > > issues like this left in the page allocation code.
> > >
> > > Can you reorder set_page_refcount in your v2?
> >
> > OK. I'll try to make something on Wednesday.
>
> Yes, I will, but...
>
> I can't stop thinking about another change. What if we simply change
> __split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
> in a main loop?
>
> This way we can greatly simplify get/put_page paths, we can rely on
> compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
> Yes, this will make _split a bit slower, but PG_compound_lock should
> not be contended? And we should change page_tail->flags carefully, but
> this looks simple.
>
> Or this is not possible/desirable?
That would be 512 nested spinlocks instead of 1, last time I did
something like that in mm_take_all_locks people weren't too pleased as
it started firing lockdeps complains too. Generally I try to avoid
taking too many locks nested if I can.
mm_take_all_locks is fine because it only runs when you register an mm
into a device driver, so it is a very rare event and not performance
critical at all, it is a slow path by all means (only runs when you
start a virtual machine or start X with nvidia etc..). So it is not a
concern. split_huge_page to the contrary could run in a flood if
you're unlucky. split_huge_page is needed not just to handle non-THP
aware paths that mostly disappeared by now, but also when you truncate
a vma so that a THP doesn't fit in it anymore. So it's up to userland
how frequently it needs to run.
I think it's reasonable to consider it though, but then it's not
guaranteed that a put_page on a THP tail is more frequent than
split_huge_page. Keep in mind we do the get_page_unless_zero to
stabilize the head to take the compound_lock on it, only for the
tails, never for the heads. So this restricts it to _only_ the
put_page following a gup_fast. Only gup_fast can ever take a reference
on the tail pages of a THP. Nothing else can.
I intend to add the foll_flags to gup_fast parameter so we remove
FOLL_GET from it in the KVM page fault to avoid doing atomic_inc
immediately followed by atomic_dec when establishing sptes. So the
only transient mappings that cannot be converted to pin-less
mmu_notifier will have to run put_page in gup_fast.
In short you're only going to help O_DIRECT with that change, and
removing 1 locked op for every 4k written to disk may not offset the
cost of 511 locked ops in split_huge_page.
Still worth thinking about it, but not obvious win in my view (plus
the lockdep trouble with taking too many locks nested). Comments welcome.
Thanks!
Andrea
next prev parent reply other threads:[~2013-12-17 18:06 UTC|newest]
Thread overview: 114+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-10 15:47 process 'stuck' at exit Dave Jones
2013-12-10 18:40 ` Linus Torvalds
2013-12-10 19:18 ` Thomas Gleixner
2013-12-10 19:55 ` Linus Torvalds
2013-12-10 20:27 ` Dave Jones
2013-12-10 20:34 ` Thomas Gleixner
2013-12-10 20:55 ` Dave Jones
2013-12-10 21:25 ` Darren Hart
2013-12-10 21:28 ` Thomas Gleixner
2013-12-10 21:39 ` Steven Rostedt
2013-12-10 20:33 ` Thomas Gleixner
2013-12-10 20:43 ` Linus Torvalds
2013-12-10 21:17 ` Thomas Gleixner
2013-12-10 20:35 ` Oleg Nesterov
2013-12-10 20:49 ` Dave Jones
2013-12-10 21:06 ` Darren Hart
2013-12-10 21:12 ` Dave Jones
2013-12-10 21:18 ` Linus Torvalds
2013-12-10 21:24 ` Linus Torvalds
2013-12-10 21:32 ` Dave Jones
2013-12-10 21:49 ` Linus Torvalds
2013-12-10 21:56 ` Dave Jones
2013-12-10 21:59 ` Linus Torvalds
2013-12-10 22:07 ` Dave Jones
2013-12-11 12:45 ` Ingo Molnar
2013-12-10 21:34 ` Oleg Nesterov
2013-12-10 21:41 ` Dave Jones
2013-12-10 21:57 ` Linus Torvalds
2013-12-10 22:02 ` Dave Jones
2013-12-10 22:09 ` Oleg Nesterov
2013-12-10 22:19 ` Linus Torvalds
2013-12-10 22:33 ` Linus Torvalds
2013-12-10 22:38 ` Darren Hart
2013-12-10 22:57 ` Thomas Gleixner
2013-12-10 23:05 ` Linus Torvalds
2013-12-10 23:28 ` Thomas Gleixner
2013-12-10 23:31 ` Al Viro
2013-12-11 17:08 ` Oleg Nesterov
2013-12-11 17:18 ` Thomas Gleixner
2013-12-11 17:56 ` Oleg Nesterov
2013-12-11 19:18 ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov
2013-12-13 15:10 ` Andrea Arcangeli
2013-12-13 16:22 ` Oleg Nesterov
2013-12-13 17:34 ` Andrea Arcangeli
2013-12-16 18:36 ` Oleg Nesterov
2013-12-16 20:19 ` Andrea Arcangeli
2013-12-16 20:46 ` Oleg Nesterov
2013-12-17 16:53 ` Oleg Nesterov
2013-12-17 18:06 ` Andrea Arcangeli [this message]
2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov
2013-12-18 19:19 ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov
2013-12-18 19:19 ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov
2013-12-18 19:19 ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov
2013-12-18 19:19 ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov
2013-12-18 19:36 ` Peter Zijlstra
2013-12-18 19:50 ` Oleg Nesterov
2013-12-18 19:20 ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov
2013-12-18 19:20 ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov
2013-12-18 21:37 ` Linus Torvalds
2013-12-19 16:29 ` Oleg Nesterov
2013-12-18 19:20 ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov
2013-12-18 19:28 ` Peter Zijlstra
2013-12-18 19:40 ` Oleg Nesterov
2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov
2013-12-19 19:09 ` [PATCH 1/1] " Oleg Nesterov
2013-12-23 11:43 ` Andrea Arcangeli
2014-01-03 19:55 ` [PATCH v2 0/1] " Oleg Nesterov
2014-01-03 19:55 ` [PATCH v2 1/1] " Oleg Nesterov
2014-01-03 21:00 ` Andrew Morton
2014-01-04 16:43 ` Oleg Nesterov
2014-01-08 11:54 ` Mel Gorman
2014-01-08 13:14 ` Mel Gorman
2014-01-08 16:13 ` Oleg Nesterov
2014-01-08 18:02 ` Mel Gorman
2014-01-08 19:04 ` Oleg Nesterov
2014-01-09 11:27 ` Mel Gorman
2014-01-09 14:04 ` Oleg Nesterov
2014-01-09 18:52 ` Andrea Arcangeli
2014-01-09 19:43 ` Oleg Nesterov
2014-01-09 21:36 ` Andrea Arcangeli
2014-01-10 16:12 ` Oleg Nesterov
2014-01-10 16:50 ` Peter Zijlstra
2014-01-10 16:12 ` Mel Gorman
2013-12-20 14:19 ` [PATCH 0/1] " Martin Schwidefsky
2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov
2013-12-16 20:19 ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov
2013-12-16 20:19 ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov
2013-12-16 20:27 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli
2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner
2013-12-10 22:48 ` Linus Torvalds
2013-12-10 22:58 ` Darren Hart
2013-12-10 23:01 ` Dave Jones
2013-12-10 23:00 ` Dave Jones
2013-12-11 0:05 ` Steven Rostedt
2013-12-11 0:23 ` Dave Jones
2013-12-11 0:55 ` Dave Jones
2013-12-14 20:17 ` Oleg Nesterov
2013-12-11 4:09 ` Dave Jones
2013-12-12 4:26 ` Dave Jones
2013-12-12 5:29 ` Darren Hart
2013-12-10 22:51 ` Darren Hart
2013-12-10 23:24 ` Al Viro
2013-12-11 16:31 ` Mel Gorman
2013-12-11 16:38 ` Thomas Gleixner
2013-12-11 17:57 ` Mel Gorman
2013-12-12 19:00 ` Andrea Arcangeli
2013-12-12 19:03 ` Linus Torvalds
2013-12-10 22:09 ` Steven Rostedt
2013-12-10 22:16 ` Dave Jones
2013-12-10 22:21 ` Steven Rostedt
2013-12-10 22:27 ` Dave Jones
2013-12-11 1:02 ` Mel Gorman
2013-12-10 20:57 ` Darren Hart
2013-12-10 21:09 ` Dave Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131217180603.GC5441@redhat.com \
--to=aarcange@redhat.com \
--cc=davej@redhat.com \
--cc=dvhart@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=oleg@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).