Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nai Xia <nai.xia@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Pawel Sikora <pluto@agmk.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, jpiszcz@lucidpixels.com, arekm@pld-linux.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
Date: Thu, 17 Nov 2011 10:49:24 +0800	[thread overview]
Message-ID: <201111171049.24779.nai.xia@gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.00.1111161540060.1861@sister.anvils>

On Thursday 17 November 2011 08:16:57 Hugh Dickins wrote:
> On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
> > On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> > > Also note, if we find a way to enforce orderings in the prio tree (not
> > > sure if it's possible, apparently it's already using list_add_tail
> > > so..), then we could also remove the i_mmap_lock from mremap and fork.
> > 
> > I'm not optimistic we can enforce ordering there. Being a tree it's
> > walked in range order.
> > 
> > I thought of another solution that would avoid having to reorder the
> > list in mremap and avoid the i_mmap_mutex to be added to fork (and
> > then we can remove it from mremap too). The solution is to rmap_walk
> > twice. I mean two loops over the same_anon_vma for those rmap walks
> > that must be reliable (that includes two calls of
> > unmap_mapping_range). For both same_anon_vma and prio tree.
> > 
> > Reading truncate_pagecache I see two loops already and a comment
> > saying it's for fork(), to avoid leaking ptes in the child. So fork is
> > probably ok already without having to take the i_mmap_mutex, but then
> > I wonder why that also doesn't fix mremap if we do two loops there and
> > why that i_mmap_mutex is really needed in mremap considering those two
> > calls already present in truncate_pagecache. I wonder if that was a
> > "theoretical" fix that missed the fact truncate already walks the prio
> > tree twice, so it doesn't matter if the rmap_walk goes in the opposite
> > direction of move_page_tables? That i_mmap_lock in mremap (now
> > i_mmap_mutex) is there since start of git history. The double loop was
> > introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
> > possible that i_mmap_mutex is now useless (after
> > d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
> > already taking care of mremap too and that i_mmap_mutex can now be
> > removed.
> 
> As you found, the mremap locking long predates truncation's double unmap.
> 
> That's an interesting point, and you may be right - though, what about
> the *very* unlikely case where unmap_mapping_range looks at new vma
> when pte is in old, then at old vma when pte is in new, then
> move_page_tables runs out of memory and cannot complete, then the
> second unmap_mapping_range looks at old vma while pte is still in new
> (I guess this needs some other activity to have jumbled the prio_tree,
> and may just be impossible), then at new (to be abandoned) vma after
> pte has moved back to old.
> 
> Probably not an everyday occurrence :)
> 
> But, setting that aside, I've always thought of that second call to
> unmap_mapping_range() as a regrettable expedient that we should try
> to eliminate e.g. by checking for private mappings in the first pass,
> and skipping the second call if there were none.
> 
> But since nobody ever complained about that added overhead, I never
> got around to bothering; and you may consider the i_mmap_mutex in
> move_ptes a more serious unnecessary overhead.
> 
> By the way, you mention "a comment saying it's for fork()": I don't
> find "fork" anywhere in mm/truncate.c, my understanding is in this
> comment (probably mine) from truncate_pagecache():

I think you guys are talking about two different COWs:

Andrea's question is that if a new VMA is created by fork() between
the two loops and PTEs are getting copied.

And you are refering to the new PTEs get COWed by __do_fault() in 
the same VMA before the cache pages are really dropped.

>From my point of view, the two loops there are really fork() 
irrelevant, as you said, they are only for missed COWed ptes in the 
same VMA before a cache page is really blind for find_get_page(). 




As for Andrea's reasoning, I think I deem this racing story as below:

1. fork() is safe without tree lock/mutex after the second loop, the 
reason is just why it's safe for the try_to_unmap_file: the new VMA is
really linked as list tail in a *same* tree node as the old VMA in 
vma prio_tree. The old and new are traveled by vma_prio_tree_foreach() 
in a proper order. And fork() does not include a error path requiring 
backward page table copy operation which needs a reverse order.

2. Partial mremap is not safe for this without tree lock/mutex, because the src
and dst VMA are different prio_tree nodes, and their order are not meant to 
be screwed.



Nai

> 
> 	/*
> 	 * unmap_mapping_range is called twice, first simply for
> 	 * efficiency so that truncate_inode_pages does fewer
> 	 * single-page unmaps.  However after this first call, and
> 	 * before truncate_inode_pages finishes, it is possible for
> 	 * private pages to be COWed, which remain after
> 	 * truncate_inode_pages finishes, hence the second
> 	 * unmap_mapping_range call must be made for correctness.
> 	 */
> 
> The second call was not (I think) necessary when we relied upon
> truncate_count, but became necessary once Nick relied upon page lock
> (the page lock on the file page providing no guarantee for the COWed
> page).
> 
> Hugh
>

next prev parent reply	other threads:[~2011-11-17  2:50 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-12 18:12 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Paweł Sikora
2011-10-13 23:16 ` Hugh Dickins
2011-10-13 23:30   ` Hugh Dickins
2011-10-16 16:11     ` Christoph Hellwig
2011-10-16 23:54     ` Andrea Arcangeli
2011-10-17 18:51       ` Hugh Dickins
2011-10-17 22:05         ` Andrea Arcangeli
2011-10-19  7:43         ` Mel Gorman
2011-10-19 13:39           ` Linus Torvalds
2011-10-19 19:42             ` Hugh Dickins
2011-10-20  6:30               ` Paweł Sikora
2011-10-20  6:51                 ` Linus Torvalds
2011-10-21  6:54                 ` Nai Xia
2011-10-21  7:35                   ` Pawel Sikora
2011-10-20 12:51               ` Nai Xia
     [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
2011-10-21  6:22                   ` Nai Xia
2011-10-21  8:07                     ` Pawel Sikora
2011-10-21  9:07                       ` Nai Xia
2011-10-21 21:36                         ` Paweł Sikora
2011-10-22  6:21                           ` Nai Xia
2011-10-22 16:42                             ` Paweł Sikora
     [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
2011-10-25  7:33                                 ` Pawel Sikora
2011-10-20  9:11       ` Nai Xia
2011-10-21 15:56         ` Mel Gorman
2011-10-21 17:21           ` Nai Xia
2011-10-21 17:41           ` Andrea Arcangeli
2011-10-21 22:50             ` Andrea Arcangeli
2011-10-22  5:52               ` Nai Xia
2011-10-31 17:14                 ` Andrea Arcangeli
2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
2011-11-01 12:07                     ` Mel Gorman
2011-11-01 14:35                     ` Nai Xia
2011-11-04  7:31                     ` Hugh Dickins
2011-11-04 14:34                       ` Nai Xia
2011-11-04 15:59                         ` Pawel Sikora
2011-11-05  2:21                           ` Nai Xia
2011-11-04 19:16                         ` Hugh Dickins
2011-11-04 20:54                           ` Andrea Arcangeli
2011-11-05  0:09                             ` Nai Xia
2011-11-05  2:21                               ` Hugh Dickins
2011-11-05  3:07                                 ` Andrea Arcangeli
2011-11-05 17:06                                 ` Andrea Arcangeli
2011-12-08  3:24                                   ` David Rientjes
2011-12-08 12:42                                     ` Andrea Arcangeli
2011-12-09  0:08                                   ` Andrew Morton
2011-12-09  1:55                                     ` Andrea Arcangeli
2011-11-04 23:56                       ` Andrea Arcangeli
2011-11-05  0:21                         ` Nai Xia
2011-11-05  0:59                           ` Nai Xia
2011-11-05  1:33                           ` Andrea Arcangeli
2011-11-05  2:00                             ` Nai Xia
2011-11-07 13:14                               ` Mel Gorman
2011-11-07 15:42                                 ` Andrea Arcangeli
2011-11-07 16:28                                   ` Mel Gorman
2011-11-09  1:25                                     ` Andrea Arcangeli
2011-11-11  9:14                                       ` Nai Xia
2011-11-16 14:00                                       ` Andrea Arcangeli
2011-11-17  0:16                                         ` Hugh Dickins
2011-11-17  2:49                                           ` Nai Xia [this message]
2011-11-17  6:21                                           ` Nai Xia
2011-11-17 18:42                                           ` Andrea Arcangeli
2011-11-18  1:42                                             ` Nai Xia
2011-11-18  2:17                                               ` Andrea Arcangeli
2011-11-19  9:15                                                 ` Nai Xia
2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
2011-10-31 16:34               ` Andrea Arcangeli
2011-10-16 22:37   ` Linus Torvalds
2011-10-17  3:02     ` Hugh Dickins
2011-10-17  3:09       ` Linus Torvalds
2011-10-18 19:17   ` Paweł Sikora
2011-10-19  7:30   ` Mel Gorman
2011-10-21 12:44     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201111171049.24779.nai.xia@gmail.com \
    --to=nai.xia@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arekm@pld-linux.org \
    --cc=hughd@google.com \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pluto@agmk.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).