Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nai Xia <nai.xia@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Pawel Sikora <pluto@agmk.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, jpiszcz@lucidpixels.com, arekm@pld-linux.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
Date: Thu, 17 Nov 2011 10:49:24 +0800	[thread overview]
Message-ID: <201111171049.24779.nai.xia@gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.00.1111161540060.1861@sister.anvils>

On Thursday 17 November 2011 08:16:57 Hugh Dickins wrote:
> On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
> > On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> > > Also note, if we find a way to enforce orderings in the prio tree (not
> > > sure if it's possible, apparently it's already using list_add_tail
> > > so..), then we could also remove the i_mmap_lock from mremap and fork.
> > 
> > I'm not optimistic we can enforce ordering there. Being a tree it's
> > walked in range order.
> > 
> > I thought of another solution that would avoid having to reorder the
> > list in mremap and avoid the i_mmap_mutex to be added to fork (and
> > then we can remove it from mremap too). The solution is to rmap_walk
> > twice. I mean two loops over the same_anon_vma for those rmap walks
> > that must be reliable (that includes two calls of
> > unmap_mapping_range). For both same_anon_vma and prio tree.
> > 
> > Reading truncate_pagecache I see two loops already and a comment
> > saying it's for fork(), to avoid leaking ptes in the child. So fork is
> > probably ok already without having to take the i_mmap_mutex, but then
> > I wonder why that also doesn't fix mremap if we do two loops there and
> > why that i_mmap_mutex is really needed in mremap considering those two
> > calls already present in truncate_pagecache. I wonder if that was a
> > "theoretical" fix that missed the fact truncate already walks the prio
> > tree twice, so it doesn't matter if the rmap_walk goes in the opposite
> > direction of move_page_tables? That i_mmap_lock in mremap (now
> > i_mmap_mutex) is there since start of git history. The double loop was
> > introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
> > possible that i_mmap_mutex is now useless (after
> > d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
> > already taking care of mremap too and that i_mmap_mutex can now be
> > removed.
> 
> As you found, the mremap locking long predates truncation's double unmap.
> 
> That's an interesting point, and you may be right - though, what about
> the *very* unlikely case where unmap_mapping_range looks at new vma
> when pte is in old, then at old vma when pte is in new, then
> move_page_tables runs out of memory and cannot complete, then the
> second unmap_mapping_range looks at old vma while pte is still in new
> (I guess this needs some other activity to have jumbled the prio_tree,
> and may just be impossible), then at new (to be abandoned) vma after
> pte has moved back to old.
> 
> Probably not an everyday occurrence :)
> 
> But, setting that aside, I've always thought of that second call to
> unmap_mapping_range() as a regrettable expedient that we should try
> to eliminate e.g. by checking for private mappings in the first pass,
> and skipping the second call if there were none.
> 
> But since nobody ever complained about that added overhead, I never
> got around to bothering; and you may consider the i_mmap_mutex in
> move_ptes a more serious unnecessary overhead.
> 
> By the way, you mention "a comment saying it's for fork()": I don't
> find "fork" anywhere in mm/truncate.c, my understanding is in this
> comment (probably mine) from truncate_pagecache():

I think you guys are talking about two different COWs:

Andrea's question is that if a new VMA is created by fork() between
the two loops and PTEs are getting copied.

And you are refering to the new PTEs get COWed by __do_fault() in 
the same VMA before the cache pages are really dropped.

From my point of view, the two loops there are really fork() 
irrelevant, as you said, they are only for missed COWed ptes in the 
same VMA before a cache page is really blind for find_get_page(). 




As for Andrea's reasoning, I think I deem this racing story as below:

1. fork() is safe without tree lock/mutex after the second loop, the 
reason is just why it's safe for the try_to_unmap_file: the new VMA is
really linked as list tail in a *same* tree node as the old VMA in 
vma prio_tree. The old and new are traveled by vma_prio_tree_foreach() 
in a proper order. And fork() does not include a error path requiring 
backward page table copy operation which needs a reverse order.

2. Partial mremap is not safe for this without tree lock/mutex, because the src
and dst VMA are different prio_tree nodes, and their order are not meant to 
be screwed.



Nai

> 
> 	/*
> 	 * unmap_mapping_range is called twice, first simply for
> 	 * efficiency so that truncate_inode_pages does fewer
> 	 * single-page unmaps.  However after this first call, and
> 	 * before truncate_inode_pages finishes, it is possible for
> 	 * private pages to be COWed, which remain after
> 	 * truncate_inode_pages finishes, hence the second
> 	 * unmap_mapping_range call must be made for correctness.
> 	 */
> 
> The second call was not (I think) necessary when we relied upon
> truncate_count, but became necessary once Nick relied upon page lock
> (the page lock on the file page providing no guarantee for the COWed
> page).
> 
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-11-17  2:50 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <201110122012.33767.pluto@agmk.net>
     [not found] ` <alpine.LSU.2.00.1110131547550.1346@sister.anvils>
2011-10-13 23:30   ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Hugh Dickins
2011-10-16 16:11     ` Christoph Hellwig
2011-10-16 23:54     ` Andrea Arcangeli
2011-10-17 18:51       ` Hugh Dickins
2011-10-17 22:05         ` Andrea Arcangeli
2011-10-19  7:43         ` Mel Gorman
2011-10-19 13:39           ` Linus Torvalds
2011-10-19 19:42             ` Hugh Dickins
2011-10-20  6:30               ` Paweł Sikora
2011-10-20  6:51                 ` Linus Torvalds
2011-10-21  6:54                 ` Nai Xia
2011-10-21  7:35                   ` Pawel Sikora
2011-10-20 12:51               ` Nai Xia
2011-10-20 18:36                 ` Hugh Dickins
2011-10-21  6:22                   ` Nai Xia
2011-10-21  8:07                     ` Pawel Sikora
2011-10-21  9:07                       ` Nai Xia
2011-10-21 21:36                         ` Paweł Sikora
2011-10-22  6:21                           ` Nai Xia
2011-10-22 16:42                             ` Paweł Sikora
2011-10-20  9:11       ` Nai Xia
2011-10-21 15:56         ` Mel Gorman
2011-10-21 17:21           ` Nai Xia
2011-10-21 17:41           ` Andrea Arcangeli
2011-10-21 22:50             ` Andrea Arcangeli
2011-10-22  5:52               ` Nai Xia
2011-10-31 17:14                 ` Andrea Arcangeli
2011-10-31 17:27                   ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
2011-11-01 12:07                     ` Mel Gorman
2011-11-01 14:35                     ` Nai Xia
2011-11-04  7:31                     ` Hugh Dickins
2011-11-04 14:34                       ` Nai Xia
2011-11-04 15:59                         ` Pawel Sikora
2011-11-05  2:21                           ` Nai Xia
2011-11-04 19:16                         ` Hugh Dickins
2011-11-04 20:54                           ` Andrea Arcangeli
2011-11-05  0:09                             ` Nai Xia
2011-11-05  2:21                               ` Hugh Dickins
2011-11-05  3:07                                 ` Andrea Arcangeli
2011-11-05 17:06                                 ` Andrea Arcangeli
2011-12-08  3:24                                   ` David Rientjes
2011-12-08 12:42                                     ` Andrea Arcangeli
2011-12-09  0:08                                   ` Andrew Morton
2011-12-09  1:55                                     ` Andrea Arcangeli
2011-11-04 23:56                       ` Andrea Arcangeli
2011-11-05  0:21                         ` Nai Xia
2011-11-05  0:59                           ` Nai Xia
2011-11-05  1:33                           ` Andrea Arcangeli
2011-11-05  2:00                             ` Nai Xia
2011-11-07 13:14                               ` Mel Gorman
2011-11-07 15:42                                 ` Andrea Arcangeli
2011-11-07 16:28                                   ` Mel Gorman
2011-11-09  1:25                                     ` Andrea Arcangeli
2011-11-11  9:14                                       ` Nai Xia
2011-11-16 14:00                                       ` Andrea Arcangeli
2011-11-17  0:16                                         ` Hugh Dickins
2011-11-17  2:49                                           ` Nai Xia [this message]
2011-11-17  6:21                                           ` Nai Xia
2011-11-17 18:42                                           ` Andrea Arcangeli
2011-11-18  1:42                                             ` Nai Xia
2011-11-18  2:17                                               ` Andrea Arcangeli
2011-11-19  9:15                                                 ` Nai Xia
2011-10-22  5:07             ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
2011-10-31 16:34               ` Andrea Arcangeli
2011-10-16 22:37   ` Linus Torvalds
2011-10-17  3:02     ` Hugh Dickins
2011-10-17  3:09       ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201111171049.24779.nai.xia@gmail.com \
    --to=nai.xia@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arekm@pld-linux.org \
    --cc=hughd@google.com \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pluto@agmk.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).