From: Mel Gorman <mgorman@suse.de>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nai Xia <nai.xia@gmail.com>, Hugh Dickins <hughd@google.com>,
Pawel Sikora <pluto@agmk.net>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, jpiszcz@lucidpixels.com, arekm@pld-linux.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
Date: Tue, 1 Nov 2011 12:07:26 +0000 [thread overview]
Message-ID: <20111101120726.GA25123@suse.de> (raw)
In-Reply-To: <1320082040-1190-1-git-send-email-aarcange@redhat.com>
On Mon, Oct 31, 2011 at 06:27:20PM +0100, Andrea Arcangeli wrote:
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
>
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
>
For future reference, why? How about this as an explanation?
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That leads to a race
between migration and mremap whereby a migration PTE is left behind.
mremap migration
create dst VMA
rmap_walk
finds dst, no ptes, release PTL
move_ptes
copies src PTEs to dst
finds src, ptes empty, releases PTL
The migration PTE is now left behind because the order of VMAs matter.
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
>
Document the alternative just in case?
"One fix would be to have mremap take the anon_vma lock which would
serialise migration and mremap but this would hurt scalability. Instead,
this patch adds....."
I would also prefer something like anon_vma_moveto_tail() but maybe
it's just me that sees "order" and thinks "high-order allocation".
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> include/linux/rmap.h | 1 +
> mm/mmap.c | 8 ++++++++
> mm/rmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 53 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
> int anon_vma_prepare(struct vm_area_struct *);
> void unlink_anon_vmas(struct vm_area_struct *);
> int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
> int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
> void __anon_vma_link(struct vm_area_struct *);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> */
> if (vma_start >= new_vma->vm_start &&
> vma_start < new_vma->vm_end)
> + /*
> + * No need to call anon_vma_order_tail() in
> + * this case because the same PT lock will
> + * serialize the rmap_walk against both src
> + * and dst vmas.
> + */
> *vmap = new_vma;
> + else
> + anon_vma_order_tail(new_vma);
> } else {
> new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> }
>
> /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> + struct anon_vma_chain *pavc;
> + struct anon_vma *root = NULL;
> +
> + list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> + struct anon_vma *anon_vma = pavc->anon_vma;
> + VM_BUG_ON(pavc->vma != dst);
> + root = lock_anon_vma_root(root, anon_vma);
> + list_del(&pavc->same_anon_vma);
> + list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> + }
> + unlock_anon_vma_root(root);
> +}
> +
This is following the same rules as anon_vma_clone() and I didn't see a
flaw in your explanation as to why it's safe.
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@suse.de>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nai Xia <nai.xia@gmail.com>, Hugh Dickins <hughd@google.com>,
Pawel Sikora <pluto@agmk.net>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, jpiszcz@lucidpixels.com, arekm@pld-linux.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma
Date: Tue, 1 Nov 2011 12:07:26 +0000 [thread overview]
Message-ID: <20111101120726.GA25123@suse.de> (raw)
In-Reply-To: <1320082040-1190-1-git-send-email-aarcange@redhat.com>
On Mon, Oct 31, 2011 at 06:27:20PM +0100, Andrea Arcangeli wrote:
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
>
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
>
For future reference, why? How about this as an explanation?
If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That leads to a race
between migration and mremap whereby a migration PTE is left behind.
mremap migration
create dst VMA
rmap_walk
finds dst, no ptes, release PTL
move_ptes
copies src PTEs to dst
finds src, ptes empty, releases PTL
The migration PTE is now left behind because the order of VMAs matter.
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
>
Document the alternative just in case?
"One fix would be to have mremap take the anon_vma lock which would
serialise migration and mremap but this would hurt scalability. Instead,
this patch adds....."
I would also prefer something like anon_vma_moveto_tail() but maybe
it's just me that sees "order" and thinks "high-order allocation".
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> include/linux/rmap.h | 1 +
> mm/mmap.c | 8 ++++++++
> mm/rmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 53 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void); /* create anon_vma_cachep */
> int anon_vma_prepare(struct vm_area_struct *);
> void unlink_anon_vmas(struct vm_area_struct *);
> int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
> int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
> void __anon_vma_link(struct vm_area_struct *);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> */
> if (vma_start >= new_vma->vm_start &&
> vma_start < new_vma->vm_end)
> + /*
> + * No need to call anon_vma_order_tail() in
> + * this case because the same PT lock will
> + * serialize the rmap_walk against both src
> + * and dst vmas.
> + */
> *vmap = new_vma;
> + else
> + anon_vma_order_tail(new_vma);
> } else {
> new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
> if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
> }
>
> /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> + struct anon_vma_chain *pavc;
> + struct anon_vma *root = NULL;
> +
> + list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> + struct anon_vma *anon_vma = pavc->anon_vma;
> + VM_BUG_ON(pavc->vma != dst);
> + root = lock_anon_vma_root(root, anon_vma);
> + list_del(&pavc->same_anon_vma);
> + list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> + }
> + unlock_anon_vma_root(root);
> +}
> +
This is following the same rules as anon_vma_clone() and I didn't see a
flaw in your explanation as to why it's safe.
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-11-01 12:07 UTC|newest]
Thread overview: 137+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-12 18:12 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Paweł Sikora
2011-10-13 23:16 ` Hugh Dickins
2011-10-13 23:30 ` Hugh Dickins
2011-10-13 23:30 ` Hugh Dickins
2011-10-16 16:11 ` Christoph Hellwig
2011-10-16 16:11 ` Christoph Hellwig
2011-10-16 23:54 ` Andrea Arcangeli
2011-10-16 23:54 ` Andrea Arcangeli
2011-10-17 18:51 ` Hugh Dickins
2011-10-17 18:51 ` Hugh Dickins
2011-10-17 22:05 ` Andrea Arcangeli
2011-10-17 22:05 ` Andrea Arcangeli
2011-10-19 7:43 ` Mel Gorman
2011-10-19 7:43 ` Mel Gorman
2011-10-19 13:39 ` Linus Torvalds
2011-10-19 13:39 ` Linus Torvalds
2011-10-19 19:42 ` Hugh Dickins
2011-10-19 19:42 ` Hugh Dickins
2011-10-20 6:30 ` Paweł Sikora
2011-10-20 6:30 ` Paweł Sikora
2011-10-20 6:51 ` Linus Torvalds
2011-10-20 6:51 ` Linus Torvalds
2011-10-21 6:54 ` Nai Xia
2011-10-21 6:54 ` Nai Xia
2011-10-21 7:35 ` Pawel Sikora
2011-10-21 7:35 ` Pawel Sikora
2011-10-20 12:51 ` Nai Xia
2011-10-20 12:51 ` Nai Xia
2011-10-20 18:36 ` Hugh Dickins
2011-10-21 6:22 ` Nai Xia
2011-10-21 6:22 ` Nai Xia
2011-10-21 8:07 ` Pawel Sikora
2011-10-21 8:07 ` Pawel Sikora
2011-10-21 9:07 ` Nai Xia
2011-10-21 9:07 ` Nai Xia
2011-10-21 21:36 ` Paweł Sikora
2011-10-21 21:36 ` Paweł Sikora
2011-10-22 6:21 ` Nai Xia
2011-10-22 6:21 ` Nai Xia
2011-10-22 16:42 ` Paweł Sikora
2011-10-22 16:42 ` Paweł Sikora
[not found] ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
2011-10-25 7:33 ` Pawel Sikora
2011-10-20 9:11 ` Nai Xia
2011-10-20 9:11 ` Nai Xia
2011-10-21 15:56 ` Mel Gorman
2011-10-21 15:56 ` Mel Gorman
2011-10-21 17:21 ` Nai Xia
2011-10-21 17:21 ` Nai Xia
2011-10-21 17:41 ` Andrea Arcangeli
2011-10-21 17:41 ` Andrea Arcangeli
2011-10-21 22:50 ` Andrea Arcangeli
2011-10-21 22:50 ` Andrea Arcangeli
2011-10-22 5:52 ` Nai Xia
2011-10-22 5:52 ` Nai Xia
2011-10-31 17:14 ` Andrea Arcangeli
2011-10-31 17:14 ` Andrea Arcangeli
2011-10-31 17:27 ` [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy_vma Andrea Arcangeli
2011-10-31 17:27 ` Andrea Arcangeli
2011-11-01 12:07 ` Mel Gorman [this message]
2011-11-01 12:07 ` Mel Gorman
2011-11-01 14:35 ` Nai Xia
2011-11-01 14:35 ` Nai Xia
2011-11-04 7:31 ` Hugh Dickins
2011-11-04 7:31 ` Hugh Dickins
2011-11-04 14:34 ` Nai Xia
2011-11-04 14:34 ` Nai Xia
2011-11-04 15:59 ` Pawel Sikora
2011-11-04 15:59 ` Pawel Sikora
2011-11-05 2:21 ` Nai Xia
2011-11-05 2:21 ` Nai Xia
2011-11-04 19:16 ` Hugh Dickins
2011-11-04 20:54 ` Andrea Arcangeli
2011-11-04 20:54 ` Andrea Arcangeli
2011-11-05 0:09 ` Nai Xia
2011-11-05 0:09 ` Nai Xia
2011-11-05 2:21 ` Hugh Dickins
2011-11-05 3:07 ` Andrea Arcangeli
2011-11-05 3:07 ` Andrea Arcangeli
2011-11-05 17:06 ` Andrea Arcangeli
2011-11-05 17:06 ` Andrea Arcangeli
2011-12-08 3:24 ` David Rientjes
2011-12-08 3:24 ` David Rientjes
2011-12-08 12:42 ` Andrea Arcangeli
2011-12-08 12:42 ` Andrea Arcangeli
2011-12-09 0:08 ` Andrew Morton
2011-12-09 0:08 ` Andrew Morton
2011-12-09 1:55 ` Andrea Arcangeli
2011-12-09 1:55 ` Andrea Arcangeli
2011-11-04 23:56 ` Andrea Arcangeli
2011-11-04 23:56 ` Andrea Arcangeli
2011-11-05 0:21 ` Nai Xia
2011-11-05 0:21 ` Nai Xia
2011-11-05 0:59 ` Nai Xia
2011-11-05 0:59 ` Nai Xia
2011-11-05 1:33 ` Andrea Arcangeli
2011-11-05 1:33 ` Andrea Arcangeli
2011-11-05 2:00 ` Nai Xia
2011-11-05 2:00 ` Nai Xia
2011-11-07 13:14 ` Mel Gorman
2011-11-07 13:14 ` Mel Gorman
2011-11-07 15:42 ` Andrea Arcangeli
2011-11-07 15:42 ` Andrea Arcangeli
2011-11-07 16:28 ` Mel Gorman
2011-11-07 16:28 ` Mel Gorman
2011-11-09 1:25 ` Andrea Arcangeli
2011-11-09 1:25 ` Andrea Arcangeli
2011-11-11 9:14 ` Nai Xia
2011-11-11 9:14 ` Nai Xia
2011-11-16 14:00 ` Andrea Arcangeli
2011-11-16 14:00 ` Andrea Arcangeli
2011-11-17 0:16 ` Hugh Dickins
2011-11-17 0:16 ` Hugh Dickins
2011-11-17 2:49 ` Nai Xia
2011-11-17 2:49 ` Nai Xia
2011-11-17 6:21 ` Nai Xia
2011-11-17 6:21 ` Nai Xia
2011-11-17 18:42 ` Andrea Arcangeli
2011-11-17 18:42 ` Andrea Arcangeli
2011-11-18 1:42 ` Nai Xia
2011-11-18 1:42 ` Nai Xia
2011-11-18 2:17 ` Andrea Arcangeli
2011-11-18 2:17 ` Andrea Arcangeli
2011-11-19 9:15 ` Nai Xia
2011-11-19 9:15 ` Nai Xia
2011-10-22 5:07 ` kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Nai Xia
2011-10-22 5:07 ` Nai Xia
2011-10-31 16:34 ` Andrea Arcangeli
2011-10-31 16:34 ` Andrea Arcangeli
2011-10-16 22:37 ` Linus Torvalds
2011-10-16 22:37 ` Linus Torvalds
2011-10-17 3:02 ` Hugh Dickins
2011-10-17 3:02 ` Hugh Dickins
2011-10-17 3:09 ` Linus Torvalds
2011-10-17 3:09 ` Linus Torvalds
2011-10-18 19:17 ` Paweł Sikora
2011-10-19 7:30 ` Mel Gorman
2011-10-21 12:44 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111101120726.GA25123@suse.de \
--to=mgorman@suse.de \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arekm@pld-linux.org \
--cc=hughd@google.com \
--cc=jpiszcz@lucidpixels.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nai.xia@gmail.com \
--cc=pluto@agmk.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.