From: Jerome Glisse <j.glisse@gmail.com>
To: Frank Mehnert <frank.mehnert@oracle.com>
Cc: Robin Holt <holt@sgi.com>,
linux-mm@kvack.org,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Hugh Dickins <hughd@google.com>
Subject: Re: Handling NUMA page migration
Date: Tue, 4 Jun 2013 11:45:01 -0400 [thread overview]
Message-ID: <20130604154500.GA5664@gmail.com> (raw)
In-Reply-To: <201306041414.52237.frank.mehnert@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 2934 bytes --]
On Tue, Jun 04, 2013 at 02:14:45PM +0200, Frank Mehnert wrote:
> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > This is probably more appropriate to be directed at the linux-mm
> > mailing list.
> >
> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > Hi,
> > >
> > > our memory management on Linux hosts conflicts with NUMA page migration.
> > > I assume this problem existed for a longer time but Linux 3.8 introduced
> > > automatic NUMA page balancing which makes the problem visible on
> > > multi-node hosts leading to kernel oopses.
> > >
> > > NUMA page migration means that the physical address of a page changes.
> > > This is fatal if the application assumes that this never happens for
> > > that page as it was supposed to be pinned.
> > >
> > > We have two kind of pinned memory:
> > >
> > > A) 1. allocate memory in userland with mmap()
> > >
> > > 2. madvise(MADV_DONTFORK)
> > > 3. pin with get_user_pages().
> > > 4. flush dcache_page()
> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > >
> > > VM_DONTCOPY | VM_LOCKED | 0xff)
> >
> > I don't think this type of allocation should be affected. The
> > get_user_pages() call should elevate the pages reference count which
> > should prevent migration from completing. I would, however, wait for
> > a more definitive answer.
>
> Thanks Robin! Actually case B) is more important for us so I'm waiting
> for more feedback :)
>
> Frank
>
> > > B) 1. allocate memory with alloc_pages()
> > >
> > > 2. SetPageReserved()
> > > 3. vm_mmap() to allocate a userspace mapping
> > > 4. vm_insert_page()
> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > 0xff)
> > >
> > > At least the memory allocated like B) is affected by automatic NUMA page
> > > migration. I'm not sure about A).
> > >
> > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > 2. Can NUMA page migration also be handled on such kind of memory without
> > >
> > > preventing migration?
> > >
> > > Thanks,
> > >
> > > Frank
I was looking at migration code lately, and while i am not an expert at all
in this area. I think there is a bug in the way handle_mm_fault deals, or
rather not deals, with migration entry.
When huge page is migrated its pmd is replace with a special swp entry pmd,
which is a non zero pmd but that does not have any of the huge pmd flag set
so none of the handle_mm_fault path detect it as swap entry. Then believe
its a valid pmd and try to allocate pte under it which should oops.
Attached patch is what i believe should be done (not even compile tested).
Again i might be missing a subtelty somewhere else and just missed where
huge migration entry are dealt with.
Cheers,
Jerome
[-- Attachment #2: 0001-mm-properly-handle-fault-on-huge-page-migration.patch --]
[-- Type: text/plain, Size: 0 bytes --]
WARNING: multiple messages have this Message-ID (diff)
From: Jerome Glisse <j.glisse@gmail.com>
To: Frank Mehnert <frank.mehnert@oracle.com>
Cc: Robin Holt <holt@sgi.com>,
linux-mm@kvack.org,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Hugh Dickins <hughd@google.com>
Subject: Re: Handling NUMA page migration
Date: Tue, 4 Jun 2013 11:45:01 -0400 [thread overview]
Message-ID: <20130604154500.GA5664@gmail.com> (raw)
In-Reply-To: <201306041414.52237.frank.mehnert@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 2934 bytes --]
On Tue, Jun 04, 2013 at 02:14:45PM +0200, Frank Mehnert wrote:
> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > This is probably more appropriate to be directed at the linux-mm
> > mailing list.
> >
> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > Hi,
> > >
> > > our memory management on Linux hosts conflicts with NUMA page migration.
> > > I assume this problem existed for a longer time but Linux 3.8 introduced
> > > automatic NUMA page balancing which makes the problem visible on
> > > multi-node hosts leading to kernel oopses.
> > >
> > > NUMA page migration means that the physical address of a page changes.
> > > This is fatal if the application assumes that this never happens for
> > > that page as it was supposed to be pinned.
> > >
> > > We have two kind of pinned memory:
> > >
> > > A) 1. allocate memory in userland with mmap()
> > >
> > > 2. madvise(MADV_DONTFORK)
> > > 3. pin with get_user_pages().
> > > 4. flush dcache_page()
> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > >
> > > VM_DONTCOPY | VM_LOCKED | 0xff)
> >
> > I don't think this type of allocation should be affected. The
> > get_user_pages() call should elevate the pages reference count which
> > should prevent migration from completing. I would, however, wait for
> > a more definitive answer.
>
> Thanks Robin! Actually case B) is more important for us so I'm waiting
> for more feedback :)
>
> Frank
>
> > > B) 1. allocate memory with alloc_pages()
> > >
> > > 2. SetPageReserved()
> > > 3. vm_mmap() to allocate a userspace mapping
> > > 4. vm_insert_page()
> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > 0xff)
> > >
> > > At least the memory allocated like B) is affected by automatic NUMA page
> > > migration. I'm not sure about A).
> > >
> > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > 2. Can NUMA page migration also be handled on such kind of memory without
> > >
> > > preventing migration?
> > >
> > > Thanks,
> > >
> > > Frank
I was looking at migration code lately, and while i am not an expert at all
in this area. I think there is a bug in the way handle_mm_fault deals, or
rather not deals, with migration entry.
When huge page is migrated its pmd is replace with a special swp entry pmd,
which is a non zero pmd but that does not have any of the huge pmd flag set
so none of the handle_mm_fault path detect it as swap entry. Then believe
its a valid pmd and try to allocate pte under it which should oops.
Attached patch is what i believe should be done (not even compile tested).
Again i might be missing a subtelty somewhere else and just missed where
huge migration entry are dealt with.
Cheers,
Jerome
[-- Attachment #2: 0001-mm-properly-handle-fault-on-huge-page-migration.patch --]
[-- Type: text/plain, Size: 1383 bytes --]
>From 22d00055bdd4d88eb01958828e4c0121231a9e01 Mon Sep 17 00:00:00 2001
From: Jerome Glisse <jglisse@redhat.com>
Date: Tue, 4 Jun 2013 11:34:14 -0400
Subject: [PATCH] mm: properly handle fault on huge page migration
When huge page is being migrated it's pmd is non zero but does not have
any of the huge pmd flags set. It's a swap entry pmd. The handle_mm_fault
never check for this case and thus if a fault happen in the huge page
range while it's being migrated handle_mm_fault will interpret badly the
pmd.
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
mm/memory.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..e2a039c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3793,6 +3793,7 @@ retry:
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
+ swp_entry_t entry;
int ret;
barrier();
@@ -3829,6 +3830,20 @@ retry:
return 0;
}
+
+ swp_entry_t entry = pte_to_swp_entry((pte_t)orig_pmd);
+ if (unlikely(non_swap_entry(entry))) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, pmd, address);
+ /* Retry the fault */
+ return 0;
+ } else if (is_hwpoison_entry(entry)) {
+ return VM_FAULT_HWPOISON;
+ } else {
+ /* Something else is wrong invalid pmd print it ? */
+ return VM_FAULT_SIGBUS;
+ }
+ }
}
if (pmd_numa(*pmd))
--
1.7.11.7
next prev parent reply other threads:[~2013-06-04 15:48 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-04 7:22 Handling NUMA page migration Frank Mehnert
2013-06-04 11:58 ` Robin Holt
2013-06-04 11:58 ` Robin Holt
2013-06-04 12:14 ` Frank Mehnert
2013-06-04 13:34 ` Robin Holt
2013-06-04 14:02 ` Michal Hocko
2013-06-04 14:02 ` Michal Hocko
2013-06-04 18:17 ` Frank Mehnert
2013-06-04 21:54 ` Frank Mehnert
2013-06-05 7:54 ` Michal Hocko
2013-06-05 7:54 ` Michal Hocko
2013-06-05 8:34 ` Frank Mehnert
2013-06-05 8:56 ` Frank Mehnert
2013-06-05 9:10 ` Michal Hocko
2013-06-05 9:10 ` Michal Hocko
2013-06-05 9:32 ` Frank Mehnert
2013-06-05 9:56 ` Michal Hocko
2013-06-05 9:56 ` Michal Hocko
2013-06-05 10:22 ` Frank Mehnert
2013-06-05 11:41 ` Michal Hocko
2013-06-05 11:41 ` Michal Hocko
2013-06-04 15:45 ` Jerome Glisse [this message]
2013-06-04 15:45 ` Jerome Glisse
2013-06-04 17:49 ` Jerome Glisse
2013-06-04 17:49 ` Jerome Glisse
2013-06-05 10:10 ` Mel Gorman
2013-06-05 10:10 ` Mel Gorman
2013-06-05 10:35 ` Frank Mehnert
2013-06-05 12:34 ` Mel Gorman
2013-06-05 12:34 ` Mel Gorman
2013-06-06 10:09 ` Frank Mehnert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130604154500.GA5664@gmail.com \
--to=j.glisse@gmail.com \
--cc=frank.mehnert@oracle.com \
--cc=holt@sgi.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.