Re: Handling NUMA page migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Frank Mehnert <frank.mehnert@oracle.com>
Cc: Robin Holt <holt@sgi.com>,
	linux-mm@kvack.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Hugh Dickins <hughd@google.com>
Subject: Re: Handling NUMA page migration
Date: Wed, 5 Jun 2013 13:34:00 +0100	[thread overview]
Message-ID: <20130605123400.GA1936@suse.de> (raw)
In-Reply-To: <201306051235.35678.frank.mehnert@oracle.com>

On Wed, Jun 05, 2013 at 12:35:35PM +0200, Frank Mehnert wrote:
> On Wednesday 05 June 2013 12:10:19 Mel Gorman wrote:
> > On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > > > B) 1. allocate memory with alloc_pages()
> > > > 
> > > >    2. SetPageReserved()
> > > >    3. vm_mmap() to allocate a userspace mapping
> > > >    4. vm_insert_page()
> > > >    5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > > >    
> > > >       (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > >       0xff)
> > > > 
> > > > At least the memory allocated like B) is affected by automatic NUMA
> > > > page migration. I'm not sure about A).
> > > > 
> > > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > > 2. Can NUMA page migration also be handled on such kind of memory
> > > > without
> > > > 
> > > >    preventing migration?
> > 
> > Page migration does not expect a PageReserved && PageLRU page. The only
> > reserved check that is made by migration is for the zero page and that
> > happens in the syscall path for move_pages() which is not used by either
> > compaction or automatic balancing.
> > 
> > At some point you must have a driver that is setting PageReserved on
> > anonymous pages that is later encountered by automatic numa balancing
> > during a NUMA hinting fault.  I expect this is an out-of-tree driver or
> > a custom kernel of some sort. Memory should be pinned by elevating the
> > reference count of the page, not setting PageReserved.
> 
> Yes, this is ring 0 code from VirtualBox. The VBox ring 0 driver does the
> steps which are shown above. Setting PageReserved is not only for pinning
> but also for fork() protection.

Offhand I don't see what setting PageReserved on an LRU page has to do
with fork() protection. If the VMA should not be copied by fork then use
MADV_DONTFORK.

> I've tried to do get_page() as well but
> it did not help preventing the migration during NUMA balancing.
> 

I think you mean elevating the page count did not prevent the unmapping. The
elevated count should have prevented the actual migration but would not
prevent the unmapping.

> As I wrote, the code for allocating + mapping the memory assumes that
> the memory is finally pinned and will be never unmapped. That assumption
> might be wrong or wrong under certain/rare conditions. I would like to
> know these conditions and how we can prevent them from happening or how
> we can handle them correctly.

Memory compaction for THP allocations will break that assumption as
compaction ignores VM_LOCKED. I strongly suspect that if you did something
like move a process into a cpuset bound to another node that it would
also break. If a process like numad is running then it would probably
break virtualbox as well as it triggers migration from userspace. It is
a fragile assumption to make.

> > It's not particularly clear how you avoid hitting the same bug due to THP
> > and memory compaction to be honest but maybe your setup hits a steady
> > state that simply never hit the problem or it happens rarely and it was
> > not identified.
> 
> I'm currently using the stock Ubuntu 13.04 generic kernel (3.8.0-23),

and an out-of-tree driver which is what is hitting the problem.

A few of your options in order of estimated time to completion are;

1. Disable numa balancing within your driver or fail to start if it's
   running
2. Create a patch that adds a new NUMA_PTE_SCAN_IGNORE value for
   mm->first_nid (see includ/linux.mm_types.h). In sched/core/fair.c,
   add a check that first_nid == NUMA_PTE_SCAN_IGNORE should be ignored.
   Document that only virtualbox needs this and set it within your
   driver. This will not fix the compaction cases or numad using cpusets
   to migrate your processes though
3. When the driver affects a region, set mm->numa_next_reset and
   mm->numa_next_scan to large values to prevent the pages being unmapped.
   This would be very fragile, could break again in the future and is ugly
4. Add a check in change_pte_range() for the !prot_numa case to check
   PageReserved. This will prevent automatic numa balancing unmapping the
   page. Document that only virtualbox requires this.
5. Add a check in change_pte_range() for an elevated page count.
   Document that there is no point unmapping a page for a NUMA hinting
   fault that will only fail migration later anyway which is true albeit of
   marginal benefit. Then, in the vbox driver, elevate the page count, do
   away with the PageReserved trick, use MADV_DONTFORK to prevent copying
   at fork time.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-06-05 12:34 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <201306040922.10235.frank.mehnert@oracle.com>
2013-06-04 11:58 ` Handling NUMA page migration Robin Holt
2013-06-04 12:14   ` Frank Mehnert
2013-06-04 13:34     ` Robin Holt
2013-06-04 14:02     ` Michal Hocko
2013-06-04 18:17       ` Frank Mehnert
2013-06-04 21:54         ` Frank Mehnert
2013-06-05  7:54           ` Michal Hocko
2013-06-05  8:34             ` Frank Mehnert
2013-06-05  8:56               ` Frank Mehnert
2013-06-05  9:10               ` Michal Hocko
2013-06-05  9:32                 ` Frank Mehnert
2013-06-05  9:56                   ` Michal Hocko
2013-06-05 10:22                     ` Frank Mehnert
2013-06-05 11:41                       ` Michal Hocko
2013-06-04 15:45     ` Jerome Glisse
2013-06-04 17:49       ` Jerome Glisse
2013-06-05 10:10   ` Mel Gorman
2013-06-05 10:35     ` Frank Mehnert
2013-06-05 12:34       ` Mel Gorman [this message]
2013-06-06 10:09         ` Frank Mehnert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130605123400.GA1936@suse.de \
    --to=mgorman@suse.de \
    --cc=frank.mehnert@oracle.com \
    --cc=holt@sgi.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).