Re: [PATCH 00/49] Automatic NUMA Balancing v10

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Hugh Dickins <hughd@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Hillf Danton <dhillf@gmail.com>,
	David Rientjes <rientjes@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Alex Shi <lkml.alex@gmail.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 00/49] Automatic NUMA Balancing v10
Date: Tue, 11 Dec 2012 01:02:01 +0000	[thread overview]
Message-ID: <20121211010201.GP1009@suse.de> (raw)
In-Reply-To: <20121210152405.GJ1009@suse.de>

On Mon, Dec 10, 2012 at 03:24:05PM +0000, Mel Gorman wrote:
> For example, I think that point 5 above is the potential source of the
> corruption because. You're not flushing the TLBs for the PTEs you are
> updating in batch. Granted, you're relaxing rather than restricting access
> so it should be ok and at worse cause a spurious fault but I also find
> it suspicious that you do not recheck pte_same under the PTL when doing
> the final PTE update.

Looking again, the lack of a pte_same check should be ok. The addr,
addr_start, ptep and ptep_start is a little messy but also look fine.
You're not accidentally crossing a PMD boundary. You should be protected
against huge pages being collapsed underneath you as you hold mmap_sem for
read. If the first page in the pmd (or VMA) is not present then
target_nid == -1 which gets passed into __do_numa_page. This check

        if (target_nid == -1 || target_nid == page_nid)
                goto out;

then means you never actually migrate for that whole PMD and will just
clear the PTEs. Possibly wrong, but not what we're looking for. Holding
PTL across task_numa_fault is bad, but not the bad we're looking for.

/me scratches his head

Machine is still unavailable so in an attempt to rattle this out I prototyped
the equivalent patch for balancenuma and then went back to numacore to see
could I spot a major difference.  Comparing them, there is no guarantee you
clear pte_numa for the address that was originally faulted if there was a
racing fault that cleared it underneath you but in itself that should not
be an issue. Your use of ptep++ instead of pte_offset_map() might break
on 32-bit with NUMA support if PTE pages are stored in highmem. Still the
wrong wrong.

If the bug is indeed here, it's not obvious. I don't know why I'm
triggering it or why it only triggers for specjbb as I cannot imagine
what the JVM would be doing that is that weird or that would not have
triggered before. Maybe we both suffer this type of problem but that
numacores rate of migration is able to trigger it.

> Basically if I felt that handling ptes in batch like this was of
> critical important I would have implemented it very differently on top of
> balancenuma. I would have only taken the PTL lock if updating the PTE to
> keep contention down and redid racy checks under PTL, I'd have only used
> trylock for every non-faulted PTE and I would only have migrated if it
> was a remote->local copy. I certainly would not hold PTL while calling
> task_numa_fault(). I would have kept the handling ona per-pmd basis when
> it was expected that most PTEs underneath should be on the same node.
> 

This is prototype only but what I was using as a reference to see could
I spot a problem in yours. It has not been even boot tested but avoids
remote->remote copies, contending on PTL or holding it longer than necessary
(should anyway)

---8<---
mm: numa: Batch pte handling

diff --git a/mm/memory.c b/mm/memory.c
index 33e20b3..f871d5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3461,30 +3461,14 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	return mpol_misplaced(page, vma, addr);
 }
 
-int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static
+int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd,
+		   spinlock_t *ptl, bool only_local, bool *migrated)
 {
 	struct page *page = NULL;
-	spinlock_t *ptl;
 	int current_nid = -1;
 	int target_nid;
-	bool migrated = false;
-
-	/*
-	* The "pte" at this point cannot be used safely without
-	* validation through pte_unmap_same(). It's of NUMA type but
-	* the pfn may be screwed if the read is non atomic.
-	*
-	* ptep_modify_prot_start is not called as this is clearing
-	* the _PAGE_NUMA bit and it is not really expected that there
-	* would be concurrent hardware modifications to the PTE.
-	*/
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte))) {
-		pte_unmap_unlock(ptep, ptl);
-		goto out;
-	}
 
 	pte = pte_mknonnuma(pte);
 	set_pte_at(mm, addr, ptep, pte);
@@ -3493,7 +3477,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = vm_normal_page(vma, addr, pte);
 	if (!page) {
 		pte_unmap_unlock(ptep, ptl);
-		return 0;
+		goto out;
 	}
 
 	current_nid = page_to_nid(page);
@@ -3509,15 +3493,88 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	/*
+	 * Only do remote-local copies when handling PTEs in batch. This does
+	 * mean we effectively lost the NUMA hinting fault if the workload
+	 * was not converged on a PMD boundary. This is bad but is it worse
+	 * can doing a remote->remote copy?
+	 */
+	if (only_local && target_nid != numa_node_id()) {
+		current_nid = -1;
+		put_page(page);
+		goto out;
+	}
+
 	/* Migrate to the requested node */
-	migrated = migrate_misplaced_page(page, target_nid);
-	if (migrated)
+	*migrated = migrate_misplaced_page(page, target_nid);
+	if (*migrated)
 		current_nid = target_nid;
 
 out:
+	return current_nid;
+}
+
+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	int current_nid = -1;
+	bool migrated = false;
+	unsigned long end_addr;
+
+	/*
+	* The "pte" at this point cannot be used safely without
+	* validation through pte_unmap_same(). It's of NUMA type but
+	* the pfn may be screwed if the read is non atomic.
+	*
+	* ptep_modify_prot_start is not called as this is clearing
+	* the _PAGE_NUMA bit and it is not really expected that there
+	* would be concurrent hardware modifications to the PTE.
+	*/
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (unlikely(!pte_same(*ptep, pte))) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
+	current_nid = __do_numa_page(mm, vma, addr, pte, ptep, pmd, ptl, false, &migrated);
+
+	/* Batch handle all PTEs in this area. PTL is not held initially */
+	addr = max(addr & PMD_MASK, vma->vm_start);
+	end_addr = min((addr + PMD_SIZE) & PMD_MASK, vma->vm_end);
+	for (; addr < end_addr; addr += PAGE_SIZE) {
+		bool batch_migrated = false;
+		int batch_nid = -1;
+
+		ptep = pte_offset_map(pmd, addr);
+		pte = *ptep;
+		if (!pte_present(pte))
+			continue;
+		if (!pte_numa(pte))
+			continue;
+
+		if (!spin_trylock(ptl)) {
+			pte_unmap(ptep);
+			break;
+		}
+
+		/* Recheck PTE under lock */
+		if (!pte_same(*ptep, pte)) {
+			pte_unmap_unlock(ptep, ptl);
+			continue;
+		}
+
+		batch_nid = __do_numa_page(mm, vma, addr, pte, ptep, pmd, ptl, true, &batch_migrated);
+		if (batch_nid != -1)
+			task_numa_fault(current_nid, 1, batch_migrated);
+	}
+
+out:
 	if (current_nid != -1)
 		task_numa_fault(current_nid, 1, migrated);
 	return 0;
+
 }
 
 /* NUMA hinting page fault entry point for regular pmds */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-12-11  1:02 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-07 10:23 [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
2012-12-07 10:23 ` [PATCH 01/49] x86: mm: only do a local tlb flush in ptep_set_access_flags() Mel Gorman
2012-12-07 10:23 ` [PATCH 02/49] x86: mm: drop TLB flush from ptep_set_access_flags Mel Gorman
2012-12-07 10:23 ` [PATCH 03/49] mm,generic: only flush the local TLB in ptep_set_access_flags Mel Gorman
2012-12-07 10:23 ` [PATCH 04/49] x86/mm: Introduce pte_accessible() Mel Gorman
2012-12-07 10:23 ` [PATCH 05/49] mm: Only flush the TLB when clearing an accessible pte Mel Gorman
2012-12-07 10:23 ` [PATCH 06/49] mm: Count the number of pages affected in change_protection() Mel Gorman
2012-12-07 10:23 ` [PATCH 07/49] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users Mel Gorman
2012-12-07 10:23 ` [PATCH 08/49] mm: compaction: Move migration fail/success stats to migrate.c Mel Gorman
2012-12-07 10:23 ` [PATCH 09/49] mm: migrate: Add a tracepoint for migrate_pages Mel Gorman
2012-12-07 10:23 ` [PATCH 10/49] mm: compaction: Add scanned and isolated counters for compaction Mel Gorman
2012-12-07 10:23 ` [PATCH 11/49] mm: numa: define _PAGE_NUMA Mel Gorman
2012-12-07 10:23 ` [PATCH 12/49] mm: numa: pte_numa() and pmd_numa() Mel Gorman
2012-12-07 10:23 ` [PATCH 13/49] mm: numa: Support NUMA hinting page faults from gup/gup_fast Mel Gorman
2012-12-07 10:23 ` [PATCH 14/49] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte Mel Gorman
2012-12-07 10:23 ` [PATCH 15/49] mm: numa: Create basic numa page hinting infrastructure Mel Gorman
2012-12-07 10:23 ` [PATCH 16/49] mm: mempolicy: Make MPOL_LOCAL a real policy Mel Gorman
2012-12-07 10:23 ` [PATCH 17/49] mm: mempolicy: Add MPOL_NOOP Mel Gorman
2012-12-07 10:23 ` [PATCH 18/49] mm: mempolicy: Check for misplaced page Mel Gorman
2012-12-07 10:23 ` [PATCH 19/49] mm: migrate: Introduce migrate_misplaced_page() Mel Gorman
2012-12-07 10:23 ` [PATCH 20/49] mm: migrate: Drop the misplaced pages reference count if the target node is full Mel Gorman
2012-12-07 10:23 ` [PATCH 21/49] mm: mempolicy: Use _PAGE_NUMA to migrate pages Mel Gorman
2012-12-07 10:23 ` [PATCH 22/49] mm: mempolicy: Add MPOL_MF_LAZY Mel Gorman
2013-01-05  5:18   ` Simon Jeons
2013-01-07 15:14     ` Mel Gorman
2012-12-07 10:23 ` [PATCH 23/49] mm: mempolicy: Implement change_prot_numa() in terms of change_protection() Mel Gorman
2012-12-07 10:23 ` [PATCH 24/49] mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now Mel Gorman
2012-12-07 10:23 ` [PATCH 25/49] mm: numa: Add fault driven placement and migration Mel Gorman
2013-01-04 11:56   ` Simon Jeons
2012-12-07 10:23 ` [PATCH 26/49] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate Mel Gorman
2012-12-07 10:23 ` [PATCH 27/49] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Mel Gorman
2012-12-07 10:23 ` [PATCH 28/49] mm: sched: numa: Implement slow start for working set sampling Mel Gorman
2012-12-07 10:23 ` [PATCH 29/49] mm: numa: Add pte updates, hinting and migration stats Mel Gorman
2013-01-04 11:42   ` Simon Jeons
2013-01-07 15:29     ` Mel Gorman
2013-01-08  1:21       ` Wanpeng Li
2013-01-08  1:21       ` Wanpeng Li
2012-12-07 10:23 ` [PATCH 30/49] mm: numa: Migrate on reference policy Mel Gorman
2012-12-07 10:23 ` [PATCH 31/49] mm: numa: Migrate pages handled during a pmd_numa hinting fault Mel Gorman
2012-12-07 10:23 ` [PATCH 32/49] mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting Mel Gorman
2012-12-07 10:23 ` [PATCH 33/49] mm: numa: Rate limit the amount of memory that is migrated between nodes Mel Gorman
2012-12-07 10:23 ` [PATCH 34/49] mm: numa: Rate limit setting of pte_numa if node is saturated Mel Gorman
2012-12-07 10:23 ` [PATCH 35/49] sched: numa: Slowly increase the scanning period as NUMA faults are handled Mel Gorman
2012-12-07 10:23 ` [PATCH 36/49] mm: numa: Introduce last_nid to the page frame Mel Gorman
2012-12-07 10:23 ` [PATCH 37/49] mm: numa: split_huge_page: Transfer last_nid on tail page Mel Gorman
2012-12-07 10:23 ` [PATCH 38/49] mm: numa: migrate: Set last_nid on newly allocated page Mel Gorman
2012-12-07 10:23 ` [PATCH 39/49] mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Mel Gorman
2012-12-07 10:23 ` [PATCH 40/49] mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate Mel Gorman
2012-12-07 10:23 ` [PATCH 41/49] mm: sched: numa: Control enabling and disabling of NUMA balancing Mel Gorman
2012-12-07 10:23 ` [PATCH 42/49] mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG Mel Gorman
2012-12-07 10:23 ` [PATCH 43/49] mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node Mel Gorman
2012-12-07 10:23 ` [PATCH 44/49] mm: numa: Add THP migration for the NUMA working set scanning fault case Mel Gorman
2013-01-05  8:42   ` Wanpeng Li
2013-01-05  8:42   ` Wanpeng Li
2013-01-07 15:37     ` Mel Gorman
2012-12-07 10:23 ` [PATCH 45/49] mm: numa: Add THP migration for the NUMA working set scanning fault case build fix Mel Gorman
2012-12-07 10:23 ` [PATCH 46/49] mm: numa: Account for failed allocations and isolations as migration failures Mel Gorman
2012-12-07 10:23 ` [PATCH 47/49] mm: migrate: Account a transhuge page properly when rate limiting Mel Gorman
2012-12-07 10:23 ` [PATCH 48/49] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Mel Gorman
2012-12-07 10:23 ` [PATCH 49/49] mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Mel Gorman
2012-12-07 11:01 ` [PATCH 00/49] Automatic NUMA Balancing v10 Ingo Molnar
2012-12-09 20:36   ` Mel Gorman
2012-12-09 21:17     ` Kirill A. Shutemov
2012-12-10  8:44       ` Mel Gorman
2012-12-10  5:07     ` Srikar Dronamraju
2012-12-10  6:28       ` Srikar Dronamraju
2012-12-10 12:44         ` [PATCH] sched: Fix task_numa_fault() + KSM crash Ingo Molnar
2012-12-13 13:57           ` Srikar Dronamraju
2012-12-10  8:46       ` [PATCH 00/49] Automatic NUMA Balancing v10 Mel Gorman
2012-12-10 12:35       ` Ingo Molnar
2012-12-10 11:39     ` Ingo Molnar
2012-12-10 11:53       ` Ingo Molnar
2012-12-10 15:24       ` Mel Gorman
2012-12-11  1:02         ` Mel Gorman [this message]
2012-12-11  8:52           ` Ingo Molnar
2012-12-11  9:18             ` Ingo Molnar
2012-12-11 15:22               ` Mel Gorman
2012-12-11 16:30             ` Mel Gorman
2012-12-17 10:33               ` Ingo Molnar
2012-12-10 16:42 ` Srikar Dronamraju
2012-12-10 19:23   ` Ingo Molnar
2012-12-10 23:35     ` Srikar Dronamraju
2012-12-10 23:40   ` Srikar Dronamraju
2012-12-13 13:21 ` Srikar Dronamraju

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:33e20b3 dfblob:f871d5d )
 OR (
bs:"Re: [PATCH 00/49] Automatic NUMA Balancing v10" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121211010201.GP1009@suse.de \
    --to=mgorman@suse.de \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dhillf@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkml.alex@gmail.com \
    --cc=mingo@kernel.org \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).