From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 932546B006E for ; Fri, 16 Nov 2012 11:25:34 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:25:32 -0800 (PST) From: Ingo Molnar Subject: [PATCH 00/19] latest numa/base patches Date: Fri, 16 Nov 2012 17:25:02 +0100 Message-Id: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins This is the split-out series of mm/ patches that got no objections from the latest (v15) posting of numa/core. If everyone is still fine with these then these will be merge candidates for v3.8. I left out the more contentious policy bits that people are still arguing about. The numa/base tree can also be found here: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base Thanks, Ingo -------------------> Andrea Arcangeli (1): numa, mm: Support NUMA hinting page faults from gup/gup_fast Gerald Schaefer (1): sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Ingo Molnar (1): mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Lee Schermerhorn (3): mm/mpol: Add MPOL_MF_NOOP mm/mpol: Check for misplaced page mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra (7): sched, numa, mm: Make find_busiest_queue() a method sched, numa, mm: Describe the NUMA scheduling problem formally mm/thp: Preserve pgprot across huge page split mm/mpol: Make MPOL_LOCAL a real policy mm/mpol: Create special PROT_NONE infrastructure mm/migrate: Introduce migrate_misplaced_page() mm/mpol: Use special PROT_NONE to migrate pages Ralf Baechle (1): sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Rik van Riel (5): mm/generic: Only flush the local TLB in ptep_set_access_flags() x86/mm: Only do a local tlb flush in ptep_set_access_flags() x86/mm: Introduce pte_accessible() mm: Only flush the TLB when clearing an accessible pte x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ arch/mips/include/asm/pgtable.h | 2 + arch/s390/include/asm/pgtable.h | 13 ++ arch/x86/include/asm/pgtable.h | 7 + arch/x86/mm/pgtable.c | 8 +- include/asm-generic/pgtable.h | 4 + include/linux/huge_mm.h | 19 +++ include/linux/mempolicy.h | 8 ++ include/linux/migrate.h | 7 + include/linux/migrate_mode.h | 3 + include/linux/mm.h | 32 +++++ include/uapi/linux/mempolicy.h | 16 ++- kernel/sched/fair.c | 20 +-- mm/huge_memory.c | 174 +++++++++++++++-------- mm/memory.c | 119 +++++++++++++++- mm/mempolicy.c | 143 +++++++++++++++---- mm/migrate.c | 85 ++++++++++-- mm/mprotect.c | 31 +++-- mm/pgtable-generic.c | 9 +- 19 files changed, 807 insertions(+), 123 deletions(-) create mode 100644 Documentation/scheduler/numa-problem.txt -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 022F96B0071 for ; Fri, 16 Nov 2012 11:25:36 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:25:35 -0800 (PST) From: Ingo Molnar Subject: [PATCH 01/19] mm/generic: Only flush the local TLB in ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:03 +0100 Message-Id: <1353083121-4560-2-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse From: Rik van Riel The function ptep_set_access_flags() is only ever used to upgrade access permissions to a page - i.e. they make it less restrictive. That means the only negative side effect of not flushing remote TLBs in this function is that other CPUs may incur spurious page faults, if they happen to access the same address, and still have a PTE with the old permissions cached in their TLB caches. Having another CPU maybe incur a spurious page fault is faster than always incurring the cost of a remote TLB flush, so replace the remote TLB flush with a purely local one. This should be safe on every architecture that correctly implements flush_tlb_fix_spurious_fault() to actually invalidate the local TLB entry that caused a page fault, as well as on architectures where the hardware invalidates TLB entries that cause page faults. In the unlikely event that you are hitting what appears to be an infinite loop of page faults, and 'git bisect' took you to this changeset, your architecture needs to implement flush_tlb_fix_spurious_fault() to actually flush the TLB entry. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse [ Changelog massage. ] Signed-off-by: Ingo Molnar --- mm/pgtable-generic.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index e642627..d8397da 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -12,8 +12,8 @@ #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* - * Only sets the access flags (dirty, accessed, and - * writable). Furthermore, we know it always gets set to a "more + * Only sets the access flags (dirty, accessed), as well as write + * permission. Furthermore, we know it always gets set to a "more * permissive" setting, which allows most architectures to optimize * this. We return whether the PTE actually changed, which in turn * instructs the caller to do things like update__mmu_cache. This @@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, int changed = !pte_same(*ptep, entry); if (changed) { set_pte_at(vma->vm_mm, address, ptep, entry); - flush_tlb_page(vma, address); + flush_tlb_fix_spurious_fault(vma, address); } return changed; } -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 3E8506B0074 for ; Fri, 16 Nov 2012 11:25:39 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125091eek.14 for ; Fri, 16 Nov 2012 08:25:37 -0800 (PST) From: Ingo Molnar Subject: [PATCH 02/19] x86/mm: Only do a local tlb flush in ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:04 +0100 Message-Id: <1353083121-4560-3-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse From: Rik van Riel Because we only ever upgrade a PTE when calling ptep_set_access_flags(), it is safe to skip flushing entries on remote TLBs. The worst that can happen is a spurious page fault on other CPUs, which would flush that TLB entry. Lazily letting another CPU incur a spurious page fault occasionally is (much!) cheaper than aggressively flushing everybody else's TLB. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 8573b83..be3bb46 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd) free_page((unsigned long)pgd); } +/* + * Used to set accessed or dirty bits in the page table entries + * on other architectures. On x86, the accessed and dirty bits + * are tracked by hardware. However, do_wp_page calls this function + * to also make the pte writeable at the same time the dirty bit is + * set. In that case we do actually need to write the PTE. + */ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t entry, int dirty) @@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - flush_tlb_page(vma, address); + __flush_tlb_one(address); } return changed; -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 6529F6B0075 for ; Fri, 16 Nov 2012 11:25:40 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:25:39 -0800 (PST) From: Ingo Molnar Subject: [PATCH 03/19] sched, numa, mm: Make find_busiest_queue() a method Date: Fri, 16 Nov 2012 17:25:05 +0100 Message-Id: <1353083121-4560-4-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra Its a bit awkward but it was the least painful means of modifying the queue selection. Used in a later patch to conditionally use a random queue. Signed-off-by: Peter Zijlstra Cc: Paul Turner Cc: Lee Schermerhorn Cc: Christoph Lameter Cc: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds Link: http://lkml.kernel.org/n/tip-lfpez319yryvdhwqfqrh99f2@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a1..6ab627e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3063,6 +3063,9 @@ struct lb_env { unsigned int loop; unsigned int loop_break; unsigned int loop_max; + + struct rq * (*find_busiest_queue)(struct lb_env *, + struct sched_group *); }; /* @@ -4236,13 +4239,14 @@ static int load_balance(int this_cpu, struct rq *this_rq, struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); struct lb_env env = { - .sd = sd, - .dst_cpu = this_cpu, - .dst_rq = this_rq, - .dst_grpmask = sched_group_cpus(sd->groups), - .idle = idle, - .loop_break = sched_nr_migrate_break, - .cpus = cpus, + .sd = sd, + .dst_cpu = this_cpu, + .dst_rq = this_rq, + .dst_grpmask = sched_group_cpus(sd->groups), + .idle = idle, + .loop_break = sched_nr_migrate_break, + .cpus = cpus, + .find_busiest_queue = find_busiest_queue, }; cpumask_copy(cpus, cpu_active_mask); @@ -4261,7 +4265,7 @@ redo: goto out_balanced; } - busiest = find_busiest_queue(&env, group); + busiest = env.find_busiest_queue(&env, group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 251756B007D for ; Fri, 16 Nov 2012 11:25:43 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:25:42 -0800 (PST) From: Ingo Molnar Subject: [PATCH 04/19] sched, numa, mm: Describe the NUMA scheduling problem formally Date: Fri, 16 Nov 2012 17:25:06 +0100 Message-Id: <1353083121-4560-5-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , "H. Peter Anvin" , Mike Galbraith From: Peter Zijlstra This is probably a first: formal description of a complex high-level computing problem, within the kernel source. Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: "H. Peter Anvin" Cc: Mike Galbraith Rik van Riel Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] Signed-off-by: Ingo Molnar --- Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ 1 file changed, 230 insertions(+) create mode 100644 Documentation/scheduler/numa-problem.txt diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt new file mode 100644 index 0000000..a5d2fee --- /dev/null +++ b/Documentation/scheduler/numa-problem.txt @@ -0,0 +1,230 @@ + + +Effective NUMA scheduling problem statement, described formally: + + * minimize interconnect traffic + +For each task 't_i' we have memory, this memory can be spread over multiple +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on +node 'k' in [pages]. + +If a task shares memory with another task let us denote this as: +'s_i,k', the memory shared between tasks including 't_i' residing on node +'k'. + +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. + +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage +frequency over those memory regions [1/s] such that the product gives an +(average) bandwidth 'bp' and 'bs' in [pages/s]. + +(note: multiple tasks sharing memory naturally avoid duplicat accounting + because each task will have its own access frequency 'fs') + +(pjt: I think this frequency is more numerically consistent if you explicitly + restrict p/s above to be the working-set. (It also makes explicit the + requirement for to change about a change in the working set.) + + Doing this does have the nice property that it lets you use your frequency + measurement as a weak-ordering for the benefit a task would receive when + we can't fit everything. + + e.g. task1 has working set 10mb, f=90% + task2 has working set 90mb, f=10% + + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit + from task1 being on the right node than task2. ) + +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': + + C: t_i -> {c_i, n_i} + +This gives us the total interconnect traffic between nodes 'k' and 'l', +'T_k,l', as: + + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l + +And our goal is to obtain C0 and M0 such that: + + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l + +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory + on node 'l' from node 'k', this would be useful for bigger NUMA systems + + pjt: I agree nice to have, but intuition suggests diminishing returns on more + usual systems given factors like things like Haswell's enormous 35mb l3 + cache and QPI being able to do a direct fetch.) + +(note: do we need a limit on the total memory per node?) + + + * fairness + +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a +load 'L_n': + + L_n = 1/P_n * \Sum_i w_i for all c_i = n + +using that we can formulate a load difference between CPUs + + L_n,m = | L_n - L_m | + +Which allows us to state the fairness goal like: + + L_n,m(C0) =< L_n,m(C) for all C, n != m + +(pjt: It can also be usefully stated that, having converged at C0: + + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | + + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is + the "worst" partition we should accept; but having it gives us a useful + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to + favor T_n,m. ) + +Together they give us the complete multi-objective optimization problem: + + min_C,M [ L_n,m(C), T_k,l(C,M) ] + + + +Notes: + + - the memory bandwidth problem is very much an inter-process problem, in + particular there is no such concept as a process in the above problem. + + - the naive solution would completely prefer fairness over interconnect + traffic, the more complicated solution could pick another Pareto point using + an aggregate objective function such that we balance the loss of work + efficiency against the gain of running, we'd want to more or less suggest + there to be a fixed bound on the error from the Pareto line for any + such solution. + +References: + + http://en.wikipedia.org/wiki/Mathematical_optimization + http://en.wikipedia.org/wiki/Multi-objective_optimization + + +* warning, significant hand-waving ahead, improvements welcome * + + +Partial solutions / approximations: + + 1) have task node placement be a pure preference from the 'fairness' pov. + +This means we always prefer fairness over interconnect bandwidth. This reduces +the problem to: + + min_C,M [ T_k,l(C,M) ] + + 2a) migrate memory towards 'n_i' (the task's node). + +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- +provided 'n_i' stays stable enough and there's sufficient memory (looks like +we might need memory limits for this). + +This does however not provide us with any 's_i' (shared) information. It does +however remove 'M' since it defines memory placement in terms of task +placement. + +XXX properties of this M vs a potential optimal + + 2b) migrate memory towards 'n_i' using 2 samples. + +This separates pages into those that will migrate and those that will not due +to the two samples not matching. We could consider the first to be of 'p_i' +(private) and the second to be of 's_i' (shared). + +This interpretation can be motivated by the previously observed property that +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only +'s_i' (shared). (here we loose the need for memory limits again, since it +becomes indistinguishable from shared). + +XXX include the statistical babble on double sampling somewhere near + +This reduces the problem further; we loose 'M' as per 2a, it further reduces +the 'T_k,l' (interconnect traffic) term to only include shared (since per the +above all private will be local): + + T_k,l = \Sum_i bs_i,l for every n_i = k, l != k + +[ more or less matches the state of sched/numa and describes its remaining + problems and assumptions. It should work well for tasks without significant + shared memory usage between tasks. ] + +Possible future directions: + +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we +can evaluate it; + + 3a) add per-task per node counters + +At fault time, count the number of pages the task faults on for each node. +This should give an approximation of 'p_i' for the local node and 's_i,k' for +all remote nodes. + +While these numbers provide pages per scan, and so have the unit [pages/s] they +don't count repeat access and thus aren't actually representable for our +bandwidth numberes. + + 3b) additional frequency term + +Additionally (or instead if it turns out we don't need the raw 'p' and 's' +numbers) we can approximate the repeat accesses by using the time since marking +the pages as indication of the access frequency. + +Let 'I' be the interval of marking pages and 'e' the elapsed time since the +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'. +If we then increment the node counters using 'a' instead of 1 we might get +a better estimate of bandwidth terms. + + 3c) additional averaging; can be applied on top of either a/b. + +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since + the decaying avg includes the old accesses and therefore has a measure of repeat + accesses. + + Rik also argued that the sample frequency is too low to get accurate access + frequency measurements, I'm not entirely convinced, event at low sample + frequencies the avg elapsed time 'e' over multiple samples should still + give us a fair approximation of the avg access frequency 'a'. + + So doing both b&c has a fair chance of working and allowing us to distinguish + between important and less important memory accesses. + + Experimentation has shown no benefit from the added frequency term so far. ] + +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute +'T_k,l' Our optimization problem now reads: + + min_C [ \Sum_i bs_i,l for every n_i = k, l != k ] + +And includes only shared terms, this makes sense since all task private memory +will become local as per 2. + +This suggests that if there is significant shared memory, we should try and +move towards it. + + 4) move towards where 'most' memory is + +The simplest significance test is comparing the biggest shared 's_i,k' against +the private 'p_i'. If we have more shared than private, move towards it. + +This effectively makes us move towards where most our memory is and forms a +feed-back loop with 2. We migrate memory towards us and we migrate towards +where 'most' memory is. + +(Note: even if there were two tasks fully trashing the same shared memory, it + is very rare for there to be an 50/50 split in memory, lacking a perfect + split, the small will move towards the larger. In case of the perfect + split, we'll tie-break towards the lower node number.) + + 5) 'throttle' 4's node placement + +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize' +and show representative numbers, we should limit node-migration to not be +faster than this. + + n) poke holes in previous that require more stuff and describe it. -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 44A976B007D for ; Fri, 16 Nov 2012 11:25:45 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:25:44 -0800 (PST) From: Ingo Molnar Subject: [PATCH 05/19] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Date: Fri, 16 Nov 2012 17:25:07 +0100 Message-Id: <1353083121-4560-6-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Gerald Schaefer , Martin Schwidefsky , Heiko Carstens , Peter Zijlstra , Ralf Baechle From: Gerald Schaefer This patch adds an implementation of pmd_pgprot() for s390, in preparation to future THP changes. Reported-by: Stephen Rothwell Signed-off-by: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Peter Zijlstra Cc: Ralf Baechle Signed-off-by: Ingo Molnar --- arch/s390/include/asm/pgtable.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index dd647c9..098fc5a 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, *pmdp = entry; } +static inline pgprot_t pmd_pgprot(pmd_t pmd) +{ + pgprot_t prot = PAGE_RW; + + if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) { + if (pmd_val(pmd) & _SEGMENT_ENTRY_INV) + prot = PAGE_NONE; + else + prot = PAGE_RO; + } + return prot; +} + static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot) { unsigned long pgprot_pmd = 0; -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 4513F6B0083 for ; Fri, 16 Nov 2012 11:25:47 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125091eek.14 for ; Fri, 16 Nov 2012 08:25:46 -0800 (PST) From: Ingo Molnar Subject: [PATCH 06/19] mm/thp: Preserve pgprot across huge page split Date: Fri, 16 Nov 2012 17:25:08 +0100 Message-Id: <1353083121-4560-7-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra We're going to play games with page-protections, ensure we don't lose them over a THP split. Collapse seems to always allocate a new (huge) page which should already end up on the new target node so loosing protections there isn't a problem. Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-eyi25t4eh3l4cd2zp4k3bj6c@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 1 + mm/huge_memory.c | 103 ++++++++++++++++++++--------------------- 2 files changed, 50 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a1f780d..f85dccd 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) } #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK) +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK) #define canon_pgprot(p) __pgprot(massage_pgprot(p)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40f17c3..176fe3d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct page *page, int ret = 0, i; pgtable_t pgtable; unsigned long haddr; + pgprot_t prot; spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); - if (pmd) { - pgtable = pgtable_trans_huge_withdraw(mm); - pmd_populate(mm, &_pmd, pgtable); - - haddr = address; - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { - pte_t *pte, entry; - BUG_ON(PageCompound(page+i)); - entry = mk_pte(page + i, vma->vm_page_prot); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (!pmd_write(*pmd)) - entry = pte_wrprotect(entry); - else - BUG_ON(page_mapcount(page) != 1); - if (!pmd_young(*pmd)) - entry = pte_mkold(entry); - pte = pte_offset_map(&_pmd, haddr); - BUG_ON(!pte_none(*pte)); - set_pte_at(mm, haddr, pte, entry); - pte_unmap(pte); - } + if (!pmd) + goto unlock; - smp_wmb(); /* make pte visible before pmd */ - /* - * Up to this point the pmd is present and huge and - * userland has the whole access to the hugepage - * during the split (which happens in place). If we - * overwrite the pmd with the not-huge version - * pointing to the pte here (which of course we could - * if all CPUs were bug free), userland could trigger - * a small page size TLB miss on the small sized TLB - * while the hugepage TLB entry is still established - * in the huge TLB. Some CPU doesn't like that. See - * http://support.amd.com/us/Processor_TechDocs/41322.pdf, - * Erratum 383 on page 93. Intel should be safe but is - * also warns that it's only safe if the permission - * and cache attributes of the two entries loaded in - * the two TLB is identical (which should be the case - * here). But it is generally safer to never allow - * small and huge TLB entries for the same virtual - * address to be loaded simultaneously. So instead of - * doing "pmd_populate(); flush_tlb_range();" we first - * mark the current pmd notpresent (atomically because - * here the pmd_trans_huge and pmd_trans_splitting - * must remain set at all times on the pmd until the - * split is complete for this pmd), then we flush the - * SMP TLB and finally we write the non-huge version - * of the pmd entry with pmd_populate. - */ - pmdp_invalidate(vma, address, pmd); - pmd_populate(mm, pmd, pgtable); - ret = 1; + prot = pmd_pgprot(*pmd); + pgtable = pgtable_trans_huge_withdraw(mm); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + + BUG_ON(PageCompound(page+i)); + entry = mk_pte(page + i, prot); + entry = pte_mkdirty(entry); + if (!pmd_young(*pmd)) + entry = pte_mkold(entry); + pte = pte_offset_map(&_pmd, haddr); + BUG_ON(!pte_none(*pte)); + set_pte_at(mm, haddr, pte, entry); + pte_unmap(pte); } + + smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */ + /* + * Up to this point the pmd is present and huge. + * + * If we overwrite the pmd with the not-huge version, we could trigger + * a small page size TLB miss on the small sized TLB while the hugepage + * TLB entry is still established in the huge TLB. + * + * Some CPUs don't like that. See + * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383 + * on page 93. + * + * Thus it is generally safer to never allow small and huge TLB entries + * for overlapping virtual addresses to be loaded. So we first mark the + * current pmd not present, then we flush the TLB and finally we write + * the non-huge version of the pmd entry with pmd_populate. + * + * The above needs to be done under the ptl because pmd_trans_huge and + * pmd_trans_splitting must remain set on the pmd until the split is + * complete. The ptl also protects against concurrent faults due to + * making the pmd not-present. + */ + set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd)); + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + pmd_populate(mm, pmd, pgtable); + ret = 1; + +unlock: spin_unlock(&mm->page_table_lock); return ret; @@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void) { struct page *hpage = NULL; unsigned int progress = 0, pass_through_head = 0; - unsigned int pages = khugepaged_pages_to_scan; bool wait = true; - - barrier(); /* write khugepaged_pages_to_scan to local stack */ + unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan); while (progress < pages) { if (!khugepaged_prealloc_page(&hpage, &wait)) -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 013546B0085 for ; Fri, 16 Nov 2012 11:25:49 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:25:49 -0800 (PST) From: Ingo Molnar Subject: [PATCH 07/19] x86/mm: Introduce pte_accessible() Date: Fri, 16 Nov 2012 17:25:09 +0100 Message-Id: <1353083121-4560-8-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Rik van Riel We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that the pte is associated with a page. However, for TLB flushing purposes, we would like to know whether the pte points to an actually accessible page. This allows us to skip remote TLB flushes for pages that are not actually accessible. Fill in this method for x86 and provide a safe (but slower) method on other architectures. Signed-off-by: Rik van Riel Signed-off-by: Peter Zijlstra Fixed-by: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org [ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 6 ++++++ include/asm-generic/pgtable.h | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index f85dccd..a984cf9 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -408,6 +408,12 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } +#define pte_accessible pte_accessible +static inline int pte_accessible(pte_t a) +{ + return pte_flags(a) & _PAGE_PRESENT; +} + static inline int pte_hidden(pte_t pte) { return pte_flags(pte) & _PAGE_HIDDEN; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index b36ce40..48fc1dc 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) #define move_pte(pte, prot, old_addr, new_addr) (pte) #endif +#ifndef pte_accessible +# define pte_accessible(pte) ((void)(pte),1) +#endif + #ifndef flush_tlb_fix_spurious_fault #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address) #endif -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id EEF406B0089 for ; Fri, 16 Nov 2012 11:25:51 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125091eek.14 for ; Fri, 16 Nov 2012 08:25:51 -0800 (PST) From: Ingo Molnar Subject: [PATCH 08/19] mm: Only flush the TLB when clearing an accessible pte Date: Fri, 16 Nov 2012 17:25:10 +0100 Message-Id: <1353083121-4560-9-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Rik van Riel If ptep_clear_flush() is called to clear a page table entry that is accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry, there is no need to flush the TLB on remote CPUs. Signed-off-by: Rik van Riel Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org Signed-off-by: Ingo Molnar --- mm/pgtable-generic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index d8397da..0c8323f 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, { pte_t pte; pte = ptep_get_and_clear((vma)->vm_mm, address, ptep); - flush_tlb_page(vma, address); + if (pte_accessible(pte)) + flush_tlb_page(vma, address); return pte; } #endif -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 20FCB6B0089 for ; Fri, 16 Nov 2012 11:25:54 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:25:53 -0800 (PST) From: Ingo Molnar Subject: [PATCH 09/19] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Date: Fri, 16 Nov 2012 17:25:11 +0100 Message-Id: <1353083121-4560-10-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Ralf Baechle , Martin Schwidefsky , Heiko Carstens , Peter Zijlstra From: Ralf Baechle Add the pmd_pgprot() method that will be needed by the new NUMA code. Reported-by: Stephen Rothwell Signed-off-by: Ralf Baechle Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Peter Zijlstra Signed-off-by: Ingo Molnar --- arch/mips/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index c02158b..bbe4cda 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -89,6 +89,8 @@ static inline int is_zero_pfn(unsigned long pfn) extern void paging_init(void); +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_PAGE_CHG_MASK) + /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 341A26B0092 for ; Fri, 16 Nov 2012 11:25:56 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:25:55 -0800 (PST) From: Ingo Molnar Subject: [PATCH 10/19] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Date: Fri, 16 Nov 2012 17:25:12 +0100 Message-Id: <1353083121-4560-11-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins pgprot_modify() is available on x86, but on other architectures it only gets defined in mm/mprotect.c - breaking the build if anything outside of mprotect.c tries to make use of this function. Move it to the generic pgprot area in mm.h, so that an upcoming patch can make use of it. Acked-by: Peter Zijlstra Cc: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-nfvarGMj9gjavowroorkizb4@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/mm.h | 13 +++++++++++++ mm/mprotect.c | 7 ------- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fa06804..2a32cf8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -164,6 +164,19 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_TRIED 0x40 /* second try */ /* + * Some architectures (such as x86) may need to preserve certain pgprot + * bits, without complicating generic pgprot code. + * + * Most architectures don't care: + */ +#ifndef pgprot_modify +static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) +{ + return newprot; +} +#endif + +/* * vm_fault is filled by the the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. diff --git a/mm/mprotect.c b/mm/mprotect.c index a409926..e97b0d6 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -28,13 +28,6 @@ #include #include -#ifndef pgprot_modify -static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) -{ - return newprot; -} -#endif - static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 296856B0095 for ; Fri, 16 Nov 2012 11:25:59 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:25:58 -0800 (PST) From: Ingo Molnar Subject: [PATCH 11/19] mm/mpol: Make MPOL_LOCAL a real policy Date: Fri, 16 Nov 2012 17:25:13 +0100 Message-Id: <1353083121-4560-12-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra Make MPOL_LOCAL a real and exposed policy such that applications that relied on the previous default behaviour can explicitly request it. Requested-by: Christoph Lameter Reviewed-by: Rik van Riel Cc: Lee Schermerhorn Cc: Andrew Morton Cc: Linus Torvalds Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 9 ++++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 23e62e0..3e835c9 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -20,6 +20,7 @@ enum { MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE, + MPOL_LOCAL, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d04a8a5..72f50ba 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, (flags & MPOL_F_RELATIVE_NODES))) return ERR_PTR(-EINVAL); } + } else if (mode == MPOL_LOCAL) { + if (!nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + mode = MPOL_PREFERRED; } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); policy = kmem_cache_alloc(policy_cache, GFP_KERNEL); @@ -2397,7 +2401,6 @@ void numa_default_policy(void) * "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag * Used only for mpol_parse_str() and mpol_to_str() */ -#define MPOL_LOCAL MPOL_MAX static const char * const policy_modes[] = { [MPOL_DEFAULT] = "default", @@ -2450,12 +2453,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context) if (flags) *flags++ = '\0'; /* terminate mode string */ - for (mode = 0; mode <= MPOL_LOCAL; mode++) { + for (mode = 0; mode < MPOL_MAX; mode++) { if (!strcmp(str, policy_modes[mode])) { break; } } - if (mode > MPOL_LOCAL) + if (mode >= MPOL_MAX) goto out; switch (mode) { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id CE5B86B0099 for ; Fri, 16 Nov 2012 11:26:01 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125091eek.14 for ; Fri, 16 Nov 2012 08:26:01 -0800 (PST) From: Ingo Molnar Subject: [PATCH 12/19] mm/mpol: Add MPOL_MF_NOOP Date: Fri, 16 Nov 2012 17:25:14 +0100 Message-Id: <1353083121-4560-13-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn From: Lee Schermerhorn This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY flags, mbind() will map the pages PROT_NONE so that they will be migrated on the next touch. This allows an application to prepare for a new phase of operation where different regions of shared storage will be assigned to worker threads, w/o changing policy. Note that we could just use "default" policy in this case. However, this also allows an application to request that pages be migrated, only if necessary, to follow any arbitrary policy that might currently apply to a range of pages, without knowing the policy, or without specifying multiple mbind()s for ranges with different policies. [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ] Bug-Reported-by: Reported-by: Fengguang Wu Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 11 ++++++----- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 3e835c9..d23dca8 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -21,6 +21,7 @@ enum { MPOL_BIND, MPOL_INTERLEAVE, MPOL_LOCAL, + MPOL_NOOP, /* retain existing policy for range */ MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 72f50ba..c7c7c86 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, pr_debug("setting mode %d flags %d nodes[0] %lx\n", mode, flags, nodes ? nodes_addr(*nodes)[0] : -1); - if (mode == MPOL_DEFAULT) { + if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) { if (nodes && !nodes_empty(*nodes)) return ERR_PTR(-EINVAL); - return NULL; /* simply delete any existing policy */ + return NULL; } VM_BUG_ON(!nodes); @@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (start & ~PAGE_MASK) return -EINVAL; - if (mode == MPOL_DEFAULT) + if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) flags &= ~MPOL_MF_STRICT; len = (len + PAGE_SIZE - 1) & PAGE_MASK; @@ -2407,7 +2407,8 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", - [MPOL_LOCAL] = "local" + [MPOL_LOCAL] = "local", + [MPOL_NOOP] = "noop", /* should not actually be used */ }; @@ -2458,7 +2459,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context) break; } } - if (mode >= MPOL_MAX) + if (mode >= MPOL_MAX || mode == MPOL_NOOP) goto out; switch (mode) { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 4FA5E6B009E for ; Fri, 16 Nov 2012 11:26:05 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:26:04 -0800 (PST) From: Ingo Molnar Subject: [PATCH 13/19] mm/mpol: Check for misplaced page Date: Fri, 16 Nov 2012 17:25:15 +0100 Message-Id: <1353083121-4560-14-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn From: Lee Schermerhorn This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-z3mgep4tgrc08o07vl1ahb2m@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/mempolicy.h | 8 +++++ include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 76 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 85 insertions(+) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index e5ccb9d..c511e25 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma) return 1; } +extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long); + #else struct mempolicy {}; @@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, return 0; } +static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma, + unsigned long address) +{ + return -1; /* no node preference */ +} + #endif /* CONFIG_NUMA */ #endif diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index d23dca8..472de8a 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -61,6 +61,7 @@ enum mpol_rebind_step { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ #define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */ +#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c7c7c86..1b2890c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2179,6 +2179,82 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +/** + * mpol_misplaced - check whether current page node is valid in policy + * + * @page - page to be checked + * @vma - vm area where page mapped + * @addr - virtual address where page mapped + * + * Lookup current policy node id for vma,addr and "compare to" page's + * node id. + * + * Returns: + * -1 - not misplaced, page is in the right node + * node - node id where the page should be + * + * Policy determination "mimics" alloc_page_vma(). + * Called from fault path where we know the vma and faulting address. + */ +int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr) +{ + struct mempolicy *pol; + struct zone *zone; + int curnid = page_to_nid(page); + unsigned long pgoff; + int polnid = -1; + int ret = -1; + + BUG_ON(!vma); + + pol = get_vma_policy(current, vma, addr); + if (!(pol->flags & MPOL_F_MOF)) + goto out; + + switch (pol->mode) { + case MPOL_INTERLEAVE: + BUG_ON(addr >= vma->vm_end); + BUG_ON(addr < vma->vm_start); + + pgoff = vma->vm_pgoff; + pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; + polnid = offset_il_node(pol, vma, pgoff); + break; + + case MPOL_PREFERRED: + if (pol->flags & MPOL_F_LOCAL) + polnid = numa_node_id(); + else + polnid = pol->v.preferred_node; + break; + + case MPOL_BIND: + /* + * allows binding to multiple nodes. + * use current page if in policy nodemask, + * else select nearest allowed node, if any. + * If no allowed nodes, use current [!misplaced]. + */ + if (node_isset(curnid, pol->v.nodes)) + goto out; + (void)first_zones_zonelist( + node_zonelist(numa_node_id(), GFP_HIGHUSER), + gfp_zone(GFP_HIGHUSER), + &pol->v.nodes, &zone); + polnid = zone->node; + break; + + default: + BUG(); + } + if (curnid != polnid) + ret = polnid; +out: + mpol_cond_put(pol); + + return ret; +} + static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug("deleting %lx-l%lx\n", n->start, n->end); -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id 42B208D0001 for ; Fri, 16 Nov 2012 11:26:08 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432891eaa.14 for ; Fri, 16 Nov 2012 08:26:06 -0800 (PST) From: Ingo Molnar Subject: [PATCH 14/19] mm/mpol: Create special PROT_NONE infrastructure Date: Fri, 16 Nov 2012 17:25:16 +0100 Message-Id: <1353083121-4560-15-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PROT_NONE variant, we can then use the 'spurious' protection faults to drive our migrations from. Pages that already had an effective PROT_NONE mapping will not be detected to generate these 'spuriuos' faults for the simple reason that we cannot distinguish them on their protection bits, see pte_numa(). This isn't a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking) aren't used or are rare enough for us to not care about their placement. Suggested-by: Rik van Riel Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-0g5k80y4df8l83lha9j75xph@git.kernel.org [ fixed various cross-arch and THP/!THP details ] Signed-off-by: Ingo Molnar --- include/linux/huge_mm.h | 19 +++++++++++++ include/linux/mm.h | 18 ++++++++++++ mm/huge_memory.c | 32 +++++++++++++++++++++ mm/memory.c | 75 ++++++++++++++++++++++++++++++++++++++++++++----- mm/mprotect.c | 24 +++++++++++----- 5 files changed, 154 insertions(+), 14 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b31cb7d..4f0f948 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -159,6 +159,13 @@ static inline struct page *compound_trans_head(struct page *page) } return page; } + +extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd); + +extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t orig_pmd); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd, { return 0; } + +static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd) +{ + return false; +} + +static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t orig_pmd) +{ +} + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a32cf8..0025bf9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, extern unsigned long do_mremap(unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags, unsigned long new_addr); +extern void change_protection(struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgprot_t newprot, + int dirty_accountable); extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); @@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags) } #endif +static inline pgprot_t vma_prot_none(struct vm_area_struct *vma) +{ + /* + * obtain PROT_NONE by removing READ|WRITE|EXEC privs + */ + vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); + return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags)); +} + +static inline void +change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end) +{ + change_protection(vma, start, end, vma_prot_none(vma), 0); +} + struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 176fe3d..6924edf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -725,6 +725,38 @@ out: return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd) +{ + /* + * See pte_numa(). + */ + if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot))) + return false; + + return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma))); +} + +void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t entry) +{ + unsigned long haddr = address & HPAGE_PMD_MASK; + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, entry))) + goto out_unlock; + + /* do fancy stuff */ + + /* change back to regular protection */ + entry = pmd_modify(entry, vma->vm_page_prot); + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1)) + update_mmu_cache_pmd(vma, address, entry); + +out_unlock: + spin_unlock(&mm->page_table_lock); +} + int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) diff --git a/mm/memory.c b/mm/memory.c index fb135ba..e3e8ab2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); +static bool pte_numa(struct vm_area_struct *vma, pte_t pte) +{ + /* + * If we have the normal vma->vm_page_prot protections we're not a + * 'special' PROT_NONE page. + * + * This means we cannot get 'special' PROT_NONE faults from genuine + * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty + * tracking. + * + * Neither case is really interesting for our current use though so we + * don't care. + */ + if (pte_same(pte, pte_modify(pte, vma->vm_page_prot))) + return false; + + return pte_same(pte, pte_modify(pte, vma_prot_none(vma))); +} + /** * follow_page - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address @@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } +static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pmd_t *pmd, + unsigned int flags, pte_t entry) +{ + spinlock_t *ptl; + int ret = 0; + + if (!pte_unmap_same(mm, pmd, ptep, entry)) + goto out; + + /* + * Do fancy stuff... + */ + + /* + * OK, nothing to do,.. change the protection back to what it + * ought to be. + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (unlikely(!pte_same(*ptep, entry))) + goto unlock; + + flush_cache_page(vma, address, pte_pfn(entry)); + + ptep_modify_prot_start(mm, address, ptep); + entry = pte_modify(entry, vma->vm_page_prot); + ptep_modify_prot_commit(mm, address, ptep, entry); + + update_mmu_cache(vma, address, ptep); +unlock: + pte_unmap_unlock(ptep, ptl); +out: + return ret; +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *mm, pte, pmd, flags, entry); } + if (pte_numa(vma, entry)) + return do_numa_page(mm, vma, address, pte, pmd, flags, entry); + ptl = pte_lockptr(mm, pmd); spin_lock(ptl); if (unlikely(!pte_same(*pte, entry))) @@ -3535,13 +3592,16 @@ retry: pmd, flags); } else { pmd_t orig_pmd = *pmd; - int ret; + int ret = 0; barrier(); - if (pmd_trans_huge(orig_pmd)) { - if (flags & FAULT_FLAG_WRITE && - !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) { + if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { + if (pmd_numa(vma, orig_pmd)) { + do_huge_pmd_numa_page(mm, vma, address, pmd, + flags, orig_pmd); + } + + if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); /* @@ -3551,12 +3611,13 @@ retry: */ if (unlikely(ret & VM_FAULT_OOM)) goto retry; - return ret; } - return 0; + + return ret; } } + /* * Use __pte_alloc instead of pte_alloc_map, because we can't * run pte_offset_map on the pmd, if an huge pmd could diff --git a/mm/mprotect.c b/mm/mprotect.c index e97b0d6..392b124 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -112,7 +112,7 @@ static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd, } while (pud++, addr = next, addr != end); } -static void change_protection(struct vm_area_struct *vma, +static void change_protection_range(struct vm_area_struct *vma, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) { @@ -134,6 +134,20 @@ static void change_protection(struct vm_area_struct *vma, flush_tlb_range(vma, start, end); } +void change_protection(struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgprot_t newprot, + int dirty_accountable) +{ + struct mm_struct *mm = vma->vm_mm; + + mmu_notifier_invalidate_range_start(mm, start, end); + if (is_vm_hugetlb_page(vma)) + hugetlb_change_protection(vma, start, end, newprot); + else + change_protection_range(vma, start, end, newprot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); +} + int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags) @@ -206,12 +220,8 @@ success: dirty_accountable = 1; } - mmu_notifier_invalidate_range_start(mm, start, end); - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma->vm_page_prot); - else - change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); - mmu_notifier_invalidate_range_end(mm, start, end); + change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); perf_event_mmap(vma); -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 3D23B6B009B for ; Fri, 16 Nov 2012 11:26:09 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:26:08 -0800 (PST) From: Ingo Molnar Subject: [PATCH 15/19] mm/mpol: Add MPOL_MF_LAZY Date: Fri, 16 Nov 2012 17:25:17 +0100 Message-Id: <1353083121-4560-16-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn From: Lee Schermerhorn This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Lee Schermerhorn Cc: Andrew Morton Cc: Linus Torvalds [ nearly complete rewrite.. ] Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-7rsodo9x8zvm5awru5o7zo0y@git.kernel.org Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 13 ++++++++--- mm/mempolicy.c | 49 +++++++++++++++++++++++++++--------------- 2 files changed, 42 insertions(+), 20 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 472de8a..6a1baae 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -49,9 +49,16 @@ enum mpol_rebind_step { /* Flags for mbind */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ -#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ -#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ -#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */ +#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform + to policy */ +#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ +#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */ +#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ + +#define MPOL_MF_VALID (MPOL_MF_STRICT | \ + MPOL_MF_MOVE | \ + MPOL_MF_MOVE_ALL | \ + MPOL_MF_LAZY) /* * Internal flags that share the struct mempolicy flags word with diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1b2890c..5ee326c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end, return ERR_PTR(-EFAULT); prev = NULL; for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { + unsigned long endvma = vma->vm_end; + + if (endvma > end) + endvma = end; + if (vma->vm_start > start) + start = vma->vm_start; + if (!(flags & MPOL_MF_DISCONTIG_OK)) { if (!vma->vm_next && vma->vm_end < end) return ERR_PTR(-EFAULT); if (prev && prev->vm_end < vma->vm_start) return ERR_PTR(-EFAULT); } - if (!is_vm_hugetlb_page(vma) && - ((flags & MPOL_MF_STRICT) || + + if (is_vm_hugetlb_page(vma)) + goto next; + + if (flags & MPOL_MF_LAZY) { + change_prot_none(vma, start, endvma); + goto next; + } + + if ((flags & MPOL_MF_STRICT) || ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && - vma_migratable(vma)))) { - unsigned long endvma = vma->vm_end; + vma_migratable(vma))) { - if (endvma > end) - endvma = end; - if (vma->vm_start > start) - start = vma->vm_start; err = check_pgd_range(vma, start, endvma, nodes, flags, private); if (err) { @@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end, break; } } +next: prev = vma; } return first; @@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start, unsigned long len, int err; LIST_HEAD(pagelist); - if (flags & ~(unsigned long)(MPOL_MF_STRICT | - MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) + if (flags & ~(unsigned long)MPOL_MF_VALID) return -EINVAL; if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) return -EPERM; @@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start, unsigned long len, if (IS_ERR(new)) return PTR_ERR(new); + if (flags & MPOL_MF_LAZY) + new->flags |= MPOL_F_MOF; + /* * If we are using the default policy then operation * on discontinuous address spaces is okay after all @@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start, unsigned long len, vma = check_range(mm, start, end, nmask, flags | MPOL_MF_INVERT, &pagelist); - err = PTR_ERR(vma); - if (!IS_ERR(vma)) { - int nr_failed = 0; - + err = PTR_ERR(vma); /* maybe ... */ + if (!IS_ERR(vma) && mode != MPOL_NOOP) err = mbind_range(mm, start, end, new); + if (!err) { + int nr_failed = 0; + if (!list_empty(&pagelist)) { + WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_vma_page, - (unsigned long)vma, - false, MIGRATE_SYNC); + (unsigned long)vma, + false, MIGRATE_SYNC); if (nr_failed) putback_lru_pages(&pagelist); } - if (!err && nr_failed && (flags & MPOL_MF_STRICT)) + if (nr_failed && (flags & MPOL_MF_STRICT)) err = -EIO; } else putback_lru_pages(&pagelist); -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 9375E8D0001 for ; Fri, 16 Nov 2012 11:26:11 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432634eaa.14 for ; Fri, 16 Nov 2012 08:26:11 -0800 (PST) From: Ingo Molnar Subject: [PATCH 16/19] numa, mm: Support NUMA hinting page faults from gup/gup_fast Date: Fri, 16 Nov 2012 17:25:18 +0100 Message-Id: <1353083121-4560-17-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Andrea Arcangeli Introduce FOLL_NUMA to tell follow_page to check pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do so because it always invokes handle_mm_fault and retries the follow_page later. KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault. Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages. [ This patch was picked up from the AutoNUMA tree. ] Originally-by: Andrea Arcangeli Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Rik van Riel [ ported to this tree. ] Signed-off-by: Ingo Molnar --- include/linux/mm.h | 1 + mm/memory.c | 17 +++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0025bf9..1821629 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address, #define FOLL_MLOCK 0x40 /* mark page as mlocked */ #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index e3e8ab2..a660fd0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); goto out; } + if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd)) + goto no_page_table; if (pmd_trans_huge(*pmd)) { if (flags & FOLL_SPLIT) { split_huge_page_pmd(mm, pmd); @@ -1565,6 +1567,8 @@ split_fallthrough: pte = *ptep; if (!pte_present(pte)) goto no_page; + if ((flags & FOLL_NUMA) && pte_numa(vma, pte)) + goto no_page; if ((flags & FOLL_WRITE) && !pte_write(pte)) goto unlock; @@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD); vm_flags &= (gup_flags & FOLL_FORCE) ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE); + + /* + * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault + * would be called on PROT_NONE ranges. We must never invoke + * handle_mm_fault on PROT_NONE ranges or the NUMA hinting + * page faults would unprotect the PROT_NONE ranges if + * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd + * bitflag. So to avoid that, don't set FOLL_NUMA if + * FOLL_FORCE is set. + */ + if (!(gup_flags & FOLL_FORCE)) + gup_flags |= FOLL_NUMA; + i = 0; do { -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id A29558D0005 for ; Fri, 16 Nov 2012 11:26:13 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125091eek.14 for ; Fri, 16 Nov 2012 08:26:13 -0800 (PST) From: Ingo Molnar Subject: [PATCH 17/19] mm/migrate: Introduce migrate_misplaced_page() Date: Fri, 16 Nov 2012 17:25:19 +0100 Message-Id: <1353083121-4560-18-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra Add migrate_misplaced_page() which deals with migrating pages from faults. This includes adding a new MIGRATE_FAULT migration mode to deal with the extra page reference required due to having to look up the page. Based-on-work-by: Lee Schermerhorn Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-es03i8ne7xee0981brw40fl5@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/migrate.h | 7 ++++ include/linux/migrate_mode.h | 3 ++ mm/migrate.c | 85 +++++++++++++++++++++++++++++++++++++++----- 3 files changed, 87 insertions(+), 8 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ce7e667..9a5afea 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_misplaced_page(struct page *page, int node); #else static inline void putback_lru_pages(struct list_head *l) {} @@ -63,5 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #define migrate_page NULL #define fail_migrate_page NULL +static inline +int migrate_misplaced_page(struct page *page, int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_MIGRATION */ + #endif /* _LINUX_MIGRATE_H */ diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index ebf3d89..40b37dc 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -6,11 +6,14 @@ * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy + * this path has an extra reference count */ enum migrate_mode { MIGRATE_ASYNC, MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, + MIGRATE_FAULT, }; #endif /* MIGRATE_MODE_H_INCLUDED */ diff --git a/mm/migrate.c b/mm/migrate.c index 77ed2d7..3299949 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, struct buffer_head *bh = head; /* Simple case, sync compaction */ - if (mode != MIGRATE_ASYNC) { + if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) { do { get_bh(bh); lock_buffer(bh); @@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { - int expected_count; + int expected_count = 0; void **pslot; + if (mode == MIGRATE_FAULT) { + /* + * MIGRATE_FAULT has an extra reference on the page and + * otherwise acts like ASYNC, no point in delaying the + * fault, we'll try again next time. + */ + expected_count++; + } + if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != 1) + expected_count += 1; + if (page_count(page) != expected_count) return -EAGAIN; return 0; } @@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping, pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); - expected_count = 2 + page_has_private(page); + expected_count += 2 + page_has_private(page); if (page_count(page) != expected_count || radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) { spin_unlock_irq(&mapping->tree_lock); @@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping, * the mapping back due to an elevated page count, we would have to * block waiting on other references to be dropped. */ - if (mode == MIGRATE_ASYNC && head && + if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head && !buffer_migrate_lock_buffers(head, mode)) { page_unfreeze_refs(page, expected_count); spin_unlock_irq(&mapping->tree_lock); @@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping, * with an IRQ-safe spinlock held. In the sync case, the buffers * need to be locked now */ - if (mode != MIGRATE_ASYNC) + if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) BUG_ON(!buffer_migrate_lock_buffers(head, mode)); ClearPagePrivate(page); @@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, struct anon_vma *anon_vma = NULL; if (!trylock_page(page)) { - if (!force || mode == MIGRATE_ASYNC) + if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) goto out; /* @@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to, } return err; } -#endif + +/* + * Attempt to migrate a misplaced page to the specified destination + * node. + */ +int migrate_misplaced_page(struct page *page, int node) +{ + struct address_space *mapping = page_mapping(page); + int page_lru = page_is_file_cache(page); + struct page *newpage; + int ret = -EAGAIN; + gfp_t gfp = GFP_HIGHUSER_MOVABLE; + + /* + * Don't migrate pages that are mapped in multiple processes. + */ + if (page_mapcount(page) != 1) + goto out; + + /* + * Never wait for allocations just to migrate on fault, but don't dip + * into reserves. And, only accept pages from the specified node. No + * sense migrating to a different "misplaced" page! + */ + if (mapping) + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_WAIT; + gfp |= __GFP_NOMEMALLOC | GFP_THISNODE; + + newpage = alloc_pages_node(node, gfp, 0); + if (!newpage) { + ret = -ENOMEM; + goto out; + } + + if (isolate_lru_page(page)) { + ret = -EBUSY; + goto put_new; + } + + inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT); + /* + * A page that has been migrated has all references removed and will be + * freed. A page that has not been migrated will have kepts its + * references and be restored. + */ + dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + putback_lru_page(page); +put_new: + /* + * Move the new page to the LRU. If migration was not successful + * then this will free the page. + */ + putback_lru_page(newpage); +out: + return ret; +} + +#endif /* CONFIG_NUMA */ -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 1796D8D0003 for ; Fri, 16 Nov 2012 11:26:18 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so2125055eek.14 for ; Fri, 16 Nov 2012 08:26:17 -0800 (PST) From: Ingo Molnar Subject: [PATCH 19/19] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:21 +0100 Message-Id: <1353083121-4560-20-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse From: Rik van Riel Intel has an architectural guarantee that the TLB entry causing a page fault gets invalidated automatically. This means we should be able to drop the local TLB invalidation. Because of the way other areas of the page fault code work, chances are good that all x86 CPUs do this. However, if someone somewhere has an x86 CPU that does not invalidate the TLB entry causing a page fault, this one-liner should be easy to revert - or a CPU model specific quirk could be added to retain this optimization on most CPUs. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse [ Applied changelog massage and moved this last in the series, to create bisection distance. ] Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index be3bb46..7353de3 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - __flush_tlb_one(address); } return changed; -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id DF03C8D0002 for ; Fri, 16 Nov 2012 11:26:15 -0500 (EST) Received: by mail-ea0-f169.google.com with SMTP id a12so432891eaa.14 for ; Fri, 16 Nov 2012 08:26:15 -0800 (PST) From: Ingo Molnar Subject: [PATCH 18/19] mm/mpol: Use special PROT_NONE to migrate pages Date: Fri, 16 Nov 2012 17:25:20 +0100 Message-Id: <1353083121-4560-19-git-send-email-mingo@kernel.org> In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins From: Peter Zijlstra Combine our previous PROT_NONE, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Suggested-by: Rik van Riel Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-e98gyl8kr9jzooh2s4piuils@git.kernel.org Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 41 +++++++++++++++++++++++++++++++++++- mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++---------------- 2 files changed, 85 insertions(+), 19 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6924edf..c4c0a57 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -741,12 +742,48 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned int flags, pmd_t entry) { unsigned long haddr = address & HPAGE_PMD_MASK; + struct page *page = NULL; + int node; spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) goto out_unlock; - /* do fancy stuff */ + if (unlikely(pmd_trans_splitting(entry))) { + spin_unlock(&mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + return; + } + +#ifdef CONFIG_NUMA + page = pmd_page(entry); + VM_BUG_ON(!PageCompound(page) || !PageHead(page)); + + get_page(page); + spin_unlock(&mm->page_table_lock); + + /* + * XXX should we serialize against split_huge_page ? + */ + + node = mpol_misplaced(page, vma, haddr); + if (node == -1) + goto do_fixup; + + /* + * Due to lacking code to migrate thp pages, we'll split + * (which preserves the special PROT_NONE) and re-take the + * fault on the normal pages. + */ + split_huge_page(page); + put_page(page); + return; + +do_fixup: + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, entry))) + goto out_unlock; +#endif /* change back to regular protection */ entry = pmd_modify(entry, vma->vm_page_prot); @@ -755,6 +792,8 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, out_unlock: spin_unlock(&mm->page_table_lock); + if (page) + put_page(page); } int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, diff --git a/mm/memory.c b/mm/memory.c index a660fd0..0d26a28 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -1467,8 +1468,10 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); static bool pte_numa(struct vm_area_struct *vma, pte_t pte) { /* - * If we have the normal vma->vm_page_prot protections we're not a - * 'special' PROT_NONE page. + * For NUMA page faults, we use PROT_NONE ptes in VMAs with + * "normal" vma->vm_page_prot protections. Genuine PROT_NONE + * VMAs should never get here, because the fault handling code + * will notice that the VMA has no read or write permissions. * * This means we cannot get 'special' PROT_NONE faults from genuine * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty @@ -3473,35 +3476,59 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pmd_t *pmd, unsigned int flags, pte_t entry) { + struct page *page = NULL; + int node, page_nid = -1; spinlock_t *ptl; - int ret = 0; - - if (!pte_unmap_same(mm, pmd, ptep, entry)) - goto out; - /* - * Do fancy stuff... - */ - - /* - * OK, nothing to do,.. change the protection back to what it - * ought to be. - */ - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); if (unlikely(!pte_same(*ptep, entry))) - goto unlock; + goto out_unlock; + page = vm_normal_page(vma, address, entry); + if (page) { + get_page(page); + page_nid = page_to_nid(page); + node = mpol_misplaced(page, vma, address); + if (node != -1) + goto migrate; + } + +out_pte_upgrade_unlock: flush_cache_page(vma, address, pte_pfn(entry)); ptep_modify_prot_start(mm, address, ptep); entry = pte_modify(entry, vma->vm_page_prot); ptep_modify_prot_commit(mm, address, ptep, entry); + /* No TLB flush needed because we upgraded the PTE */ + update_mmu_cache(vma, address, ptep); -unlock: + +out_unlock: pte_unmap_unlock(ptep, ptl); out: - return ret; + if (page) + put_page(page); + + return 0; + +migrate: + pte_unmap_unlock(ptep, ptl); + + if (!migrate_misplaced_page(page, node)) { + page_nid = node; + goto out; + } + + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (!pte_same(*ptep, entry)) { + put_page(page); + page = NULL; + goto out_unlock; + } + + goto out_pte_upgrade_unlock; } /* -- 1.7.11.7 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id 14C9C6B004D for ; Sat, 17 Nov 2012 03:35:39 -0500 (EST) Received: by mail-oa0-f41.google.com with SMTP id k14so4334014oag.14 for ; Sat, 17 Nov 2012 00:35:38 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Date: Sat, 17 Nov 2012 16:35:38 +0800 Message-ID: Subject: Re: [PATCH 00/19] latest numa/base patches From: Alex Shi Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Alex Shi Just find imbalance issue on the patchset. I write a one line program: int main () { int i; for (i=0; i< 1; ) __asm__ __volatile__ ("nop"); } it was compiled with name pl and start it on my 2 socket * 4 cores * HT NUMA machine: the cpu domain top like this: domain 0: span 4,12 level SIBLING groups: 4 (cpu_power = 589) 12 (cpu_power = 589) domain 1: span 0,2,4,6,8,10,12,14 level MC groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8 (cpu_power = 1178) 2,10 (cpu_power = 1178) domain 2: span 0,2,4,6,8,10,12,14 level CPU groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) domain 3: span 0-15 level NUMA groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15 (cpu_power = 4712) $for ((i=0; i< I; i++)); do ./pl & done when I = 2, they are running on cpu 0,12 I = 4, they are running on cpu 0,9,12,14 I = 8, they are running on cpu 0,4,9,10,11,12,13,14 Regards! Alex On Sat, Nov 17, 2012 at 12:25 AM, Ingo Molnar wrote: > This is the split-out series of mm/ patches that got no objections > from the latest (v15) posting of numa/core. If everyone is still > fine with these then these will be merge candidates for v3.8. > > I left out the more contentious policy bits that people are still > arguing about. > > The numa/base tree can also be found here: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base > > Thanks, > > Ingo > > -------------------> > > Andrea Arcangeli (1): > numa, mm: Support NUMA hinting page faults from gup/gup_fast > > Gerald Schaefer (1): > sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 > > Ingo Molnar (1): > mm/pgprot: Move the pgprot_modify() fallback definition to mm.h > > Lee Schermerhorn (3): > mm/mpol: Add MPOL_MF_NOOP > mm/mpol: Check for misplaced page > mm/mpol: Add MPOL_MF_LAZY > > Peter Zijlstra (7): > sched, numa, mm: Make find_busiest_queue() a method > sched, numa, mm: Describe the NUMA scheduling problem formally > mm/thp: Preserve pgprot across huge page split > mm/mpol: Make MPOL_LOCAL a real policy > mm/mpol: Create special PROT_NONE infrastructure > mm/migrate: Introduce migrate_misplaced_page() > mm/mpol: Use special PROT_NONE to migrate pages > > Ralf Baechle (1): > sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation > > Rik van Riel (5): > mm/generic: Only flush the local TLB in ptep_set_access_flags() > x86/mm: Only do a local tlb flush in ptep_set_access_flags() > x86/mm: Introduce pte_accessible() > mm: Only flush the TLB when clearing an accessible pte > x86/mm: Completely drop the TLB flush from ptep_set_access_flags() > > Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ > arch/mips/include/asm/pgtable.h | 2 + > arch/s390/include/asm/pgtable.h | 13 ++ > arch/x86/include/asm/pgtable.h | 7 + > arch/x86/mm/pgtable.c | 8 +- > include/asm-generic/pgtable.h | 4 + > include/linux/huge_mm.h | 19 +++ > include/linux/mempolicy.h | 8 ++ > include/linux/migrate.h | 7 + > include/linux/migrate_mode.h | 3 + > include/linux/mm.h | 32 +++++ > include/uapi/linux/mempolicy.h | 16 ++- > kernel/sched/fair.c | 20 +-- > mm/huge_memory.c | 174 +++++++++++++++-------- > mm/memory.c | 119 +++++++++++++++- > mm/mempolicy.c | 143 +++++++++++++++---- > mm/migrate.c | 85 ++++++++++-- > mm/mprotect.c | 31 +++-- > mm/pgtable-generic.c | 9 +- > 19 files changed, 807 insertions(+), 123 deletions(-) > create mode 100644 Documentation/scheduler/numa-problem.txt > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Thanks Alex -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx202.postini.com [74.125.245.202]) by kanga.kvack.org (Postfix) with SMTP id 8E4D66B0068 for ; Sat, 17 Nov 2012 03:40:08 -0500 (EST) Received: by mail-ob0-f169.google.com with SMTP id lz20so4349318obb.14 for ; Sat, 17 Nov 2012 00:40:07 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Date: Sat, 17 Nov 2012 16:40:07 +0800 Message-ID: Subject: Re: [PATCH 00/19] latest numa/base patches From: Alex Shi Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Alex Shi On Sat, Nov 17, 2012 at 4:35 PM, Alex Shi wrote: > Just find imbalance issue on the patchset. > > I write a one line program: > int main () > { > int i; > for (i=0; i< 1; ) > __asm__ __volatile__ ("nop"); > } > it was compiled with name pl and start it on my 2 socket * 4 cores * > HT NUMA machine: > the cpu domain top like this: > domain 0: span 4,12 level SIBLING > groups: 4 (cpu_power = 589) 12 (cpu_power = 589) > domain 1: span 0,2,4,6,8,10,12,14 level MC > groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8 > (cpu_power = 1178) 2,10 (cpu_power = 1178) > domain 2: span 0,2,4,6,8,10,12,14 level CPU > groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) > domain 3: span 0-15 level NUMA > groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15 > (cpu_power = 4712) > > $for ((i=0; i< I; i++)); do ./pl & done > when I = 2, they are running on cpu 0,12 > I = 4, they are running on cpu 0,9,12,14 > I = 8, they are running on cpu 0,4,9,10,11,12,13,14 > Ops, it was tested on latest V15 tip/master tree, head is a7b7a8ad4476bb641c8455a4e0d7d0fd3eb86f90 not on this series. Sorry. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx186.postini.com [74.125.245.186]) by kanga.kvack.org (Postfix) with SMTP id 6938A6B005D for ; Sun, 18 Nov 2012 21:26:03 -0500 (EST) Received: by mail-ee0-f41.google.com with SMTP id d41so3184794eek.14 for ; Sun, 18 Nov 2012 18:26:01 -0800 (PST) Date: Mon, 19 Nov 2012 03:25:58 +0100 From: Ingo Molnar Subject: [PATCH 17/19, v2] mm/migrate: Introduce migrate_misplaced_page() Message-ID: <20121119022558.GA3186@gmail.com> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-18-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1353083121-4560-18-git-send-email-mingo@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins * Ingo Molnar wrote: > From: Peter Zijlstra > > Add migrate_misplaced_page() which deals with migrating pages from > faults. > > This includes adding a new MIGRATE_FAULT migration mode to > deal with the extra page reference required due to having to look up > the page. [...] > --- a/include/linux/migrate_mode.h > +++ b/include/linux/migrate_mode.h > @@ -6,11 +6,14 @@ > * on most operations but not ->writepage as the potential stall time > * is too significant > * MIGRATE_SYNC will block when migrating pages > + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy > + * this path has an extra reference count > */ Note, this is still the older, open-coded version. The newer replacement version created from Mel's patch which reuses migrate_pages() and is nicer on out-of-node-memory conditions and is cleaner all around can be found below. I tested it today and it appears to work fine. I noticed no performance improvement or performance drop from it - if it holds up in testing it will be part of the -v17 release of numa/core. Thanks, Ingo --------------------------> Subject: mm/migration: Introduce migrate_misplaced_page() From: Mel Gorman Date: Fri, 16 Nov 2012 11:22:23 +0000 Note: This was originally based on Peter's patch "mm/migrate: Introduce migrate_misplaced_page()" but borrows extremely heavily from Andrea's "autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection". The end result is barely recognisable so signed-offs had to be dropped. If original authors are ok with it, I'll re-add the signed-off-bys. Add migrate_misplaced_page() which deals with migrating pages from faults. Based-on-work-by: Lee Schermerhorn Based-on-work-by: Peter Zijlstra Based-on-work-by: Andrea Arcangeli Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Johannes Weiner Cc: Hugh Dickins Cc: Linus Torvalds Cc: Linux-MM Cc: Peter Zijlstra Cc: Andrea Arcangeli Link: http://lkml.kernel.org/r/1353064973-26082-14-git-send-email-mgorman@suse.de [ Adapted to the numa/core tree. ] Signed-off-by: Ingo Molnar --- mm/memory.c | 13 ++----- mm/migrate.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 106 insertions(+), 10 deletions(-) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -3494,28 +3494,25 @@ out_pte_upgrade_unlock: out_unlock: pte_unmap_unlock(ptep, ptl); -out: + if (page) { task_numa_fault(page_nid, last_cpu, 1); put_page(page); } - +out: return 0; migrate: pte_unmap_unlock(ptep, ptl); - if (!migrate_misplaced_page(page, node)) { - page_nid = node; + if (migrate_misplaced_page(page, node)) { goto out; } + page = NULL; ptep = pte_offset_map_lock(mm, pmd, address, &ptl); - if (!pte_same(*ptep, entry)) { - put_page(page); - page = NULL; + if (!pte_same(*ptep, entry)) goto out_unlock; - } goto out_pte_upgrade_unlock; } Index: linux/mm/migrate.c =================================================================== --- linux.orig/mm/migrate.c +++ linux/mm/migrate.c @@ -279,7 +279,7 @@ static int migrate_page_move_mapping(str struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { - int expected_count; + int expected_count = 0; void **pslot; if (!mapping) { @@ -1403,4 +1403,103 @@ int migrate_vmas(struct mm_struct *mm, c } return err; } -#endif + +/* + * Returns true if this is a safe migration target node for misplaced NUMA + * pages. Currently it only checks the watermarks which crude + */ +static bool migrate_balanced_pgdat(struct pglist_data *pgdat, + int nr_migrate_pages) +{ + int z; + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + struct zone *zone = pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone->all_unreclaimable) + continue; + + /* Avoid waking kswapd by allocating pages_to_migrate pages. */ + if (!zone_watermark_ok(zone, 0, + high_wmark_pages(zone) + + nr_migrate_pages, + 0, 0)) + continue; + return true; + } + return false; +} + +static struct page *alloc_misplaced_dst_page(struct page *page, + unsigned long data, + int **result) +{ + int nid = (int) data; + struct page *newpage; + + newpage = alloc_pages_exact_node(nid, + (GFP_HIGHUSER_MOVABLE | GFP_THISNODE | + __GFP_NOMEMALLOC | __GFP_NORETRY | + __GFP_NOWARN) & + ~GFP_IOFS, 0); + return newpage; +} + +/* + * Attempt to migrate a misplaced page to the specified destination + * node. Caller is expected to have an elevated reference count on + * the page that will be dropped by this function before returning. + */ +int migrate_misplaced_page(struct page *page, int node) +{ + int isolated = 0; + LIST_HEAD(migratepages); + + /* + * Don't migrate pages that are mapped in multiple processes. + * TODO: Handle false sharing detection instead of this hammer + */ + if (page_mapcount(page) != 1) + goto out; + + /* Avoid migrating to a node that is nearly full */ + if (migrate_balanced_pgdat(NODE_DATA(node), 1)) { + int page_lru; + + if (isolate_lru_page(page)) { + put_page(page); + goto out; + } + isolated = 1; + + /* + * Page is isolated which takes a reference count so now the + * callers reference can be safely dropped without the page + * disappearing underneath us during migration + */ + put_page(page); + + page_lru = page_is_file_cache(page); + inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + list_add(&page->lru, &migratepages); + } + + if (isolated) { + int nr_remaining; + + nr_remaining = migrate_pages(&migratepages, + alloc_misplaced_dst_page, + node, false, MIGRATE_ASYNC); + if (nr_remaining) { + putback_lru_pages(&migratepages); + isolated = 0; + } + } + BUG_ON(!list_empty(&migratepages)); +out: + return isolated; +} + +#endif /* CONFIG_NUMA */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx190.postini.com [74.125.245.190]) by kanga.kvack.org (Postfix) with SMTP id 914F16B006E for ; Mon, 19 Nov 2012 11:03:09 -0500 (EST) Message-ID: <50AA582E.30602@redhat.com> Date: Mon, 19 Nov 2012 11:02:54 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 17/19, v2] mm/migrate: Introduce migrate_misplaced_page() References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-18-git-send-email-mingo@kernel.org> <20121119022558.GA3186@gmail.com> In-Reply-To: <20121119022558.GA3186@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins On 11/18/2012 09:25 PM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> From: Peter Zijlstra >> >> Add migrate_misplaced_page() which deals with migrating pages from >> faults. >> >> This includes adding a new MIGRATE_FAULT migration mode to >> deal with the extra page reference required due to having to look up >> the page. > [...] > >> --- a/include/linux/migrate_mode.h >> +++ b/include/linux/migrate_mode.h >> @@ -6,11 +6,14 @@ >> * on most operations but not ->writepage as the potential stall time >> * is too significant >> * MIGRATE_SYNC will block when migrating pages >> + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy >> + * this path has an extra reference count >> */ > > Note, this is still the older, open-coded version. > > The newer replacement version created from Mel's patch which > reuses migrate_pages() and is nicer on out-of-node-memory > conditions and is cleaner all around can be found below. > > I tested it today and it appears to work fine. I noticed no > performance improvement or performance drop from it - if it > holds up in testing it will be part of the -v17 release of > numa/core. Excellent. That gets rid of the last issue with numa/base :) -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx182.postini.com [74.125.245.182]) by kanga.kvack.org (Postfix) with SMTP id 5A7076B005A for ; Sun, 25 Nov 2012 01:07:13 -0500 (EST) Received: by mail-ie0-f169.google.com with SMTP id c14so1620541ieb.14 for ; Sat, 24 Nov 2012 22:07:12 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1353083121-4560-5-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-5-git-send-email-mingo@kernel.org> Date: Sun, 25 Nov 2012 11:37:12 +0530 Message-ID: Subject: Re: [PATCH 04/19] sched, numa, mm: Describe the NUMA scheduling problem formally From: abhishek agarwal Content-Type: multipart/alternative; boundary=f46d04339c96b7a92704cf4ba0b1 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , "H. Peter Anvin" , Mike Galbraith --f46d04339c96b7a92704cf4ba0b1 Content-Type: text/plain; charset=ISO-8859-1 as per 4) move towards where "most" memory. If we have a large shared memory than private memnory. Why not we just move the process towrds the memory.. instead of the memory moving towards the node. This will i guess be less cumbersome, then moving all the shared memory On Fri, Nov 16, 2012 at 9:55 PM, Ingo Molnar wrote: > +Since per 2b our 's_i,k' and 'p_i' require at least two scans to > 'stabilize' > +and show representative numbers, we should limit node-migration to not be > +faster than this. > --f46d04339c96b7a92704cf4ba0b1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
as per 4) move towards where "most" me= mory. If we have a large shared memory than private memnory. Why not we jus= t move the process towrds the memory.. instead of the memory moving towards= the node. This will i guess be less cumbersome, then moving all the shared= memory

On Fri, Nov 16, 2012 at 9:55 PM, Ingo Molnar= <mingo@kernel.org> wrote:
+Since per 2b our 's_i,k' and 'p_i' requir= e at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be<= br> +faster than this.

--f46d04339c96b7a92704cf4ba0b1-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id DF76A6B0068 for ; Sun, 25 Nov 2012 01:09:45 -0500 (EST) Received: by mail-ie0-f169.google.com with SMTP id c14so1621835ieb.14 for ; Sat, 24 Nov 2012 22:09:45 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <1353083121-4560-5-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-5-git-send-email-mingo@kernel.org> Date: Sun, 25 Nov 2012 11:39:45 +0530 Message-ID: Subject: Re: [PATCH 04/19] sched, numa, mm: Describe the NUMA scheduling problem formally From: abhishek agarwal Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , "H. Peter Anvin" , Mike Galbraith as per 4) move towards where "most" memory. If we have a large shared memory than private memnory. Why not we just move the process towrds the memory.. instead of the memory moving towards the node. This will i guess be less cumbersome, then moving all the shared memory On Fri, Nov 16, 2012 at 9:55 PM, Ingo Molnar wrote: > From: Peter Zijlstra > > This is probably a first: formal description of a complex high-level > computing problem, within the kernel source. > > Signed-off-by: Peter Zijlstra > Cc: Linus Torvalds > Cc: Andrew Morton > Cc: Peter Zijlstra > Cc: "H. Peter Anvin" > Cc: Mike Galbraith > Rik van Riel > Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org > [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] > Signed-off-by: Ingo Molnar > --- > Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ > 1 file changed, 230 insertions(+) > create mode 100644 Documentation/scheduler/numa-problem.txt > > diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt > new file mode 100644 > index 0000000..a5d2fee > --- /dev/null > +++ b/Documentation/scheduler/numa-problem.txt > @@ -0,0 +1,230 @@ > + > + > +Effective NUMA scheduling problem statement, described formally: > + > + * minimize interconnect traffic > + > +For each task 't_i' we have memory, this memory can be spread over multiple > +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on > +node 'k' in [pages]. > + > +If a task shares memory with another task let us denote this as: > +'s_i,k', the memory shared between tasks including 't_i' residing on node > +'k'. > + > +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. > + > +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage > +frequency over those memory regions [1/s] such that the product gives an > +(average) bandwidth 'bp' and 'bs' in [pages/s]. > + > +(note: multiple tasks sharing memory naturally avoid duplicat accounting > + because each task will have its own access frequency 'fs') > + > +(pjt: I think this frequency is more numerically consistent if you explicitly > + restrict p/s above to be the working-set. (It also makes explicit the > + requirement for to change about a change in the working set.) > + > + Doing this does have the nice property that it lets you use your frequency > + measurement as a weak-ordering for the benefit a task would receive when > + we can't fit everything. > + > + e.g. task1 has working set 10mb, f=90% > + task2 has working set 90mb, f=10% > + > + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit > + from task1 being on the right node than task2. ) > + > +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': > + > + C: t_i -> {c_i, n_i} > + > +This gives us the total interconnect traffic between nodes 'k' and 'l', > +'T_k,l', as: > + > + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l > + > +And our goal is to obtain C0 and M0 such that: > + > + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l > + > +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory > + on node 'l' from node 'k', this would be useful for bigger NUMA systems > + > + pjt: I agree nice to have, but intuition suggests diminishing returns on more > + usual systems given factors like things like Haswell's enormous 35mb l3 > + cache and QPI being able to do a direct fetch.) > + > +(note: do we need a limit on the total memory per node?) > + > + > + * fairness > + > +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu > +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a > +load 'L_n': > + > + L_n = 1/P_n * \Sum_i w_i for all c_i = n > + > +using that we can formulate a load difference between CPUs > + > + L_n,m = | L_n - L_m | > + > +Which allows us to state the fairness goal like: > + > + L_n,m(C0) =< L_n,m(C) for all C, n != m > + > +(pjt: It can also be usefully stated that, having converged at C0: > + > + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | > + > + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is > + the "worst" partition we should accept; but having it gives us a useful > + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to > + favor T_n,m. ) > + > +Together they give us the complete multi-objective optimization problem: > + > + min_C,M [ L_n,m(C), T_k,l(C,M) ] > + > + > + > +Notes: > + > + - the memory bandwidth problem is very much an inter-process problem, in > + particular there is no such concept as a process in the above problem. > + > + - the naive solution would completely prefer fairness over interconnect > + traffic, the more complicated solution could pick another Pareto point using > + an aggregate objective function such that we balance the loss of work > + efficiency against the gain of running, we'd want to more or less suggest > + there to be a fixed bound on the error from the Pareto line for any > + such solution. > + > +References: > + > + http://en.wikipedia.org/wiki/Mathematical_optimization > + http://en.wikipedia.org/wiki/Multi-objective_optimization > + > + > +* warning, significant hand-waving ahead, improvements welcome * > + > + > +Partial solutions / approximations: > + > + 1) have task node placement be a pure preference from the 'fairness' pov. > + > +This means we always prefer fairness over interconnect bandwidth. This reduces > +the problem to: > + > + min_C,M [ T_k,l(C,M) ] > + > + 2a) migrate memory towards 'n_i' (the task's node). > + > +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- > +provided 'n_i' stays stable enough and there's sufficient memory (looks like > +we might need memory limits for this). > + > +This does however not provide us with any 's_i' (shared) information. It does > +however remove 'M' since it defines memory placement in terms of task > +placement. > + > +XXX properties of this M vs a potential optimal > + > + 2b) migrate memory towards 'n_i' using 2 samples. > + > +This separates pages into those that will migrate and those that will not due > +to the two samples not matching. We could consider the first to be of 'p_i' > +(private) and the second to be of 's_i' (shared). > + > +This interpretation can be motivated by the previously observed property that > +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only > +'s_i' (shared). (here we loose the need for memory limits again, since it > +becomes indistinguishable from shared). > + > +XXX include the statistical babble on double sampling somewhere near > + > +This reduces the problem further; we loose 'M' as per 2a, it further reduces > +the 'T_k,l' (interconnect traffic) term to only include shared (since per the > +above all private will be local): > + > + T_k,l = \Sum_i bs_i,l for every n_i = k, l != k > + > +[ more or less matches the state of sched/numa and describes its remaining > + problems and assumptions. It should work well for tasks without significant > + shared memory usage between tasks. ] > + > +Possible future directions: > + > +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we > +can evaluate it; > + > + 3a) add per-task per node counters > + > +At fault time, count the number of pages the task faults on for each node. > +This should give an approximation of 'p_i' for the local node and 's_i,k' for > +all remote nodes. > + > +While these numbers provide pages per scan, and so have the unit [pages/s] they > +don't count repeat access and thus aren't actually representable for our > +bandwidth numberes. > + > + 3b) additional frequency term > + > +Additionally (or instead if it turns out we don't need the raw 'p' and 's' > +numbers) we can approximate the repeat accesses by using the time since marking > +the pages as indication of the access frequency. > + > +Let 'I' be the interval of marking pages and 'e' the elapsed time since the > +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'. > +If we then increment the node counters using 'a' instead of 1 we might get > +a better estimate of bandwidth terms. > + > + 3c) additional averaging; can be applied on top of either a/b. > + > +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since > + the decaying avg includes the old accesses and therefore has a measure of repeat > + accesses. > + > + Rik also argued that the sample frequency is too low to get accurate access > + frequency measurements, I'm not entirely convinced, event at low sample > + frequencies the avg elapsed time 'e' over multiple samples should still > + give us a fair approximation of the avg access frequency 'a'. > + > + So doing both b&c has a fair chance of working and allowing us to distinguish > + between important and less important memory accesses. > + > + Experimentation has shown no benefit from the added frequency term so far. ] > + > +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute > +'T_k,l' Our optimization problem now reads: > + > + min_C [ \Sum_i bs_i,l for every n_i = k, l != k ] > + > +And includes only shared terms, this makes sense since all task private memory > +will become local as per 2. > + > +This suggests that if there is significant shared memory, we should try and > +move towards it. > + > + 4) move towards where 'most' memory is > + > +The simplest significance test is comparing the biggest shared 's_i,k' against > +the private 'p_i'. If we have more shared than private, move towards it. > + > +This effectively makes us move towards where most our memory is and forms a > +feed-back loop with 2. We migrate memory towards us and we migrate towards > +where 'most' memory is. > + > +(Note: even if there were two tasks fully trashing the same shared memory, it > + is very rare for there to be an 50/50 split in memory, lacking a perfect > + split, the small will move towards the larger. In case of the perfect > + split, we'll tie-break towards the lower node number.) > + > + 5) 'throttle' 4's node placement > + > +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize' > +and show representative numbers, we should limit node-migration to not be > +faster than this. > + > + n) poke holes in previous that require more stuff and describe it. > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752695Ab2KPQZf (ORCPT ); Fri, 16 Nov 2012 11:25:35 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752284Ab2KPQZe (ORCPT ); Fri, 16 Nov 2012 11:25:34 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 00/19] latest numa/base patches Date: Fri, 16 Nov 2012 17:25:02 +0100 Message-Id: <1353083121-4560-1-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is the split-out series of mm/ patches that got no objections from the latest (v15) posting of numa/core. If everyone is still fine with these then these will be merge candidates for v3.8. I left out the more contentious policy bits that people are still arguing about. The numa/base tree can also be found here: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base Thanks, Ingo -------------------> Andrea Arcangeli (1): numa, mm: Support NUMA hinting page faults from gup/gup_fast Gerald Schaefer (1): sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Ingo Molnar (1): mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Lee Schermerhorn (3): mm/mpol: Add MPOL_MF_NOOP mm/mpol: Check for misplaced page mm/mpol: Add MPOL_MF_LAZY Peter Zijlstra (7): sched, numa, mm: Make find_busiest_queue() a method sched, numa, mm: Describe the NUMA scheduling problem formally mm/thp: Preserve pgprot across huge page split mm/mpol: Make MPOL_LOCAL a real policy mm/mpol: Create special PROT_NONE infrastructure mm/migrate: Introduce migrate_misplaced_page() mm/mpol: Use special PROT_NONE to migrate pages Ralf Baechle (1): sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Rik van Riel (5): mm/generic: Only flush the local TLB in ptep_set_access_flags() x86/mm: Only do a local tlb flush in ptep_set_access_flags() x86/mm: Introduce pte_accessible() mm: Only flush the TLB when clearing an accessible pte x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ arch/mips/include/asm/pgtable.h | 2 + arch/s390/include/asm/pgtable.h | 13 ++ arch/x86/include/asm/pgtable.h | 7 + arch/x86/mm/pgtable.c | 8 +- include/asm-generic/pgtable.h | 4 + include/linux/huge_mm.h | 19 +++ include/linux/mempolicy.h | 8 ++ include/linux/migrate.h | 7 + include/linux/migrate_mode.h | 3 + include/linux/mm.h | 32 +++++ include/uapi/linux/mempolicy.h | 16 ++- kernel/sched/fair.c | 20 +-- mm/huge_memory.c | 174 +++++++++++++++-------- mm/memory.c | 119 +++++++++++++++- mm/mempolicy.c | 143 +++++++++++++++---- mm/migrate.c | 85 ++++++++++-- mm/mprotect.c | 31 +++-- mm/pgtable-generic.c | 9 +- 19 files changed, 807 insertions(+), 123 deletions(-) create mode 100644 Documentation/scheduler/numa-problem.txt -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752748Ab2KPQZj (ORCPT ); Fri, 16 Nov 2012 11:25:39 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752722Ab2KPQZh (ORCPT ); Fri, 16 Nov 2012 11:25:37 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse Subject: [PATCH 01/19] mm/generic: Only flush the local TLB in ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:03 +0100 Message-Id: <1353083121-4560-2-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel The function ptep_set_access_flags() is only ever used to upgrade access permissions to a page - i.e. they make it less restrictive. That means the only negative side effect of not flushing remote TLBs in this function is that other CPUs may incur spurious page faults, if they happen to access the same address, and still have a PTE with the old permissions cached in their TLB caches. Having another CPU maybe incur a spurious page fault is faster than always incurring the cost of a remote TLB flush, so replace the remote TLB flush with a purely local one. This should be safe on every architecture that correctly implements flush_tlb_fix_spurious_fault() to actually invalidate the local TLB entry that caused a page fault, as well as on architectures where the hardware invalidates TLB entries that cause page faults. In the unlikely event that you are hitting what appears to be an infinite loop of page faults, and 'git bisect' took you to this changeset, your architecture needs to implement flush_tlb_fix_spurious_fault() to actually flush the TLB entry. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse [ Changelog massage. ] Signed-off-by: Ingo Molnar --- mm/pgtable-generic.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index e642627..d8397da 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -12,8 +12,8 @@ #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS /* - * Only sets the access flags (dirty, accessed, and - * writable). Furthermore, we know it always gets set to a "more + * Only sets the access flags (dirty, accessed), as well as write + * permission. Furthermore, we know it always gets set to a "more * permissive" setting, which allows most architectures to optimize * this. We return whether the PTE actually changed, which in turn * instructs the caller to do things like update__mmu_cache. This @@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, int changed = !pte_same(*ptep, entry); if (changed) { set_pte_at(vma->vm_mm, address, ptep, entry); - flush_tlb_page(vma, address); + flush_tlb_fix_spurious_fault(vma, address); } return changed; } -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752796Ab2KPQZr (ORCPT ); Fri, 16 Nov 2012 11:25:47 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752772Ab2KPQZn (ORCPT ); Fri, 16 Nov 2012 11:25:43 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , "H. Peter Anvin" , Mike Galbraith Subject: [PATCH 04/19] sched, numa, mm: Describe the NUMA scheduling problem formally Date: Fri, 16 Nov 2012 17:25:06 +0100 Message-Id: <1353083121-4560-5-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra This is probably a first: formal description of a complex high-level computing problem, within the kernel source. Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: "H. Peter Anvin" Cc: Mike Galbraith Rik van Riel Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] Signed-off-by: Ingo Molnar --- Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ 1 file changed, 230 insertions(+) create mode 100644 Documentation/scheduler/numa-problem.txt diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt new file mode 100644 index 0000000..a5d2fee --- /dev/null +++ b/Documentation/scheduler/numa-problem.txt @@ -0,0 +1,230 @@ + + +Effective NUMA scheduling problem statement, described formally: + + * minimize interconnect traffic + +For each task 't_i' we have memory, this memory can be spread over multiple +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on +node 'k' in [pages]. + +If a task shares memory with another task let us denote this as: +'s_i,k', the memory shared between tasks including 't_i' residing on node +'k'. + +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. + +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage +frequency over those memory regions [1/s] such that the product gives an +(average) bandwidth 'bp' and 'bs' in [pages/s]. + +(note: multiple tasks sharing memory naturally avoid duplicat accounting + because each task will have its own access frequency 'fs') + +(pjt: I think this frequency is more numerically consistent if you explicitly + restrict p/s above to be the working-set. (It also makes explicit the + requirement for to change about a change in the working set.) + + Doing this does have the nice property that it lets you use your frequency + measurement as a weak-ordering for the benefit a task would receive when + we can't fit everything. + + e.g. task1 has working set 10mb, f=90% + task2 has working set 90mb, f=10% + + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit + from task1 being on the right node than task2. ) + +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': + + C: t_i -> {c_i, n_i} + +This gives us the total interconnect traffic between nodes 'k' and 'l', +'T_k,l', as: + + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l + +And our goal is to obtain C0 and M0 such that: + + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l + +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory + on node 'l' from node 'k', this would be useful for bigger NUMA systems + + pjt: I agree nice to have, but intuition suggests diminishing returns on more + usual systems given factors like things like Haswell's enormous 35mb l3 + cache and QPI being able to do a direct fetch.) + +(note: do we need a limit on the total memory per node?) + + + * fairness + +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a +load 'L_n': + + L_n = 1/P_n * \Sum_i w_i for all c_i = n + +using that we can formulate a load difference between CPUs + + L_n,m = | L_n - L_m | + +Which allows us to state the fairness goal like: + + L_n,m(C0) =< L_n,m(C) for all C, n != m + +(pjt: It can also be usefully stated that, having converged at C0: + + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | + + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is + the "worst" partition we should accept; but having it gives us a useful + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to + favor T_n,m. ) + +Together they give us the complete multi-objective optimization problem: + + min_C,M [ L_n,m(C), T_k,l(C,M) ] + + + +Notes: + + - the memory bandwidth problem is very much an inter-process problem, in + particular there is no such concept as a process in the above problem. + + - the naive solution would completely prefer fairness over interconnect + traffic, the more complicated solution could pick another Pareto point using + an aggregate objective function such that we balance the loss of work + efficiency against the gain of running, we'd want to more or less suggest + there to be a fixed bound on the error from the Pareto line for any + such solution. + +References: + + http://en.wikipedia.org/wiki/Mathematical_optimization + http://en.wikipedia.org/wiki/Multi-objective_optimization + + +* warning, significant hand-waving ahead, improvements welcome * + + +Partial solutions / approximations: + + 1) have task node placement be a pure preference from the 'fairness' pov. + +This means we always prefer fairness over interconnect bandwidth. This reduces +the problem to: + + min_C,M [ T_k,l(C,M) ] + + 2a) migrate memory towards 'n_i' (the task's node). + +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- +provided 'n_i' stays stable enough and there's sufficient memory (looks like +we might need memory limits for this). + +This does however not provide us with any 's_i' (shared) information. It does +however remove 'M' since it defines memory placement in terms of task +placement. + +XXX properties of this M vs a potential optimal + + 2b) migrate memory towards 'n_i' using 2 samples. + +This separates pages into those that will migrate and those that will not due +to the two samples not matching. We could consider the first to be of 'p_i' +(private) and the second to be of 's_i' (shared). + +This interpretation can be motivated by the previously observed property that +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only +'s_i' (shared). (here we loose the need for memory limits again, since it +becomes indistinguishable from shared). + +XXX include the statistical babble on double sampling somewhere near + +This reduces the problem further; we loose 'M' as per 2a, it further reduces +the 'T_k,l' (interconnect traffic) term to only include shared (since per the +above all private will be local): + + T_k,l = \Sum_i bs_i,l for every n_i = k, l != k + +[ more or less matches the state of sched/numa and describes its remaining + problems and assumptions. It should work well for tasks without significant + shared memory usage between tasks. ] + +Possible future directions: + +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we +can evaluate it; + + 3a) add per-task per node counters + +At fault time, count the number of pages the task faults on for each node. +This should give an approximation of 'p_i' for the local node and 's_i,k' for +all remote nodes. + +While these numbers provide pages per scan, and so have the unit [pages/s] they +don't count repeat access and thus aren't actually representable for our +bandwidth numberes. + + 3b) additional frequency term + +Additionally (or instead if it turns out we don't need the raw 'p' and 's' +numbers) we can approximate the repeat accesses by using the time since marking +the pages as indication of the access frequency. + +Let 'I' be the interval of marking pages and 'e' the elapsed time since the +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'. +If we then increment the node counters using 'a' instead of 1 we might get +a better estimate of bandwidth terms. + + 3c) additional averaging; can be applied on top of either a/b. + +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since + the decaying avg includes the old accesses and therefore has a measure of repeat + accesses. + + Rik also argued that the sample frequency is too low to get accurate access + frequency measurements, I'm not entirely convinced, event at low sample + frequencies the avg elapsed time 'e' over multiple samples should still + give us a fair approximation of the avg access frequency 'a'. + + So doing both b&c has a fair chance of working and allowing us to distinguish + between important and less important memory accesses. + + Experimentation has shown no benefit from the added frequency term so far. ] + +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute +'T_k,l' Our optimization problem now reads: + + min_C [ \Sum_i bs_i,l for every n_i = k, l != k ] + +And includes only shared terms, this makes sense since all task private memory +will become local as per 2. + +This suggests that if there is significant shared memory, we should try and +move towards it. + + 4) move towards where 'most' memory is + +The simplest significance test is comparing the biggest shared 's_i,k' against +the private 'p_i'. If we have more shared than private, move towards it. + +This effectively makes us move towards where most our memory is and forms a +feed-back loop with 2. We migrate memory towards us and we migrate towards +where 'most' memory is. + +(Note: even if there were two tasks fully trashing the same shared memory, it + is very rare for there to be an 50/50 split in memory, lacking a perfect + split, the small will move towards the larger. In case of the perfect + split, we'll tie-break towards the lower node number.) + + 5) 'throttle' 4's node placement + +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize' +and show representative numbers, we should limit node-migration to not be +faster than this. + + n) poke holes in previous that require more stuff and describe it. -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752771Ab2KPQZm (ORCPT ); Fri, 16 Nov 2012 11:25:42 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752722Ab2KPQZk (ORCPT ); Fri, 16 Nov 2012 11:25:40 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 03/19] sched, numa, mm: Make find_busiest_queue() a method Date: Fri, 16 Nov 2012 17:25:05 +0100 Message-Id: <1353083121-4560-4-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra Its a bit awkward but it was the least painful means of modifying the queue selection. Used in a later patch to conditionally use a random queue. Signed-off-by: Peter Zijlstra Cc: Paul Turner Cc: Lee Schermerhorn Cc: Christoph Lameter Cc: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds Link: http://lkml.kernel.org/n/tip-lfpez319yryvdhwqfqrh99f2@git.kernel.org Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6b800a1..6ab627e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3063,6 +3063,9 @@ struct lb_env { unsigned int loop; unsigned int loop_break; unsigned int loop_max; + + struct rq * (*find_busiest_queue)(struct lb_env *, + struct sched_group *); }; /* @@ -4236,13 +4239,14 @@ static int load_balance(int this_cpu, struct rq *this_rq, struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask); struct lb_env env = { - .sd = sd, - .dst_cpu = this_cpu, - .dst_rq = this_rq, - .dst_grpmask = sched_group_cpus(sd->groups), - .idle = idle, - .loop_break = sched_nr_migrate_break, - .cpus = cpus, + .sd = sd, + .dst_cpu = this_cpu, + .dst_rq = this_rq, + .dst_grpmask = sched_group_cpus(sd->groups), + .idle = idle, + .loop_break = sched_nr_migrate_break, + .cpus = cpus, + .find_busiest_queue = find_busiest_queue, }; cpumask_copy(cpus, cpu_active_mask); @@ -4261,7 +4265,7 @@ redo: goto out_balanced; } - busiest = find_busiest_queue(&env, group); + busiest = env.find_busiest_queue(&env, group); if (!busiest) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752882Ab2KPQ0B (ORCPT ); Fri, 16 Nov 2012 11:26:01 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752772Ab2KPQZ7 (ORCPT ); Fri, 16 Nov 2012 11:25:59 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 11/19] mm/mpol: Make MPOL_LOCAL a real policy Date: Fri, 16 Nov 2012 17:25:13 +0100 Message-Id: <1353083121-4560-12-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra Make MPOL_LOCAL a real and exposed policy such that applications that relied on the previous default behaviour can explicitly request it. Requested-by: Christoph Lameter Reviewed-by: Rik van Riel Cc: Lee Schermerhorn Cc: Andrew Morton Cc: Linus Torvalds Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 9 ++++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 23e62e0..3e835c9 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -20,6 +20,7 @@ enum { MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE, + MPOL_LOCAL, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d04a8a5..72f50ba 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, (flags & MPOL_F_RELATIVE_NODES))) return ERR_PTR(-EINVAL); } + } else if (mode == MPOL_LOCAL) { + if (!nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + mode = MPOL_PREFERRED; } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); policy = kmem_cache_alloc(policy_cache, GFP_KERNEL); @@ -2397,7 +2401,6 @@ void numa_default_policy(void) * "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag * Used only for mpol_parse_str() and mpol_to_str() */ -#define MPOL_LOCAL MPOL_MAX static const char * const policy_modes[] = { [MPOL_DEFAULT] = "default", @@ -2450,12 +2453,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context) if (flags) *flags++ = '\0'; /* terminate mode string */ - for (mode = 0; mode <= MPOL_LOCAL; mode++) { + for (mode = 0; mode < MPOL_MAX; mode++) { if (!strcmp(str, policy_modes[mode])) { break; } } - if (mode > MPOL_LOCAL) + if (mode >= MPOL_MAX) goto out; switch (mode) { -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752909Ab2KPQ0G (ORCPT ); Fri, 16 Nov 2012 11:26:06 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752877Ab2KPQ0B (ORCPT ); Fri, 16 Nov 2012 11:26:01 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn Subject: [PATCH 12/19] mm/mpol: Add MPOL_MF_NOOP Date: Fri, 16 Nov 2012 17:25:14 +0100 Message-Id: <1353083121-4560-13-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Lee Schermerhorn This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY flags, mbind() will map the pages PROT_NONE so that they will be migrated on the next touch. This allows an application to prepare for a new phase of operation where different regions of shared storage will be assigned to worker threads, w/o changing policy. Note that we could just use "default" policy in this case. However, this also allows an application to request that pages be migrated, only if necessary, to follow any arbitrary policy that might currently apply to a range of pages, without knowing the policy, or without specifying multiple mbind()s for ranges with different policies. [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ] Bug-Reported-by: Reported-by: Fengguang Wu Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 11 ++++++----- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 3e835c9..d23dca8 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -21,6 +21,7 @@ enum { MPOL_BIND, MPOL_INTERLEAVE, MPOL_LOCAL, + MPOL_NOOP, /* retain existing policy for range */ MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 72f50ba..c7c7c86 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, pr_debug("setting mode %d flags %d nodes[0] %lx\n", mode, flags, nodes ? nodes_addr(*nodes)[0] : -1); - if (mode == MPOL_DEFAULT) { + if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) { if (nodes && !nodes_empty(*nodes)) return ERR_PTR(-EINVAL); - return NULL; /* simply delete any existing policy */ + return NULL; } VM_BUG_ON(!nodes); @@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (start & ~PAGE_MASK) return -EINVAL; - if (mode == MPOL_DEFAULT) + if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) flags &= ~MPOL_MF_STRICT; len = (len + PAGE_SIZE - 1) & PAGE_MASK; @@ -2407,7 +2407,8 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", - [MPOL_LOCAL] = "local" + [MPOL_LOCAL] = "local", + [MPOL_NOOP] = "noop", /* should not actually be used */ }; @@ -2458,7 +2459,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context) break; } } - if (mode >= MPOL_MAX) + if (mode >= MPOL_MAX || mode == MPOL_NOOP) goto out; switch (mode) { -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752972Ab2KPQ0P (ORCPT ); Fri, 16 Nov 2012 11:26:15 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752888Ab2KPQ0I (ORCPT ); Fri, 16 Nov 2012 11:26:08 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 14/19] mm/mpol: Create special PROT_NONE infrastructure Date: Fri, 16 Nov 2012 17:25:16 +0100 Message-Id: <1353083121-4560-15-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra In order to facilitate a lazy -- fault driven -- migration of pages, create a special transient PROT_NONE variant, we can then use the 'spurious' protection faults to drive our migrations from. Pages that already had an effective PROT_NONE mapping will not be detected to generate these 'spuriuos' faults for the simple reason that we cannot distinguish them on their protection bits, see pte_numa(). This isn't a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking) aren't used or are rare enough for us to not care about their placement. Suggested-by: Rik van Riel Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-0g5k80y4df8l83lha9j75xph@git.kernel.org [ fixed various cross-arch and THP/!THP details ] Signed-off-by: Ingo Molnar --- include/linux/huge_mm.h | 19 +++++++++++++ include/linux/mm.h | 18 ++++++++++++ mm/huge_memory.c | 32 +++++++++++++++++++++ mm/memory.c | 75 ++++++++++++++++++++++++++++++++++++++++++++----- mm/mprotect.c | 24 +++++++++++----- 5 files changed, 154 insertions(+), 14 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b31cb7d..4f0f948 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -159,6 +159,13 @@ static inline struct page *compound_trans_head(struct page *page) } return page; } + +extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd); + +extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t orig_pmd); + #else /* CONFIG_TRANSPARENT_HUGEPAGE */ #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) @@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd, { return 0; } + +static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd) +{ + return false; +} + +static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t orig_pmd) +{ +} + #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_HUGE_MM_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 2a32cf8..0025bf9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, extern unsigned long do_mremap(unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags, unsigned long new_addr); +extern void change_protection(struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgprot_t newprot, + int dirty_accountable); extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); @@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags) } #endif +static inline pgprot_t vma_prot_none(struct vm_area_struct *vma) +{ + /* + * obtain PROT_NONE by removing READ|WRITE|EXEC privs + */ + vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC); + return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags)); +} + +static inline void +change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end) +{ + change_protection(vma, start, end, vma_prot_none(vma), 0); +} + struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr); int remap_pfn_range(struct vm_area_struct *, unsigned long addr, unsigned long pfn, unsigned long size, pgprot_t); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 176fe3d..6924edf 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -725,6 +725,38 @@ out: return handle_pte_fault(mm, vma, address, pte, pmd, flags); } +bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd) +{ + /* + * See pte_numa(). + */ + if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot))) + return false; + + return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma))); +} + +void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd, + unsigned int flags, pmd_t entry) +{ + unsigned long haddr = address & HPAGE_PMD_MASK; + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, entry))) + goto out_unlock; + + /* do fancy stuff */ + + /* change back to regular protection */ + entry = pmd_modify(entry, vma->vm_page_prot); + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1)) + update_mmu_cache_pmd(vma, address, entry); + +out_unlock: + spin_unlock(&mm->page_table_lock); +} + int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) diff --git a/mm/memory.c b/mm/memory.c index fb135ba..e3e8ab2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(zap_vma_ptes); +static bool pte_numa(struct vm_area_struct *vma, pte_t pte) +{ + /* + * If we have the normal vma->vm_page_prot protections we're not a + * 'special' PROT_NONE page. + * + * This means we cannot get 'special' PROT_NONE faults from genuine + * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty + * tracking. + * + * Neither case is really interesting for our current use though so we + * don't care. + */ + if (pte_same(pte, pte_modify(pte, vma->vm_page_prot))) + return false; + + return pte_same(pte, pte_modify(pte, vma_prot_none(vma))); +} + /** * follow_page - look up a page descriptor from a user-virtual address * @vma: vm_area_struct mapping @address @@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); } +static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pte_t *ptep, pmd_t *pmd, + unsigned int flags, pte_t entry) +{ + spinlock_t *ptl; + int ret = 0; + + if (!pte_unmap_same(mm, pmd, ptep, entry)) + goto out; + + /* + * Do fancy stuff... + */ + + /* + * OK, nothing to do,.. change the protection back to what it + * ought to be. + */ + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (unlikely(!pte_same(*ptep, entry))) + goto unlock; + + flush_cache_page(vma, address, pte_pfn(entry)); + + ptep_modify_prot_start(mm, address, ptep); + entry = pte_modify(entry, vma->vm_page_prot); + ptep_modify_prot_commit(mm, address, ptep, entry); + + update_mmu_cache(vma, address, ptep); +unlock: + pte_unmap_unlock(ptep, ptl); +out: + return ret; +} + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *mm, pte, pmd, flags, entry); } + if (pte_numa(vma, entry)) + return do_numa_page(mm, vma, address, pte, pmd, flags, entry); + ptl = pte_lockptr(mm, pmd); spin_lock(ptl); if (unlikely(!pte_same(*pte, entry))) @@ -3535,13 +3592,16 @@ retry: pmd, flags); } else { pmd_t orig_pmd = *pmd; - int ret; + int ret = 0; barrier(); - if (pmd_trans_huge(orig_pmd)) { - if (flags & FAULT_FLAG_WRITE && - !pmd_write(orig_pmd) && - !pmd_trans_splitting(orig_pmd)) { + if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { + if (pmd_numa(vma, orig_pmd)) { + do_huge_pmd_numa_page(mm, vma, address, pmd, + flags, orig_pmd); + } + + if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); /* @@ -3551,12 +3611,13 @@ retry: */ if (unlikely(ret & VM_FAULT_OOM)) goto retry; - return ret; } - return 0; + + return ret; } } + /* * Use __pte_alloc instead of pte_alloc_map, because we can't * run pte_offset_map on the pmd, if an huge pmd could diff --git a/mm/mprotect.c b/mm/mprotect.c index e97b0d6..392b124 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -112,7 +112,7 @@ static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd, } while (pud++, addr = next, addr != end); } -static void change_protection(struct vm_area_struct *vma, +static void change_protection_range(struct vm_area_struct *vma, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) { @@ -134,6 +134,20 @@ static void change_protection(struct vm_area_struct *vma, flush_tlb_range(vma, start, end); } +void change_protection(struct vm_area_struct *vma, unsigned long start, + unsigned long end, pgprot_t newprot, + int dirty_accountable) +{ + struct mm_struct *mm = vma->vm_mm; + + mmu_notifier_invalidate_range_start(mm, start, end); + if (is_vm_hugetlb_page(vma)) + hugetlb_change_protection(vma, start, end, newprot); + else + change_protection_range(vma, start, end, newprot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); +} + int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags) @@ -206,12 +220,8 @@ success: dirty_accountable = 1; } - mmu_notifier_invalidate_range_start(mm, start, end); - if (is_vm_hugetlb_page(vma)) - hugetlb_change_protection(vma, start, end, vma->vm_page_prot); - else - change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); - mmu_notifier_invalidate_range_end(mm, start, end); + change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); perf_event_mmap(vma); -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753018Ab2KPQ0V (ORCPT ); Fri, 16 Nov 2012 11:26:21 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752977Ab2KPQ0S (ORCPT ); Fri, 16 Nov 2012 11:26:18 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse Subject: [PATCH 19/19] x86/mm: Completely drop the TLB flush from ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:21 +0100 Message-Id: <1353083121-4560-20-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Intel has an architectural guarantee that the TLB entry causing a page fault gets invalidated automatically. This means we should be able to drop the local TLB invalidation. Because of the way other areas of the page fault code work, chances are good that all x86 CPUs do this. However, if someone somewhere has an x86 CPU that does not invalidate the TLB entry causing a page fault, this one-liner should be easy to revert - or a CPU model specific quirk could be added to retain this optimization on most CPUs. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse [ Applied changelog massage and moved this last in the series, to create bisection distance. ] Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index be3bb46..7353de3 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - __flush_tlb_one(address); } return changed; -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752994Ab2KPQ0S (ORCPT ); Fri, 16 Nov 2012 11:26:18 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752975Ab2KPQ0Q (ORCPT ); Fri, 16 Nov 2012 11:26:16 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 18/19] mm/mpol: Use special PROT_NONE to migrate pages Date: Fri, 16 Nov 2012 17:25:20 +0100 Message-Id: <1353083121-4560-19-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra Combine our previous PROT_NONE, mpol_misplaced and migrate_misplaced_page() pieces into an effective migrate on fault scheme. Note that (on x86) we rely on PROT_NONE pages being !present and avoid the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the page-migration performance. Suggested-by: Rik van Riel Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-e98gyl8kr9jzooh2s4piuils@git.kernel.org Signed-off-by: Ingo Molnar --- mm/huge_memory.c | 41 +++++++++++++++++++++++++++++++++++- mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++---------------- 2 files changed, 85 insertions(+), 19 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6924edf..c4c0a57 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -741,12 +742,48 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned int flags, pmd_t entry) { unsigned long haddr = address & HPAGE_PMD_MASK; + struct page *page = NULL; + int node; spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) goto out_unlock; - /* do fancy stuff */ + if (unlikely(pmd_trans_splitting(entry))) { + spin_unlock(&mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, pmd); + return; + } + +#ifdef CONFIG_NUMA + page = pmd_page(entry); + VM_BUG_ON(!PageCompound(page) || !PageHead(page)); + + get_page(page); + spin_unlock(&mm->page_table_lock); + + /* + * XXX should we serialize against split_huge_page ? + */ + + node = mpol_misplaced(page, vma, haddr); + if (node == -1) + goto do_fixup; + + /* + * Due to lacking code to migrate thp pages, we'll split + * (which preserves the special PROT_NONE) and re-take the + * fault on the normal pages. + */ + split_huge_page(page); + put_page(page); + return; + +do_fixup: + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(*pmd, entry))) + goto out_unlock; +#endif /* change back to regular protection */ entry = pmd_modify(entry, vma->vm_page_prot); @@ -755,6 +792,8 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, out_unlock: spin_unlock(&mm->page_table_lock); + if (page) + put_page(page); } int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, diff --git a/mm/memory.c b/mm/memory.c index a660fd0..0d26a28 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -1467,8 +1468,10 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes); static bool pte_numa(struct vm_area_struct *vma, pte_t pte) { /* - * If we have the normal vma->vm_page_prot protections we're not a - * 'special' PROT_NONE page. + * For NUMA page faults, we use PROT_NONE ptes in VMAs with + * "normal" vma->vm_page_prot protections. Genuine PROT_NONE + * VMAs should never get here, because the fault handling code + * will notice that the VMA has no read or write permissions. * * This means we cannot get 'special' PROT_NONE faults from genuine * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty @@ -3473,35 +3476,59 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pmd_t *pmd, unsigned int flags, pte_t entry) { + struct page *page = NULL; + int node, page_nid = -1; spinlock_t *ptl; - int ret = 0; - - if (!pte_unmap_same(mm, pmd, ptep, entry)) - goto out; - /* - * Do fancy stuff... - */ - - /* - * OK, nothing to do,.. change the protection back to what it - * ought to be. - */ - ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + ptl = pte_lockptr(mm, pmd); + spin_lock(ptl); if (unlikely(!pte_same(*ptep, entry))) - goto unlock; + goto out_unlock; + page = vm_normal_page(vma, address, entry); + if (page) { + get_page(page); + page_nid = page_to_nid(page); + node = mpol_misplaced(page, vma, address); + if (node != -1) + goto migrate; + } + +out_pte_upgrade_unlock: flush_cache_page(vma, address, pte_pfn(entry)); ptep_modify_prot_start(mm, address, ptep); entry = pte_modify(entry, vma->vm_page_prot); ptep_modify_prot_commit(mm, address, ptep, entry); + /* No TLB flush needed because we upgraded the PTE */ + update_mmu_cache(vma, address, ptep); -unlock: + +out_unlock: pte_unmap_unlock(ptep, ptl); out: - return ret; + if (page) + put_page(page); + + return 0; + +migrate: + pte_unmap_unlock(ptep, ptl); + + if (!migrate_misplaced_page(page, node)) { + page_nid = node; + goto out; + } + + ptep = pte_offset_map_lock(mm, pmd, address, &ptl); + if (!pte_same(*ptep, entry)) { + put_page(page); + page = NULL; + goto out_unlock; + } + + goto out_pte_upgrade_unlock; } /* -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752948Ab2KPQ0M (ORCPT ); Fri, 16 Nov 2012 11:26:12 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752915Ab2KPQ0J (ORCPT ); Fri, 16 Nov 2012 11:26:09 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn Subject: [PATCH 15/19] mm/mpol: Add MPOL_MF_LAZY Date: Fri, 16 Nov 2012 17:25:17 +0100 Message-Id: <1353083121-4560-16-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Lee Schermerhorn This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Lee Schermerhorn Cc: Andrew Morton Cc: Linus Torvalds [ nearly complete rewrite.. ] Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-7rsodo9x8zvm5awru5o7zo0y@git.kernel.org Signed-off-by: Ingo Molnar --- include/uapi/linux/mempolicy.h | 13 ++++++++--- mm/mempolicy.c | 49 +++++++++++++++++++++++++++--------------- 2 files changed, 42 insertions(+), 20 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 472de8a..6a1baae 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -49,9 +49,16 @@ enum mpol_rebind_step { /* Flags for mbind */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ -#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ -#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ -#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */ +#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform + to policy */ +#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ +#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */ +#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ + +#define MPOL_MF_VALID (MPOL_MF_STRICT | \ + MPOL_MF_MOVE | \ + MPOL_MF_MOVE_ALL | \ + MPOL_MF_LAZY) /* * Internal flags that share the struct mempolicy flags word with diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1b2890c..5ee326c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end, return ERR_PTR(-EFAULT); prev = NULL; for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { + unsigned long endvma = vma->vm_end; + + if (endvma > end) + endvma = end; + if (vma->vm_start > start) + start = vma->vm_start; + if (!(flags & MPOL_MF_DISCONTIG_OK)) { if (!vma->vm_next && vma->vm_end < end) return ERR_PTR(-EFAULT); if (prev && prev->vm_end < vma->vm_start) return ERR_PTR(-EFAULT); } - if (!is_vm_hugetlb_page(vma) && - ((flags & MPOL_MF_STRICT) || + + if (is_vm_hugetlb_page(vma)) + goto next; + + if (flags & MPOL_MF_LAZY) { + change_prot_none(vma, start, endvma); + goto next; + } + + if ((flags & MPOL_MF_STRICT) || ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) && - vma_migratable(vma)))) { - unsigned long endvma = vma->vm_end; + vma_migratable(vma))) { - if (endvma > end) - endvma = end; - if (vma->vm_start > start) - start = vma->vm_start; err = check_pgd_range(vma, start, endvma, nodes, flags, private); if (err) { @@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end, break; } } +next: prev = vma; } return first; @@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start, unsigned long len, int err; LIST_HEAD(pagelist); - if (flags & ~(unsigned long)(MPOL_MF_STRICT | - MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) + if (flags & ~(unsigned long)MPOL_MF_VALID) return -EINVAL; if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) return -EPERM; @@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start, unsigned long len, if (IS_ERR(new)) return PTR_ERR(new); + if (flags & MPOL_MF_LAZY) + new->flags |= MPOL_F_MOF; + /* * If we are using the default policy then operation * on discontinuous address spaces is okay after all @@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start, unsigned long len, vma = check_range(mm, start, end, nmask, flags | MPOL_MF_INVERT, &pagelist); - err = PTR_ERR(vma); - if (!IS_ERR(vma)) { - int nr_failed = 0; - + err = PTR_ERR(vma); /* maybe ... */ + if (!IS_ERR(vma) && mode != MPOL_NOOP) err = mbind_range(mm, start, end, new); + if (!err) { + int nr_failed = 0; + if (!list_empty(&pagelist)) { + WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_vma_page, - (unsigned long)vma, - false, MIGRATE_SYNC); + (unsigned long)vma, + false, MIGRATE_SYNC); if (nr_failed) putback_lru_pages(&pagelist); } - if (!err && nr_failed && (flags & MPOL_MF_STRICT)) + if (nr_failed && (flags & MPOL_MF_STRICT)) err = -EIO; } else putback_lru_pages(&pagelist); -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753063Ab2KPQ04 (ORCPT ); Fri, 16 Nov 2012 11:26:56 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752915Ab2KPQ0N (ORCPT ); Fri, 16 Nov 2012 11:26:13 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 17/19] mm/migrate: Introduce migrate_misplaced_page() Date: Fri, 16 Nov 2012 17:25:19 +0100 Message-Id: <1353083121-4560-18-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra Add migrate_misplaced_page() which deals with migrating pages from faults. This includes adding a new MIGRATE_FAULT migration mode to deal with the extra page reference required due to having to look up the page. Based-on-work-by: Lee Schermerhorn Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-es03i8ne7xee0981brw40fl5@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/migrate.h | 7 ++++ include/linux/migrate_mode.h | 3 ++ mm/migrate.c | 85 +++++++++++++++++++++++++++++++++++++++----- 3 files changed, 87 insertions(+), 8 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index ce7e667..9a5afea 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm, extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); +extern int migrate_misplaced_page(struct page *page, int node); #else static inline void putback_lru_pages(struct list_head *l) {} @@ -63,5 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #define migrate_page NULL #define fail_migrate_page NULL +static inline +int migrate_misplaced_page(struct page *page, int node) +{ + return -EAGAIN; /* can't migrate now */ +} #endif /* CONFIG_MIGRATION */ + #endif /* _LINUX_MIGRATE_H */ diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index ebf3d89..40b37dc 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -6,11 +6,14 @@ * on most operations but not ->writepage as the potential stall time * is too significant * MIGRATE_SYNC will block when migrating pages + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy + * this path has an extra reference count */ enum migrate_mode { MIGRATE_ASYNC, MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, + MIGRATE_FAULT, }; #endif /* MIGRATE_MODE_H_INCLUDED */ diff --git a/mm/migrate.c b/mm/migrate.c index 77ed2d7..3299949 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, struct buffer_head *bh = head; /* Simple case, sync compaction */ - if (mode != MIGRATE_ASYNC) { + if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) { do { get_bh(bh); lock_buffer(bh); @@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { - int expected_count; + int expected_count = 0; void **pslot; + if (mode == MIGRATE_FAULT) { + /* + * MIGRATE_FAULT has an extra reference on the page and + * otherwise acts like ASYNC, no point in delaying the + * fault, we'll try again next time. + */ + expected_count++; + } + if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != 1) + expected_count += 1; + if (page_count(page) != expected_count) return -EAGAIN; return 0; } @@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping, pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); - expected_count = 2 + page_has_private(page); + expected_count += 2 + page_has_private(page); if (page_count(page) != expected_count || radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) { spin_unlock_irq(&mapping->tree_lock); @@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping, * the mapping back due to an elevated page count, we would have to * block waiting on other references to be dropped. */ - if (mode == MIGRATE_ASYNC && head && + if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head && !buffer_migrate_lock_buffers(head, mode)) { page_unfreeze_refs(page, expected_count); spin_unlock_irq(&mapping->tree_lock); @@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping, * with an IRQ-safe spinlock held. In the sync case, the buffers * need to be locked now */ - if (mode != MIGRATE_ASYNC) + if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) BUG_ON(!buffer_migrate_lock_buffers(head, mode)); ClearPagePrivate(page); @@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, struct anon_vma *anon_vma = NULL; if (!trylock_page(page)) { - if (!force || mode == MIGRATE_ASYNC) + if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) goto out; /* @@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to, } return err; } -#endif + +/* + * Attempt to migrate a misplaced page to the specified destination + * node. + */ +int migrate_misplaced_page(struct page *page, int node) +{ + struct address_space *mapping = page_mapping(page); + int page_lru = page_is_file_cache(page); + struct page *newpage; + int ret = -EAGAIN; + gfp_t gfp = GFP_HIGHUSER_MOVABLE; + + /* + * Don't migrate pages that are mapped in multiple processes. + */ + if (page_mapcount(page) != 1) + goto out; + + /* + * Never wait for allocations just to migrate on fault, but don't dip + * into reserves. And, only accept pages from the specified node. No + * sense migrating to a different "misplaced" page! + */ + if (mapping) + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_WAIT; + gfp |= __GFP_NOMEMALLOC | GFP_THISNODE; + + newpage = alloc_pages_node(node, gfp, 0); + if (!newpage) { + ret = -ENOMEM; + goto out; + } + + if (isolate_lru_page(page)) { + ret = -EBUSY; + goto put_new; + } + + inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT); + /* + * A page that has been migrated has all references removed and will be + * freed. A page that has not been migrated will have kepts its + * references and be restored. + */ + dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + putback_lru_page(page); +put_new: + /* + * Move the new page to the LRU. If migration was not successful + * then this will free the page. + */ + putback_lru_page(newpage); +out: + return ret; +} + +#endif /* CONFIG_NUMA */ -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753075Ab2KPQ1Y (ORCPT ); Fri, 16 Nov 2012 11:27:24 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752926Ab2KPQ0L (ORCPT ); Fri, 16 Nov 2012 11:26:11 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 16/19] numa, mm: Support NUMA hinting page faults from gup/gup_fast Date: Fri, 16 Nov 2012 17:25:18 +0100 Message-Id: <1353083121-4560-17-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andrea Arcangeli Introduce FOLL_NUMA to tell follow_page to check pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do so because it always invokes handle_mm_fault and retries the follow_page later. KVM secondary MMU page faults will trigger the NUMA hinting page faults through gup_fast -> get_user_pages -> follow_page -> handle_mm_fault. Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages. [ This patch was picked up from the AutoNUMA tree. ] Originally-by: Andrea Arcangeli Cc: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Cc: Andrea Arcangeli Cc: Rik van Riel [ ported to this tree. ] Signed-off-by: Ingo Molnar --- include/linux/mm.h | 1 + mm/memory.c | 17 +++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0025bf9..1821629 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address, #define FOLL_MLOCK 0x40 /* mark page as mlocked */ #define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */ #define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/memory.c b/mm/memory.c index e3e8ab2..a660fd0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); goto out; } + if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd)) + goto no_page_table; if (pmd_trans_huge(*pmd)) { if (flags & FOLL_SPLIT) { split_huge_page_pmd(mm, pmd); @@ -1565,6 +1567,8 @@ split_fallthrough: pte = *ptep; if (!pte_present(pte)) goto no_page; + if ((flags & FOLL_NUMA) && pte_numa(vma, pte)) + goto no_page; if ((flags & FOLL_WRITE) && !pte_write(pte)) goto unlock; @@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD); vm_flags &= (gup_flags & FOLL_FORCE) ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE); + + /* + * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault + * would be called on PROT_NONE ranges. We must never invoke + * handle_mm_fault on PROT_NONE ranges or the NUMA hinting + * page faults would unprotect the PROT_NONE ranges if + * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd + * bitflag. So to avoid that, don't set FOLL_NUMA if + * FOLL_FORCE is set. + */ + if (!(gup_flags & FOLL_FORCE)) + gup_flags |= FOLL_NUMA; + i = 0; do { -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753102Ab2KPQ1p (ORCPT ); Fri, 16 Nov 2012 11:27:45 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752889Ab2KPQ0F (ORCPT ); Fri, 16 Nov 2012 11:26:05 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Lee Schermerhorn Subject: [PATCH 13/19] mm/mpol: Check for misplaced page Date: Fri, 16 Nov 2012 17:25:15 +0100 Message-Id: <1353083121-4560-14-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Lee Schermerhorn This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: Lee Schermerhorn Reviewed-by: Rik van Riel Cc: Andrew Morton Cc: Linus Torvalds [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-z3mgep4tgrc08o07vl1ahb2m@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/mempolicy.h | 8 +++++ include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 76 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 85 insertions(+) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index e5ccb9d..c511e25 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma) return 1; } +extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long); + #else struct mempolicy {}; @@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, return 0; } +static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma, + unsigned long address) +{ + return -1; /* no node preference */ +} + #endif /* CONFIG_NUMA */ #endif diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index d23dca8..472de8a 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -61,6 +61,7 @@ enum mpol_rebind_step { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ #define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */ +#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c7c7c86..1b2890c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2179,6 +2179,82 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +/** + * mpol_misplaced - check whether current page node is valid in policy + * + * @page - page to be checked + * @vma - vm area where page mapped + * @addr - virtual address where page mapped + * + * Lookup current policy node id for vma,addr and "compare to" page's + * node id. + * + * Returns: + * -1 - not misplaced, page is in the right node + * node - node id where the page should be + * + * Policy determination "mimics" alloc_page_vma(). + * Called from fault path where we know the vma and faulting address. + */ +int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr) +{ + struct mempolicy *pol; + struct zone *zone; + int curnid = page_to_nid(page); + unsigned long pgoff; + int polnid = -1; + int ret = -1; + + BUG_ON(!vma); + + pol = get_vma_policy(current, vma, addr); + if (!(pol->flags & MPOL_F_MOF)) + goto out; + + switch (pol->mode) { + case MPOL_INTERLEAVE: + BUG_ON(addr >= vma->vm_end); + BUG_ON(addr < vma->vm_start); + + pgoff = vma->vm_pgoff; + pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; + polnid = offset_il_node(pol, vma, pgoff); + break; + + case MPOL_PREFERRED: + if (pol->flags & MPOL_F_LOCAL) + polnid = numa_node_id(); + else + polnid = pol->v.preferred_node; + break; + + case MPOL_BIND: + /* + * allows binding to multiple nodes. + * use current page if in policy nodemask, + * else select nearest allowed node, if any. + * If no allowed nodes, use current [!misplaced]. + */ + if (node_isset(curnid, pol->v.nodes)) + goto out; + (void)first_zones_zonelist( + node_zonelist(numa_node_id(), GFP_HIGHUSER), + gfp_zone(GFP_HIGHUSER), + &pol->v.nodes, &zone); + polnid = zone->node; + break; + + default: + BUG(); + } + if (curnid != polnid) + ret = polnid; +out: + mpol_cond_put(pol); + + return ret; +} + static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug("deleting %lx-l%lx\n", n->start, n->end); -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752863Ab2KPQZ5 (ORCPT ); Fri, 16 Nov 2012 11:25:57 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752772Ab2KPQZy (ORCPT ); Fri, 16 Nov 2012 11:25:54 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Ralf Baechle , Martin Schwidefsky , Heiko Carstens , Peter Zijlstra Subject: [PATCH 09/19] sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation Date: Fri, 16 Nov 2012 17:25:11 +0100 Message-Id: <1353083121-4560-10-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Ralf Baechle Add the pmd_pgprot() method that will be needed by the new NUMA code. Reported-by: Stephen Rothwell Signed-off-by: Ralf Baechle Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Peter Zijlstra Signed-off-by: Ingo Molnar --- arch/mips/include/asm/pgtable.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h index c02158b..bbe4cda 100644 --- a/arch/mips/include/asm/pgtable.h +++ b/arch/mips/include/asm/pgtable.h @@ -89,6 +89,8 @@ static inline int is_zero_pfn(unsigned long pfn) extern void paging_init(void); +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_PAGE_CHG_MASK) + /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752841Ab2KPQZ4 (ORCPT ); Fri, 16 Nov 2012 11:25:56 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752799Ab2KPQZu (ORCPT ); Fri, 16 Nov 2012 11:25:50 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 06/19] mm/thp: Preserve pgprot across huge page split Date: Fri, 16 Nov 2012 17:25:08 +0100 Message-Id: <1353083121-4560-7-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra We're going to play games with page-protections, ensure we don't lose them over a THP split. Collapse seems to always allocate a new (huge) page which should already end up on the new target node so loosing protections there isn't a problem. Signed-off-by: Peter Zijlstra Reviewed-by: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Link: http://lkml.kernel.org/n/tip-eyi25t4eh3l4cd2zp4k3bj6c@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 1 + mm/huge_memory.c | 103 ++++++++++++++++++++--------------------- 2 files changed, 50 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a1f780d..f85dccd 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) } #define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK) +#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK) #define canon_pgprot(p) __pgprot(massage_pgprot(p)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40f17c3..176fe3d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct page *page, int ret = 0, i; pgtable_t pgtable; unsigned long haddr; + pgprot_t prot; spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); - if (pmd) { - pgtable = pgtable_trans_huge_withdraw(mm); - pmd_populate(mm, &_pmd, pgtable); - - haddr = address; - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { - pte_t *pte, entry; - BUG_ON(PageCompound(page+i)); - entry = mk_pte(page + i, vma->vm_page_prot); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (!pmd_write(*pmd)) - entry = pte_wrprotect(entry); - else - BUG_ON(page_mapcount(page) != 1); - if (!pmd_young(*pmd)) - entry = pte_mkold(entry); - pte = pte_offset_map(&_pmd, haddr); - BUG_ON(!pte_none(*pte)); - set_pte_at(mm, haddr, pte, entry); - pte_unmap(pte); - } + if (!pmd) + goto unlock; - smp_wmb(); /* make pte visible before pmd */ - /* - * Up to this point the pmd is present and huge and - * userland has the whole access to the hugepage - * during the split (which happens in place). If we - * overwrite the pmd with the not-huge version - * pointing to the pte here (which of course we could - * if all CPUs were bug free), userland could trigger - * a small page size TLB miss on the small sized TLB - * while the hugepage TLB entry is still established - * in the huge TLB. Some CPU doesn't like that. See - * http://support.amd.com/us/Processor_TechDocs/41322.pdf, - * Erratum 383 on page 93. Intel should be safe but is - * also warns that it's only safe if the permission - * and cache attributes of the two entries loaded in - * the two TLB is identical (which should be the case - * here). But it is generally safer to never allow - * small and huge TLB entries for the same virtual - * address to be loaded simultaneously. So instead of - * doing "pmd_populate(); flush_tlb_range();" we first - * mark the current pmd notpresent (atomically because - * here the pmd_trans_huge and pmd_trans_splitting - * must remain set at all times on the pmd until the - * split is complete for this pmd), then we flush the - * SMP TLB and finally we write the non-huge version - * of the pmd entry with pmd_populate. - */ - pmdp_invalidate(vma, address, pmd); - pmd_populate(mm, pmd, pgtable); - ret = 1; + prot = pmd_pgprot(*pmd); + pgtable = pgtable_trans_huge_withdraw(mm); + pmd_populate(mm, &_pmd, pgtable); + + for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { + pte_t *pte, entry; + + BUG_ON(PageCompound(page+i)); + entry = mk_pte(page + i, prot); + entry = pte_mkdirty(entry); + if (!pmd_young(*pmd)) + entry = pte_mkold(entry); + pte = pte_offset_map(&_pmd, haddr); + BUG_ON(!pte_none(*pte)); + set_pte_at(mm, haddr, pte, entry); + pte_unmap(pte); } + + smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */ + /* + * Up to this point the pmd is present and huge. + * + * If we overwrite the pmd with the not-huge version, we could trigger + * a small page size TLB miss on the small sized TLB while the hugepage + * TLB entry is still established in the huge TLB. + * + * Some CPUs don't like that. See + * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383 + * on page 93. + * + * Thus it is generally safer to never allow small and huge TLB entries + * for overlapping virtual addresses to be loaded. So we first mark the + * current pmd not present, then we flush the TLB and finally we write + * the non-huge version of the pmd entry with pmd_populate. + * + * The above needs to be done under the ptl because pmd_trans_huge and + * pmd_trans_splitting must remain set on the pmd until the split is + * complete. The ptl also protects against concurrent faults due to + * making the pmd not-present. + */ + set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd)); + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + pmd_populate(mm, pmd, pgtable); + ret = 1; + +unlock: spin_unlock(&mm->page_table_lock); return ret; @@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void) { struct page *hpage = NULL; unsigned int progress = 0, pass_through_head = 0; - unsigned int pages = khugepaged_pages_to_scan; bool wait = true; - - barrier(); /* write khugepaged_pages_to_scan to local stack */ + unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan); while (progress < pages) { if (!khugepaged_prealloc_page(&hpage, &wait)) -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752950Ab2KPQ2e (ORCPT ); Fri, 16 Nov 2012 11:28:34 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752832Ab2KPQZ4 (ORCPT ); Fri, 16 Nov 2012 11:25:56 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 10/19] mm/pgprot: Move the pgprot_modify() fallback definition to mm.h Date: Fri, 16 Nov 2012 17:25:12 +0100 Message-Id: <1353083121-4560-11-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org pgprot_modify() is available on x86, but on other architectures it only gets defined in mm/mprotect.c - breaking the build if anything outside of mprotect.c tries to make use of this function. Move it to the generic pgprot area in mm.h, so that an upcoming patch can make use of it. Acked-by: Peter Zijlstra Cc: Rik van Riel Cc: Paul Turner Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-nfvarGMj9gjavowroorkizb4@git.kernel.org Signed-off-by: Ingo Molnar --- include/linux/mm.h | 13 +++++++++++++ mm/mprotect.c | 7 ------- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fa06804..2a32cf8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -164,6 +164,19 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_TRIED 0x40 /* second try */ /* + * Some architectures (such as x86) may need to preserve certain pgprot + * bits, without complicating generic pgprot code. + * + * Most architectures don't care: + */ +#ifndef pgprot_modify +static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) +{ + return newprot; +} +#endif + +/* * vm_fault is filled by the the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. diff --git a/mm/mprotect.c b/mm/mprotect.c index a409926..e97b0d6 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -28,13 +28,6 @@ #include #include -#ifndef pgprot_modify -static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) -{ - return newprot; -} -#endif - static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable) -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752818Ab2KPQZx (ORCPT ); Fri, 16 Nov 2012 11:25:53 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752772Ab2KPQZu (ORCPT ); Fri, 16 Nov 2012 11:25:50 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 07/19] x86/mm: Introduce pte_accessible() Date: Fri, 16 Nov 2012 17:25:09 +0100 Message-Id: <1353083121-4560-8-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that the pte is associated with a page. However, for TLB flushing purposes, we would like to know whether the pte points to an actually accessible page. This allows us to skip remote TLB flushes for pages that are not actually accessible. Fill in this method for x86 and provide a safe (but slower) method on other architectures. Signed-off-by: Rik van Riel Signed-off-by: Peter Zijlstra Fixed-by: Linus Torvalds Cc: Andrew Morton Cc: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org [ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar --- arch/x86/include/asm/pgtable.h | 6 ++++++ include/asm-generic/pgtable.h | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index f85dccd..a984cf9 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -408,6 +408,12 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } +#define pte_accessible pte_accessible +static inline int pte_accessible(pte_t a) +{ + return pte_flags(a) & _PAGE_PRESENT; +} + static inline int pte_hidden(pte_t pte) { return pte_flags(pte) & _PAGE_HIDDEN; diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index b36ce40..48fc1dc 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b) #define move_pte(pte, prot, old_addr, new_addr) (pte) #endif +#ifndef pte_accessible +# define pte_accessible(pte) ((void)(pte),1) +#endif + #ifndef flush_tlb_fix_spurious_fault #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address) #endif -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753029Ab2KPQ3E (ORCPT ); Fri, 16 Nov 2012 11:29:04 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752800Ab2KPQZw (ORCPT ); Fri, 16 Nov 2012 11:25:52 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 08/19] mm: Only flush the TLB when clearing an accessible pte Date: Fri, 16 Nov 2012 17:25:10 +0100 Message-Id: <1353083121-4560-9-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel If ptep_clear_flush() is called to clear a page table entry that is accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry, there is no need to flush the TLB on remote CPUs. Signed-off-by: Rik van Riel Signed-off-by: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org Signed-off-by: Ingo Molnar --- mm/pgtable-generic.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index d8397da..0c8323f 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, { pte_t pte; pte = ptep_get_and_clear((vma)->vm_mm, address, ptep); - flush_tlb_page(vma, address); + if (pte_accessible(pte)) + flush_tlb_page(vma, address); return pte; } #endif -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752880Ab2KPQ3e (ORCPT ); Fri, 16 Nov 2012 11:29:34 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:41441 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752775Ab2KPQZp (ORCPT ); Fri, 16 Nov 2012 11:25:45 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Gerald Schaefer , Martin Schwidefsky , Heiko Carstens , Peter Zijlstra , Ralf Baechle Subject: [PATCH 05/19] sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 Date: Fri, 16 Nov 2012 17:25:07 +0100 Message-Id: <1353083121-4560-6-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Gerald Schaefer This patch adds an implementation of pmd_pgprot() for s390, in preparation to future THP changes. Reported-by: Stephen Rothwell Signed-off-by: Gerald Schaefer Cc: Martin Schwidefsky Cc: Heiko Carstens Cc: Peter Zijlstra Cc: Ralf Baechle Signed-off-by: Ingo Molnar --- arch/s390/include/asm/pgtable.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index dd647c9..098fc5a 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, *pmdp = entry; } +static inline pgprot_t pmd_pgprot(pmd_t pmd) +{ + pgprot_t prot = PAGE_RW; + + if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) { + if (pmd_val(pmd) & _SEGMENT_ENTRY_INV) + prot = PAGE_NONE; + else + prot = PAGE_RO; + } + return prot; +} + static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot) { unsigned long pgprot_pmd = 0; -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752834Ab2KPQaW (ORCPT ); Fri, 16 Nov 2012 11:30:22 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:58067 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752725Ab2KPQZi (ORCPT ); Fri, 16 Nov 2012 11:25:38 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Michel Lespinasse Subject: [PATCH 02/19] x86/mm: Only do a local tlb flush in ptep_set_access_flags() Date: Fri, 16 Nov 2012 17:25:04 +0100 Message-Id: <1353083121-4560-3-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Because we only ever upgrade a PTE when calling ptep_set_access_flags(), it is safe to skip flushing entries on remote TLBs. The worst that can happen is a spurious page fault on other CPUs, which would flush that TLB entry. Lazily letting another CPU incur a spurious page fault occasionally is (much!) cheaper than aggressively flushing everybody else's TLB. Signed-off-by: Rik van Riel Acked-by: Linus Torvalds Acked-by: Peter Zijlstra Cc: Andrew Morton Cc: Michel Lespinasse Signed-off-by: Ingo Molnar --- arch/x86/mm/pgtable.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 8573b83..be3bb46 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd) free_page((unsigned long)pgd); } +/* + * Used to set accessed or dirty bits in the page table entries + * on other architectures. On x86, the accessed and dirty bits + * are tracked by hardware. However, do_wp_page calls this function + * to also make the pte writeable at the same time the dirty bit is + * set. In that case we do actually need to write the PTE. + */ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address, pte_t *ptep, pte_t entry, int dirty) @@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, if (changed && dirty) { *ptep = entry; pte_update_defer(vma->vm_mm, address, ptep); - flush_tlb_page(vma, address); + __flush_tlb_one(address); } return changed; -- 1.7.11.7 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753609Ab2KQIfk (ORCPT ); Sat, 17 Nov 2012 03:35:40 -0500 Received: from mail-oa0-f46.google.com ([209.85.219.46]:54997 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752506Ab2KQIfi (ORCPT ); Sat, 17 Nov 2012 03:35:38 -0500 MIME-Version: 1.0 In-Reply-To: <1353083121-4560-1-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Date: Sat, 17 Nov 2012 16:35:38 +0800 Message-ID: Subject: Re: [PATCH 00/19] latest numa/base patches From: Alex Shi To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Alex Shi Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Just find imbalance issue on the patchset. I write a one line program: int main () { int i; for (i=0; i< 1; ) __asm__ __volatile__ ("nop"); } it was compiled with name pl and start it on my 2 socket * 4 cores * HT NUMA machine: the cpu domain top like this: domain 0: span 4,12 level SIBLING groups: 4 (cpu_power = 589) 12 (cpu_power = 589) domain 1: span 0,2,4,6,8,10,12,14 level MC groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8 (cpu_power = 1178) 2,10 (cpu_power = 1178) domain 2: span 0,2,4,6,8,10,12,14 level CPU groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) domain 3: span 0-15 level NUMA groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15 (cpu_power = 4712) $for ((i=0; i< I; i++)); do ./pl & done when I = 2, they are running on cpu 0,12 I = 4, they are running on cpu 0,9,12,14 I = 8, they are running on cpu 0,4,9,10,11,12,13,14 Regards! Alex On Sat, Nov 17, 2012 at 12:25 AM, Ingo Molnar wrote: > This is the split-out series of mm/ patches that got no objections > from the latest (v15) posting of numa/core. If everyone is still > fine with these then these will be merge candidates for v3.8. > > I left out the more contentious policy bits that people are still > arguing about. > > The numa/base tree can also be found here: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base > > Thanks, > > Ingo > > -------------------> > > Andrea Arcangeli (1): > numa, mm: Support NUMA hinting page faults from gup/gup_fast > > Gerald Schaefer (1): > sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390 > > Ingo Molnar (1): > mm/pgprot: Move the pgprot_modify() fallback definition to mm.h > > Lee Schermerhorn (3): > mm/mpol: Add MPOL_MF_NOOP > mm/mpol: Check for misplaced page > mm/mpol: Add MPOL_MF_LAZY > > Peter Zijlstra (7): > sched, numa, mm: Make find_busiest_queue() a method > sched, numa, mm: Describe the NUMA scheduling problem formally > mm/thp: Preserve pgprot across huge page split > mm/mpol: Make MPOL_LOCAL a real policy > mm/mpol: Create special PROT_NONE infrastructure > mm/migrate: Introduce migrate_misplaced_page() > mm/mpol: Use special PROT_NONE to migrate pages > > Ralf Baechle (1): > sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation > > Rik van Riel (5): > mm/generic: Only flush the local TLB in ptep_set_access_flags() > x86/mm: Only do a local tlb flush in ptep_set_access_flags() > x86/mm: Introduce pte_accessible() > mm: Only flush the TLB when clearing an accessible pte > x86/mm: Completely drop the TLB flush from ptep_set_access_flags() > > Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ > arch/mips/include/asm/pgtable.h | 2 + > arch/s390/include/asm/pgtable.h | 13 ++ > arch/x86/include/asm/pgtable.h | 7 + > arch/x86/mm/pgtable.c | 8 +- > include/asm-generic/pgtable.h | 4 + > include/linux/huge_mm.h | 19 +++ > include/linux/mempolicy.h | 8 ++ > include/linux/migrate.h | 7 + > include/linux/migrate_mode.h | 3 + > include/linux/mm.h | 32 +++++ > include/uapi/linux/mempolicy.h | 16 ++- > kernel/sched/fair.c | 20 +-- > mm/huge_memory.c | 174 +++++++++++++++-------- > mm/memory.c | 119 +++++++++++++++- > mm/mempolicy.c | 143 +++++++++++++++---- > mm/migrate.c | 85 ++++++++++-- > mm/mprotect.c | 31 +++-- > mm/pgtable-generic.c | 9 +- > 19 files changed, 807 insertions(+), 123 deletions(-) > create mode 100644 Documentation/scheduler/numa-problem.txt > > -- > 1.7.11.7 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Thanks Alex From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753560Ab2KQIkJ (ORCPT ); Sat, 17 Nov 2012 03:40:09 -0500 Received: from mail-ob0-f174.google.com ([209.85.214.174]:48471 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752325Ab2KQIkI (ORCPT ); Sat, 17 Nov 2012 03:40:08 -0500 MIME-Version: 1.0 In-Reply-To: References: <1353083121-4560-1-git-send-email-mingo@kernel.org> Date: Sat, 17 Nov 2012 16:40:07 +0800 Message-ID: Subject: Re: [PATCH 00/19] latest numa/base patches From: Alex Shi To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , Alex Shi Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 17, 2012 at 4:35 PM, Alex Shi wrote: > Just find imbalance issue on the patchset. > > I write a one line program: > int main () > { > int i; > for (i=0; i< 1; ) > __asm__ __volatile__ ("nop"); > } > it was compiled with name pl and start it on my 2 socket * 4 cores * > HT NUMA machine: > the cpu domain top like this: > domain 0: span 4,12 level SIBLING > groups: 4 (cpu_power = 589) 12 (cpu_power = 589) > domain 1: span 0,2,4,6,8,10,12,14 level MC > groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8 > (cpu_power = 1178) 2,10 (cpu_power = 1178) > domain 2: span 0,2,4,6,8,10,12,14 level CPU > groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) > domain 3: span 0-15 level NUMA > groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15 > (cpu_power = 4712) > > $for ((i=0; i< I; i++)); do ./pl & done > when I = 2, they are running on cpu 0,12 > I = 4, they are running on cpu 0,9,12,14 > I = 8, they are running on cpu 0,4,9,10,11,12,13,14 > Ops, it was tested on latest V15 tip/master tree, head is a7b7a8ad4476bb641c8455a4e0d7d0fd3eb86f90 not on this series. Sorry. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752635Ab2KSC0G (ORCPT ); Sun, 18 Nov 2012 21:26:06 -0500 Received: from mail-ee0-f46.google.com ([74.125.83.46]:40320 "EHLO mail-ee0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752181Ab2KSC0D (ORCPT ); Sun, 18 Nov 2012 21:26:03 -0500 Date: Mon, 19 Nov 2012 03:25:58 +0100 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: [PATCH 17/19, v2] mm/migrate: Introduce migrate_misplaced_page() Message-ID: <20121119022558.GA3186@gmail.com> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-18-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1353083121-4560-18-git-send-email-mingo@kernel.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ingo Molnar wrote: > From: Peter Zijlstra > > Add migrate_misplaced_page() which deals with migrating pages from > faults. > > This includes adding a new MIGRATE_FAULT migration mode to > deal with the extra page reference required due to having to look up > the page. [...] > --- a/include/linux/migrate_mode.h > +++ b/include/linux/migrate_mode.h > @@ -6,11 +6,14 @@ > * on most operations but not ->writepage as the potential stall time > * is too significant > * MIGRATE_SYNC will block when migrating pages > + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy > + * this path has an extra reference count > */ Note, this is still the older, open-coded version. The newer replacement version created from Mel's patch which reuses migrate_pages() and is nicer on out-of-node-memory conditions and is cleaner all around can be found below. I tested it today and it appears to work fine. I noticed no performance improvement or performance drop from it - if it holds up in testing it will be part of the -v17 release of numa/core. Thanks, Ingo --------------------------> Subject: mm/migration: Introduce migrate_misplaced_page() From: Mel Gorman Date: Fri, 16 Nov 2012 11:22:23 +0000 Note: This was originally based on Peter's patch "mm/migrate: Introduce migrate_misplaced_page()" but borrows extremely heavily from Andrea's "autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection". The end result is barely recognisable so signed-offs had to be dropped. If original authors are ok with it, I'll re-add the signed-off-bys. Add migrate_misplaced_page() which deals with migrating pages from faults. Based-on-work-by: Lee Schermerhorn Based-on-work-by: Peter Zijlstra Based-on-work-by: Andrea Arcangeli Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Johannes Weiner Cc: Hugh Dickins Cc: Linus Torvalds Cc: Linux-MM Cc: Peter Zijlstra Cc: Andrea Arcangeli Link: http://lkml.kernel.org/r/1353064973-26082-14-git-send-email-mgorman@suse.de [ Adapted to the numa/core tree. ] Signed-off-by: Ingo Molnar --- mm/memory.c | 13 ++----- mm/migrate.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 106 insertions(+), 10 deletions(-) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -3494,28 +3494,25 @@ out_pte_upgrade_unlock: out_unlock: pte_unmap_unlock(ptep, ptl); -out: + if (page) { task_numa_fault(page_nid, last_cpu, 1); put_page(page); } - +out: return 0; migrate: pte_unmap_unlock(ptep, ptl); - if (!migrate_misplaced_page(page, node)) { - page_nid = node; + if (migrate_misplaced_page(page, node)) { goto out; } + page = NULL; ptep = pte_offset_map_lock(mm, pmd, address, &ptl); - if (!pte_same(*ptep, entry)) { - put_page(page); - page = NULL; + if (!pte_same(*ptep, entry)) goto out_unlock; - } goto out_pte_upgrade_unlock; } Index: linux/mm/migrate.c =================================================================== --- linux.orig/mm/migrate.c +++ linux/mm/migrate.c @@ -279,7 +279,7 @@ static int migrate_page_move_mapping(str struct page *newpage, struct page *page, struct buffer_head *head, enum migrate_mode mode) { - int expected_count; + int expected_count = 0; void **pslot; if (!mapping) { @@ -1403,4 +1403,103 @@ int migrate_vmas(struct mm_struct *mm, c } return err; } -#endif + +/* + * Returns true if this is a safe migration target node for misplaced NUMA + * pages. Currently it only checks the watermarks which crude + */ +static bool migrate_balanced_pgdat(struct pglist_data *pgdat, + int nr_migrate_pages) +{ + int z; + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + struct zone *zone = pgdat->node_zones + z; + + if (!populated_zone(zone)) + continue; + + if (zone->all_unreclaimable) + continue; + + /* Avoid waking kswapd by allocating pages_to_migrate pages. */ + if (!zone_watermark_ok(zone, 0, + high_wmark_pages(zone) + + nr_migrate_pages, + 0, 0)) + continue; + return true; + } + return false; +} + +static struct page *alloc_misplaced_dst_page(struct page *page, + unsigned long data, + int **result) +{ + int nid = (int) data; + struct page *newpage; + + newpage = alloc_pages_exact_node(nid, + (GFP_HIGHUSER_MOVABLE | GFP_THISNODE | + __GFP_NOMEMALLOC | __GFP_NORETRY | + __GFP_NOWARN) & + ~GFP_IOFS, 0); + return newpage; +} + +/* + * Attempt to migrate a misplaced page to the specified destination + * node. Caller is expected to have an elevated reference count on + * the page that will be dropped by this function before returning. + */ +int migrate_misplaced_page(struct page *page, int node) +{ + int isolated = 0; + LIST_HEAD(migratepages); + + /* + * Don't migrate pages that are mapped in multiple processes. + * TODO: Handle false sharing detection instead of this hammer + */ + if (page_mapcount(page) != 1) + goto out; + + /* Avoid migrating to a node that is nearly full */ + if (migrate_balanced_pgdat(NODE_DATA(node), 1)) { + int page_lru; + + if (isolate_lru_page(page)) { + put_page(page); + goto out; + } + isolated = 1; + + /* + * Page is isolated which takes a reference count so now the + * callers reference can be safely dropped without the page + * disappearing underneath us during migration + */ + put_page(page); + + page_lru = page_is_file_cache(page); + inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru); + list_add(&page->lru, &migratepages); + } + + if (isolated) { + int nr_remaining; + + nr_remaining = migrate_pages(&migratepages, + alloc_misplaced_dst_page, + node, false, MIGRATE_ASYNC); + if (nr_remaining) { + putback_lru_pages(&migratepages); + isolated = 0; + } + } + BUG_ON(!list_empty(&migratepages)); +out: + return isolated; +} + +#endif /* CONFIG_NUMA */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753743Ab2KSQDU (ORCPT ); Mon, 19 Nov 2012 11:03:20 -0500 Received: from mx1.redhat.com ([209.132.183.28]:64420 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752845Ab2KSQDT (ORCPT ); Mon, 19 Nov 2012 11:03:19 -0500 Message-ID: <50AA582E.30602@redhat.com> Date: Mon, 19 Nov 2012 11:02:54 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0 MIME-Version: 1.0 To: Ingo Molnar CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins Subject: Re: [PATCH 17/19, v2] mm/migrate: Introduce migrate_misplaced_page() References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-18-git-send-email-mingo@kernel.org> <20121119022558.GA3186@gmail.com> In-Reply-To: <20121119022558.GA3186@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/18/2012 09:25 PM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> From: Peter Zijlstra >> >> Add migrate_misplaced_page() which deals with migrating pages from >> faults. >> >> This includes adding a new MIGRATE_FAULT migration mode to >> deal with the extra page reference required due to having to look up >> the page. > [...] > >> --- a/include/linux/migrate_mode.h >> +++ b/include/linux/migrate_mode.h >> @@ -6,11 +6,14 @@ >> * on most operations but not ->writepage as the potential stall time >> * is too significant >> * MIGRATE_SYNC will block when migrating pages >> + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy >> + * this path has an extra reference count >> */ > > Note, this is still the older, open-coded version. > > The newer replacement version created from Mel's patch which > reuses migrate_pages() and is nicer on out-of-node-memory > conditions and is cleaner all around can be found below. > > I tested it today and it appears to work fine. I noticed no > performance improvement or performance drop from it - if it > holds up in testing it will be part of the -v17 release of > numa/core. Excellent. That gets rid of the last issue with numa/base :) -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751086Ab2KYGJr (ORCPT ); Sun, 25 Nov 2012 01:09:47 -0500 Received: from mail-ia0-f174.google.com ([209.85.210.174]:61190 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750830Ab2KYGJp (ORCPT ); Sun, 25 Nov 2012 01:09:45 -0500 MIME-Version: 1.0 In-Reply-To: <1353083121-4560-5-git-send-email-mingo@kernel.org> References: <1353083121-4560-1-git-send-email-mingo@kernel.org> <1353083121-4560-5-git-send-email-mingo@kernel.org> Date: Sun, 25 Nov 2012 11:39:45 +0530 Message-ID: Subject: Re: [PATCH 04/19] sched, numa, mm: Describe the NUMA scheduling problem formally From: abhishek agarwal To: Ingo Molnar Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Peter Zijlstra , Thomas Gleixner , Hugh Dickins , "H. Peter Anvin" , Mike Galbraith Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org as per 4) move towards where "most" memory. If we have a large shared memory than private memnory. Why not we just move the process towrds the memory.. instead of the memory moving towards the node. This will i guess be less cumbersome, then moving all the shared memory On Fri, Nov 16, 2012 at 9:55 PM, Ingo Molnar wrote: > From: Peter Zijlstra > > This is probably a first: formal description of a complex high-level > computing problem, within the kernel source. > > Signed-off-by: Peter Zijlstra > Cc: Linus Torvalds > Cc: Andrew Morton > Cc: Peter Zijlstra > Cc: "H. Peter Anvin" > Cc: Mike Galbraith > Rik van Riel > Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org > [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] > Signed-off-by: Ingo Molnar > --- > Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++ > 1 file changed, 230 insertions(+) > create mode 100644 Documentation/scheduler/numa-problem.txt > > diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt > new file mode 100644 > index 0000000..a5d2fee > --- /dev/null > +++ b/Documentation/scheduler/numa-problem.txt > @@ -0,0 +1,230 @@ > + > + > +Effective NUMA scheduling problem statement, described formally: > + > + * minimize interconnect traffic > + > +For each task 't_i' we have memory, this memory can be spread over multiple > +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on > +node 'k' in [pages]. > + > +If a task shares memory with another task let us denote this as: > +'s_i,k', the memory shared between tasks including 't_i' residing on node > +'k'. > + > +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement. > + > +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage > +frequency over those memory regions [1/s] such that the product gives an > +(average) bandwidth 'bp' and 'bs' in [pages/s]. > + > +(note: multiple tasks sharing memory naturally avoid duplicat accounting > + because each task will have its own access frequency 'fs') > + > +(pjt: I think this frequency is more numerically consistent if you explicitly > + restrict p/s above to be the working-set. (It also makes explicit the > + requirement for to change about a change in the working set.) > + > + Doing this does have the nice property that it lets you use your frequency > + measurement as a weak-ordering for the benefit a task would receive when > + we can't fit everything. > + > + e.g. task1 has working set 10mb, f=90% > + task2 has working set 90mb, f=10% > + > + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit > + from task1 being on the right node than task2. ) > + > +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i': > + > + C: t_i -> {c_i, n_i} > + > +This gives us the total interconnect traffic between nodes 'k' and 'l', > +'T_k,l', as: > + > + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l > + > +And our goal is to obtain C0 and M0 such that: > + > + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l > + > +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory > + on node 'l' from node 'k', this would be useful for bigger NUMA systems > + > + pjt: I agree nice to have, but intuition suggests diminishing returns on more > + usual systems given factors like things like Haswell's enormous 35mb l3 > + cache and QPI being able to do a direct fetch.) > + > +(note: do we need a limit on the total memory per node?) > + > + > + * fairness > + > +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu > +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a > +load 'L_n': > + > + L_n = 1/P_n * \Sum_i w_i for all c_i = n > + > +using that we can formulate a load difference between CPUs > + > + L_n,m = | L_n - L_m | > + > +Which allows us to state the fairness goal like: > + > + L_n,m(C0) =< L_n,m(C) for all C, n != m > + > +(pjt: It can also be usefully stated that, having converged at C0: > + > + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) | > + > + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is > + the "worst" partition we should accept; but having it gives us a useful > + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to > + favor T_n,m. ) > + > +Together they give us the complete multi-objective optimization problem: > + > + min_C,M [ L_n,m(C), T_k,l(C,M) ] > + > + > + > +Notes: > + > + - the memory bandwidth problem is very much an inter-process problem, in > + particular there is no such concept as a process in the above problem. > + > + - the naive solution would completely prefer fairness over interconnect > + traffic, the more complicated solution could pick another Pareto point using > + an aggregate objective function such that we balance the loss of work > + efficiency against the gain of running, we'd want to more or less suggest > + there to be a fixed bound on the error from the Pareto line for any > + such solution. > + > +References: > + > + http://en.wikipedia.org/wiki/Mathematical_optimization > + http://en.wikipedia.org/wiki/Multi-objective_optimization > + > + > +* warning, significant hand-waving ahead, improvements welcome * > + > + > +Partial solutions / approximations: > + > + 1) have task node placement be a pure preference from the 'fairness' pov. > + > +This means we always prefer fairness over interconnect bandwidth. This reduces > +the problem to: > + > + min_C,M [ T_k,l(C,M) ] > + > + 2a) migrate memory towards 'n_i' (the task's node). > + > +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 -- > +provided 'n_i' stays stable enough and there's sufficient memory (looks like > +we might need memory limits for this). > + > +This does however not provide us with any 's_i' (shared) information. It does > +however remove 'M' since it defines memory placement in terms of task > +placement. > + > +XXX properties of this M vs a potential optimal > + > + 2b) migrate memory towards 'n_i' using 2 samples. > + > +This separates pages into those that will migrate and those that will not due > +to the two samples not matching. We could consider the first to be of 'p_i' > +(private) and the second to be of 's_i' (shared). > + > +This interpretation can be motivated by the previously observed property that > +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only > +'s_i' (shared). (here we loose the need for memory limits again, since it > +becomes indistinguishable from shared). > + > +XXX include the statistical babble on double sampling somewhere near > + > +This reduces the problem further; we loose 'M' as per 2a, it further reduces > +the 'T_k,l' (interconnect traffic) term to only include shared (since per the > +above all private will be local): > + > + T_k,l = \Sum_i bs_i,l for every n_i = k, l != k > + > +[ more or less matches the state of sched/numa and describes its remaining > + problems and assumptions. It should work well for tasks without significant > + shared memory usage between tasks. ] > + > +Possible future directions: > + > +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we > +can evaluate it; > + > + 3a) add per-task per node counters > + > +At fault time, count the number of pages the task faults on for each node. > +This should give an approximation of 'p_i' for the local node and 's_i,k' for > +all remote nodes. > + > +While these numbers provide pages per scan, and so have the unit [pages/s] they > +don't count repeat access and thus aren't actually representable for our > +bandwidth numberes. > + > + 3b) additional frequency term > + > +Additionally (or instead if it turns out we don't need the raw 'p' and 's' > +numbers) we can approximate the repeat accesses by using the time since marking > +the pages as indication of the access frequency. > + > +Let 'I' be the interval of marking pages and 'e' the elapsed time since the > +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'. > +If we then increment the node counters using 'a' instead of 1 we might get > +a better estimate of bandwidth terms. > + > + 3c) additional averaging; can be applied on top of either a/b. > + > +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since > + the decaying avg includes the old accesses and therefore has a measure of repeat > + accesses. > + > + Rik also argued that the sample frequency is too low to get accurate access > + frequency measurements, I'm not entirely convinced, event at low sample > + frequencies the avg elapsed time 'e' over multiple samples should still > + give us a fair approximation of the avg access frequency 'a'. > + > + So doing both b&c has a fair chance of working and allowing us to distinguish > + between important and less important memory accesses. > + > + Experimentation has shown no benefit from the added frequency term so far. ] > + > +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute > +'T_k,l' Our optimization problem now reads: > + > + min_C [ \Sum_i bs_i,l for every n_i = k, l != k ] > + > +And includes only shared terms, this makes sense since all task private memory > +will become local as per 2. > + > +This suggests that if there is significant shared memory, we should try and > +move towards it. > + > + 4) move towards where 'most' memory is > + > +The simplest significance test is comparing the biggest shared 's_i,k' against > +the private 'p_i'. If we have more shared than private, move towards it. > + > +This effectively makes us move towards where most our memory is and forms a > +feed-back loop with 2. We migrate memory towards us and we migrate towards > +where 'most' memory is. > + > +(Note: even if there were two tasks fully trashing the same shared memory, it > + is very rare for there to be an 50/50 split in memory, lacking a perfect > + split, the small will move towards the larger. In case of the perfect > + split, we'll tie-break towards the lower node number.) > + > + 5) 'throttle' 4's node placement > + > +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize' > +and show representative numbers, we should limit node-migration to not be > +faster than this. > + > + n) poke holes in previous that require more stuff and describe it. > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/