From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f53.google.com (mail-qe0-f53.google.com [209.85.128.53]) by kanga.kvack.org (Postfix) with ESMTP id EE3686B0031 for ; Fri, 17 Jan 2014 16:14:06 -0500 (EST) Received: by mail-qe0-f53.google.com with SMTP id t7so4445845qeb.26 for ; Fri, 17 Jan 2014 13:14:06 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id t32si3266781qgd.102.2014.01.17.13.14.05 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:14:05 -0800 (PST) From: riel@redhat.com Subject: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing Date: Fri, 17 Jan 2014 16:12:02 -0500 Message-Id: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being available to the workload. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch series identifies the NUMA nodes on which the workload is actively running, and balances (somewhat lazily) the memory between those nodes, satisfying the criteria above. As usual, the series has had some performance testing, but it could always benefit from more testing, on other systems. Changes since v1: - fix divide by zero found by Chegu Vinod - improve comment, as suggested by Peter Zijlstra - do stats calculations in task_numa_placement in local variables Some performance numbers, with two 40-warehouse specjbb instances on an 8 node system with 10 CPU cores per node, using a pre-cleanup version of these patches, courtesy of Chegu Vinod: numactl manual pinning spec1.txt: throughput = 755900.20 SPECjbb2005 bops spec2.txt: throughput = 754914.40 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, with patches) spec1.txt: throughput = 706439.84 SPECjbb2005 bops spec2.txt: throughput = 729347.75 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, without patches) spec1.txt: throughput = 667988.47 SPECjbb2005 bops spec2.txt: throughput = 638220.45 SPECjbb2005 bops No Automatic NUMA and NO-pinning results spec1.txt: throughput = 544120.97 SPECjbb2005 bops spec2.txt: throughput = 453553.41 SPECjbb2005 bops My own performance numbers are not as relevant, since I have been running with a more hostile workload on purpose, and I have run into a scheduler issue that caused the workload to run on only two of the four NUMA nodes on my test system... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f178.google.com (mail-qc0-f178.google.com [209.85.216.178]) by kanga.kvack.org (Postfix) with ESMTP id C94B96B0035 for ; Fri, 17 Jan 2014 16:14:56 -0500 (EST) Received: by mail-qc0-f178.google.com with SMTP id m20so4180692qcx.9 for ; Fri, 17 Jan 2014 13:14:56 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id o46si3268562qgo.158.2014.01.17.13.14.55 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:14:55 -0800 (PST) From: riel@redhat.com Subject: [PATCH 1/7] numa,sched,mm: remove p->numa_migrate_deferred Date: Fri, 17 Jan 2014 16:12:03 -0500 Message-Id: <1389993129-28180-2-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that the second stage of the automatic numa balancing code has stabilized, it is time to replace the simplistic migration deferral code with something smarter. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 1 - kernel/sched/fair.c | 8 -------- kernel/sysctl.c | 7 ------- mm/mempolicy.c | 45 --------------------------------------------- 4 files changed, 61 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 68a0e84..97efba4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1469,7 +1469,6 @@ struct task_struct { unsigned int numa_scan_period; unsigned int numa_scan_period_max; int numa_preferred_nid; - int numa_migrate_deferred; unsigned long numa_migrate_retry; u64 node_stamp; /* migration stamp */ struct callback_head numa_work; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 867b0a4..41e2176 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; -/* - * After skipping a page migration on a shared page, skip N more numa page - * migrations unconditionally. This reduces the number of NUMA migrations - * in shared memory workloads, and has the effect of pulling tasks towards - * where their memory lives, over pulling the memory towards the task. - */ -unsigned int sysctl_numa_balancing_migrate_deferred = 16; - static unsigned int task_nr_scan_windows(struct task_struct *p) { unsigned long rss = 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 096db74..4d19492 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, { - .procname = "numa_balancing_migrate_deferred", - .data = &sysctl_numa_balancing_migrate_deferred, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, - { .procname = "numa_balancing", .data = NULL, /* filled in by handler */ .maxlen = sizeof(unsigned int), diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 36cb46c..052abac 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } -#ifdef CONFIG_NUMA_BALANCING -static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) -{ - /* Never defer a private fault */ - if (cpupid_match_pid(p, last_cpupid)) - return false; - - if (p->numa_migrate_deferred) { - p->numa_migrate_deferred--; - return true; - } - return false; -} - -static inline void defer_numa_migrate(struct task_struct *p) -{ - p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred; -} -#else -static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) -{ - return false; -} - -static inline void defer_numa_migrate(struct task_struct *p) -{ -} -#endif /* CONFIG_NUMA_BALANCING */ - /** * mpol_misplaced - check whether current page node is valid in policy * @@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long */ last_cpupid = page_cpupid_xchg_last(page, this_cpupid); if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) { - - /* See sysctl_numa_balancing_migrate_deferred comment */ - if (!cpupid_match_pid(current, last_cpupid)) - defer_numa_migrate(current); - goto out; } - - /* - * The quadratic filter above reduces extraneous migration - * of shared pages somewhat. This code reduces it even more, - * reducing the overhead of page migrations of shared pages. - * This makes workloads with shared pages rely more on - * "move task near its memory", and less on "move memory - * towards its task", which is exactly what we want. - */ - if (numa_migrate_deferred(current, last_cpupid)) - goto out; } if (curnid != polnid) -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f54.google.com (mail-qa0-f54.google.com [209.85.216.54]) by kanga.kvack.org (Postfix) with ESMTP id C68556B0031 for ; Fri, 17 Jan 2014 16:16:02 -0500 (EST) Received: by mail-qa0-f54.google.com with SMTP id i13so3722086qae.41 for ; Fri, 17 Jan 2014 13:16:02 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id x3si14125029qat.95.2014.01.17.13.16.01 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:16:02 -0800 (PST) From: riel@redhat.com Subject: [PATCH 2/7] numa,sched: track from which nodes NUMA faults are triggered Date: Fri, 17 Jan 2014 16:12:04 -0500 Message-Id: <1389993129-28180-3-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is actively running on. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 10 ++++++++-- kernel/sched/fair.c | 30 +++++++++++++++++++++++------- 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 97efba4..a9f7f05 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1492,6 +1492,14 @@ struct task_struct { unsigned long *numa_faults_buffer; /* + * Track the nodes where faults are incurred. This is not very + * interesting on a per-task basis, but it help with smarter + * numa memory placement for groups of processes. + */ + unsigned long *numa_faults_from; + unsigned long *numa_faults_from_buffer; + + /* * numa_faults_locality tracks if faults recorded during the last * scan window were remote/local. The task scan period is adapted * based on the locality of the faults with different weights @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p); - -extern unsigned int sysctl_numa_balancing_migrate_deferred; #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41e2176..1945ddc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -886,6 +886,7 @@ struct numa_group { struct rcu_head rcu; unsigned long total_faults; + unsigned long *faults_from; unsigned long faults[0]; }; @@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p) int priv, i; for (priv = 0; priv < 2; priv++) { - long diff; + long diff, f_diff; i = task_faults_idx(nid, priv); diff = -p->numa_faults[i]; + f_diff = -p->numa_faults_from[i]; /* Decay existing window, copy faults since last scan */ p->numa_faults[i] >>= 1; @@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p) fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; + p->numa_faults_from[i] >>= 1; + p->numa_faults_from[i] += p->numa_faults_from_buffer[i]; + p->numa_faults_from_buffer[i] = 0; + faults += p->numa_faults[i]; diff += p->numa_faults[i]; + f_diff += p->numa_faults_from[i]; p->total_numa_faults += diff; if (p->numa_group) { /* safe because we can only change our own group */ p->numa_group->faults[i] += diff; + p->numa_group->faults_from[i] += f_diff; p->numa_group->total_faults += diff; group_faults += p->numa_group->faults[i]; } @@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (unlikely(!p->numa_group)) { unsigned int size = sizeof(struct numa_group) + - 2*nr_node_ids*sizeof(unsigned long); + 4*nr_node_ids*sizeof(unsigned long); grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); if (!grp) @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, spin_lock_init(&grp->lock); INIT_LIST_HEAD(&grp->task_list); grp->gid = p->pid; + /* Second half of the array tracks where faults come from */ + grp->faults_from = grp->faults + 2 * nr_node_ids; - for (i = 0; i < 2*nr_node_ids; i++) + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] = p->numa_faults[i]; grp->total_faults = p->total_numa_faults; @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, double_lock(&my_grp->lock, &grp->lock); - for (i = 0; i < 2*nr_node_ids; i++) { + for (i = 0; i < 4*nr_node_ids; i++) { my_grp->faults[i] -= p->numa_faults[i]; grp->faults[i] += p->numa_faults[i]; } @@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p) if (grp) { spin_lock(&grp->lock); - for (i = 0; i < 2*nr_node_ids; i++) + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] -= p->numa_faults[i]; grp->total_faults -= p->total_numa_faults; @@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p) p->numa_faults = NULL; p->numa_faults_buffer = NULL; + p->numa_faults_from = NULL; + p->numa_faults_from_buffer = NULL; kfree(numa_faults); } @@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) { struct task_struct *p = current; bool migrated = flags & TNF_MIGRATED; + int this_node = task_node(current); int priv; if (!numabalancing_enabled) @@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) /* Allocate buffer to track faults on a per-node basis */ if (unlikely(!p->numa_faults)) { - int size = sizeof(*p->numa_faults) * 2 * nr_node_ids; + int size = sizeof(*p->numa_faults) * 4 * nr_node_ids; /* numa_faults and numa_faults_buffer share the allocation */ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN); @@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) return; BUG_ON(p->numa_faults_buffer); - p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids); + p->numa_faults_from = p->numa_faults + (2 * nr_node_ids); + p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids); + p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids); p->total_numa_faults = 0; memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); } @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) p->numa_pages_migrated += pages; p->numa_faults_buffer[task_faults_idx(node, priv)] += pages; + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages; p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages; } -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f50.google.com (mail-qe0-f50.google.com [209.85.128.50]) by kanga.kvack.org (Postfix) with ESMTP id 10C016B0035 for ; Fri, 17 Jan 2014 16:17:08 -0500 (EST) Received: by mail-qe0-f50.google.com with SMTP id 1so4591781qec.9 for ; Fri, 17 Jan 2014 13:17:07 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id h2si6220628qcn.9.2014.01.17.13.17.06 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:17:07 -0800 (PST) From: riel@redhat.com Subject: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Date: Fri, 17 Jan 2014 16:12:05 -0500 Message-Id: <1389993129-28180-4-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1945ddc..aa680e2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -885,6 +885,7 @@ struct numa_group { struct list_head task_list; struct rcu_head rcu; + nodemask_t active_nodes; unsigned long total_faults; unsigned long *faults_from; unsigned long faults[0]; @@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p) } /* + * Iterate over the nodes from which NUMA hinting faults were triggered, in + * other words where the CPUs that incurred NUMA hinting faults are. The + * bitmask is used to limit NUMA page migrations, and spread out memory + * between the actively used nodes. To prevent flip-flopping, and excessive + * page migrations, nodes are added when they cause over 40% of the maximum + * number of faults, but only removed when they drop below 20%. + */ +static void update_numa_active_node_mask(struct task_struct *p) +{ + unsigned long faults, max_faults = 0; + struct numa_group *numa_group = p->numa_group; + int nid; + + for_each_online_node(nid) { + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + + numa_group->faults_from[task_faults_idx(nid, 1)]; + if (faults > max_faults) + max_faults = faults; + } + + for_each_online_node(nid) { + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + + numa_group->faults_from[task_faults_idx(nid, 1)]; + if (!node_isset(nid, numa_group->active_nodes)) { + if (faults > max_faults * 4 / 10) + node_set(nid, numa_group->active_nodes); + } else if (faults < max_faults * 2 / 10) + node_clear(nid, numa_group->active_nodes); + } +} + +/* * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS * increments. The more local the fault statistics are, the higher the scan * period will be for the next scan window. If local/remote ratio is below @@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p) update_task_scan_period(p, fault_types[0], fault_types[1]); if (p->numa_group) { + update_numa_active_node_mask(p); /* * If the preferred task and group nids are different, * iterate over the nodes again to find the best place. @@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, /* Second half of the array tracks where faults come from */ grp->faults_from = grp->faults + 2 * nr_node_ids; + node_set(task_node(current), grp->active_nodes); + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] = p->numa_faults[i]; @@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, my_grp->nr_tasks--; grp->nr_tasks++; + update_numa_active_node_mask(p); + spin_unlock(&my_grp->lock); spin_unlock(&grp->lock); -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f48.google.com (mail-qe0-f48.google.com [209.85.128.48]) by kanga.kvack.org (Postfix) with ESMTP id 4908B6B0031 for ; Fri, 17 Jan 2014 16:18:13 -0500 (EST) Received: by mail-qe0-f48.google.com with SMTP id ne12so2642209qeb.7 for ; Fri, 17 Jan 2014 13:18:13 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id i33si3275414qgf.180.2014.01.17.13.18.12 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:18:12 -0800 (PST) From: riel@redhat.com Subject: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Date: Fri, 17 Jan 2014 16:12:06 -0500 Message-Id: <1389993129-28180-5-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 8 ++++++-- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 67e1bbf..91726b6 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa, __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, __entry->dst_cpu, __entry->dst_nid) ); + +TRACE_EVENT(update_numa_active_nodes_mask, + + TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults), + + TP_ARGS(pid, gid, nid, set, faults, max_faults), + + TP_STRUCT__entry( + __field( pid_t, pid) + __field( pid_t, gid) + __field( int, nid) + __field( int, set) + __field( long, faults) + __field( long, max_faults); + ), + + TP_fast_assign( + __entry->pid = pid; + __entry->gid = gid; + __entry->nid = nid; + __entry->set = set; + __entry->faults = faults; + __entry->max_faults = max_faults; + ), + + TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld", + __entry->pid, + __entry->gid, + __entry->nid, + __entry->set, + __entry->faults, + __entry->max_faults) + +); #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aa680e2..3551009 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) faults = numa_group->faults_from[task_faults_idx(nid, 0)] + numa_group->faults_from[task_faults_idx(nid, 1)]; if (!node_isset(nid, numa_group->active_nodes)) { - if (faults > max_faults * 4 / 10) + if (faults > max_faults * 4 / 10) { + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); node_set(nid, numa_group->active_nodes); - } else if (faults < max_faults * 2 / 10) + } + } else if (faults < max_faults * 2 / 10) { + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults); node_clear(nid, numa_group->active_nodes); + } } } -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f171.google.com (mail-qc0-f171.google.com [209.85.216.171]) by kanga.kvack.org (Postfix) with ESMTP id AEB086B0035 for ; Fri, 17 Jan 2014 16:19:21 -0500 (EST) Received: by mail-qc0-f171.google.com with SMTP id n7so4111491qcx.16 for ; Fri, 17 Jan 2014 13:19:21 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id q18si14147756qeu.44.2014.01.17.13.19.20 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:19:20 -0800 (PST) From: riel@redhat.com Subject: [PATCH 5/7] numa,sched,mm: use active_nodes nodemask to limit numa migrations Date: Fri, 17 Jan 2014 16:12:07 -0500 Message-Id: <1389993129-28180-6-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 7 +++++++ kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 3 +++ 3 files changed, 47 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index a9f7f05..0af6c1a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p); +extern bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid); #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) @@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled) static inline void task_numa_free(struct task_struct *p) { } +static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid) +{ + return true; +} #endif static inline struct pid *task_pid(struct task_struct *task) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3551009..8e0a53a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid) return 1000 * group_faults(p, nid) / p->numa_group->total_faults; } +bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid) +{ + struct numa_group *ng = p->numa_group; + + /* Always allow migrate on private faults */ + if (cpupid_match_pid(p, last_cpupid)) + return true; + + /* A shared fault, but p->numa_group has not been set up yet. */ + if (!ng) + return true; + + /* + * Do not migrate if the destination is not a node that + * is actively used by this numa group. + */ + if (!node_isset(dst_nid, ng->active_nodes)) + return false; + + /* + * Source is a node that is not actively used by this + * numa group, while the destination is. Migrate. + */ + if (!node_isset(src_nid, ng->active_nodes)) + return true; + + /* + * Both source and destination are nodes in active + * use by this numa group. Maximize memory bandwidth + * by migrating from more heavily used groups, to less + * heavily used ones, spreading the load around. + * Use a 1/4 hysteresis to avoid spurious page movement. + */ + return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4); +} + static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 052abac..050962b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) { goto out; } + + if (!should_numa_migrate(current, last_cpupid, curnid, polnid)) + goto out; } if (curnid != polnid) -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f43.google.com (mail-qe0-f43.google.com [209.85.128.43]) by kanga.kvack.org (Postfix) with ESMTP id B8A376B0031 for ; Fri, 17 Jan 2014 16:20:24 -0500 (EST) Received: by mail-qe0-f43.google.com with SMTP id nc12so4495598qeb.16 for ; Fri, 17 Jan 2014 13:20:24 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id 75si3279061qgv.147.2014.01.17.13.20.23 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:20:23 -0800 (PST) From: riel@redhat.com Subject: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Date: Fri, 17 Jan 2014 16:12:08 -0500 Message-Id: <1389993129-28180-7-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the node with the garbage collector being marked the only active node in the group. This issue is avoided if we weigh the statistics by CPU use of each task in the numa group, instead of by how many faults each thread has occurred. To achieve this, we normalize the number of faults to the fraction of faults that occurred on each node, and then multiply that fraction by the fraction of CPU time the task has used since the last time task_numa_placement was invoked. This way the nodes in the active node mask will be the ones where the tasks from the numa group are most actively running, and the influence of eg. the garbage collector and other do-little threads is properly minimized. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 2 ++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 50 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0af6c1a..52de567 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1471,6 +1471,8 @@ struct task_struct { int numa_preferred_nid; unsigned long numa_migrate_retry; u64 node_stamp; /* migration stamp */ + u64 last_task_numa_placement; + u64 last_sum_exec_runtime; struct callback_head numa_work; struct list_head numa_entry; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f45fd5..9a0908a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_work.next = &p->numa_work; p->numa_faults = NULL; p->numa_faults_buffer = NULL; + p->last_task_numa_placement = 0; + p->last_sum_exec_runtime = 0; INIT_LIST_HEAD(&p->numa_entry); p->numa_group = NULL; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8e0a53a..0d395a0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); } +/* + * Get the fraction of time the task has been running since the last + * NUMA placement cycle. The scheduler keeps similar statistics, but + * decays those on a 32ms period, which is orders of magnitude off + * from the dozens-of-seconds NUMA balancing period. Use the scheduler + * stats only if the task is so new there are no NUMA statistics yet. + */ +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) +{ + u64 runtime, delta, now; + /* Use the start of this time slice to avoid calculations. */ + now = p->se.exec_start; + runtime = p->se.sum_exec_runtime; + + if (p->last_task_numa_placement) { + delta = runtime - p->last_sum_exec_runtime; + *period = now - p->last_task_numa_placement; + } else { + delta = p->se.avg.runnable_avg_sum; + *period = p->se.avg.runnable_avg_period; + } + + p->last_sum_exec_runtime = runtime; + p->last_task_numa_placement = now; + + return delta; +} + static void task_numa_placement(struct task_struct *p) { int seq, nid, max_nid = -1, max_group_nid = -1; unsigned long max_faults = 0, max_group_faults = 0; unsigned long fault_types[2] = { 0, 0 }; + unsigned long total_faults; + u64 runtime, period; spinlock_t *group_lock = NULL; seq = ACCESS_ONCE(p->mm->numa_scan_seq); @@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p) p->numa_scan_seq = seq; p->numa_scan_period_max = task_scan_max(p); + total_faults = p->numa_faults_locality[0] + + p->numa_faults_locality[1] + 1; + runtime = numa_get_avg_runtime(p, &period); + /* If the task is part of a group prevent parallel updates to group stats */ if (p->numa_group) { group_lock = &p->numa_group->lock; @@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p) int priv, i; for (priv = 0; priv < 2; priv++) { - long diff, f_diff; + long diff, f_diff, f_weight; i = task_faults_idx(nid, priv); diff = -p->numa_faults[i]; @@ -1459,8 +1493,18 @@ static void task_numa_placement(struct task_struct *p) fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; + /* + * Normalize the faults_from, so all tasks in a group + * count according to CPU use, instead of by the raw + * number of faults. Tasks with little runtime have + * little over-all impact on throughput, and thus their + * faults are less important. + */ + f_weight = (1024 * runtime * + p->numa_faults_from_buffer[i]) / + (total_faults * period + 1); p->numa_faults_from[i] >>= 1; - p->numa_faults_from[i] += p->numa_faults_from_buffer[i]; + p->numa_faults_from[i] += f_weight; p->numa_faults_from_buffer[i] = 0; faults += p->numa_faults[i]; -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by kanga.kvack.org (Postfix) with ESMTP id B912A6B0037 for ; Fri, 17 Jan 2014 16:21:30 -0500 (EST) Received: by mail-qc0-f182.google.com with SMTP id c9so4160574qcz.13 for ; Fri, 17 Jan 2014 13:21:30 -0800 (PST) Received: from shelob.surriel.com (shelob.surriel.com. [2002:4a5c:3b41:1:216:3eff:fe57:7f4]) by mx.google.com with ESMTPS id y1si3580862qal.40.2014.01.17.13.21.29 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 17 Jan 2014 13:21:29 -0800 (PST) From: riel@redhat.com Subject: [PATCH 7/7] numa,sched: do statistics calculation using local variables only Date: Fri, 17 Jan 2014 16:12:09 -0500 Message-Id: <1389993129-28180-8-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com From: Rik van Riel The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks temporarily see halved statistics could lead to unwanted numa migrations. This can be avoided by doing all the math in local variables. This change also simplifies the code a little. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0d395a0..0f48382 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1484,12 +1484,9 @@ static void task_numa_placement(struct task_struct *p) long diff, f_diff, f_weight; i = task_faults_idx(nid, priv); - diff = -p->numa_faults[i]; - f_diff = -p->numa_faults_from[i]; /* Decay existing window, copy faults since last scan */ - p->numa_faults[i] >>= 1; - p->numa_faults[i] += p->numa_faults_buffer[i]; + diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2; fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; @@ -1503,13 +1500,12 @@ static void task_numa_placement(struct task_struct *p) f_weight = (1024 * runtime * p->numa_faults_from_buffer[i]) / (total_faults * period + 1); - p->numa_faults_from[i] >>= 1; - p->numa_faults_from[i] += f_weight; + f_diff = f_weight - p->numa_faults_from[i] / 2; p->numa_faults_from_buffer[i] = 0; + p->numa_faults[i] += diff; + p->numa_faults_from[i] += f_diff; faults += p->numa_faults[i]; - diff += p->numa_faults[i]; - f_diff += p->numa_faults_from[i]; p->total_numa_faults += diff; if (p->numa_group) { /* safe because we can only change our own group */ -- 1.8.4.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f49.google.com (mail-qe0-f49.google.com [209.85.128.49]) by kanga.kvack.org (Postfix) with ESMTP id 6A2136B0031 for ; Fri, 17 Jan 2014 22:32:07 -0500 (EST) Received: by mail-qe0-f49.google.com with SMTP id w4so4718102qeb.36 for ; Fri, 17 Jan 2014 19:32:07 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id b11si14822868qen.65.2014.01.17.19.32.05 for ; Fri, 17 Jan 2014 19:32:06 -0800 (PST) Message-ID: <52D9F599.3040508@redhat.com> Date: Fri, 17 Jan 2014 22:31:37 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 7/7] numa,sched: do statistics calculation using local variables only References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-8-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-8-git-send-email-riel@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, Joe Mario On 01/17/2014 04:12 PM, riel@redhat.com wrote: > From: Rik van Riel > > The current code in task_numa_placement calculates the difference > between the old and the new value, but also temporarily stores half > of the old value in the per-process variables. > > The NUMA balancing code looks at those per-process variables, and > having other tasks temporarily see halved statistics could lead to > unwanted numa migrations. This can be avoided by doing all the math > in local variables. > > This change also simplifies the code a little. I am seeing what looks like a performance improvement with this patch, so it is not just a theoretical bug. The improvement is small, as is to be expected with such a small race, but with two 32-warehouse specjbb instances on a 4-node, 10core/20thread per node system, I see the following change in performance, and reduced numa page migrations. Without the patch: run 1: throughput 367660 367660, migrated 3112982 run 2: throughput 353821 355612, migrated 2881317 run 3: throughput 355027 355027, migrated 3358105 run 4: throughput 354366 354366, migrated 3466687 run 5: throughput 356186 356186, migrated 3152194 run 6: throughput 361431 361431, migrated 3336219 run 7: throughput 354704 354704, migrated 3345418 run 8: throughput 363770 363770, migrated 3642925 run 9: throughput 363380 363380, migrated 3192836 run 10: throughput 358440 358440, migrated 3354028 avg: througphut 358968, migrated 3284271 With the patch: run 1: throughput 360580 360580, migrated 3169872 run 2: throughput 361303 361303, migrated 3220280 run 3: throughput 367692 367692, migrated 3096093 run 4: throughput 362320 362320, migrated 2981762 run 5: throughput 364201 364201, migrated 3089107 run 6: throughput 364561 364561, migrated 2892364 run 7: throughput 360771 360771, migrated 3086638 run 8: throughput 361530 361530, migrated 2933256 run 9: throughput 365841 365841, migrated 3356944 run 10: throughput 359188 359188, migrated 3394545 avg: througphut 362798, migrated 3122086 -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f175.google.com (mail-ob0-f175.google.com [209.85.214.175]) by kanga.kvack.org (Postfix) with ESMTP id B44B56B0031 for ; Sat, 18 Jan 2014 17:06:05 -0500 (EST) Received: by mail-ob0-f175.google.com with SMTP id wn1so206673obc.20 for ; Sat, 18 Jan 2014 14:06:05 -0800 (PST) Received: from g4t0014.houston.hp.com (g4t0014.houston.hp.com. [15.201.24.17]) by mx.google.com with ESMTPS id ds9si14092174obc.60.2014.01.18.14.06.03 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Sat, 18 Jan 2014 14:06:04 -0800 (PST) Message-ID: <52DAFAC7.7080307@hp.com> Date: Sat, 18 Jan 2014 14:05:59 -0800 From: Chegu Vinod MIME-Version: 1.0 Subject: Re: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing References: <1389993129-28180-1-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: riel@redhat.com, linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com On 1/17/2014 1:12 PM, riel@redhat.com wrote: > The current automatic NUMA balancing code base has issues with > workloads that do not fit on one NUMA load. Page migration is > slowed down, but memory distribution between the nodes where > the workload runs is essentially random, often resulting in a > suboptimal amount of memory bandwidth being available to the > workload. > > In order to maximize performance of workloads that do not fit in one NUMA > node, we want to satisfy the following criteria: > 1) keep private memory local to each thread > 2) avoid excessive NUMA migration of pages > 3) distribute shared memory across the active nodes, to > maximize memory bandwidth available to the workload > > This patch series identifies the NUMA nodes on which the workload > is actively running, and balances (somewhat lazily) the memory > between those nodes, satisfying the criteria above. > > As usual, the series has had some performance testing, but it > could always benefit from more testing, on other systems. > > Changes since v1: > - fix divide by zero found by Chegu Vinod > - improve comment, as suggested by Peter Zijlstra > - do stats calculations in task_numa_placement in local variables > > > Some performance numbers, with two 40-warehouse specjbb instances > on an 8 node system with 10 CPU cores per node, using a pre-cleanup > version of these patches, courtesy of Chegu Vinod: > > numactl manual pinning > spec1.txt: throughput = 755900.20 SPECjbb2005 bops > spec2.txt: throughput = 754914.40 SPECjbb2005 bops > > NO-pinning results (Automatic NUMA balancing, with patches) > spec1.txt: throughput = 706439.84 SPECjbb2005 bops > spec2.txt: throughput = 729347.75 SPECjbb2005 bops > > NO-pinning results (Automatic NUMA balancing, without patches) > spec1.txt: throughput = 667988.47 SPECjbb2005 bops > spec2.txt: throughput = 638220.45 SPECjbb2005 bops > > No Automatic NUMA and NO-pinning results > spec1.txt: throughput = 544120.97 SPECjbb2005 bops > spec2.txt: throughput = 453553.41 SPECjbb2005 bops > > > My own performance numbers are not as relevant, since I have been > running with a more hostile workload on purpose, and I have run > into a scheduler issue that caused the workload to run on only > two of the four NUMA nodes on my test system... > > . > Acked-by: Chegu Vinod ---- Here are some results using the v2 version of the patches on an 8 socket box using SPECjbb2005 as a workload : I) Eight 1-socket wide instances(10 warehouse threads each) : Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 270620.04 273675.10 spec2.txt: throughput = 274115.33 272845.17 spec3.txt: throughput = 277830.09 272057.33 spec4.txt: throughput = 270898.52 270670.54 spec5.txt: throughput = 270397.30 270906.82 spec6.txt: throughput = 270451.93 268217.55 spec7.txt: throughput = 269511.07 269354.46 spec8.txt: throughput = 269386.06 270540.00 b)Automatic NUMA balancing results spec1.txt: throughput = 244333.41 248072.72 spec2.txt: throughput = 252166.99 251818.30 spec3.txt: throughput = 251365.58 258266.24 spec4.txt: throughput = 245247.91 256873.51 spec5.txt: throughput = 245579.68 247743.18 spec6.txt: throughput = 249767.38 256285.86 spec7.txt: throughput = 244570.64 255343.99 spec8.txt: throughput = 245703.60 254434.36 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 132959.73 136957.12 spec2.txt: throughput = 127937.11 129326.23 spec3.txt: throughput = 130697.10 125772.11 spec4.txt: throughput = 134978.49 141607.58 spec5.txt: throughput = 127574.34 126748.18 spec6.txt: throughput = 138699.99 128597.95 spec7.txt: throughput = 133247.25 137344.57 spec8.txt: throughput = 124548.00 139040.98 ------ II) Four 2-socket wide instances(20 warehouse threads each) : Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 479931.16 472467.58 spec2.txt: throughput = 466652.15 466237.10 spec3.txt: throughput = 473591.51 466891.98 spec4.txt: throughput = 462346.62 466891.98 b)Automatic NUMA balancing results spec1.txt: throughput = 383758.29 437489.99 spec2.txt: throughput = 370926.06 435692.97 spec3.txt: throughput = 368872.72 444615.08 spec4.txt: throughput = 404422.82 435236.20 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 252752.12 231762.30 spec2.txt: throughput = 255391.51 253250.95 spec3.txt: throughput = 264764.00 263721.03 spec4.txt: throughput = 254833.39 242892.72 ------ III) Two 4-socket wide instances(40 warehouse threads each) Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 771340.84 769039.53 spec2.txt: throughput = 762184.48 760745.65 b)Automatic NUMA balancing results spec1.txt: throughput = 667182.98 720197.01 spec2.txt: throughput = 692564.11 739872.51 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 457079.28 467199.30 spec2.txt: throughput = 479790.47 456279.07 ----- IV) One 8-socket wide instance(80 warehouse threads) Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 982113.03 985836.96 b)Automatic NUMA balancing results spec1.txt: throughput = 755615.94 843632.09 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 671583.26 661768.54 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-bk0-f41.google.com (mail-bk0-f41.google.com [209.85.214.41]) by kanga.kvack.org (Postfix) with ESMTP id D0BAF6B0035 for ; Mon, 20 Jan 2014 11:31:23 -0500 (EST) Received: by mail-bk0-f41.google.com with SMTP id na10so287143bkb.0 for ; Mon, 20 Jan 2014 08:31:23 -0800 (PST) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id qw9si2050101bkb.1.2014.01.20.08.31.21 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2014 08:31:21 -0800 (PST) Date: Mon, 20 Jan 2014 17:31:03 +0100 From: Peter Zijlstra Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Message-ID: <20140120163103.GI31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-4-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: > /* > + * Iterate over the nodes from which NUMA hinting faults were triggered, in > + * other words where the CPUs that incurred NUMA hinting faults are. The > + * bitmask is used to limit NUMA page migrations, and spread out memory > + * between the actively used nodes. To prevent flip-flopping, and excessive > + * page migrations, nodes are added when they cause over 40% of the maximum > + * number of faults, but only removed when they drop below 20%. > + */ > +static void update_numa_active_node_mask(struct task_struct *p) > +{ > + unsigned long faults, max_faults = 0; > + struct numa_group *numa_group = p->numa_group; > + int nid; > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (faults > max_faults) > + max_faults = faults; > + } > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (!node_isset(nid, numa_group->active_nodes)) { > + if (faults > max_faults * 4 / 10) > + node_set(nid, numa_group->active_nodes); > + } else if (faults < max_faults * 2 / 10) > + node_clear(nid, numa_group->active_nodes); > + } > +} Why not use 6/16 and 3/16 resp.? That avoids an actual division. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f47.google.com (mail-qe0-f47.google.com [209.85.128.47]) by kanga.kvack.org (Postfix) with ESMTP id 6873C6B0037 for ; Mon, 20 Jan 2014 11:52:28 -0500 (EST) Received: by mail-qe0-f47.google.com with SMTP id 5so6683361qeb.34 for ; Mon, 20 Jan 2014 08:52:28 -0800 (PST) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id 6si1038018qgy.136.2014.01.20.08.52.23 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2014 08:52:24 -0800 (PST) Date: Mon, 20 Jan 2014 17:52:05 +0100 From: Peter Zijlstra Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Message-ID: <20140120165205.GJ31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-5-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com, Steven Rostedt On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: > From: Rik van Riel > > Being able to see how the active nodemask changes over time, and why, > can be quite useful. > > Cc: Peter Zijlstra > Cc: Mel Gorman > Cc: Ingo Molnar > Cc: Chegu Vinod > Signed-off-by: Rik van Riel > --- > include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 8 ++++++-- > 2 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 67e1bbf..91726b6 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa, > __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, > __entry->dst_cpu, __entry->dst_nid) > ); > + > +TRACE_EVENT(update_numa_active_nodes_mask, Please stick to the sched_ naming for these things. Ideally we'd rename the sysctls too :/ > +++ b/kernel/sched/fair.c > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) > faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > numa_group->faults_from[task_faults_idx(nid, 1)]; > if (!node_isset(nid, numa_group->active_nodes)) { > - if (faults > max_faults * 4 / 10) > + if (faults > max_faults * 4 / 10) { > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); While I think the tracepoint hookery is smart enough to avoid evaluating arguments when they're disabled, it might be best to simply pass: current and numa_group and do the dereference in fast_assign(). That said, this is the first and only numa tracepoint, I'm not sure why this qualifies and other metrics do not. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f173.google.com (mail-qc0-f173.google.com [209.85.216.173]) by kanga.kvack.org (Postfix) with ESMTP id BA6316B0037 for ; Mon, 20 Jan 2014 11:55:42 -0500 (EST) Received: by mail-qc0-f173.google.com with SMTP id i8so6008011qcq.4 for ; Mon, 20 Jan 2014 08:55:42 -0800 (PST) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id 6si1068956qgr.10.2014.01.20.08.55.41 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2014 08:55:41 -0800 (PST) Date: Mon, 20 Jan 2014 17:55:23 +0100 From: Peter Zijlstra Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Message-ID: <20140120165523.GK31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-4-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: > /* > + * Iterate over the nodes from which NUMA hinting faults were triggered, in > + * other words where the CPUs that incurred NUMA hinting faults are. The > + * bitmask is used to limit NUMA page migrations, and spread out memory > + * between the actively used nodes. To prevent flip-flopping, and excessive > + * page migrations, nodes are added when they cause over 40% of the maximum > + * number of faults, but only removed when they drop below 20%. > + */ Maybe break the above into two paragraphs for added readability. Also, I think this might be a good spot to explain why you need the second fault metric -- that is, why can't we create the interleave mask from the existing memory location faults. > +static void update_numa_active_node_mask(struct task_struct *p) > +{ > + unsigned long faults, max_faults = 0; > + struct numa_group *numa_group = p->numa_group; > + int nid; > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (faults > max_faults) > + max_faults = faults; > + } > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (!node_isset(nid, numa_group->active_nodes)) { > + if (faults > max_faults * 4 / 10) > + node_set(nid, numa_group->active_nodes); > + } else if (faults < max_faults * 2 / 10) > + node_clear(nid, numa_group->active_nodes); > + } > +} -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f49.google.com (mail-yh0-f49.google.com [209.85.213.49]) by kanga.kvack.org (Postfix) with ESMTP id BABD46B0037 for ; Mon, 20 Jan 2014 11:58:05 -0500 (EST) Received: by mail-yh0-f49.google.com with SMTP id b6so2353364yha.22 for ; Mon, 20 Jan 2014 08:58:05 -0800 (PST) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id p8si1042413qeo.143.2014.01.20.08.58.04 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2014 08:58:04 -0800 (PST) Date: Mon, 20 Jan 2014 17:57:47 +0100 From: Peter Zijlstra Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Message-ID: <20140120165747.GL31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-7-git-send-email-riel@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On Fri, Jan 17, 2014 at 04:12:08PM -0500, riel@redhat.com wrote: > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 0af6c1a..52de567 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1471,6 +1471,8 @@ struct task_struct { > int numa_preferred_nid; > unsigned long numa_migrate_retry; > u64 node_stamp; /* migration stamp */ > + u64 last_task_numa_placement; > + u64 last_sum_exec_runtime; > struct callback_head numa_work; > > struct list_head numa_entry; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8e0a53a..0d395a0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, > memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); > } > > +/* > + * Get the fraction of time the task has been running since the last > + * NUMA placement cycle. The scheduler keeps similar statistics, but > + * decays those on a 32ms period, which is orders of magnitude off > + * from the dozens-of-seconds NUMA balancing period. Use the scheduler > + * stats only if the task is so new there are no NUMA statistics yet. > + */ > +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) > +{ > + u64 runtime, delta, now; > + /* Use the start of this time slice to avoid calculations. */ > + now = p->se.exec_start; > + runtime = p->se.sum_exec_runtime; > + > + if (p->last_task_numa_placement) { > + delta = runtime - p->last_sum_exec_runtime; > + *period = now - p->last_task_numa_placement; > + } else { > + delta = p->se.avg.runnable_avg_sum; > + *period = p->se.avg.runnable_avg_period; > + } > + > + p->last_sum_exec_runtime = runtime; > + p->last_task_numa_placement = now; > + > + return delta; > +} Have you tried what happens if you use p->se.avg.runnable_avg_sum / p->se.avg.runnable_avg_period instead? If that also works it avoids growing the datastructures and keeping of yet another set of runtime stats. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f52.google.com (mail-qa0-f52.google.com [209.85.216.52]) by kanga.kvack.org (Postfix) with ESMTP id 185006B0035 for ; Mon, 20 Jan 2014 13:51:41 -0500 (EST) Received: by mail-qa0-f52.google.com with SMTP id j15so5812115qaq.39 for ; Mon, 20 Jan 2014 10:51:40 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id y1si1252179qal.136.2014.01.20.10.51.39 for ; Mon, 20 Jan 2014 10:51:40 -0800 (PST) Message-ID: <52DD7016.9080708@redhat.com> Date: Mon, 20 Jan 2014 13:51:02 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> <20140120165205.GJ31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120165205.GJ31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com, Steven Rostedt On 01/20/2014 11:52 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: >> +++ b/kernel/sched/fair.c >> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) >> faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> numa_group->faults_from[task_faults_idx(nid, 1)]; >> if (!node_isset(nid, numa_group->active_nodes)) { >> - if (faults > max_faults * 4 / 10) >> + if (faults > max_faults * 4 / 10) { >> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); > > While I think the tracepoint hookery is smart enough to avoid evaluating > arguments when they're disabled, it might be best to simply pass: > current and numa_group and do the dereference in fast_assign(). > > That said, this is the first and only numa tracepoint, I'm not sure why > this qualifies and other metrics do not. It's there because I needed it in development. If you think it is not merge material, I would be comfortable leaving it out. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f173.google.com (mail-qc0-f173.google.com [209.85.216.173]) by kanga.kvack.org (Postfix) with ESMTP id 96B2F6B0035 for ; Mon, 20 Jan 2014 14:03:06 -0500 (EST) Received: by mail-qc0-f173.google.com with SMTP id i8so6129350qcq.4 for ; Mon, 20 Jan 2014 11:03:06 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id nh12si1291994qeb.4.2014.01.20.11.03.05 for ; Mon, 20 Jan 2014 11:03:05 -0800 (PST) Message-ID: <52DD72C8.2050602@redhat.com> Date: Mon, 20 Jan 2014 14:02:32 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> <20140120165747.GL31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120165747.GL31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On 01/20/2014 11:57 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:08PM -0500, riel@redhat.com wrote: >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 0af6c1a..52de567 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1471,6 +1471,8 @@ struct task_struct { >> int numa_preferred_nid; >> unsigned long numa_migrate_retry; >> u64 node_stamp; /* migration stamp */ >> + u64 last_task_numa_placement; >> + u64 last_sum_exec_runtime; >> struct callback_head numa_work; >> >> struct list_head numa_entry; > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8e0a53a..0d395a0 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, >> memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); >> } >> >> +/* >> + * Get the fraction of time the task has been running since the last >> + * NUMA placement cycle. The scheduler keeps similar statistics, but >> + * decays those on a 32ms period, which is orders of magnitude off >> + * from the dozens-of-seconds NUMA balancing period. Use the scheduler >> + * stats only if the task is so new there are no NUMA statistics yet. >> + */ >> +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) >> +{ >> + u64 runtime, delta, now; >> + /* Use the start of this time slice to avoid calculations. */ >> + now = p->se.exec_start; >> + runtime = p->se.sum_exec_runtime; >> + >> + if (p->last_task_numa_placement) { >> + delta = runtime - p->last_sum_exec_runtime; >> + *period = now - p->last_task_numa_placement; >> + } else { >> + delta = p->se.avg.runnable_avg_sum; >> + *period = p->se.avg.runnable_avg_period; >> + } >> + >> + p->last_sum_exec_runtime = runtime; >> + p->last_task_numa_placement = now; >> + >> + return delta; >> +} > > Have you tried what happens if you use p->se.avg.runnable_avg_sum / > p->se.avg.runnable_avg_period instead? If that also works it avoids > growing the datastructures and keeping of yet another set of runtime > stats. That is what I started out with, and the results were not as stable as with this calculation. Having said that, I did that before I came up with patch 7/7, so maybe the effect would no longer be as pronounced any more as it was before... I can send in a simplified version, if you prefer. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f46.google.com (mail-qa0-f46.google.com [209.85.216.46]) by kanga.kvack.org (Postfix) with ESMTP id B0C2A6B0035 for ; Mon, 20 Jan 2014 14:05:54 -0500 (EST) Received: by mail-qa0-f46.google.com with SMTP id ii20so5865955qab.33 for ; Mon, 20 Jan 2014 11:05:54 -0800 (PST) Received: from cdptpa-oedge-vip.email.rr.com (cdptpa-outbound-snat.email.rr.com. [107.14.166.225]) by mx.google.com with ESMTP id f91si1290172qge.48.2014.01.20.11.05.53 for ; Mon, 20 Jan 2014 11:05:53 -0800 (PST) Date: Mon, 20 Jan 2014 14:05:51 -0500 From: Steven Rostedt Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Message-ID: <20140120140551.3343ab2b@gandalf.local.home> In-Reply-To: <20140120165205.GJ31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> <20140120165205.GJ31570@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On Mon, 20 Jan 2014 17:52:05 +0100 Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: > > From: Rik van Riel > > > > +++ b/kernel/sched/fair.c > > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) > > faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > > numa_group->faults_from[task_faults_idx(nid, 1)]; > > if (!node_isset(nid, numa_group->active_nodes)) { > > - if (faults > max_faults * 4 / 10) > > + if (faults > max_faults * 4 / 10) { > > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); > > While I think the tracepoint hookery is smart enough to avoid evaluating > arguments when they're disabled, it might be best to simply pass: > current and numa_group and do the dereference in fast_assign(). It's really up to gcc to optimize it. But that said, it is more efficient to just past the pointer and do the dereferencing in the fast_assign(). At least it keeps any bad optimization in gcc from infecting the tracepoint caller. It also makes it easier to get other information if you want to later extend that tracepoint. Does this tracepoint always use current? If so, why bother passing it in? -- Steve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f172.google.com (mail-yk0-f172.google.com [209.85.160.172]) by kanga.kvack.org (Postfix) with ESMTP id 2DCA26B0035 for ; Mon, 20 Jan 2014 14:11:03 -0500 (EST) Received: by mail-yk0-f172.google.com with SMTP id 200so3514624ykr.3 for ; Mon, 20 Jan 2014 11:11:02 -0800 (PST) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id g5si2372538yhd.162.2014.01.20.11.10.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2014 11:10:57 -0800 (PST) Date: Mon, 20 Jan 2014 20:10:51 +0100 From: Peter Zijlstra Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Message-ID: <20140120191051.GQ11314@laptop.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> <20140120165747.GL31570@twins.programming.kicks-ass.net> <52DD72C8.2050602@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52DD72C8.2050602@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On Mon, Jan 20, 2014 at 02:02:32PM -0500, Rik van Riel wrote: > That is what I started out with, and the results were not > as stable as with this calculation. > > Having said that, I did that before I came up with patch 7/7, > so maybe the effect would no longer be as pronounced any more > as it was before... > > I can send in a simplified version, if you prefer. If you could retry with 7/7, I don't mind adding the extra stats too much, but it would be nice if we can avoid it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qe0-f41.google.com (mail-qe0-f41.google.com [209.85.128.41]) by kanga.kvack.org (Postfix) with ESMTP id 1C2EB6B0035 for ; Mon, 20 Jan 2014 14:39:19 -0500 (EST) Received: by mail-qe0-f41.google.com with SMTP id gc15so3488080qeb.14 for ; Mon, 20 Jan 2014 11:39:18 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id o8si1337859qey.81.2014.01.20.11.39.17 for ; Mon, 20 Jan 2014 11:39:18 -0800 (PST) Message-ID: <52DD7129.3040106@redhat.com> Date: Mon, 20 Jan 2014 13:55:37 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> <20140120163103.GI31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120163103.GI31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com On 01/20/2014 11:31 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: >> /* >> + * Iterate over the nodes from which NUMA hinting faults were triggered, in >> + * other words where the CPUs that incurred NUMA hinting faults are. The >> + * bitmask is used to limit NUMA page migrations, and spread out memory >> + * between the actively used nodes. To prevent flip-flopping, and excessive >> + * page migrations, nodes are added when they cause over 40% of the maximum >> + * number of faults, but only removed when they drop below 20%. >> + */ >> +static void update_numa_active_node_mask(struct task_struct *p) >> +{ >> + unsigned long faults, max_faults = 0; >> + struct numa_group *numa_group = p->numa_group; >> + int nid; >> + >> + for_each_online_node(nid) { >> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> + numa_group->faults_from[task_faults_idx(nid, 1)]; >> + if (faults > max_faults) >> + max_faults = faults; >> + } >> + >> + for_each_online_node(nid) { >> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> + numa_group->faults_from[task_faults_idx(nid, 1)]; >> + if (!node_isset(nid, numa_group->active_nodes)) { >> + if (faults > max_faults * 4 / 10) >> + node_set(nid, numa_group->active_nodes); >> + } else if (faults < max_faults * 2 / 10) >> + node_clear(nid, numa_group->active_nodes); >> + } >> +} > > Why not use 6/16 and 3/16 resp.? That avoids an actual division. OK, will do. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752819AbaAQVOH (ORCPT ); Fri, 17 Jan 2014 16:14:07 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59110 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751369AbaAQVOG (ORCPT ); Fri, 17 Jan 2014 16:14:06 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing Date: Fri, 17 Jan 2014 16:12:02 -0500 Message-Id: <1389993129-28180-1-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The current automatic NUMA balancing code base has issues with workloads that do not fit on one NUMA load. Page migration is slowed down, but memory distribution between the nodes where the workload runs is essentially random, often resulting in a suboptimal amount of memory bandwidth being available to the workload. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch series identifies the NUMA nodes on which the workload is actively running, and balances (somewhat lazily) the memory between those nodes, satisfying the criteria above. As usual, the series has had some performance testing, but it could always benefit from more testing, on other systems. Changes since v1: - fix divide by zero found by Chegu Vinod - improve comment, as suggested by Peter Zijlstra - do stats calculations in task_numa_placement in local variables Some performance numbers, with two 40-warehouse specjbb instances on an 8 node system with 10 CPU cores per node, using a pre-cleanup version of these patches, courtesy of Chegu Vinod: numactl manual pinning spec1.txt: throughput = 755900.20 SPECjbb2005 bops spec2.txt: throughput = 754914.40 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, with patches) spec1.txt: throughput = 706439.84 SPECjbb2005 bops spec2.txt: throughput = 729347.75 SPECjbb2005 bops NO-pinning results (Automatic NUMA balancing, without patches) spec1.txt: throughput = 667988.47 SPECjbb2005 bops spec2.txt: throughput = 638220.45 SPECjbb2005 bops No Automatic NUMA and NO-pinning results spec1.txt: throughput = 544120.97 SPECjbb2005 bops spec2.txt: throughput = 453553.41 SPECjbb2005 bops My own performance numbers are not as relevant, since I have been running with a more hostile workload on purpose, and I have run into a scheduler issue that caused the workload to run on only two of the four NUMA nodes on my test system... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753218AbaAQVO7 (ORCPT ); Fri, 17 Jan 2014 16:14:59 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59115 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751484AbaAQVOz (ORCPT ); Fri, 17 Jan 2014 16:14:55 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 1/7] numa,sched,mm: remove p->numa_migrate_deferred Date: Fri, 17 Jan 2014 16:12:03 -0500 Message-Id: <1389993129-28180-2-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that the second stage of the automatic numa balancing code has stabilized, it is time to replace the simplistic migration deferral code with something smarter. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 1 - kernel/sched/fair.c | 8 -------- kernel/sysctl.c | 7 ------- mm/mempolicy.c | 45 --------------------------------------------- 4 files changed, 61 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 68a0e84..97efba4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1469,7 +1469,6 @@ struct task_struct { unsigned int numa_scan_period; unsigned int numa_scan_period_max; int numa_preferred_nid; - int numa_migrate_deferred; unsigned long numa_migrate_retry; u64 node_stamp; /* migration stamp */ struct callback_head numa_work; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 867b0a4..41e2176 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -819,14 +819,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; -/* - * After skipping a page migration on a shared page, skip N more numa page - * migrations unconditionally. This reduces the number of NUMA migrations - * in shared memory workloads, and has the effect of pulling tasks towards - * where their memory lives, over pulling the memory towards the task. - */ -unsigned int sysctl_numa_balancing_migrate_deferred = 16; - static unsigned int task_nr_scan_windows(struct task_struct *p) { unsigned long rss = 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 096db74..4d19492 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -384,13 +384,6 @@ static struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, { - .procname = "numa_balancing_migrate_deferred", - .data = &sysctl_numa_balancing_migrate_deferred, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, - { .procname = "numa_balancing", .data = NULL, /* filled in by handler */ .maxlen = sizeof(unsigned int), diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 36cb46c..052abac 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2301,35 +2301,6 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } -#ifdef CONFIG_NUMA_BALANCING -static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) -{ - /* Never defer a private fault */ - if (cpupid_match_pid(p, last_cpupid)) - return false; - - if (p->numa_migrate_deferred) { - p->numa_migrate_deferred--; - return true; - } - return false; -} - -static inline void defer_numa_migrate(struct task_struct *p) -{ - p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred; -} -#else -static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) -{ - return false; -} - -static inline void defer_numa_migrate(struct task_struct *p) -{ -} -#endif /* CONFIG_NUMA_BALANCING */ - /** * mpol_misplaced - check whether current page node is valid in policy * @@ -2432,24 +2403,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long */ last_cpupid = page_cpupid_xchg_last(page, this_cpupid); if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) { - - /* See sysctl_numa_balancing_migrate_deferred comment */ - if (!cpupid_match_pid(current, last_cpupid)) - defer_numa_migrate(current); - goto out; } - - /* - * The quadratic filter above reduces extraneous migration - * of shared pages somewhat. This code reduces it even more, - * reducing the overhead of page migrations of shared pages. - * This makes workloads with shared pages rely more on - * "move task near its memory", and less on "move memory - * towards its task", which is exactly what we want. - */ - if (numa_migrate_deferred(current, last_cpupid)) - goto out; } if (curnid != polnid) -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753395AbaAQVQE (ORCPT ); Fri, 17 Jan 2014 16:16:04 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59119 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751276AbaAQVQB (ORCPT ); Fri, 17 Jan 2014 16:16:01 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 2/7] numa,sched: track from which nodes NUMA faults are triggered Date: Fri, 17 Jan 2014 16:12:04 -0500 Message-Id: <1389993129-28180-3-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is actively running on. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 10 ++++++++-- kernel/sched/fair.c | 30 +++++++++++++++++++++++------- 2 files changed, 31 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 97efba4..a9f7f05 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1492,6 +1492,14 @@ struct task_struct { unsigned long *numa_faults_buffer; /* + * Track the nodes where faults are incurred. This is not very + * interesting on a per-task basis, but it help with smarter + * numa memory placement for groups of processes. + */ + unsigned long *numa_faults_from; + unsigned long *numa_faults_from_buffer; + + /* * numa_faults_locality tracks if faults recorded during the last * scan window were remote/local. The task scan period is adapted * based on the locality of the faults with different weights @@ -1594,8 +1602,6 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p); - -extern unsigned int sysctl_numa_balancing_migrate_deferred; #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 41e2176..1945ddc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -886,6 +886,7 @@ struct numa_group { struct rcu_head rcu; unsigned long total_faults; + unsigned long *faults_from; unsigned long faults[0]; }; @@ -1372,10 +1373,11 @@ static void task_numa_placement(struct task_struct *p) int priv, i; for (priv = 0; priv < 2; priv++) { - long diff; + long diff, f_diff; i = task_faults_idx(nid, priv); diff = -p->numa_faults[i]; + f_diff = -p->numa_faults_from[i]; /* Decay existing window, copy faults since last scan */ p->numa_faults[i] >>= 1; @@ -1383,12 +1385,18 @@ static void task_numa_placement(struct task_struct *p) fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; + p->numa_faults_from[i] >>= 1; + p->numa_faults_from[i] += p->numa_faults_from_buffer[i]; + p->numa_faults_from_buffer[i] = 0; + faults += p->numa_faults[i]; diff += p->numa_faults[i]; + f_diff += p->numa_faults_from[i]; p->total_numa_faults += diff; if (p->numa_group) { /* safe because we can only change our own group */ p->numa_group->faults[i] += diff; + p->numa_group->faults_from[i] += f_diff; p->numa_group->total_faults += diff; group_faults += p->numa_group->faults[i]; } @@ -1457,7 +1465,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, if (unlikely(!p->numa_group)) { unsigned int size = sizeof(struct numa_group) + - 2*nr_node_ids*sizeof(unsigned long); + 4*nr_node_ids*sizeof(unsigned long); grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); if (!grp) @@ -1467,8 +1475,10 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, spin_lock_init(&grp->lock); INIT_LIST_HEAD(&grp->task_list); grp->gid = p->pid; + /* Second half of the array tracks where faults come from */ + grp->faults_from = grp->faults + 2 * nr_node_ids; - for (i = 0; i < 2*nr_node_ids; i++) + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] = p->numa_faults[i]; grp->total_faults = p->total_numa_faults; @@ -1526,7 +1536,7 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, double_lock(&my_grp->lock, &grp->lock); - for (i = 0; i < 2*nr_node_ids; i++) { + for (i = 0; i < 4*nr_node_ids; i++) { my_grp->faults[i] -= p->numa_faults[i]; grp->faults[i] += p->numa_faults[i]; } @@ -1558,7 +1568,7 @@ void task_numa_free(struct task_struct *p) if (grp) { spin_lock(&grp->lock); - for (i = 0; i < 2*nr_node_ids; i++) + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] -= p->numa_faults[i]; grp->total_faults -= p->total_numa_faults; @@ -1571,6 +1581,8 @@ void task_numa_free(struct task_struct *p) p->numa_faults = NULL; p->numa_faults_buffer = NULL; + p->numa_faults_from = NULL; + p->numa_faults_from_buffer = NULL; kfree(numa_faults); } @@ -1581,6 +1593,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) { struct task_struct *p = current; bool migrated = flags & TNF_MIGRATED; + int this_node = task_node(current); int priv; if (!numabalancing_enabled) @@ -1596,7 +1609,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) /* Allocate buffer to track faults on a per-node basis */ if (unlikely(!p->numa_faults)) { - int size = sizeof(*p->numa_faults) * 2 * nr_node_ids; + int size = sizeof(*p->numa_faults) * 4 * nr_node_ids; /* numa_faults and numa_faults_buffer share the allocation */ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN); @@ -1604,7 +1617,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) return; BUG_ON(p->numa_faults_buffer); - p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids); + p->numa_faults_from = p->numa_faults + (2 * nr_node_ids); + p->numa_faults_buffer = p->numa_faults + (4 * nr_node_ids); + p->numa_faults_from_buffer = p->numa_faults + (6 * nr_node_ids); p->total_numa_faults = 0; memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); } @@ -1634,6 +1649,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags) p->numa_pages_migrated += pages; p->numa_faults_buffer[task_faults_idx(node, priv)] += pages; + p->numa_faults_from_buffer[task_faults_idx(this_node, priv)] += pages; p->numa_faults_locality[!!(flags & TNF_FAULT_LOCAL)] += pages; } -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753458AbaAQVRJ (ORCPT ); Fri, 17 Jan 2014 16:17:09 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59125 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751892AbaAQVRH (ORCPT ); Fri, 17 Jan 2014 16:17:07 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Date: Fri, 17 Jan 2014 16:12:05 -0500 Message-Id: <1389993129-28180-4-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel The faults_from statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1945ddc..aa680e2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -885,6 +885,7 @@ struct numa_group { struct list_head task_list; struct rcu_head rcu; + nodemask_t active_nodes; unsigned long total_faults; unsigned long *faults_from; unsigned long faults[0]; @@ -1275,6 +1276,38 @@ static void numa_migrate_preferred(struct task_struct *p) } /* + * Iterate over the nodes from which NUMA hinting faults were triggered, in + * other words where the CPUs that incurred NUMA hinting faults are. The + * bitmask is used to limit NUMA page migrations, and spread out memory + * between the actively used nodes. To prevent flip-flopping, and excessive + * page migrations, nodes are added when they cause over 40% of the maximum + * number of faults, but only removed when they drop below 20%. + */ +static void update_numa_active_node_mask(struct task_struct *p) +{ + unsigned long faults, max_faults = 0; + struct numa_group *numa_group = p->numa_group; + int nid; + + for_each_online_node(nid) { + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + + numa_group->faults_from[task_faults_idx(nid, 1)]; + if (faults > max_faults) + max_faults = faults; + } + + for_each_online_node(nid) { + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + + numa_group->faults_from[task_faults_idx(nid, 1)]; + if (!node_isset(nid, numa_group->active_nodes)) { + if (faults > max_faults * 4 / 10) + node_set(nid, numa_group->active_nodes); + } else if (faults < max_faults * 2 / 10) + node_clear(nid, numa_group->active_nodes); + } +} + +/* * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS * increments. The more local the fault statistics are, the higher the scan * period will be for the next scan window. If local/remote ratio is below @@ -1416,6 +1449,7 @@ static void task_numa_placement(struct task_struct *p) update_task_scan_period(p, fault_types[0], fault_types[1]); if (p->numa_group) { + update_numa_active_node_mask(p); /* * If the preferred task and group nids are different, * iterate over the nodes again to find the best place. @@ -1478,6 +1512,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, /* Second half of the array tracks where faults come from */ grp->faults_from = grp->faults + 2 * nr_node_ids; + node_set(task_node(current), grp->active_nodes); + for (i = 0; i < 4*nr_node_ids; i++) grp->faults[i] = p->numa_faults[i]; @@ -1547,6 +1583,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags, my_grp->nr_tasks--; grp->nr_tasks++; + update_numa_active_node_mask(p); + spin_unlock(&my_grp->lock); spin_unlock(&grp->lock); -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753465AbaAQVSP (ORCPT ); Fri, 17 Jan 2014 16:18:15 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59143 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751812AbaAQVSM (ORCPT ); Fri, 17 Jan 2014 16:18:12 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Date: Fri, 17 Jan 2014 16:12:06 -0500 Message-Id: <1389993129-28180-5-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Being able to see how the active nodemask changes over time, and why, can be quite useful. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 8 ++++++-- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 67e1bbf..91726b6 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa, __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, __entry->dst_cpu, __entry->dst_nid) ); + +TRACE_EVENT(update_numa_active_nodes_mask, + + TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults), + + TP_ARGS(pid, gid, nid, set, faults, max_faults), + + TP_STRUCT__entry( + __field( pid_t, pid) + __field( pid_t, gid) + __field( int, nid) + __field( int, set) + __field( long, faults) + __field( long, max_faults); + ), + + TP_fast_assign( + __entry->pid = pid; + __entry->gid = gid; + __entry->nid = nid; + __entry->set = set; + __entry->faults = faults; + __entry->max_faults = max_faults; + ), + + TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld", + __entry->pid, + __entry->gid, + __entry->nid, + __entry->set, + __entry->faults, + __entry->max_faults) + +); #endif /* _TRACE_SCHED_H */ /* This part must be outside protection */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aa680e2..3551009 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) faults = numa_group->faults_from[task_faults_idx(nid, 0)] + numa_group->faults_from[task_faults_idx(nid, 1)]; if (!node_isset(nid, numa_group->active_nodes)) { - if (faults > max_faults * 4 / 10) + if (faults > max_faults * 4 / 10) { + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); node_set(nid, numa_group->active_nodes); - } else if (faults < max_faults * 2 / 10) + } + } else if (faults < max_faults * 2 / 10) { + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, false, faults, max_faults); node_clear(nid, numa_group->active_nodes); + } } } -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753483AbaAQVTW (ORCPT ); Fri, 17 Jan 2014 16:19:22 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59149 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751369AbaAQVTT (ORCPT ); Fri, 17 Jan 2014 16:19:19 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 5/7] numa,sched,mm: use active_nodes nodemask to limit numa migrations Date: Fri, 17 Jan 2014 16:12:07 -0500 Message-Id: <1389993129-28180-6-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 7 +++++++ kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 3 +++ 3 files changed, 47 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index a9f7f05..0af6c1a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1602,6 +1602,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p); +extern bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid); #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) @@ -1617,6 +1619,11 @@ static inline void set_numabalancing_state(bool enabled) static inline void task_numa_free(struct task_struct *p) { } +static inline bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid) +{ + return true; +} #endif static inline struct pid *task_pid(struct task_struct *task) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3551009..8e0a53a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -948,6 +948,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid) return 1000 * group_faults(p, nid) / p->numa_group->total_faults; } +bool should_numa_migrate(struct task_struct *p, int last_cpupid, + int src_nid, int dst_nid) +{ + struct numa_group *ng = p->numa_group; + + /* Always allow migrate on private faults */ + if (cpupid_match_pid(p, last_cpupid)) + return true; + + /* A shared fault, but p->numa_group has not been set up yet. */ + if (!ng) + return true; + + /* + * Do not migrate if the destination is not a node that + * is actively used by this numa group. + */ + if (!node_isset(dst_nid, ng->active_nodes)) + return false; + + /* + * Source is a node that is not actively used by this + * numa group, while the destination is. Migrate. + */ + if (!node_isset(src_nid, ng->active_nodes)) + return true; + + /* + * Both source and destination are nodes in active + * use by this numa group. Maximize memory bandwidth + * by migrating from more heavily used groups, to less + * heavily used ones, spreading the load around. + * Use a 1/4 hysteresis to avoid spurious page movement. + */ + return group_faults(p, dst_nid) < (group_faults(p, src_nid) * 3 / 4); +} + static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 052abac..050962b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2405,6 +2405,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) { goto out; } + + if (!should_numa_migrate(current, last_cpupid, curnid, polnid)) + goto out; } if (curnid != polnid) -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753526AbaAQVUZ (ORCPT ); Fri, 17 Jan 2014 16:20:25 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59163 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753086AbaAQVUX (ORCPT ); Fri, 17 Jan 2014 16:20:23 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Date: Fri, 17 Jan 2014 16:12:08 -0500 Message-Id: <1389993129-28180-7-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel The tracepoint has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the node with the garbage collector being marked the only active node in the group. This issue is avoided if we weigh the statistics by CPU use of each task in the numa group, instead of by how many faults each thread has occurred. To achieve this, we normalize the number of faults to the fraction of faults that occurred on each node, and then multiply that fraction by the fraction of CPU time the task has used since the last time task_numa_placement was invoked. This way the nodes in the active node mask will be the ones where the tasks from the numa group are most actively running, and the influence of eg. the garbage collector and other do-little threads is properly minimized. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- include/linux/sched.h | 2 ++ kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 50 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 0af6c1a..52de567 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1471,6 +1471,8 @@ struct task_struct { int numa_preferred_nid; unsigned long numa_migrate_retry; u64 node_stamp; /* migration stamp */ + u64 last_task_numa_placement; + u64 last_sum_exec_runtime; struct callback_head numa_work; struct list_head numa_entry; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f45fd5..9a0908a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1758,6 +1758,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->numa_work.next = &p->numa_work; p->numa_faults = NULL; p->numa_faults_buffer = NULL; + p->last_task_numa_placement = 0; + p->last_sum_exec_runtime = 0; INIT_LIST_HEAD(&p->numa_entry); p->numa_group = NULL; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8e0a53a..0d395a0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); } +/* + * Get the fraction of time the task has been running since the last + * NUMA placement cycle. The scheduler keeps similar statistics, but + * decays those on a 32ms period, which is orders of magnitude off + * from the dozens-of-seconds NUMA balancing period. Use the scheduler + * stats only if the task is so new there are no NUMA statistics yet. + */ +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) +{ + u64 runtime, delta, now; + /* Use the start of this time slice to avoid calculations. */ + now = p->se.exec_start; + runtime = p->se.sum_exec_runtime; + + if (p->last_task_numa_placement) { + delta = runtime - p->last_sum_exec_runtime; + *period = now - p->last_task_numa_placement; + } else { + delta = p->se.avg.runnable_avg_sum; + *period = p->se.avg.runnable_avg_period; + } + + p->last_sum_exec_runtime = runtime; + p->last_task_numa_placement = now; + + return delta; +} + static void task_numa_placement(struct task_struct *p) { int seq, nid, max_nid = -1, max_group_nid = -1; unsigned long max_faults = 0, max_group_faults = 0; unsigned long fault_types[2] = { 0, 0 }; + unsigned long total_faults; + u64 runtime, period; spinlock_t *group_lock = NULL; seq = ACCESS_ONCE(p->mm->numa_scan_seq); @@ -1435,6 +1465,10 @@ static void task_numa_placement(struct task_struct *p) p->numa_scan_seq = seq; p->numa_scan_period_max = task_scan_max(p); + total_faults = p->numa_faults_locality[0] + + p->numa_faults_locality[1] + 1; + runtime = numa_get_avg_runtime(p, &period); + /* If the task is part of a group prevent parallel updates to group stats */ if (p->numa_group) { group_lock = &p->numa_group->lock; @@ -1447,7 +1481,7 @@ static void task_numa_placement(struct task_struct *p) int priv, i; for (priv = 0; priv < 2; priv++) { - long diff, f_diff; + long diff, f_diff, f_weight; i = task_faults_idx(nid, priv); diff = -p->numa_faults[i]; @@ -1459,8 +1493,18 @@ static void task_numa_placement(struct task_struct *p) fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; + /* + * Normalize the faults_from, so all tasks in a group + * count according to CPU use, instead of by the raw + * number of faults. Tasks with little runtime have + * little over-all impact on throughput, and thus their + * faults are less important. + */ + f_weight = (1024 * runtime * + p->numa_faults_from_buffer[i]) / + (total_faults * period + 1); p->numa_faults_from[i] >>= 1; - p->numa_faults_from[i] += p->numa_faults_from_buffer[i]; + p->numa_faults_from[i] += f_weight; p->numa_faults_from_buffer[i] = 0; faults += p->numa_faults[i]; -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753608AbaAQVVc (ORCPT ); Fri, 17 Jan 2014 16:21:32 -0500 Received: from shelob.surriel.com ([74.92.59.67]:59184 "EHLO shelob.surriel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751276AbaAQVV3 (ORCPT ); Fri, 17 Jan 2014 16:21:29 -0500 From: riel@redhat.com To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: [PATCH 7/7] numa,sched: do statistics calculation using local variables only Date: Fri, 17 Jan 2014 16:12:09 -0500 Message-Id: <1389993129-28180-8-git-send-email-riel@redhat.com> X-Mailer: git-send-email 1.8.4.2 In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> References: <1389993129-28180-1-git-send-email-riel@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rik van Riel The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks temporarily see halved statistics could lead to unwanted numa migrations. This can be avoided by doing all the math in local variables. This change also simplifies the code a little. Cc: Peter Zijlstra Cc: Mel Gorman Cc: Ingo Molnar Cc: Chegu Vinod Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0d395a0..0f48382 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1484,12 +1484,9 @@ static void task_numa_placement(struct task_struct *p) long diff, f_diff, f_weight; i = task_faults_idx(nid, priv); - diff = -p->numa_faults[i]; - f_diff = -p->numa_faults_from[i]; /* Decay existing window, copy faults since last scan */ - p->numa_faults[i] >>= 1; - p->numa_faults[i] += p->numa_faults_buffer[i]; + diff = p->numa_faults_buffer[i] - p->numa_faults[i] / 2; fault_types[priv] += p->numa_faults_buffer[i]; p->numa_faults_buffer[i] = 0; @@ -1503,13 +1500,12 @@ static void task_numa_placement(struct task_struct *p) f_weight = (1024 * runtime * p->numa_faults_from_buffer[i]) / (total_faults * period + 1); - p->numa_faults_from[i] >>= 1; - p->numa_faults_from[i] += f_weight; + f_diff = f_weight - p->numa_faults_from[i] / 2; p->numa_faults_from_buffer[i] = 0; + p->numa_faults[i] += diff; + p->numa_faults_from[i] += f_diff; faults += p->numa_faults[i]; - diff += p->numa_faults[i]; - f_diff += p->numa_faults_from[i]; p->total_numa_faults += diff; if (p->numa_group) { /* safe because we can only change our own group */ -- 1.8.4.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753363AbaARDcK (ORCPT ); Fri, 17 Jan 2014 22:32:10 -0500 Received: from mx1.redhat.com ([209.132.183.28]:29102 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752100AbaARDcJ (ORCPT ); Fri, 17 Jan 2014 22:32:09 -0500 Message-ID: <52D9F599.3040508@redhat.com> Date: Fri, 17 Jan 2014 22:31:37 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org CC: linux-mm@kvack.org, chegu_vinod@hp.com, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, Joe Mario Subject: Re: [PATCH 7/7] numa,sched: do statistics calculation using local variables only References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-8-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-8-git-send-email-riel@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/17/2014 04:12 PM, riel@redhat.com wrote: > From: Rik van Riel > > The current code in task_numa_placement calculates the difference > between the old and the new value, but also temporarily stores half > of the old value in the per-process variables. > > The NUMA balancing code looks at those per-process variables, and > having other tasks temporarily see halved statistics could lead to > unwanted numa migrations. This can be avoided by doing all the math > in local variables. > > This change also simplifies the code a little. I am seeing what looks like a performance improvement with this patch, so it is not just a theoretical bug. The improvement is small, as is to be expected with such a small race, but with two 32-warehouse specjbb instances on a 4-node, 10core/20thread per node system, I see the following change in performance, and reduced numa page migrations. Without the patch: run 1: throughput 367660 367660, migrated 3112982 run 2: throughput 353821 355612, migrated 2881317 run 3: throughput 355027 355027, migrated 3358105 run 4: throughput 354366 354366, migrated 3466687 run 5: throughput 356186 356186, migrated 3152194 run 6: throughput 361431 361431, migrated 3336219 run 7: throughput 354704 354704, migrated 3345418 run 8: throughput 363770 363770, migrated 3642925 run 9: throughput 363380 363380, migrated 3192836 run 10: throughput 358440 358440, migrated 3354028 avg: througphut 358968, migrated 3284271 With the patch: run 1: throughput 360580 360580, migrated 3169872 run 2: throughput 361303 361303, migrated 3220280 run 3: throughput 367692 367692, migrated 3096093 run 4: throughput 362320 362320, migrated 2981762 run 5: throughput 364201 364201, migrated 3089107 run 6: throughput 364561 364561, migrated 2892364 run 7: throughput 360771 360771, migrated 3086638 run 8: throughput 361530 361530, migrated 2933256 run 9: throughput 365841 365841, migrated 3356944 run 10: throughput 359188 359188, migrated 3394545 avg: througphut 362798, migrated 3122086 -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751647AbaARWGJ (ORCPT ); Sat, 18 Jan 2014 17:06:09 -0500 Received: from g4t0014.houston.hp.com ([15.201.24.17]:42722 "EHLO g4t0014.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751436AbaARWGE (ORCPT ); Sat, 18 Jan 2014 17:06:04 -0500 Message-ID: <52DAFAC7.7080307@hp.com> Date: Sat, 18 Jan 2014 14:05:59 -0800 From: Chegu Vinod User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: riel@redhat.com, linux-kernel@vger.kernel.org CC: linux-mm@kvack.org, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH v2 0/7] pseudo-interleaving for automatic NUMA balancing References: <1389993129-28180-1-git-send-email-riel@redhat.com> In-Reply-To: <1389993129-28180-1-git-send-email-riel@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/17/2014 1:12 PM, riel@redhat.com wrote: > The current automatic NUMA balancing code base has issues with > workloads that do not fit on one NUMA load. Page migration is > slowed down, but memory distribution between the nodes where > the workload runs is essentially random, often resulting in a > suboptimal amount of memory bandwidth being available to the > workload. > > In order to maximize performance of workloads that do not fit in one NUMA > node, we want to satisfy the following criteria: > 1) keep private memory local to each thread > 2) avoid excessive NUMA migration of pages > 3) distribute shared memory across the active nodes, to > maximize memory bandwidth available to the workload > > This patch series identifies the NUMA nodes on which the workload > is actively running, and balances (somewhat lazily) the memory > between those nodes, satisfying the criteria above. > > As usual, the series has had some performance testing, but it > could always benefit from more testing, on other systems. > > Changes since v1: > - fix divide by zero found by Chegu Vinod > - improve comment, as suggested by Peter Zijlstra > - do stats calculations in task_numa_placement in local variables > > > Some performance numbers, with two 40-warehouse specjbb instances > on an 8 node system with 10 CPU cores per node, using a pre-cleanup > version of these patches, courtesy of Chegu Vinod: > > numactl manual pinning > spec1.txt: throughput = 755900.20 SPECjbb2005 bops > spec2.txt: throughput = 754914.40 SPECjbb2005 bops > > NO-pinning results (Automatic NUMA balancing, with patches) > spec1.txt: throughput = 706439.84 SPECjbb2005 bops > spec2.txt: throughput = 729347.75 SPECjbb2005 bops > > NO-pinning results (Automatic NUMA balancing, without patches) > spec1.txt: throughput = 667988.47 SPECjbb2005 bops > spec2.txt: throughput = 638220.45 SPECjbb2005 bops > > No Automatic NUMA and NO-pinning results > spec1.txt: throughput = 544120.97 SPECjbb2005 bops > spec2.txt: throughput = 453553.41 SPECjbb2005 bops > > > My own performance numbers are not as relevant, since I have been > running with a more hostile workload on purpose, and I have run > into a scheduler issue that caused the workload to run on only > two of the four NUMA nodes on my test system... > > . > Acked-by: Chegu Vinod ---- Here are some results using the v2 version of the patches on an 8 socket box using SPECjbb2005 as a workload : I) Eight 1-socket wide instances(10 warehouse threads each) : Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 270620.04 273675.10 spec2.txt: throughput = 274115.33 272845.17 spec3.txt: throughput = 277830.09 272057.33 spec4.txt: throughput = 270898.52 270670.54 spec5.txt: throughput = 270397.30 270906.82 spec6.txt: throughput = 270451.93 268217.55 spec7.txt: throughput = 269511.07 269354.46 spec8.txt: throughput = 269386.06 270540.00 b)Automatic NUMA balancing results spec1.txt: throughput = 244333.41 248072.72 spec2.txt: throughput = 252166.99 251818.30 spec3.txt: throughput = 251365.58 258266.24 spec4.txt: throughput = 245247.91 256873.51 spec5.txt: throughput = 245579.68 247743.18 spec6.txt: throughput = 249767.38 256285.86 spec7.txt: throughput = 244570.64 255343.99 spec8.txt: throughput = 245703.60 254434.36 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 132959.73 136957.12 spec2.txt: throughput = 127937.11 129326.23 spec3.txt: throughput = 130697.10 125772.11 spec4.txt: throughput = 134978.49 141607.58 spec5.txt: throughput = 127574.34 126748.18 spec6.txt: throughput = 138699.99 128597.95 spec7.txt: throughput = 133247.25 137344.57 spec8.txt: throughput = 124548.00 139040.98 ------ II) Four 2-socket wide instances(20 warehouse threads each) : Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 479931.16 472467.58 spec2.txt: throughput = 466652.15 466237.10 spec3.txt: throughput = 473591.51 466891.98 spec4.txt: throughput = 462346.62 466891.98 b)Automatic NUMA balancing results spec1.txt: throughput = 383758.29 437489.99 spec2.txt: throughput = 370926.06 435692.97 spec3.txt: throughput = 368872.72 444615.08 spec4.txt: throughput = 404422.82 435236.20 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 252752.12 231762.30 spec2.txt: throughput = 255391.51 253250.95 spec3.txt: throughput = 264764.00 263721.03 spec4.txt: throughput = 254833.39 242892.72 ------ III) Two 4-socket wide instances(40 warehouse threads each) Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 771340.84 769039.53 spec2.txt: throughput = 762184.48 760745.65 b)Automatic NUMA balancing results spec1.txt: throughput = 667182.98 720197.01 spec2.txt: throughput = 692564.11 739872.51 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 457079.28 467199.30 spec2.txt: throughput = 479790.47 456279.07 ----- IV) One 8-socket wide instance(80 warehouse threads) Without patches With patches -------------------- ---------------- a) numactl pinning results spec1.txt: throughput = 982113.03 985836.96 b)Automatic NUMA balancing results spec1.txt: throughput = 755615.94 843632.09 c)NO Automatic NUMA balancing and NO-pinning results spec1.txt: throughput = 671583.26 661768.54 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754064AbaATQb1 (ORCPT ); Mon, 20 Jan 2014 11:31:27 -0500 Received: from merlin.infradead.org ([205.233.59.134]:32913 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751669AbaATQbY (ORCPT ); Mon, 20 Jan 2014 11:31:24 -0500 Date: Mon, 20 Jan 2014 17:31:03 +0100 From: Peter Zijlstra To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Message-ID: <20140120163103.GI31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-4-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: > /* > + * Iterate over the nodes from which NUMA hinting faults were triggered, in > + * other words where the CPUs that incurred NUMA hinting faults are. The > + * bitmask is used to limit NUMA page migrations, and spread out memory > + * between the actively used nodes. To prevent flip-flopping, and excessive > + * page migrations, nodes are added when they cause over 40% of the maximum > + * number of faults, but only removed when they drop below 20%. > + */ > +static void update_numa_active_node_mask(struct task_struct *p) > +{ > + unsigned long faults, max_faults = 0; > + struct numa_group *numa_group = p->numa_group; > + int nid; > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (faults > max_faults) > + max_faults = faults; > + } > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (!node_isset(nid, numa_group->active_nodes)) { > + if (faults > max_faults * 4 / 10) > + node_set(nid, numa_group->active_nodes); > + } else if (faults < max_faults * 2 / 10) > + node_clear(nid, numa_group->active_nodes); > + } > +} Why not use 6/16 and 3/16 resp.? That avoids an actual division. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753831AbaATQwc (ORCPT ); Mon, 20 Jan 2014 11:52:32 -0500 Received: from merlin.infradead.org ([205.233.59.134]:33608 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750758AbaATQw3 (ORCPT ); Mon, 20 Jan 2014 11:52:29 -0500 Date: Mon, 20 Jan 2014 17:52:05 +0100 From: Peter Zijlstra To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com, Steven Rostedt Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Message-ID: <20140120165205.GJ31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-5-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: > From: Rik van Riel > > Being able to see how the active nodemask changes over time, and why, > can be quite useful. > > Cc: Peter Zijlstra > Cc: Mel Gorman > Cc: Ingo Molnar > Cc: Chegu Vinod > Signed-off-by: Rik van Riel > --- > include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++ > kernel/sched/fair.c | 8 ++++++-- > 2 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 67e1bbf..91726b6 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa, > __entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid, > __entry->dst_cpu, __entry->dst_nid) > ); > + > +TRACE_EVENT(update_numa_active_nodes_mask, Please stick to the sched_ naming for these things. Ideally we'd rename the sysctls too :/ > +++ b/kernel/sched/fair.c > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) > faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > numa_group->faults_from[task_faults_idx(nid, 1)]; > if (!node_isset(nid, numa_group->active_nodes)) { > - if (faults > max_faults * 4 / 10) > + if (faults > max_faults * 4 / 10) { > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); While I think the tracepoint hookery is smart enough to avoid evaluating arguments when they're disabled, it might be best to simply pass: current and numa_group and do the dereference in fast_assign(). That said, this is the first and only numa tracepoint, I'm not sure why this qualifies and other metrics do not. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754280AbaATQzq (ORCPT ); Mon, 20 Jan 2014 11:55:46 -0500 Received: from merlin.infradead.org ([205.233.59.134]:33699 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751976AbaATQzn (ORCPT ); Mon, 20 Jan 2014 11:55:43 -0500 Date: Mon, 20 Jan 2014 17:55:23 +0100 From: Peter Zijlstra To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics Message-ID: <20140120165523.GK31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-4-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: > /* > + * Iterate over the nodes from which NUMA hinting faults were triggered, in > + * other words where the CPUs that incurred NUMA hinting faults are. The > + * bitmask is used to limit NUMA page migrations, and spread out memory > + * between the actively used nodes. To prevent flip-flopping, and excessive > + * page migrations, nodes are added when they cause over 40% of the maximum > + * number of faults, but only removed when they drop below 20%. > + */ Maybe break the above into two paragraphs for added readability. Also, I think this might be a good spot to explain why you need the second fault metric -- that is, why can't we create the interleave mask from the existing memory location faults. > +static void update_numa_active_node_mask(struct task_struct *p) > +{ > + unsigned long faults, max_faults = 0; > + struct numa_group *numa_group = p->numa_group; > + int nid; > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (faults > max_faults) > + max_faults = faults; > + } > + > + for_each_online_node(nid) { > + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > + numa_group->faults_from[task_faults_idx(nid, 1)]; > + if (!node_isset(nid, numa_group->active_nodes)) { > + if (faults > max_faults * 4 / 10) > + node_set(nid, numa_group->active_nodes); > + } else if (faults < max_faults * 2 / 10) > + node_clear(nid, numa_group->active_nodes); > + } > +} From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754023AbaATQ6J (ORCPT ); Mon, 20 Jan 2014 11:58:09 -0500 Received: from merlin.infradead.org ([205.233.59.134]:33739 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751740AbaATQ6G (ORCPT ); Mon, 20 Jan 2014 11:58:06 -0500 Date: Mon, 20 Jan 2014 17:57:47 +0100 From: Peter Zijlstra To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Message-ID: <20140120165747.GL31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1389993129-28180-7-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 17, 2014 at 04:12:08PM -0500, riel@redhat.com wrote: > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 0af6c1a..52de567 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1471,6 +1471,8 @@ struct task_struct { > int numa_preferred_nid; > unsigned long numa_migrate_retry; > u64 node_stamp; /* migration stamp */ > + u64 last_task_numa_placement; > + u64 last_sum_exec_runtime; > struct callback_head numa_work; > > struct list_head numa_entry; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8e0a53a..0d395a0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, > memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); > } > > +/* > + * Get the fraction of time the task has been running since the last > + * NUMA placement cycle. The scheduler keeps similar statistics, but > + * decays those on a 32ms period, which is orders of magnitude off > + * from the dozens-of-seconds NUMA balancing period. Use the scheduler > + * stats only if the task is so new there are no NUMA statistics yet. > + */ > +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) > +{ > + u64 runtime, delta, now; > + /* Use the start of this time slice to avoid calculations. */ > + now = p->se.exec_start; > + runtime = p->se.sum_exec_runtime; > + > + if (p->last_task_numa_placement) { > + delta = runtime - p->last_sum_exec_runtime; > + *period = now - p->last_task_numa_placement; > + } else { > + delta = p->se.avg.runnable_avg_sum; > + *period = p->se.avg.runnable_avg_period; > + } > + > + p->last_sum_exec_runtime = runtime; > + p->last_task_numa_placement = now; > + > + return delta; > +} Have you tried what happens if you use p->se.avg.runnable_avg_sum / p->se.avg.runnable_avg_period instead? If that also works it avoids growing the datastructures and keeping of yet another set of runtime stats. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752229AbaATSvp (ORCPT ); Mon, 20 Jan 2014 13:51:45 -0500 Received: from mx1.redhat.com ([209.132.183.28]:35062 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750822AbaATSvl (ORCPT ); Mon, 20 Jan 2014 13:51:41 -0500 Message-ID: <52DD7016.9080708@redhat.com> Date: Mon, 20 Jan 2014 13:51:02 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Peter Zijlstra CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com, Steven Rostedt Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> <20140120165205.GJ31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120165205.GJ31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/20/2014 11:52 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: >> +++ b/kernel/sched/fair.c >> @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) >> faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> numa_group->faults_from[task_faults_idx(nid, 1)]; >> if (!node_isset(nid, numa_group->active_nodes)) { >> - if (faults > max_faults * 4 / 10) >> + if (faults > max_faults * 4 / 10) { >> + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); > > While I think the tracepoint hookery is smart enough to avoid evaluating > arguments when they're disabled, it might be best to simply pass: > current and numa_group and do the dereference in fast_assign(). > > That said, this is the first and only numa tracepoint, I'm not sure why > this qualifies and other metrics do not. It's there because I needed it in development. If you think it is not merge material, I would be comfortable leaving it out. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753600AbaATTDM (ORCPT ); Mon, 20 Jan 2014 14:03:12 -0500 Received: from mx1.redhat.com ([209.132.183.28]:24334 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752992AbaATTDH (ORCPT ); Mon, 20 Jan 2014 14:03:07 -0500 Message-ID: <52DD72C8.2050602@redhat.com> Date: Mon, 20 Jan 2014 14:02:32 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Peter Zijlstra CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> <20140120165747.GL31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120165747.GL31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/20/2014 11:57 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:08PM -0500, riel@redhat.com wrote: >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index 0af6c1a..52de567 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -1471,6 +1471,8 @@ struct task_struct { >> int numa_preferred_nid; >> unsigned long numa_migrate_retry; >> u64 node_stamp; /* migration stamp */ >> + u64 last_task_numa_placement; >> + u64 last_sum_exec_runtime; >> struct callback_head numa_work; >> >> struct list_head numa_entry; > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8e0a53a..0d395a0 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1422,11 +1422,41 @@ static void update_task_scan_period(struct task_struct *p, >> memset(p->numa_faults_locality, 0, sizeof(p->numa_faults_locality)); >> } >> >> +/* >> + * Get the fraction of time the task has been running since the last >> + * NUMA placement cycle. The scheduler keeps similar statistics, but >> + * decays those on a 32ms period, which is orders of magnitude off >> + * from the dozens-of-seconds NUMA balancing period. Use the scheduler >> + * stats only if the task is so new there are no NUMA statistics yet. >> + */ >> +static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) >> +{ >> + u64 runtime, delta, now; >> + /* Use the start of this time slice to avoid calculations. */ >> + now = p->se.exec_start; >> + runtime = p->se.sum_exec_runtime; >> + >> + if (p->last_task_numa_placement) { >> + delta = runtime - p->last_sum_exec_runtime; >> + *period = now - p->last_task_numa_placement; >> + } else { >> + delta = p->se.avg.runnable_avg_sum; >> + *period = p->se.avg.runnable_avg_period; >> + } >> + >> + p->last_sum_exec_runtime = runtime; >> + p->last_task_numa_placement = now; >> + >> + return delta; >> +} > > Have you tried what happens if you use p->se.avg.runnable_avg_sum / > p->se.avg.runnable_avg_period instead? If that also works it avoids > growing the datastructures and keeping of yet another set of runtime > stats. That is what I started out with, and the results were not as stable as with this calculation. Having said that, I did that before I came up with patch 7/7, so maybe the effect would no longer be as pronounced any more as it was before... I can send in a simplified version, if you prefer. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752878AbaATTF4 (ORCPT ); Mon, 20 Jan 2014 14:05:56 -0500 Received: from cdptpa-outbound-snat.email.rr.com ([107.14.166.225]:39039 "EHLO cdptpa-oedge-vip.email.rr.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751380AbaATTFy (ORCPT ); Mon, 20 Jan 2014 14:05:54 -0500 Date: Mon, 20 Jan 2014 14:05:51 -0500 From: Steven Rostedt To: Peter Zijlstra Cc: riel@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 4/7] numa,sched: tracepoints for NUMA balancing active nodemask changes Message-ID: <20140120140551.3343ab2b@gandalf.local.home> In-Reply-To: <20140120165205.GJ31570@twins.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-5-git-send-email-riel@redhat.com> <20140120165205.GJ31570@twins.programming.kicks-ass.net> X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.22; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-RR-Connecting-IP: 107.14.168.142:25 X-Cloudmark-Score: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 20 Jan 2014 17:52:05 +0100 Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:06PM -0500, riel@redhat.com wrote: > > From: Rik van Riel > > > > +++ b/kernel/sched/fair.c > > @@ -1300,10 +1300,14 @@ static void update_numa_active_node_mask(struct task_struct *p) > > faults = numa_group->faults_from[task_faults_idx(nid, 0)] + > > numa_group->faults_from[task_faults_idx(nid, 1)]; > > if (!node_isset(nid, numa_group->active_nodes)) { > > - if (faults > max_faults * 4 / 10) > > + if (faults > max_faults * 4 / 10) { > > + trace_update_numa_active_nodes_mask(current->pid, numa_group->gid, nid, true, faults, max_faults); > > While I think the tracepoint hookery is smart enough to avoid evaluating > arguments when they're disabled, it might be best to simply pass: > current and numa_group and do the dereference in fast_assign(). It's really up to gcc to optimize it. But that said, it is more efficient to just past the pointer and do the dereferencing in the fast_assign(). At least it keeps any bad optimization in gcc from infecting the tracepoint caller. It also makes it easier to get other information if you want to later extend that tracepoint. Does this tracepoint always use current? If so, why bother passing it in? -- Steve From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752738AbaATTLG (ORCPT ); Mon, 20 Jan 2014 14:11:06 -0500 Received: from merlin.infradead.org ([205.233.59.134]:36503 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751936AbaATTLE (ORCPT ); Mon, 20 Jan 2014 14:11:04 -0500 Date: Mon, 20 Jan 2014 20:10:51 +0100 From: Peter Zijlstra To: Rik van Riel Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 6/7] numa,sched: normalize faults_from stats and weigh by CPU use Message-ID: <20140120191051.GQ11314@laptop.programming.kicks-ass.net> References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-7-git-send-email-riel@redhat.com> <20140120165747.GL31570@twins.programming.kicks-ass.net> <52DD72C8.2050602@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52DD72C8.2050602@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 20, 2014 at 02:02:32PM -0500, Rik van Riel wrote: > That is what I started out with, and the results were not > as stable as with this calculation. > > Having said that, I did that before I came up with patch 7/7, > so maybe the effect would no longer be as pronounced any more > as it was before... > > I can send in a simplified version, if you prefer. If you could retry with 7/7, I don't mind adding the extra stats too much, but it would be nice if we can avoid it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752475AbaATTjY (ORCPT ); Mon, 20 Jan 2014 14:39:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:63370 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750721AbaATTjU (ORCPT ); Mon, 20 Jan 2014 14:39:20 -0500 Message-ID: <52DD7129.3040106@redhat.com> Date: Mon, 20 Jan 2014 13:55:37 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Peter Zijlstra CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, chegu_vinod@hp.com, mgorman@suse.de, mingo@redhat.com Subject: Re: [PATCH 3/7] numa,sched: build per numa_group active node mask from faults_from statistics References: <1389993129-28180-1-git-send-email-riel@redhat.com> <1389993129-28180-4-git-send-email-riel@redhat.com> <20140120163103.GI31570@twins.programming.kicks-ass.net> In-Reply-To: <20140120163103.GI31570@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/20/2014 11:31 AM, Peter Zijlstra wrote: > On Fri, Jan 17, 2014 at 04:12:05PM -0500, riel@redhat.com wrote: >> /* >> + * Iterate over the nodes from which NUMA hinting faults were triggered, in >> + * other words where the CPUs that incurred NUMA hinting faults are. The >> + * bitmask is used to limit NUMA page migrations, and spread out memory >> + * between the actively used nodes. To prevent flip-flopping, and excessive >> + * page migrations, nodes are added when they cause over 40% of the maximum >> + * number of faults, but only removed when they drop below 20%. >> + */ >> +static void update_numa_active_node_mask(struct task_struct *p) >> +{ >> + unsigned long faults, max_faults = 0; >> + struct numa_group *numa_group = p->numa_group; >> + int nid; >> + >> + for_each_online_node(nid) { >> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> + numa_group->faults_from[task_faults_idx(nid, 1)]; >> + if (faults > max_faults) >> + max_faults = faults; >> + } >> + >> + for_each_online_node(nid) { >> + faults = numa_group->faults_from[task_faults_idx(nid, 0)] + >> + numa_group->faults_from[task_faults_idx(nid, 1)]; >> + if (!node_isset(nid, numa_group->active_nodes)) { >> + if (faults > max_faults * 4 / 10) >> + node_set(nid, numa_group->active_nodes); >> + } else if (faults < max_faults * 2 / 10) >> + node_clear(nid, numa_group->active_nodes); >> + } >> +} > > Why not use 6/16 and 3/16 resp.? That avoids an actual division. OK, will do. -- All rights reversed