From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
Date: Fri, 28 Jun 2013 12:02:33 +0530 [thread overview]
Message-ID: <20130628063233.GC17195@linux.vnet.ibm.com> (raw)
In-Reply-To: <1372257487-9749-5-git-send-email-mgorman@suse.de>
* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:
> NUMA hinting faults counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
>
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/sched.h | 13 +++++++++++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 16 +++++++++++++---
> 3 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ba46a64..42f9818 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1506,7 +1506,20 @@ struct task_struct {
> u64 node_stamp; /* migration stamp */
> struct callback_head numa_work;
>
> + /*
> + * Exponential decaying average of faults on a per-node basis.
> + * Scheduling placement decisions are made based on the these counts.
> + * The values remain static for the duration of a PTE scan
> + */
> unsigned long *numa_faults;
> +
> + /*
> + * numa_faults_buffer records faults per node during the current
> + * scan window. When the scan completes, the counts in numa_faults
> + * decay and these values are copied.
> + */
> + unsigned long *numa_faults_buffer;
> +
> int numa_preferred_nid;
> #endif /* CONFIG_NUMA_BALANCING */
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 019baae..b00b81a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
> p->numa_preferred_nid = -1;
> p->numa_work.next = &p->numa_work;
> p->numa_faults = NULL;
> + p->numa_faults_buffer = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8c3f61..5893399 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
>
> /* Find the node with the highest number of faults */
> for (nid = 0; nid < nr_node_ids; nid++) {
> - unsigned long faults = p->numa_faults[nid];
> + unsigned long faults;
> +
> + /* Decay existing window and copy faults since last scan */
> p->numa_faults[nid] >>= 1;
> + p->numa_faults[nid] += p->numa_faults_buffer[nid];
> + p->numa_faults_buffer[nid] = 0;
> +
> + faults = p->numa_faults[nid];
> if (faults > max_faults) {
> max_faults = faults;
> max_nid = nid;
> @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
> if (unlikely(!p->numa_faults)) {
> int size = sizeof(*p->numa_faults) * nr_node_ids;
>
> - p->numa_faults = kzalloc(size, GFP_KERNEL);
> + /* numa_faults and numa_faults_buffer share the allocation */
> + p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
Instead of allocating buffer to hold the current faults, cant we pass
the nr of pages and node information (and probably migrate) to
task_numa_placement()?.
Why should task_struct be passed as an argument to task_numa_placement().
It seems it always will be current.
> if (!p->numa_faults)
> return;
> +
> + BUG_ON(p->numa_faults_buffer);
> + p->numa_faults_buffer = p->numa_faults + nr_node_ids;
> }
>
> /*
> @@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
> task_numa_placement(p);
>
> /* Record the fault, double the weight if pages were migrated */
> - p->numa_faults[node] += pages << migrated;
> + p->numa_faults_buffer[node] += pages << migrated;
> }
>
> static void reset_ptenuma_scan(struct task_struct *p)
> --
> 1.8.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Thanks and Regards
Srikar Dronamraju
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/8] sched: Update NUMA hinting faults once per scan
Date: Fri, 28 Jun 2013 12:02:33 +0530 [thread overview]
Message-ID: <20130628063233.GC17195@linux.vnet.ibm.com> (raw)
In-Reply-To: <1372257487-9749-5-git-send-email-mgorman@suse.de>
* Mel Gorman <mgorman@suse.de> [2013-06-26 15:38:03]:
> NUMA hinting faults counts and placement decisions are both recorded in the
> same array which distorts the samples in an unpredictable fashion. The values
> linearly accumulate during the scan and then decay creating a sawtooth-like
> pattern in the per-node counts. It also means that placement decisions are
> time sensitive. At best it means that it is very difficult to state that
> the buffer holds a decaying average of past faulting behaviour. At worst,
> it can confuse the load balancer if it sees one node with an artifically high
> count due to very recent faulting activity and may create a bouncing effect.
>
> This patch adds a second array. numa_faults stores the historical data
> which is used for placement decisions. numa_faults_buffer holds the
> fault activity during the current scan window. When the scan completes,
> numa_faults decays and the values from numa_faults_buffer are copied
> across.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/sched.h | 13 +++++++++++++
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 16 +++++++++++++---
> 3 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ba46a64..42f9818 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1506,7 +1506,20 @@ struct task_struct {
> u64 node_stamp; /* migration stamp */
> struct callback_head numa_work;
>
> + /*
> + * Exponential decaying average of faults on a per-node basis.
> + * Scheduling placement decisions are made based on the these counts.
> + * The values remain static for the duration of a PTE scan
> + */
> unsigned long *numa_faults;
> +
> + /*
> + * numa_faults_buffer records faults per node during the current
> + * scan window. When the scan completes, the counts in numa_faults
> + * decay and these values are copied.
> + */
> + unsigned long *numa_faults_buffer;
> +
> int numa_preferred_nid;
> #endif /* CONFIG_NUMA_BALANCING */
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 019baae..b00b81a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
> p->numa_preferred_nid = -1;
> p->numa_work.next = &p->numa_work;
> p->numa_faults = NULL;
> + p->numa_faults_buffer = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> }
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8c3f61..5893399 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
>
> /* Find the node with the highest number of faults */
> for (nid = 0; nid < nr_node_ids; nid++) {
> - unsigned long faults = p->numa_faults[nid];
> + unsigned long faults;
> +
> + /* Decay existing window and copy faults since last scan */
> p->numa_faults[nid] >>= 1;
> + p->numa_faults[nid] += p->numa_faults_buffer[nid];
> + p->numa_faults_buffer[nid] = 0;
> +
> + faults = p->numa_faults[nid];
> if (faults > max_faults) {
> max_faults = faults;
> max_nid = nid;
> @@ -831,9 +837,13 @@ void task_numa_fault(int node, int pages, bool migrated)
> if (unlikely(!p->numa_faults)) {
> int size = sizeof(*p->numa_faults) * nr_node_ids;
>
> - p->numa_faults = kzalloc(size, GFP_KERNEL);
> + /* numa_faults and numa_faults_buffer share the allocation */
> + p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
Instead of allocating buffer to hold the current faults, cant we pass
the nr of pages and node information (and probably migrate) to
task_numa_placement()?.
Why should task_struct be passed as an argument to task_numa_placement().
It seems it always will be current.
> if (!p->numa_faults)
> return;
> +
> + BUG_ON(p->numa_faults_buffer);
> + p->numa_faults_buffer = p->numa_faults + nr_node_ids;
> }
>
> /*
> @@ -847,7 +857,7 @@ void task_numa_fault(int node, int pages, bool migrated)
> task_numa_placement(p);
>
> /* Record the fault, double the weight if pages were migrated */
> - p->numa_faults[node] += pages << migrated;
> + p->numa_faults_buffer[node] += pages << migrated;
> }
>
> static void reset_ptenuma_scan(struct task_struct *p)
> --
> 1.8.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Thanks and Regards
Srikar Dronamraju
next prev parent reply other threads:[~2013-06-28 6:33 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-26 14:37 [PATCH 0/6] Basic scheduler support for automatic NUMA balancing Mel Gorman
2013-06-26 14:37 ` Mel Gorman
2013-06-26 14:38 ` [PATCH 1/8] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-26 14:38 ` [PATCH 2/8] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-27 15:57 ` Peter Zijlstra
2013-06-27 15:57 ` Peter Zijlstra
2013-06-28 12:22 ` Mel Gorman
2013-06-28 12:22 ` Mel Gorman
2013-06-28 6:08 ` Srikar Dronamraju
2013-06-28 6:08 ` Srikar Dronamraju
2013-06-28 8:56 ` Peter Zijlstra
2013-06-28 8:56 ` Peter Zijlstra
2013-06-28 12:30 ` Mel Gorman
2013-06-28 12:30 ` Mel Gorman
2013-06-26 14:38 ` [PATCH 3/8] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-28 6:14 ` Srikar Dronamraju
2013-06-28 6:14 ` Srikar Dronamraju
2013-06-28 8:59 ` Peter Zijlstra
2013-06-28 8:59 ` Peter Zijlstra
2013-06-28 10:24 ` Srikar Dronamraju
2013-06-28 10:24 ` Srikar Dronamraju
2013-06-28 12:33 ` Mel Gorman
2013-06-28 12:33 ` Mel Gorman
2013-06-26 14:38 ` [PATCH 4/8] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-28 6:32 ` Srikar Dronamraju [this message]
2013-06-28 6:32 ` Srikar Dronamraju
2013-06-28 9:01 ` Peter Zijlstra
2013-06-28 9:01 ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 5/8] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-27 14:52 ` Peter Zijlstra
2013-06-27 14:52 ` Peter Zijlstra
2013-06-27 14:53 ` Peter Zijlstra
2013-06-27 14:53 ` Peter Zijlstra
2013-06-28 13:00 ` Mel Gorman
2013-06-28 13:00 ` Mel Gorman
2013-06-27 16:01 ` Peter Zijlstra
2013-06-27 16:01 ` Peter Zijlstra
2013-06-28 13:01 ` Mel Gorman
2013-06-28 13:01 ` Mel Gorman
2013-06-27 16:11 ` Peter Zijlstra
2013-06-27 16:11 ` Peter Zijlstra
2013-06-28 13:45 ` Mel Gorman
2013-06-28 13:45 ` Mel Gorman
2013-06-28 15:10 ` Peter Zijlstra
2013-06-28 15:10 ` Peter Zijlstra
2013-06-28 8:11 ` Srikar Dronamraju
2013-06-28 8:11 ` Srikar Dronamraju
2013-06-28 9:04 ` Peter Zijlstra
2013-06-28 9:04 ` Peter Zijlstra
2013-06-28 10:07 ` Srikar Dronamraju
2013-06-28 10:07 ` Srikar Dronamraju
2013-06-28 10:24 ` Peter Zijlstra
2013-06-28 10:24 ` Peter Zijlstra
2013-06-28 13:51 ` Mel Gorman
2013-06-28 13:51 ` Mel Gorman
2013-06-28 17:14 ` Srikar Dronamraju
2013-06-28 17:14 ` Srikar Dronamraju
2013-06-28 17:34 ` Mel Gorman
2013-06-28 17:34 ` Mel Gorman
2013-06-28 17:44 ` Srikar Dronamraju
2013-06-28 17:44 ` Srikar Dronamraju
2013-06-26 14:38 ` [PATCH 6/8] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-27 14:54 ` Peter Zijlstra
2013-06-27 14:54 ` Peter Zijlstra
2013-06-28 13:54 ` Mel Gorman
2013-06-28 13:54 ` Mel Gorman
2013-07-02 12:06 ` Srikar Dronamraju
2013-07-02 12:06 ` Srikar Dronamraju
2013-07-02 16:29 ` Mel Gorman
2013-07-02 16:29 ` Mel Gorman
2013-07-02 18:17 ` Peter Zijlstra
2013-07-02 18:17 ` Peter Zijlstra
2013-07-06 6:44 ` Srikar Dronamraju
2013-07-06 6:44 ` Srikar Dronamraju
2013-07-06 10:47 ` Peter Zijlstra
2013-07-06 10:47 ` Peter Zijlstra
2013-07-02 18:15 ` Peter Zijlstra
2013-07-02 18:15 ` Peter Zijlstra
2013-07-03 9:50 ` Peter Zijlstra
2013-07-03 9:50 ` Peter Zijlstra
2013-07-03 15:28 ` Mel Gorman
2013-07-03 15:28 ` Mel Gorman
2013-07-03 18:46 ` Peter Zijlstra
2013-07-03 18:46 ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 7/8] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-27 14:56 ` Peter Zijlstra
2013-06-27 14:56 ` Peter Zijlstra
2013-06-28 14:00 ` Mel Gorman
2013-06-28 14:00 ` Mel Gorman
2013-06-28 7:00 ` Srikar Dronamraju
2013-06-28 7:00 ` Srikar Dronamraju
2013-06-28 9:36 ` Peter Zijlstra
2013-06-28 9:36 ` Peter Zijlstra
2013-06-28 10:12 ` Srikar Dronamraju
2013-06-28 10:12 ` Srikar Dronamraju
2013-06-28 10:33 ` Peter Zijlstra
2013-06-28 10:33 ` Peter Zijlstra
2013-06-28 14:29 ` Mel Gorman
2013-06-28 14:29 ` Mel Gorman
2013-06-28 15:12 ` Peter Zijlstra
2013-06-28 15:12 ` Peter Zijlstra
2013-06-26 14:38 ` [PATCH 8/8] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-06-26 14:38 ` Mel Gorman
2013-06-27 14:59 ` [PATCH 0/6] Basic scheduler support for automatic NUMA balancing Peter Zijlstra
2013-06-27 14:59 ` Peter Zijlstra
2013-06-28 13:54 ` Srikar Dronamraju
2013-06-28 13:54 ` Srikar Dronamraju
2013-07-01 5:39 ` Srikar Dronamraju
2013-07-01 5:39 ` Srikar Dronamraju
2013-07-01 8:43 ` Mel Gorman
2013-07-01 8:43 ` Mel Gorman
2013-07-02 5:28 ` Srikar Dronamraju
2013-07-02 5:28 ` Srikar Dronamraju
2013-07-02 7:46 ` Peter Zijlstra
2013-07-02 7:46 ` Peter Zijlstra
2013-07-02 8:55 ` Peter Zijlstra
2013-07-02 8:55 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130628063233.GC17195@linux.vnet.ibm.com \
--to=srikar@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.