From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>, Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
Preeti U Murthy <preeti@linux.vnet.ibm.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
Date: Wed, 31 Jul 2013 23:05:13 +0530 [thread overview]
Message-ID: <20130731173513.GA12770@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130730093321.GO3008@twins.programming.kicks-ass.net>
* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:33:21]:
> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
>
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
>
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
>
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
>
The above dumped core; Looks like -T is a must with -G.
I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).
Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)
Convergence wise my patchset did really well.
bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.
Is this what you were looking for? Or was it something else?
(Lower is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
1x3-convergence 0.320 100.060 100.204 secs
1x4-convergence 100.139 100.162 100.155 secs
1x6-convergence 100.455 100.179 1.078 secs
2x3-convergence 100.261 100.339 9.743 secs
3x3-convergence 100.213 100.168 10.073 secs
4x4-convergence 100.307 100.201 19.686 secs
4x4-convergence-NOTHP 100.229 100.221 3.189 secs
4x6-convergence 101.441 100.632 6.204 secs
4x8-convergence 100.680 100.588 5.275 secs
8x4-convergence 100.335 100.365 34.069 secs
8x4-convergence-NOTHP 100.331 100.412 100.478 secs
3x1-convergence 1.227 1.536 0.576 secs
4x1-convergence 1.224 1.063 1.390 secs
8x1-convergence 1.713 2.437 1.704 secs
16x1-convergence 2.750 2.677 1.856 secs
32x1-convergence 1.985 1.795 1.391 secs
(Higher is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
RAM-bw-local 3.341 3.340 3.325 GB/sec
RAM-bw-local-NOTHP 3.308 3.307 3.290 GB/sec
RAM-bw-remote 1.815 1.815 1.815 GB/sec
RAM-bw-local-2x 6.410 6.413 6.412 GB/sec
RAM-bw-remote-2x 3.020 3.041 3.027 GB/sec
RAM-bw-cross 4.397 3.425 4.374 GB/sec
2x1-bw-process 3.481 3.442 3.492 GB/sec
3x1-bw-process 5.423 7.547 5.445 GB/sec
4x1-bw-process 5.108 11.009 5.118 GB/sec
8x1-bw-process 8.929 10.935 8.825 GB/sec
8x1-bw-process-NOTHP 12.754 11.442 22.889 GB/sec
16x1-bw-process 12.886 12.685 13.546 GB/sec
4x1-bw-thread 19.147 17.964 9.622 GB/sec
8x1-bw-thread 26.342 30.171 14.679 GB/sec
16x1-bw-thread 41.527 36.363 40.070 GB/sec
32x1-bw-thread 45.005 40.950 49.846 GB/sec
2x3-bw-thread 9.493 14.444 8.145 GB/sec
4x4-bw-thread 18.309 16.382 45.384 GB/sec
4x6-bw-thread 14.524 18.502 17.058 GB/sec
4x8-bw-thread 13.315 16.852 33.693 GB/sec
4x8-bw-thread-NOTHP 12.273 12.226 24.887 GB/sec
3x3-bw-thread 17.614 11.960 16.119 GB/sec
5x5-bw-thread 13.415 17.585 24.251 GB/sec
2x16-bw-thread 11.718 11.174 17.971 GB/sec
1x32-bw-thread 11.360 10.902 14.330 GB/sec
numa02-bw 48.999 44.173 54.795 GB/sec
numa02-bw-NOTHP 47.655 42.600 53.445 GB/sec
numa01-bw-thread 36.983 39.692 45.254 GB/sec
numa01-bw-thread-NOTHP 38.486 35.208 44.118 GB/sec
With HT ON
(Lower is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
1x3-convergence 100.114 100.138 100.084 secs
1x4-convergence 0.468 100.227 100.153 secs
1x6-convergence 100.278 100.400 100.197 secs
2x3-convergence 100.186 1.833 13.132 secs
3x3-convergence 100.302 100.457 2.087 secs
4x4-convergence 100.237 100.178 2.466 secs
4x4-convergence-NOTHP 100.148 100.251 2.985 secs
4x6-convergence 100.931 3.632 9.184 secs
4x8-convergence 100.398 100.456 4.801 secs
8x4-convergence 100.649 100.458 4.179 secs
8x4-convergence-NOTHP 100.391 100.428 9.758 secs
3x1-convergence 1.472 1.501 0.727 secs
4x1-convergence 1.478 1.489 1.408 secs
8x1-convergence 2.380 2.385 2.432 secs
16x1-convergence 3.260 3.399 2.219 secs
32x1-convergence 2.622 2.067 1.951 secs
(Higher is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
RAM-bw-local 3.333 3.342 3.345 GB/sec
RAM-bw-local-NOTHP 3.305 3.306 3.307 GB/sec
RAM-bw-remote 1.814 1.814 1.816 GB/sec
RAM-bw-local-2x 7.896 6.400 6.538 GB/sec
RAM-bw-remote-2x 2.982 3.038 3.034 GB/sec
RAM-bw-cross 4.313 3.427 4.372 GB/sec
2x1-bw-process 3.473 4.708 3.784 GB/sec
3x1-bw-process 5.397 4.983 5.399 GB/sec
4x1-bw-process 5.040 8.775 5.098 GB/sec
8x1-bw-process 8.989 6.862 13.745 GB/sec
8x1-bw-process-NOTHP 8.457 19.094 8.118 GB/sec
16x1-bw-process 13.482 23.067 15.138 GB/sec
4x1-bw-thread 14.904 18.258 9.713 GB/sec
8x1-bw-thread 24.160 29.153 12.495 GB/sec
16x1-bw-thread 41.283 36.642 32.140 GB/sec
32x1-bw-thread 46.983 43.068 48.153 GB/sec
2x3-bw-thread 9.718 15.344 10.846 GB/sec
4x4-bw-thread 12.602 15.758 13.148 GB/sec
4x6-bw-thread 13.807 11.278 18.540 GB/sec
4x8-bw-thread 13.316 11.677 22.795 GB/sec
4x8-bw-thread-NOTHP 12.548 21.797 30.807 GB/sec
3x3-bw-thread 13.500 18.758 18.569 GB/sec
5x5-bw-thread 14.575 14.199 36.521 GB/sec
2x16-bw-thread 11.345 11.434 19.569 GB/sec
1x32-bw-thread 14.123 10.586 14.587 GB/sec
numa02-bw 50.963 44.092 53.419 GB/sec
numa02-bw-NOTHP 50.553 42.724 51.106 GB/sec
numa01-bw-thread 33.724 33.050 37.801 GB/sec
numa01-bw-thread-NOTHP 39.064 35.139 43.314 GB/sec
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>, Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
Preeti U Murthy <preeti@linux.vnet.ibm.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
Date: Wed, 31 Jul 2013 23:05:13 +0530 [thread overview]
Message-ID: <20130731173513.GA12770@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130730093321.GO3008@twins.programming.kicks-ass.net>
* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:33:21]:
> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
>
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
>
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
>
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
>
The above dumped core; Looks like -T is a must with -G.
I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).
Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)
Convergence wise my patchset did really well.
bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.
Is this what you were looking for? Or was it something else?
(Lower is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
1x3-convergence 0.320 100.060 100.204 secs
1x4-convergence 100.139 100.162 100.155 secs
1x6-convergence 100.455 100.179 1.078 secs
2x3-convergence 100.261 100.339 9.743 secs
3x3-convergence 100.213 100.168 10.073 secs
4x4-convergence 100.307 100.201 19.686 secs
4x4-convergence-NOTHP 100.229 100.221 3.189 secs
4x6-convergence 101.441 100.632 6.204 secs
4x8-convergence 100.680 100.588 5.275 secs
8x4-convergence 100.335 100.365 34.069 secs
8x4-convergence-NOTHP 100.331 100.412 100.478 secs
3x1-convergence 1.227 1.536 0.576 secs
4x1-convergence 1.224 1.063 1.390 secs
8x1-convergence 1.713 2.437 1.704 secs
16x1-convergence 2.750 2.677 1.856 secs
32x1-convergence 1.985 1.795 1.391 secs
(Higher is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
RAM-bw-local 3.341 3.340 3.325 GB/sec
RAM-bw-local-NOTHP 3.308 3.307 3.290 GB/sec
RAM-bw-remote 1.815 1.815 1.815 GB/sec
RAM-bw-local-2x 6.410 6.413 6.412 GB/sec
RAM-bw-remote-2x 3.020 3.041 3.027 GB/sec
RAM-bw-cross 4.397 3.425 4.374 GB/sec
2x1-bw-process 3.481 3.442 3.492 GB/sec
3x1-bw-process 5.423 7.547 5.445 GB/sec
4x1-bw-process 5.108 11.009 5.118 GB/sec
8x1-bw-process 8.929 10.935 8.825 GB/sec
8x1-bw-process-NOTHP 12.754 11.442 22.889 GB/sec
16x1-bw-process 12.886 12.685 13.546 GB/sec
4x1-bw-thread 19.147 17.964 9.622 GB/sec
8x1-bw-thread 26.342 30.171 14.679 GB/sec
16x1-bw-thread 41.527 36.363 40.070 GB/sec
32x1-bw-thread 45.005 40.950 49.846 GB/sec
2x3-bw-thread 9.493 14.444 8.145 GB/sec
4x4-bw-thread 18.309 16.382 45.384 GB/sec
4x6-bw-thread 14.524 18.502 17.058 GB/sec
4x8-bw-thread 13.315 16.852 33.693 GB/sec
4x8-bw-thread-NOTHP 12.273 12.226 24.887 GB/sec
3x3-bw-thread 17.614 11.960 16.119 GB/sec
5x5-bw-thread 13.415 17.585 24.251 GB/sec
2x16-bw-thread 11.718 11.174 17.971 GB/sec
1x32-bw-thread 11.360 10.902 14.330 GB/sec
numa02-bw 48.999 44.173 54.795 GB/sec
numa02-bw-NOTHP 47.655 42.600 53.445 GB/sec
numa01-bw-thread 36.983 39.692 45.254 GB/sec
numa01-bw-thread-NOTHP 38.486 35.208 44.118 GB/sec
With HT ON
(Lower is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
1x3-convergence 100.114 100.138 100.084 secs
1x4-convergence 0.468 100.227 100.153 secs
1x6-convergence 100.278 100.400 100.197 secs
2x3-convergence 100.186 1.833 13.132 secs
3x3-convergence 100.302 100.457 2.087 secs
4x4-convergence 100.237 100.178 2.466 secs
4x4-convergence-NOTHP 100.148 100.251 2.985 secs
4x6-convergence 100.931 3.632 9.184 secs
4x8-convergence 100.398 100.456 4.801 secs
8x4-convergence 100.649 100.458 4.179 secs
8x4-convergence-NOTHP 100.391 100.428 9.758 secs
3x1-convergence 1.472 1.501 0.727 secs
4x1-convergence 1.478 1.489 1.408 secs
8x1-convergence 2.380 2.385 2.432 secs
16x1-convergence 3.260 3.399 2.219 secs
32x1-convergence 2.622 2.067 1.951 secs
(Higher is better)
testcase 3.9.0 Mels v5 this_patchset Units
------------------------------------------------------------------------------
RAM-bw-local 3.333 3.342 3.345 GB/sec
RAM-bw-local-NOTHP 3.305 3.306 3.307 GB/sec
RAM-bw-remote 1.814 1.814 1.816 GB/sec
RAM-bw-local-2x 7.896 6.400 6.538 GB/sec
RAM-bw-remote-2x 2.982 3.038 3.034 GB/sec
RAM-bw-cross 4.313 3.427 4.372 GB/sec
2x1-bw-process 3.473 4.708 3.784 GB/sec
3x1-bw-process 5.397 4.983 5.399 GB/sec
4x1-bw-process 5.040 8.775 5.098 GB/sec
8x1-bw-process 8.989 6.862 13.745 GB/sec
8x1-bw-process-NOTHP 8.457 19.094 8.118 GB/sec
16x1-bw-process 13.482 23.067 15.138 GB/sec
4x1-bw-thread 14.904 18.258 9.713 GB/sec
8x1-bw-thread 24.160 29.153 12.495 GB/sec
16x1-bw-thread 41.283 36.642 32.140 GB/sec
32x1-bw-thread 46.983 43.068 48.153 GB/sec
2x3-bw-thread 9.718 15.344 10.846 GB/sec
4x4-bw-thread 12.602 15.758 13.148 GB/sec
4x6-bw-thread 13.807 11.278 18.540 GB/sec
4x8-bw-thread 13.316 11.677 22.795 GB/sec
4x8-bw-thread-NOTHP 12.548 21.797 30.807 GB/sec
3x3-bw-thread 13.500 18.758 18.569 GB/sec
5x5-bw-thread 14.575 14.199 36.521 GB/sec
2x16-bw-thread 11.345 11.434 19.569 GB/sec
1x32-bw-thread 14.123 10.586 14.587 GB/sec
numa02-bw 50.963 44.092 53.419 GB/sec
numa02-bw-NOTHP 50.553 42.724 51.106 GB/sec
numa01-bw-thread 33.724 33.050 37.801 GB/sec
numa01-bw-thread-NOTHP 39.064 35.139 43.314 GB/sec
next prev parent reply other threads:[~2013-07-31 17:36 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-30 7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 02/10] sched: Use numa weights while migrating tasks Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 06/10] sched: Limit migrations from a node Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 7:48 ` [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables Srikar Dronamraju
2013-07-30 7:48 ` Srikar Dronamraju
2013-07-30 8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
2013-07-30 8:17 ` Peter Zijlstra
2013-07-30 8:20 ` Peter Zijlstra
2013-07-30 8:20 ` Peter Zijlstra
2013-07-30 9:03 ` Srikar Dronamraju
2013-07-30 9:03 ` Srikar Dronamraju
2013-07-30 9:10 ` Peter Zijlstra
2013-07-30 9:10 ` Peter Zijlstra
2013-07-30 9:26 ` Peter Zijlstra
2013-07-30 9:26 ` Peter Zijlstra
2013-07-30 9:46 ` Srikar Dronamraju
2013-07-30 9:46 ` Srikar Dronamraju
2013-07-31 15:09 ` Peter Zijlstra
2013-07-31 15:09 ` Peter Zijlstra
2013-07-31 18:06 ` Srikar Dronamraju
2013-07-31 18:06 ` Srikar Dronamraju
2013-07-30 9:15 ` Srikar Dronamraju
2013-07-30 9:15 ` Srikar Dronamraju
2013-07-30 9:33 ` Peter Zijlstra
2013-07-30 9:33 ` Peter Zijlstra
2013-07-31 17:35 ` Srikar Dronamraju [this message]
2013-07-31 17:35 ` Srikar Dronamraju
2013-07-31 13:33 ` Andrew Theurer
2013-07-31 13:33 ` Andrew Theurer
2013-07-31 15:43 ` Srikar Dronamraju
2013-07-31 15:43 ` Srikar Dronamraju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130731173513.GA12770@linux.vnet.ibm.com \
--to=srikar@linux.vnet.ibm.com \
--cc=aarcange@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=preeti@linux.vnet.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.