Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>, Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Preeti U Murthy <preeti@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
Date: Wed, 31 Jul 2013 23:05:13 +0530	[thread overview]
Message-ID: <20130731173513.GA12770@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130730093321.GO3008@twins.programming.kicks-ass.net>

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:33:21]:

> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
> 
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
> 
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
> 
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
> 

The above dumped core; Looks like -T is a must with -G.

I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).

Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)

Convergence wise my patchset did really well.

bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.

Is this what you were looking for? Or was it something else?

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		0.320		100.060		100.204		secs
1x4-convergence		100.139		100.162		100.155		secs
1x6-convergence		100.455		100.179		1.078		secs
2x3-convergence		100.261		100.339		9.743		secs
3x3-convergence		100.213		100.168		10.073		secs
4x4-convergence		100.307		100.201		19.686		secs
4x4-convergence-NOTHP	100.229		100.221		3.189		secs
4x6-convergence		101.441		100.632		6.204		secs
4x8-convergence		100.680		100.588		5.275		secs
8x4-convergence		100.335		100.365		34.069		secs
8x4-convergence-NOTHP	100.331		100.412		100.478		secs
3x1-convergence		1.227		1.536		0.576		secs
4x1-convergence		1.224		1.063		1.390		secs
8x1-convergence		1.713		2.437		1.704		secs
16x1-convergence	2.750		2.677		1.856		secs
32x1-convergence	1.985		1.795		1.391		secs


(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.341		3.340		3.325		GB/sec
RAM-bw-local-NOTHP	3.308		3.307		3.290		GB/sec
RAM-bw-remote		1.815		1.815		1.815		GB/sec
RAM-bw-local-2x		6.410		6.413		6.412		GB/sec
RAM-bw-remote-2x	3.020		3.041		3.027		GB/sec
RAM-bw-cross		4.397		3.425		4.374		GB/sec
2x1-bw-process		3.481		3.442		3.492		GB/sec
3x1-bw-process		5.423		7.547		5.445		GB/sec
4x1-bw-process		5.108		11.009		5.118		GB/sec
8x1-bw-process		8.929		10.935		8.825		GB/sec
8x1-bw-process-NOTHP	12.754		11.442		22.889		GB/sec
16x1-bw-process		12.886		12.685		13.546		GB/sec
4x1-bw-thread		19.147		17.964		9.622		GB/sec
8x1-bw-thread		26.342		30.171		14.679		GB/sec
16x1-bw-thread		41.527		36.363		40.070		GB/sec
32x1-bw-thread		45.005		40.950		49.846		GB/sec
2x3-bw-thread		9.493		14.444		8.145		GB/sec
4x4-bw-thread		18.309		16.382		45.384		GB/sec
4x6-bw-thread		14.524		18.502		17.058		GB/sec
4x8-bw-thread		13.315		16.852		33.693		GB/sec
4x8-bw-thread-NOTHP	12.273		12.226		24.887		GB/sec
3x3-bw-thread		17.614		11.960		16.119		GB/sec
5x5-bw-thread		13.415		17.585		24.251		GB/sec
2x16-bw-thread		11.718		11.174		17.971		GB/sec
1x32-bw-thread		11.360		10.902		14.330		GB/sec
numa02-bw		48.999		44.173		54.795		GB/sec
numa02-bw-NOTHP		47.655		42.600		53.445		GB/sec
numa01-bw-thread	36.983		39.692		45.254		GB/sec
numa01-bw-thread-NOTHP	38.486		35.208		44.118		GB/sec



With HT ON

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		100.114		100.138		100.084		secs
1x4-convergence		0.468		100.227		100.153		secs
1x6-convergence		100.278		100.400		100.197		secs
2x3-convergence		100.186		1.833		13.132		secs
3x3-convergence		100.302		100.457		2.087		secs
4x4-convergence		100.237		100.178		2.466		secs
4x4-convergence-NOTHP	100.148		100.251		2.985		secs
4x6-convergence		100.931		3.632		9.184		secs
4x8-convergence		100.398		100.456		4.801		secs
8x4-convergence		100.649		100.458		4.179		secs
8x4-convergence-NOTHP	100.391		100.428		9.758		secs
3x1-convergence		1.472		1.501		0.727		secs
4x1-convergence		1.478		1.489		1.408		secs
8x1-convergence		2.380		2.385		2.432		secs
16x1-convergence	3.260		3.399		2.219		secs
32x1-convergence	2.622		2.067		1.951		secs



(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.333		3.342		3.345		GB/sec
RAM-bw-local-NOTHP	3.305		3.306		3.307		GB/sec
RAM-bw-remote		1.814		1.814		1.816		GB/sec
RAM-bw-local-2x		7.896		6.400		6.538		GB/sec
RAM-bw-remote-2x	2.982		3.038		3.034		GB/sec
RAM-bw-cross		4.313		3.427		4.372		GB/sec
2x1-bw-process		3.473		4.708		3.784		GB/sec
3x1-bw-process		5.397		4.983		5.399		GB/sec
4x1-bw-process		5.040		8.775		5.098		GB/sec
8x1-bw-process		8.989		6.862		13.745		GB/sec
8x1-bw-process-NOTHP	8.457		19.094		8.118		GB/sec
16x1-bw-process		13.482		23.067		15.138		GB/sec
4x1-bw-thread		14.904		18.258		9.713		GB/sec
8x1-bw-thread		24.160		29.153		12.495		GB/sec
16x1-bw-thread		41.283		36.642		32.140		GB/sec
32x1-bw-thread		46.983		43.068		48.153		GB/sec
2x3-bw-thread		9.718		15.344		10.846		GB/sec
4x4-bw-thread		12.602		15.758		13.148		GB/sec
4x6-bw-thread		13.807		11.278		18.540		GB/sec
4x8-bw-thread		13.316		11.677		22.795		GB/sec
4x8-bw-thread-NOTHP	12.548		21.797		30.807		GB/sec
3x3-bw-thread		13.500		18.758		18.569		GB/sec
5x5-bw-thread		14.575		14.199		36.521		GB/sec
2x16-bw-thread		11.345		11.434		19.569		GB/sec
1x32-bw-thread		14.123		10.586		14.587		GB/sec
numa02-bw		50.963		44.092		53.419		GB/sec
numa02-bw-NOTHP		50.553		42.724		51.106		GB/sec
numa01-bw-thread	33.724		33.050		37.801		GB/sec
numa01-bw-thread-NOTHP	39.064		35.139		43.314		GB/sec


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>, Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Preeti U Murthy <preeti@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
Date: Wed, 31 Jul 2013 23:05:13 +0530	[thread overview]
Message-ID: <20130731173513.GA12770@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130730093321.GO3008@twins.programming.kicks-ass.net>

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:33:21]:

> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
> 
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
> 
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
> 
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
> 

The above dumped core; Looks like -T is a must with -G.

I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).

Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)

Convergence wise my patchset did really well.

bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.

Is this what you were looking for? Or was it something else?

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		0.320		100.060		100.204		secs
1x4-convergence		100.139		100.162		100.155		secs
1x6-convergence		100.455		100.179		1.078		secs
2x3-convergence		100.261		100.339		9.743		secs
3x3-convergence		100.213		100.168		10.073		secs
4x4-convergence		100.307		100.201		19.686		secs
4x4-convergence-NOTHP	100.229		100.221		3.189		secs
4x6-convergence		101.441		100.632		6.204		secs
4x8-convergence		100.680		100.588		5.275		secs
8x4-convergence		100.335		100.365		34.069		secs
8x4-convergence-NOTHP	100.331		100.412		100.478		secs
3x1-convergence		1.227		1.536		0.576		secs
4x1-convergence		1.224		1.063		1.390		secs
8x1-convergence		1.713		2.437		1.704		secs
16x1-convergence	2.750		2.677		1.856		secs
32x1-convergence	1.985		1.795		1.391		secs


(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.341		3.340		3.325		GB/sec
RAM-bw-local-NOTHP	3.308		3.307		3.290		GB/sec
RAM-bw-remote		1.815		1.815		1.815		GB/sec
RAM-bw-local-2x		6.410		6.413		6.412		GB/sec
RAM-bw-remote-2x	3.020		3.041		3.027		GB/sec
RAM-bw-cross		4.397		3.425		4.374		GB/sec
2x1-bw-process		3.481		3.442		3.492		GB/sec
3x1-bw-process		5.423		7.547		5.445		GB/sec
4x1-bw-process		5.108		11.009		5.118		GB/sec
8x1-bw-process		8.929		10.935		8.825		GB/sec
8x1-bw-process-NOTHP	12.754		11.442		22.889		GB/sec
16x1-bw-process		12.886		12.685		13.546		GB/sec
4x1-bw-thread		19.147		17.964		9.622		GB/sec
8x1-bw-thread		26.342		30.171		14.679		GB/sec
16x1-bw-thread		41.527		36.363		40.070		GB/sec
32x1-bw-thread		45.005		40.950		49.846		GB/sec
2x3-bw-thread		9.493		14.444		8.145		GB/sec
4x4-bw-thread		18.309		16.382		45.384		GB/sec
4x6-bw-thread		14.524		18.502		17.058		GB/sec
4x8-bw-thread		13.315		16.852		33.693		GB/sec
4x8-bw-thread-NOTHP	12.273		12.226		24.887		GB/sec
3x3-bw-thread		17.614		11.960		16.119		GB/sec
5x5-bw-thread		13.415		17.585		24.251		GB/sec
2x16-bw-thread		11.718		11.174		17.971		GB/sec
1x32-bw-thread		11.360		10.902		14.330		GB/sec
numa02-bw		48.999		44.173		54.795		GB/sec
numa02-bw-NOTHP		47.655		42.600		53.445		GB/sec
numa01-bw-thread	36.983		39.692		45.254		GB/sec
numa01-bw-thread-NOTHP	38.486		35.208		44.118		GB/sec



With HT ON

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		100.114		100.138		100.084		secs
1x4-convergence		0.468		100.227		100.153		secs
1x6-convergence		100.278		100.400		100.197		secs
2x3-convergence		100.186		1.833		13.132		secs
3x3-convergence		100.302		100.457		2.087		secs
4x4-convergence		100.237		100.178		2.466		secs
4x4-convergence-NOTHP	100.148		100.251		2.985		secs
4x6-convergence		100.931		3.632		9.184		secs
4x8-convergence		100.398		100.456		4.801		secs
8x4-convergence		100.649		100.458		4.179		secs
8x4-convergence-NOTHP	100.391		100.428		9.758		secs
3x1-convergence		1.472		1.501		0.727		secs
4x1-convergence		1.478		1.489		1.408		secs
8x1-convergence		2.380		2.385		2.432		secs
16x1-convergence	3.260		3.399		2.219		secs
32x1-convergence	2.622		2.067		1.951		secs



(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.333		3.342		3.345		GB/sec
RAM-bw-local-NOTHP	3.305		3.306		3.307		GB/sec
RAM-bw-remote		1.814		1.814		1.816		GB/sec
RAM-bw-local-2x		7.896		6.400		6.538		GB/sec
RAM-bw-remote-2x	2.982		3.038		3.034		GB/sec
RAM-bw-cross		4.313		3.427		4.372		GB/sec
2x1-bw-process		3.473		4.708		3.784		GB/sec
3x1-bw-process		5.397		4.983		5.399		GB/sec
4x1-bw-process		5.040		8.775		5.098		GB/sec
8x1-bw-process		8.989		6.862		13.745		GB/sec
8x1-bw-process-NOTHP	8.457		19.094		8.118		GB/sec
16x1-bw-process		13.482		23.067		15.138		GB/sec
4x1-bw-thread		14.904		18.258		9.713		GB/sec
8x1-bw-thread		24.160		29.153		12.495		GB/sec
16x1-bw-thread		41.283		36.642		32.140		GB/sec
32x1-bw-thread		46.983		43.068		48.153		GB/sec
2x3-bw-thread		9.718		15.344		10.846		GB/sec
4x4-bw-thread		12.602		15.758		13.148		GB/sec
4x6-bw-thread		13.807		11.278		18.540		GB/sec
4x8-bw-thread		13.316		11.677		22.795		GB/sec
4x8-bw-thread-NOTHP	12.548		21.797		30.807		GB/sec
3x3-bw-thread		13.500		18.758		18.569		GB/sec
5x5-bw-thread		14.575		14.199		36.521		GB/sec
2x16-bw-thread		11.345		11.434		19.569		GB/sec
1x32-bw-thread		14.123		10.586		14.587		GB/sec
numa02-bw		50.963		44.092		53.419		GB/sec
numa02-bw-NOTHP		50.553		42.724		51.106		GB/sec
numa01-bw-thread	33.724		33.050		37.801		GB/sec
numa01-bw-thread-NOTHP	39.064		35.139		43.314		GB/sec

next prev parent reply	other threads:[~2013-07-31 17:36 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
2013-07-30  7:48 ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 02/10] sched: Use numa weights while migrating tasks Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 06/10] sched: Limit migrations from a node Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables Srikar Dronamraju
2013-07-30  7:48   ` Srikar Dronamraju
2013-07-30  8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
2013-07-30  8:17   ` Peter Zijlstra
2013-07-30  8:20   ` Peter Zijlstra
2013-07-30  8:20     ` Peter Zijlstra
2013-07-30  9:03     ` Srikar Dronamraju
2013-07-30  9:03       ` Srikar Dronamraju
2013-07-30  9:10       ` Peter Zijlstra
2013-07-30  9:10         ` Peter Zijlstra
2013-07-30  9:26         ` Peter Zijlstra
2013-07-30  9:26           ` Peter Zijlstra
2013-07-30  9:46         ` Srikar Dronamraju
2013-07-30  9:46           ` Srikar Dronamraju
2013-07-31 15:09           ` Peter Zijlstra
2013-07-31 15:09             ` Peter Zijlstra
2013-07-31 18:06             ` Srikar Dronamraju
2013-07-31 18:06               ` Srikar Dronamraju
2013-07-30  9:15     ` Srikar Dronamraju
2013-07-30  9:15       ` Srikar Dronamraju
2013-07-30  9:33       ` Peter Zijlstra
2013-07-30  9:33         ` Peter Zijlstra
2013-07-31 17:35         ` Srikar Dronamraju [this message]
2013-07-31 17:35           ` Srikar Dronamraju
2013-07-31 13:33 ` Andrew Theurer
2013-07-31 13:33   ` Andrew Theurer
2013-07-31 15:43   ` Srikar Dronamraju
2013-07-31 15:43     ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130731173513.GA12770@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.