All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Christoph Lameter <clameter@sgi.com>,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH] change global zonelist order v4 [0/2]
Date: Fri, 04 May 2007 13:12:08 -0400	[thread overview]
Message-ID: <1178298729.5236.33.camel@localhost> (raw)
In-Reply-To: <20070503224730.3bc6f8a8.akpm@linux-foundation.org>

On Thu, 2007-05-03 at 22:47 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
> 
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?

See below.  Something is needed here on some platforms.  The current
zonelist ordering results in some unfortunate behavior on some
platforms.


> 
> And it's complex - how do poor users know what to do with this new control?
> 
Kame's autoconfig seems to be doing the right thing for our platform.
Might not be the case for other platforms, or some workloads on them.  I
suppose the documentation in sysctl.txt could be expanded to describe
when you might want to select a non-default setting, should we decide to
provide that capability.

> 
> This:
> 
> + *	= "[dD]efault | "0"	- default, automatic configuration.
> + *	= "[nN]ode"|"1" 	- order by node locality,
> + *         			  then zone within node.
> + *	= "[zZ]one"|"2" - order by zone, then by locality within zone
> 
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?

I agree, but I was considering dropping the "0/1/2" in favor of the more
descriptive [IMO] values ;-).

> 
> 
> I haven't followed this discussion very closely I'm afraid.  If we came up
> with a good reason why Linux needs this feature then could someone please
> (re)describe it?

Kame originally described the need for it in:

	http://marc.info/?l=linux-mm&m=117747120307559&w=4

I chimed in with support as we have a similar need for our cell-based
ia64 platforms:

	http://marc.info/?l=linux-mm&m=117760331328012&w=4

I can easily consume all of DMA on our platforms [configured as 100%
"cell local memory" -- always leaves some "cache-line interleaved" at
phys addr zero => ZONE_DMA] by allocating, e.g., a shared memory segment
of size > 1 node's memory + size of ZONE_DMA.  This occurs because the
node containing zone DMA is always 2nd in a zone's ZONE_NORMAL zonelist
[after the zone itself, assuming it has memory].  Then, any driver that
requests memory from ZONE_DMA will be denied, resulting in IO errors,
death of hald [maybe that's a feature? ;-)], ...

I guess I would be happy with Kame's V3 patch that unconditionally
changes the order to be zone first--i.e., ZONE_NORMAL for all nodes
before ZONE_DMA*:

	http://marc.info/?l=linux-mm&m=117758484122663&w=4

However, this patch apparently crossed in the mail with Christoph's
observation that making the new order [zone order] the default w/o any
option wouldn't be appropriate for some configurations:

	http://marc.info/?l=linux-mm&m=117760245022005&w=4

Meanwhile, I was factoring out common code in Kame's V1/V2 patch and
adding the "excessive" user interface to the boot parameter/sysctl.
After some additional rework, Kame posted this a V4--the one you're
questioning.

If we decide to proceed with this, I have another "cleanup" patch that
eliminates some redundant "estimating of zone order" [autoconfig] and
reports what order was chosen in the "Build %d zonelists..." message.


Lee


WARNING: multiple messages have this Message-ID (diff)
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Christoph Lameter <clameter@sgi.com>,
	Eric Whitney <eric.whitney@hp.com>
Subject: Re: [PATCH] change global zonelist order v4 [0/2]
Date: Fri, 04 May 2007 13:12:08 -0400	[thread overview]
Message-ID: <1178298729.5236.33.camel@localhost> (raw)
In-Reply-To: <20070503224730.3bc6f8a8.akpm@linux-foundation.org>

On Thu, 2007-05-03 at 22:47 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 14:45:30 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Hi, this is version 4. including Lee Schermerhon's good rework.
> > and automatic configuration at boot time.
> 
> hm, this adds rather a lot of code.  Have we established that it's worth
> it?

See below.  Something is needed here on some platforms.  The current
zonelist ordering results in some unfortunate behavior on some
platforms.


> 
> And it's complex - how do poor users know what to do with this new control?
> 
Kame's autoconfig seems to be doing the right thing for our platform.
Might not be the case for other platforms, or some workloads on them.  I
suppose the documentation in sysctl.txt could be expanded to describe
when you might want to select a non-default setting, should we decide to
provide that capability.

> 
> This:
> 
> + *	= "[dD]efault | "0"	- default, automatic configuration.
> + *	= "[nN]ode"|"1" 	- order by node locality,
> + *         			  then zone within node.
> + *	= "[zZ]one"|"2" - order by zone, then by locality within zone
> 
> seems a bit excessive.  I think just the 0/1/2 plus documentation would
> suffice?

I agree, but I was considering dropping the "0/1/2" in favor of the more
descriptive [IMO] values ;-).

> 
> 
> I haven't followed this discussion very closely I'm afraid.  If we came up
> with a good reason why Linux needs this feature then could someone please
> (re)describe it?

Kame originally described the need for it in:

	http://marc.info/?l=linux-mm&m=117747120307559&w=4

I chimed in with support as we have a similar need for our cell-based
ia64 platforms:

	http://marc.info/?l=linux-mm&m=117760331328012&w=4

I can easily consume all of DMA on our platforms [configured as 100%
"cell local memory" -- always leaves some "cache-line interleaved" at
phys addr zero => ZONE_DMA] by allocating, e.g., a shared memory segment
of size > 1 node's memory + size of ZONE_DMA.  This occurs because the
node containing zone DMA is always 2nd in a zone's ZONE_NORMAL zonelist
[after the zone itself, assuming it has memory].  Then, any driver that
requests memory from ZONE_DMA will be denied, resulting in IO errors,
death of hald [maybe that's a feature? ;-)], ...

I guess I would be happy with Kame's V3 patch that unconditionally
changes the order to be zone first--i.e., ZONE_NORMAL for all nodes
before ZONE_DMA*:

	http://marc.info/?l=linux-mm&m=117758484122663&w=4

However, this patch apparently crossed in the mail with Christoph's
observation that making the new order [zone order] the default w/o any
option wouldn't be appropriate for some configurations:

	http://marc.info/?l=linux-mm&m=117760245022005&w=4

Meanwhile, I was factoring out common code in Kame's V1/V2 patch and
adding the "excessive" user interface to the boot parameter/sysctl.
After some additional rework, Kame posted this a V4--the one you're
questioning.

If we decide to proceed with this, I have another "cleanup" patch that
eliminates some redundant "estimating of zone order" [autoconfig] and
reports what order was chosen in the "Build %d zonelists..." message.


Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-05-04 17:12 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-27  5:45 [PATCH] change global zonelist order v4 [0/2] KAMEZAWA Hiroyuki
2007-04-27  5:45 ` KAMEZAWA Hiroyuki
2007-04-27  6:04 ` [PATCH] change global zonelist order v4 [1/2] change zonelist ordering KAMEZAWA Hiroyuki
2007-04-27  6:04   ` KAMEZAWA Hiroyuki
2007-04-30 16:12   ` Lee Schermerhorn
2007-04-30 16:12     ` Lee Schermerhorn
2007-04-27  6:17 ` [PATCH] change global zonelist order v4 [2/2] auto configuration KAMEZAWA Hiroyuki
2007-04-27  6:17   ` KAMEZAWA Hiroyuki
2007-04-30 16:26   ` Lee Schermerhorn
2007-04-30 16:26     ` Lee Schermerhorn
2007-05-04  5:47 ` [PATCH] change global zonelist order v4 [0/2] Andrew Morton
2007-05-04  5:47   ` Andrew Morton
2007-05-04 15:26   ` Jesse Barnes
2007-05-04 15:26     ` Jesse Barnes
2007-05-04 16:18     ` Christoph Lameter
2007-05-04 16:18       ` Christoph Lameter
2007-05-04 17:24       ` Lee Schermerhorn
2007-05-04 17:24         ` Lee Schermerhorn
2007-05-04 17:28         ` Christoph Lameter
2007-05-04 17:28           ` Christoph Lameter
2007-05-04 17:36           ` Jesse Barnes
2007-05-04 17:36             ` Jesse Barnes
2007-05-04 18:03             ` Christoph Lameter
2007-05-04 18:03               ` Christoph Lameter
2007-05-04 17:12   ` Lee Schermerhorn [this message]
2007-05-04 17:12     ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1178298729.5236.33.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.