Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Tejun Heo <tj@kernel.org>
To: Yinghai Lu <yinghai@kernel.org>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>,
	Zhang Yanfei <zhangyanfei@cn.fujitsu.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Toshi Kani <toshi.kani@hp.com>,
	Ingo Molnar <mingo@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE
Date: Mon, 14 Oct 2013 16:04:37 -0400	[thread overview]
Message-ID: <20131014200437.GA5720@htj.dyndns.org> (raw)
In-Reply-To: <CAE9FiQX0xeR61ehPy2SfWfN-fgSZVS=dtaWT-QQ+RR12tx+A0w@mail.gmail.com>

Hello, Yinghai.

On Mon, Oct 14, 2013 at 12:34:49PM -0700, Yinghai Lu wrote:
> The points for parsing SRAT early instead of Yanfei/Tang v7:
>
> 1. We just reached one unified path to setup page tables for 32bit,
> 64bit and xen or non xen after several years. We should not have add
> another path for system
> that support hotplug.

The separate code path we're talking about is tiny.  It's just an
extra function for page table allocation and another for memblock
allocation which is symmetric to the existing one.  Sure, there are
benefits to not diverging code paths but these are fairly trivial in
terms of maintenance overhead and test coverage.

> 2. also we should avoid adding "movable_nodes" command line.

Can we?  What about the pgdat?  We're allocating them off-node with
movable_nodes which can't be the default behavior.

> 3. debug mapping 4k, and it is working all the way, why breaking it even for
> memory hotplug path?

If it comes for free, sure, no reason to break it.  On the other hand,
if maintaining it fully with a niche feature costs overhead, it's
somethinig to be traded-off.  It's not like using 4k page mapping with
bottom-up allocation will be immediately broken either.  It might
affect devices which can't DMA to higher addresses on gigantic
machines under debug configs.  It's quite a corner case.

> 4. numa_meminfo now is static structure.
> we have no reason that we can not parse SRAT etc to fill that struct.

Sure, there's no reason we can't.  The whole point is that the
benefits arent' strong enough.  We don't do things just because we
can.

> 5. for device tree, i assume that we could do same like srat parsing to find out
> numa to fill the numa_meminfo early. or with help of BRK.

Digesting device tree involves a lot more complexity.  The whole
reason why things like SRAT are broken into tables in the first place.
We'll be basically pulling in huge chunk of ACPICA into early boot.
Again, justfications.  The *only* thing which may benefit from that
are debug setups.  We'll have to pull in a lot of complexity before
page table setup and modify page table allocation to be
memory-device-specific just for debug configs, which is not a good
trade-off.  Benefit / cost ratio doesn't make any sense.

> 6. in the long run, We should rework our NUMA booting:
> a. boot system with boot numa nodes early only.
> b. in later init stage or user space, init other nodes
> RAM/CPU/PCI...in parallel.
> that will reduce boot time for 8 sockets/32 sockets dramatically.
> 
> We will need to parse srat table early so could avoid init memory for
> non-boot nodes.

Among the six you listed, this one sounds somewhat valid but still
assuming huge page, what difference does it make?  We're just talking
about page table alloc / init and ACPI init.  If you wanna speed up
huge NUMA machine booting and chop down memory init per-NUMA, sure,
move those pieces to later stages.  You can init the amount necessary
during early boot and then bring up the rest later on.  I don't see
why that'd require parsing SRAT.  In fact, I think there'll be more
cases where you want to actively ignore NUMA mapping during early
boot.  What if the system maps low memory to a non-boot numa node?

Optimizing NUMA boot just requires moving the heavy lifting to
appropriate NUMA nodes.  It doesn't require that early boot phase
should strictly follow NUMA node boundaries.

Thanks.

-- 
tejun

next prev parent reply	other threads:[~2013-10-14 20:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-12  6:00 [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-10-12  6:03 ` [PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node Zhang Yanfei
2013-10-12  6:04 ` [PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock Zhang Yanfei
2013-10-12  6:05 ` [PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions Zhang Yanfei
2013-10-12  6:06 ` [PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type Zhang Yanfei
2013-10-12  6:07 ` [PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock Zhang Yanfei
2013-10-12  6:08 ` [PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable Zhang Yanfei
2013-10-12  6:09 ` [PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed Zhang Yanfei
2013-10-12  6:09 ` [PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority Zhang Yanfei
     [not found] ` <525B19C3.9040907@gmail.com>
     [not found]   ` <20131014133835.GG4722@htj.dyndns.org>
     [not found]     ` <525BFCF3.5010908@gmail.com>
     [not found]       ` <20131014142719.GI4722@htj.dyndns.org>
     [not found]         ` <525C02DC.4050706@gmail.com>
     [not found]           ` <20131014145131.GJ4722@htj.dyndns.org>
     [not found]             ` <525C0866.2010808@gmail.com>
     [not found]               ` <20131014151902.GL4722@htj.dyndns.org>
2013-10-14 15:34                 ` [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE Zhang Yanfei
2013-10-14 19:34                   ` Yinghai Lu
2013-10-14 20:04                     ` Tejun Heo [this message]
2013-10-14 20:37                       ` Yinghai Lu
2013-10-14 20:42                         ` H. Peter Anvin
2013-10-15  6:50                           ` Ingo Molnar
2013-10-15 17:31                             ` H. Peter Anvin
2013-10-16  7:03                               ` Ingo Molnar
2013-10-14 20:55                         ` Tejun Heo
2013-10-15  1:40                           ` Zhang Yanfei
2013-10-15  2:25                           ` Yinghai Lu
2013-10-15 13:16                             ` Tejun Heo
2013-10-14 20:35                     ` H. Peter Anvin
2013-10-14 20:42                       ` Yinghai Lu
2013-10-14 20:49                         ` H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131014200437.GA5720@htj.dyndns.org \
    --to=tj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=toshi.kani@hp.com \
    --cc=yinghai@kernel.org \
    --cc=zhangyanfei.yes@gmail.com \
    --cc=zhangyanfei@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox