From: Jaegeuk Hanse <jaegeuk.hanse@gmail.com>
To: Jiang Liu <jiang.liu@huawei.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>,
hpa@zytor.com, akpm@linux-foundation.org, rob@landley.net,
isimatu.yasuaki@jp.fujitsu.com, laijs@cn.fujitsu.com,
wency@cn.fujitsu.com, linfeng@cn.fujitsu.com, yinghai@kernel.org,
kosaki.motohiro@jp.fujitsu.com, minchan.kim@gmail.com,
mgorman@suse.de, rientjes@google.com, rusty@rustcorp.com.au,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-doc@vger.kernel.org, Len Brown <lenb@kernel.org>,
Tony Luck <tony.luck@intel.com>,
"Wang, Frank" <frank.wang@intel.com>
Subject: Re: [PATCH v2 0/5] Add movablecore_map boot option
Date: Thu, 29 Nov 2012 09:42:51 +0800 [thread overview]
Message-ID: <20121129014251.GA9217@kernel> (raw)
In-Reply-To: <50B5CFAE.80103@huawei.com>
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
> Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
> First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to
>add several extra platform dependent parameters to enable memory hotplug.
>But how to enable memory (or CPU/node) hotplug out of box? I think the key point
>is to cooperate with BIOS/ACPI/firmware/device management teams.
> I still position memory hotplug as an advanced feature for high end
>servers and those systems may/should provide some management interfaces to
>configure CPU/memory/node hotplug features. The configuration UI may be provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device hotplug.
> On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least only
>limited quantity if any. So backward compatibility is not a big issue for us now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
> Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
> As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
> And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to
Hi Jiang,
Could you give an explanation how hardware migration technology works?
Regards,
Jaegeuk
>the just reclaimed memory device on other nodes.
>
> I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
> Back to this patch, we could rely on the mechanism provided
>by it to automatically mark memory ranges as movable with information
>from ACPI SRAT/MPST/PMTT tables. So we don't need administrator to
>manually configure kernel parameters to enable memory hotplug.
>
> Again, any comments are welcomed!
>
>Regards!
>Gerry
>
>
>On 2012-11-23 18:44, Tang Chen wrote:
>> [What we are doing]
>> This patchset provide a boot option for user to specify ZONE_MOVABLE memory
>> map for each node in the system.
>>
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> This option make sure memory range from ss to ss+nn is movable memory.
>>
>>
>> [Why we do this]
>> If we hot remove a memroy, the memory cannot have kernel memory,
>> because Linux cannot migrate kernel memory currently. Therefore,
>> we have to guarantee that the hot removed memory has only movable
>> memoroy.
>>
>> Linux has two boot options, kernelcore= and movablecore=, for
>> creating movable memory. These boot options can specify the amount
>> of memory use as kernel or movable memory. Using them, we can
>> create ZONE_MOVABLE which has only movable memory.
>>
>> But it does not fulfill a requirement of memory hot remove, because
>> even if we specify the boot options, movable memory is distributed
>> in each node evenly. So when we want to hot remove memory which
>> memory range is 0x80000000-0c0000000, we have no way to specify
>> the memory as movable memory.
>>
>> So we proposed a new feature which specifies memory range to use as
>> movable memory.
>>
>>
>> [Ways to do this]
>> There may be 2 ways to specify movable memory.
>> 1. use firmware information
>> 2. use boot option
>>
>> 1. use firmware information
>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>> Affinity Structure". If we use the information, we might be able to
>> specify movable memory by firmware. For example, if Hot Pluggable
>> Filed is enabled, Linux sets the memory as movable memory.
>>
>> 2. use boot option
>> This is our proposal. New boot option can specify memory range to use
>> as movable memory.
>>
>>
>> [How we do this]
>> We chose second way, because if we use first way, users cannot change
>> memory range to use as movable memory easily. We think if we create
>> movable memory, performance regression may occur by NUMA. In this case,
>> user can turn off the feature easily if we prepare the boot option.
>> And if we prepare the boot optino, the user can select which memory
>> to use as movable memory easily.
>>
>>
>> [How to use]
>> Specify the following boot option:
>> movablecore_map=nn[KMG]@ss[KMG]
>>
>> That means physical address range from ss to ss+nn will be allocated as
>> ZONE_MOVABLE.
>>
>> And the following points should be considered.
>>
>> 1) If the range is involved in a single node, then from ss to the end of
>> the node will be ZONE_MOVABLE.
>> 2) If the range covers two or more nodes, then from ss to the end of
>> the node will be ZONE_MOVABLE, and all the other nodes will only
>> have ZONE_MOVABLE.
>> 3) If no range is in the node, then the node will have no ZONE_MOVABLE
>> unless kernelcore or movablecore is specified.
>> 4) This option could be specified at most MAX_NUMNODES times.
>> 5) If kernelcore or movablecore is also specified, movablecore_map will have
>> higher priority to be satisfied.
>> 6) This option has no conflict with memmap option.
>>
>>
>>
>> Tang Chen (4):
>> page_alloc: add movable_memmap kernel parameter
>> page_alloc: Introduce zone_movable_limit[] to keep movable limit for
>> nodes
>> page_alloc: Make movablecore_map has higher priority
>> page_alloc: Bootmem limit with movablecore_map
>>
>> Yasuaki Ishimatsu (1):
>> x86: get pg_data_t's memory from other node
>>
>> Documentation/kernel-parameters.txt | 17 +++
>> arch/x86/mm/numa.c | 11 ++-
>> include/linux/memblock.h | 1 +
>> include/linux/mm.h | 11 ++
>> mm/memblock.c | 15 +++-
>> mm/page_alloc.c | 216 ++++++++++++++++++++++++++++++++++-
>> 6 files changed, 263 insertions(+), 8 deletions(-)
>>
>>
>> .
>>
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org. For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-29 1:43 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-23 10:44 [PATCH v2 0/5] Add movablecore_map boot option Tang Chen
2012-11-23 10:44 ` [PATCH v2 1/5] x86: get pg_data_t's memory from other node Tang Chen
2012-11-24 1:19 ` Jiang Liu
2012-11-26 1:19 ` Tang Chen
2012-12-02 15:11 ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 2/5] page_alloc: add movable_memmap kernel parameter Tang Chen
2012-11-23 10:44 ` [PATCH v2 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes Tang Chen
2012-12-05 15:46 ` Jiang Liu
2012-12-06 1:20 ` Tang Chen
2012-11-23 10:44 ` [PATCH v2 4/5] page_alloc: Make movablecore_map has higher priority Tang Chen
2012-12-05 15:43 ` Jiang Liu
2012-12-06 1:26 ` Tang Chen
2012-12-06 2:26 ` Jiang Liu
2012-12-06 2:51 ` Jianguo Wu
2012-12-06 2:57 ` Tang Chen
2012-12-09 8:10 ` Tang Chen
2012-12-10 2:15 ` Jiang Liu
2012-11-23 10:44 ` [PATCH v2 5/5] page_alloc: Bootmem limit with movablecore_map Tang Chen
2012-11-26 12:22 ` wujianguo
2012-11-26 12:53 ` Tang Chen
2012-11-26 12:40 ` wujianguo
2012-11-26 13:15 ` Tang Chen
2012-11-26 15:48 ` H. Peter Anvin
2012-11-27 0:58 ` Jianguo Wu
2012-11-27 3:19 ` Wen Congyang
2012-11-27 3:22 ` Jianguo Wu
2012-11-27 3:34 ` Wen Congyang
2012-11-27 1:12 ` Jiang Liu
2012-11-27 1:20 ` H. Peter Anvin
2012-11-27 3:15 ` Wen Congyang
2012-11-27 5:31 ` H. Peter Anvin
2012-12-06 17:28 ` Jiang Liu
2012-12-06 17:41 ` H. Peter Anvin
2012-12-07 0:18 ` Jiang Liu
2012-12-19 9:17 ` Tang Chen
2012-11-27 3:10 ` [PATCH v2 0/5] Add movablecore_map boot option wujianguo
2012-11-27 5:43 ` Tang Chen
2012-11-27 6:20 ` H. Peter Anvin
2012-11-27 6:47 ` Jianguo Wu
2012-11-28 3:47 ` Tang Chen
2012-11-28 4:01 ` Jiang Liu
2012-11-28 5:21 ` Wen Congyang
2012-11-28 5:17 ` Jiang Liu
2012-11-28 4:53 ` Jianguo Wu
2012-11-27 8:00 ` Bob Liu
2012-11-27 8:29 ` Tang Chen
2012-11-27 8:49 ` H. Peter Anvin
2012-11-27 9:47 ` Wen Congyang
2012-11-27 9:53 ` H. Peter Anvin
2012-11-27 9:59 ` Yasuaki Ishimatsu
2012-11-27 12:09 ` Bob Liu
2012-11-27 12:49 ` Tang Chen
2012-11-28 3:24 ` Bob Liu
2012-11-28 4:08 ` Jiang Liu
2012-11-28 6:16 ` Tang Chen
2012-11-28 7:03 ` Jiang Liu
2012-11-28 8:29 ` Wen Congyang
2012-11-28 8:28 ` Jiang Liu
2012-11-28 8:38 ` Wen Congyang
2012-11-29 0:43 ` Jaegeuk Hanse
2012-11-29 1:24 ` Tang Chen
2012-11-30 9:20 ` Lai Jiangshan
2012-11-28 8:47 ` Jiang Liu
2012-11-28 21:34 ` Luck, Tony
2012-11-28 21:38 ` H. Peter Anvin
2012-11-29 11:00 ` Mel Gorman
2012-11-29 16:07 ` H. Peter Anvin
2012-11-29 22:41 ` Luck, Tony
2012-11-29 22:45 ` H. Peter Anvin
2012-11-30 2:56 ` Jiang Liu
2012-11-30 3:15 ` Yasuaki Ishimatsu
2012-11-30 15:36 ` Jiang Liu
2012-11-30 2:58 ` Luck, Tony
2012-11-30 3:28 ` H. Peter Anvin
2012-11-30 10:19 ` Glauber Costa
2012-11-30 10:52 ` Mel Gorman
2012-11-29 10:38 ` Yasuaki Ishimatsu
2012-11-29 11:05 ` Mel Gorman
2012-11-29 15:47 ` Jiang Liu
2012-11-29 15:53 ` Jiang Liu
2012-11-29 1:42 ` Jaegeuk Hanse [this message]
2012-11-29 2:25 ` Jiang Liu
2012-11-29 2:49 ` Wanpeng Li
2012-11-29 2:59 ` Jiang Liu
2012-11-29 2:49 ` Wanpeng Li
2012-11-30 22:27 ` Toshi Kani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121129014251.GA9217@kernel \
--to=jaegeuk.hanse@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=frank.wang@intel.com \
--cc=hpa@zytor.com \
--cc=isimatu.yasuaki@jp.fujitsu.com \
--cc=jiang.liu@huawei.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=laijs@cn.fujitsu.com \
--cc=lenb@kernel.org \
--cc=linfeng@cn.fujitsu.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=minchan.kim@gmail.com \
--cc=rientjes@google.com \
--cc=rob@landley.net \
--cc=rusty@rustcorp.com.au \
--cc=tangchen@cn.fujitsu.com \
--cc=tony.luck@intel.com \
--cc=wency@cn.fujitsu.com \
--cc=yinghai@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).