From: Michal Hocko <mhocko@kernel.org>
To: Igor Mammedov <imammedo@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Greg KH <gregkh@linuxfoundation.org>,
"K. Y. Srinivasan" <kys@microsoft.com>,
David Rientjes <rientjes@google.com>,
Daniel Kiper <daniel.kiper@oracle.com>,
linux-api@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
linux-s390@vger.kernel.org, xen-devel@lists.xenproject.org,
linux-acpi@vger.kernel.org, qiuxishi@huawei.com,
toshi.kani@hpe.com, xieyisheng1@huawei.com, slaoub@gmail.com,
iamjoonsoo.kim@lge.com, vbabka@suse.cz,
Zhang Zhen <zhenzhang.zhang@huawei.com>,
Reza Arbab <arbab@linux.vnet.ibm.com>,
Yasuaki Ishimatsu <yasu.isimatu@gmail.com>,
Tang Chen <tangchen@cn.fujitsu.com>
Subject: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Date: Fri, 10 Mar 2017 14:58:07 +0100 [thread overview]
Message-ID: <20170310135807.GI3753@dhcp22.suse.cz> (raw)
In-Reply-To: <20170309125400.GI11592@dhcp22.suse.cz>
Let's CC people touching this logic. A short summary is that onlining
memory via udev is currently unusable for online_movable because blocks
are added from lower addresses while movable blocks are allowed from
last blocks. More below.
On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > On Mon, 6 Mar 2017 15:54:17 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> >
> > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> [...]
> > > > in current mainline kernel it triggers following code path:
> > > >
> > > > online_pages()
> > > > ...
> > > > if (online_type == MMOP_ONLINE_KERNEL) {
> > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
> > > > return -EINVAL;
> > >
> > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > pretty much, reproducer is above so try and see for yourself
>
> I will play with this...
OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
[...]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
[ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
[ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
[ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
[ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
so there is neither any normal zone nor movable one at the boot time.
Then I hotplugged 1G slot
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
unfortunatelly the memory didn't show up automatically and I got
[ 116.375781] acpi PNP0C80:00: Enumeration failure
so I had to probe it manually (prbably the BIOS my qemu uses doesn't
support auto probing - I haven't really dug further). Anyway the SRAT
table printed during the boot told that we should start at 0x100000000
# echo 0x100000000 > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable
which looks reasonably right? Both Normal and Movable zones are allowed
# echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
Huh, so our valid_zones have changed under our feet...
# echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
and again. So only the last memblock is considered movable. Let's try to
online them now.
# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal
This would explain why onlining from the last block actually works but
to me this sounds like a completely crappy behavior. All we need to
guarantee AFAICS is that Normal and Movable zones do not overlap. I
believe there is even no real requirement about ordering of the physical
memory in Normal vs. Movable zones as long as they do not overlap. But
let's keep it simple for the start and always enforce the current status
quo that Normal zone is physically preceeding Movable zone.
Can somebody explain why we cannot have a simple rule for Normal vs.
Movable which would be:
- block [pfn, pfn+block_size] can be Normal if
!zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
- block [pfn, pfn+block_size] can be Movable if
!zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
I haven't fully grokked all the restrictions on the movable zone size
based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
this shouldn't really make the situation really much more complicated I
believe because those parameters should be mostly about static
initialization rather than hotplug but I might be easily missing
something.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-03-10 13:58 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-27 9:28 [RFC PATCH] mm, hotplug: get rid of auto_online_blocks Michal Hocko
2017-02-27 10:02 ` Vitaly Kuznetsov
2017-02-27 10:21 ` Michal Hocko
2017-02-27 10:49 ` Vitaly Kuznetsov
2017-02-27 12:56 ` Michal Hocko
[not found] ` <20170227125636.GB26504-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-02-27 13:17 ` Vitaly Kuznetsov
2017-02-27 11:25 ` Heiko Carstens
2017-02-27 11:50 ` Vitaly Kuznetsov
2017-02-27 15:43 ` Michal Hocko
2017-02-28 10:21 ` Heiko Carstens
2017-03-02 13:53 ` Igor Mammedov
2017-03-02 14:28 ` Michal Hocko
2017-03-02 17:03 ` Igor Mammedov
2017-03-03 8:27 ` Michal Hocko
2017-03-03 17:34 ` Igor Mammedov
2017-03-06 14:54 ` Michal Hocko
2017-03-07 12:40 ` Igor Mammedov
2017-03-09 12:54 ` Michal Hocko
2017-03-10 13:58 ` Michal Hocko [this message]
2017-03-10 15:53 ` WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Michal Hocko
2017-03-10 19:00 ` Reza Arbab
2017-03-13 9:21 ` Michal Hocko
2017-03-13 14:58 ` Reza Arbab
2017-03-14 19:35 ` Andrea Arcangeli
2017-03-15 7:57 ` Michal Hocko
2017-03-13 15:11 ` Michal Hocko
2017-03-13 23:16 ` Andi Kleen
2017-03-10 17:39 ` WTH is going on with memory hotplug sysf interface Yasuaki Ishimatsu
2017-03-13 9:19 ` Michal Hocko
2017-03-14 16:05 ` YASUAKI ISHIMATSU
2017-03-14 16:20 ` Michal Hocko
2017-03-13 10:31 ` WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Igor Mammedov
2017-03-13 10:43 ` Michal Hocko
2017-03-13 13:57 ` Igor Mammedov
2017-03-13 14:36 ` Michal Hocko
2017-03-13 10:55 ` [RFC PATCH] mm, hotplug: get rid of auto_online_blocks Igor Mammedov
2017-03-13 12:28 ` Michal Hocko
2017-03-13 12:54 ` Vitaly Kuznetsov
2017-03-13 13:19 ` Michal Hocko
2017-03-13 13:42 ` Vitaly Kuznetsov
2017-03-13 14:32 ` Michal Hocko
2017-03-13 15:10 ` Vitaly Kuznetsov
2017-03-14 13:20 ` Igor Mammedov
2017-03-15 7:53 ` Michal Hocko
2017-03-10 22:00 ` Daniel Kiper
2017-02-27 17:28 ` Reza Arbab
2017-02-27 17:34 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170310135807.GI3753@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=arbab@linux.vnet.ibm.com \
--cc=daniel.kiper@oracle.com \
--cc=gregkh@linuxfoundation.org \
--cc=heiko.carstens@de.ibm.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=imammedo@redhat.com \
--cc=kys@microsoft.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=qiuxishi@huawei.com \
--cc=rientjes@google.com \
--cc=slaoub@gmail.com \
--cc=tangchen@cn.fujitsu.com \
--cc=toshi.kani@hpe.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=xen-devel@lists.xenproject.org \
--cc=xieyisheng1@huawei.com \
--cc=yasu.isimatu@gmail.com \
--cc=zhenzhang.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).