From: Greg Kurz <groug@kaod.org>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: "Laurent Vivier" <lvivier@redhat.com>,
qemu-ppc@nongnu.org, "Daniel P. Berrangé" <berrange@redhat.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Date: Mon, 2 Sep 2019 15:58:53 +0200 [thread overview]
Message-ID: <20190902155853.5ecb42e2@bahia.lan> (raw)
In-Reply-To: <20190902062718.GG415@umbus.fritz.box>
[-- Attachment #1: Type: text/plain, Size: 7938 bytes --]
On Mon, 2 Sep 2019 16:27:18 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> > On Fri, 30 Aug 2019 17:34:13 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > > crashes.
> > > >
> > > > This happens because linux kernel needs to know the NUMA topology at
> > > > start to be able to initialize the distance lookup table.
> > > >
> > > > On pseries, the topology is provided by the firmware via the existing
> > > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > > discovered by the kernel.
> > > >
> > > > To avoid the kernel crash, do not allow to start pseries with empty
> > > > nodes.
> > >
> > > This describes one possible guest OS. Is there any reasonable chance
> > > that a non-Linux guest might be able to handle this situation correctly,
> > > or do you expect any guest to have the same restriction ?
>
> That's... a more complicated question than you'd think.
>
> The problem here is it's not really obvious in PAPR how topology
> information for nodes without memory should be described in the device
> tree (which is the only way we given that information to the guest).
>
The reported issue is to have a node without memory AND without cpu.
> It's possible there's some way to encode this information that would
> make AIX happy and we just need to fix Linux to cope with that, but
> it's not really clear what it would be.
>
> > I can try to grab an AIX image and give a try, but anyway this looks like
> > a very big hammer to me... :-\
>
> I'm not really sure why everyone seems to think losing zero-memory
> node capability is such a big deal. It's never worked in practice on
> POWER and we can always put it back if we figure out a sensible way to
> do it.
>
It isn't really about losing the memory-less/cpu-less node capability, but
more about finding the appropriate fix. The changelog doesn't give much
clues on what's happening exactly: QEMU command line ? linux call stack ?
For example, I could hit a crash with the following command line:
-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1
(qemu) info numa
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1
[ 24.507552] Built 1 zonelists, mobility grouping on. Total pages: 7656
[ 24.507592] Policy zone: Normal
[ 24.553481] WARNING: workqueue cpumask: online intersect > possible intersect
[ 24.608814] BUG: Unable to handle kernel data access at 0x14e13da04c5bc37e
[ 24.608875] Faulting instruction address: 0xc000000000175650
[ 24.608931] Oops: Kernel access of bad area, sig: 11 [#1]
[ 24.608976] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries
[ 24.609042] Modules linked in: virtio_net vmx_crypto net_failover failover crct10dif_vpmsum ip_tables xfs libcrc32c crc32c_vpmsum virtio_blk kvm rpadlpar_io rpaphp 9p fscache 9pnet_virtio 9pnet
[ 24.609222] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.17-300.fc30.ppc64le #1
[ 24.609286] NIP: c000000000175650 LR: c000000000175310 CTR: 0000000000000000
[ 24.609351] REGS: c00000001e597210 TRAP: 0380 Not tainted (5.1.17-300.fc30.ppc64le)
[ 24.609414] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44444248 XER: 00000000
[ 24.609482] CFAR: c000000000175528 IRQMASK: 0
[ 24.609482] GPR00: c000000000175310 c00000001e5974a0 c0000000015fc400 0000000000000002
[ 24.609482] GPR04: 0000000000000001 0000000000000001 0000000000000001 0000000000000400
[ 24.609482] GPR08: 14e13da04c5bc37e 0000000000000000 0000000000000000 0000000000000000
[ 24.609482] GPR12: 0000000024022248 c00000000fffee00 0000000000000007 c00000001e0e8fb0
[ 24.609482] GPR16: c00000000162dc70 0000000000000008 c00000001e5976d8 0000000020000000
[ 24.609482] GPR20: 0000000100000003 0000000000000001 0000000000000000 14e13da04c5bc35e
[ 24.609482] GPR24: c000000001630164 0000000000000010 14e13da04c5bc37e 0000000000000000
[ 24.609482] GPR28: 0000000000000002 c0000000142a0e00 c00000001ff25d80 c00000001e5975a8
[ 24.610052] NIP [c000000000175650] find_busiest_group+0x510/0xe10
[ 24.610107] LR [c000000000175310] find_busiest_group+0x1d0/0xe10
[ 24.610169] Call Trace:
[ 24.610203] [c00000001e5974a0] [c000000000175310] find_busiest_group+0x1d0/0xe10 (unreliable)
[ 24.610304] [c00000001e597680] [c000000000176110] load_balance+0x1c0/0xe80
[ 24.610377] [c00000001e5977d0] [c000000000176ff8] rebalance_domains+0x228/0x380
[ 24.610467] [c00000001e597880] [c000000000c7c170] __do_softirq+0x170/0x404
[ 24.610542] [c00000001e597980] [c000000000124368] irq_exit+0xd8/0x110
[ 24.610617] [c00000001e5979a0] [c000000000028778] timer_interrupt+0x128/0x2e0
[ 24.610706] [c00000001e597a00] [c000000000009314] decrementer_common+0x154/0x160
[ 24.610799] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[ 24.610799] LR = check_and_cede_processor+0x48/0x60
[ 24.610915] [c00000001e597d00] [c00000001e597d60] 0xc00000001e597d60 (unreliable)
[ 24.611004] [c00000001e597d60] [c0000000009e22a8] shared_cede_loop+0x68/0x180
[ 24.611096] [c00000001e597da0] [c0000000009dec64] cpuidle_enter_state+0xa4/0x660
[ 24.611191] [c00000001e597e30] [c0000000001647a0] call_cpuidle+0x50/0xa0
[ 24.611270] [c00000001e597e50] [c000000000164d6c] do_idle+0x2cc/0x3b0
[ 24.611350] [c00000001e597ec0] [c00000000016508c] cpu_startup_entry+0x3c/0x50
[ 24.611445] [c00000001e597ef0] [c000000000051dd0] start_secondary+0x630/0x660
[ 24.611539] [c00000001e597f90] [c00000000000b25c] start_secondary_prolog+0x10/0x14
[ 24.611632] Instruction dump:
[ 24.611680] 7c374800 41820234 e8920016 3b570020 8152002c 7c893670 7d290194 548506be
[ 24.611775] 788606a0 7d2907b4 79291f24 7d1a4a14 <7cfa482a> 7ce72c36 78e707e0 2d270000
[ 24.611871] ---[ end trace 0e5e3ed14d31f59d ]---
[ 24.617852]
[ 25.617885] Kernel panic - not syncing: Aiee, killing interrupt handler!
(qemu) info numa
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB
but the crash doesn't occur with:
-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1 \
-device spapr-pci-host-bridge,index=1,id=phb1,numa_node=1
(qemu) info numa
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1
[ 154.637304] Policy zone: Normal
[ 154.665463] WARNING: workqueue cpumask: online intersect > possible intersect
(qemu) info numa
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB
nor with:
-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0,cpus=0 \
-numa node,nodeid=1
qemu-system-ppc64: warning: CPU(s) not present in any NUMA nodes: CPU 1 [core-id: 1]
qemu-system-ppc64: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
(qemu) device_add host-spapr-cpu-core,core-id=1
(qemu) info numa
2 nodes
node 0 cpus: 0 1
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
so I don't know why linux crashes, but it isn't exactly because of having
a cpu-less/memory-less node and this patch catches the non-crashing cases
anyway.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2019-09-02 14:00 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-30 16:13 [Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node Laurent Vivier
2019-08-30 16:34 ` Daniel P. Berrangé
2019-08-30 17:45 ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2019-09-02 6:27 ` David Gibson
2019-09-02 8:57 ` Daniel P. Berrangé
2019-09-02 9:11 ` David Gibson
2019-09-02 13:58 ` Greg Kurz [this message]
2019-09-02 9:14 ` [Qemu-devel] " David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190902155853.5ecb42e2@bahia.lan \
--to=groug@kaod.org \
--cc=berrange@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=lvivier@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.