Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Greg Kurz <groug@kaod.org>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: "Laurent Vivier" <lvivier@redhat.com>,
	qemu-ppc@nongnu.org, "Daniel P. Berrangé" <berrange@redhat.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node
Date: Mon, 2 Sep 2019 15:58:53 +0200	[thread overview]
Message-ID: <20190902155853.5ecb42e2@bahia.lan> (raw)
In-Reply-To: <20190902062718.GG415@umbus.fritz.box>

[-- Attachment #1: Type: text/plain, Size: 7938 bytes --]

On Mon, 2 Sep 2019 16:27:18 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote:
> > On Fri, 30 Aug 2019 17:34:13 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote:
> > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel
> > > > crashes.
> > > > 
> > > > This happens because linux kernel needs to know the NUMA topology at
> > > > start to be able to initialize the distance lookup table.
> > > > 
> > > > On pseries, the topology is provided by the firmware via the existing
> > > > CPUs and memory information. Thus a node without memory and CPU cannot be
> > > > discovered by the kernel.
> > > > 
> > > > To avoid the kernel crash, do not allow to start pseries with empty
> > > > nodes.
> > > 
> > > This describes one possible guest OS. Is there any reasonable chance
> > > that a non-Linux guest might be able to handle this situation correctly,
> > > or do you expect any guest to have the same restriction ?
> 
> That's... a more complicated question than you'd think.
> 
> The problem here is it's not really obvious in PAPR how topology
> information for nodes without memory should be described in the device
> tree (which is the only way we given that information to the guest).
> 

The reported issue is to have a node without memory AND without cpu.

> It's possible there's some way to encode this information that would
> make AIX happy and we just need to fix Linux to cope with that, but
> it's not really clear what it would be.
> 
> > I can try to grab an AIX image and give a try, but anyway this looks like
> > a very big hammer to me... :-\
> 
> I'm not really sure why everyone seems to think losing zero-memory
> node capability is such a big deal.  It's never worked in practice on
> POWER and we can always put it back if we figure out a sensible way to
> do it.
> 

It isn't really about losing the memory-less/cpu-less node capability, but
more about finding the appropriate fix. The changelog doesn't give much
clues on what's happening exactly: QEMU command line ? linux call stack ?

For example, I could hit a crash with the following command line:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1

[   24.507552] Built 1 zonelists, mobility grouping on.  Total pages: 7656
[   24.507592] Policy zone: Normal
[   24.553481] WARNING: workqueue cpumask: online intersect > possible intersect
[   24.608814] BUG: Unable to handle kernel data access at 0x14e13da04c5bc37e
[   24.608875] Faulting instruction address: 0xc000000000175650
[   24.608931] Oops: Kernel access of bad area, sig: 11 [#1]
[   24.608976] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries
[   24.609042] Modules linked in: virtio_net vmx_crypto net_failover failover crct10dif_vpmsum ip_tables xfs libcrc32c crc32c_vpmsum virtio_blk kvm rpadlpar_io rpaphp 9p fscache 9pnet_virtio 9pnet
[   24.609222] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.17-300.fc30.ppc64le #1
[   24.609286] NIP:  c000000000175650 LR: c000000000175310 CTR: 0000000000000000
[   24.609351] REGS: c00000001e597210 TRAP: 0380   Not tainted  (5.1.17-300.fc30.ppc64le)
[   24.609414] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44444248  XER: 00000000
[   24.609482] CFAR: c000000000175528 IRQMASK: 0 
[   24.609482] GPR00: c000000000175310 c00000001e5974a0 c0000000015fc400 0000000000000002 
[   24.609482] GPR04: 0000000000000001 0000000000000001 0000000000000001 0000000000000400 
[   24.609482] GPR08: 14e13da04c5bc37e 0000000000000000 0000000000000000 0000000000000000 
[   24.609482] GPR12: 0000000024022248 c00000000fffee00 0000000000000007 c00000001e0e8fb0 
[   24.609482] GPR16: c00000000162dc70 0000000000000008 c00000001e5976d8 0000000020000000 
[   24.609482] GPR20: 0000000100000003 0000000000000001 0000000000000000 14e13da04c5bc35e 
[   24.609482] GPR24: c000000001630164 0000000000000010 14e13da04c5bc37e 0000000000000000 
[   24.609482] GPR28: 0000000000000002 c0000000142a0e00 c00000001ff25d80 c00000001e5975a8 
[   24.610052] NIP [c000000000175650] find_busiest_group+0x510/0xe10
[   24.610107] LR [c000000000175310] find_busiest_group+0x1d0/0xe10
[   24.610169] Call Trace:
[   24.610203] [c00000001e5974a0] [c000000000175310] find_busiest_group+0x1d0/0xe10 (unreliable)
[   24.610304] [c00000001e597680] [c000000000176110] load_balance+0x1c0/0xe80
[   24.610377] [c00000001e5977d0] [c000000000176ff8] rebalance_domains+0x228/0x380
[   24.610467] [c00000001e597880] [c000000000c7c170] __do_softirq+0x170/0x404
[   24.610542] [c00000001e597980] [c000000000124368] irq_exit+0xd8/0x110
[   24.610617] [c00000001e5979a0] [c000000000028778] timer_interrupt+0x128/0x2e0
[   24.610706] [c00000001e597a00] [c000000000009314] decrementer_common+0x154/0x160
[   24.610799] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[   24.610799]     LR = check_and_cede_processor+0x48/0x60
[   24.610915] [c00000001e597d00] [c00000001e597d60] 0xc00000001e597d60 (unreliable)
[   24.611004] [c00000001e597d60] [c0000000009e22a8] shared_cede_loop+0x68/0x180
[   24.611096] [c00000001e597da0] [c0000000009dec64] cpuidle_enter_state+0xa4/0x660
[   24.611191] [c00000001e597e30] [c0000000001647a0] call_cpuidle+0x50/0xa0
[   24.611270] [c00000001e597e50] [c000000000164d6c] do_idle+0x2cc/0x3b0
[   24.611350] [c00000001e597ec0] [c00000000016508c] cpu_startup_entry+0x3c/0x50
[   24.611445] [c00000001e597ef0] [c000000000051dd0] start_secondary+0x630/0x660
[   24.611539] [c00000001e597f90] [c00000000000b25c] start_secondary_prolog+0x10/0x14
[   24.611632] Instruction dump:
[   24.611680] 7c374800 41820234 e8920016 3b570020 8152002c 7c893670 7d290194 548506be 
[   24.611775] 788606a0 7d2907b4 79291f24 7d1a4a14 <7cfa482a> 7ce72c36 78e707e0 2d270000 
[   24.611871] ---[ end trace 0e5e3ed14d31f59d ]---
[   24.617852] 
[   25.617885] Kernel panic - not syncing: Aiee, killing interrupt handler!

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB

but the crash doesn't occur with:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0 \
-numa node,nodeid=1 \
-device spapr-pci-host-bridge,index=1,id=phb1,numa_node=1

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB
(qemu) device_add host-spapr-cpu-core,core-id=1

[  154.637304] Policy zone: Normal
[  154.665463] WARNING: workqueue cpumask: online intersect > possible intersect

(qemu) info numa 
2 nodes
node 0 cpus: 0
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus: 1
node 1 size: 0 MB
node 1 plugged: 0 MB


nor with:

-smp 1,maxcpus=2 \
-object memory-backend-ram,size=512M,id=node0 \
-numa node,nodeid=0,memdev=node0,cpus=0 \
-numa node,nodeid=1

qemu-system-ppc64: warning: CPU(s) not present in any NUMA nodes: CPU 1 [core-id: 1]
qemu-system-ppc64: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
(qemu) device_add host-spapr-cpu-core,core-id=1
(qemu) info numa 
2 nodes
node 0 cpus: 0 1
node 0 size: 512 MB
node 0 plugged: 0 MB
node 1 cpus:
node 1 size: 0 MB
node 1 plugged: 0 MB

so I don't know why linux crashes, but it isn't exactly because of having
a cpu-less/memory-less node and this patch catches the non-crashing cases
anyway.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2019-09-02 14:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-30 16:13 [Qemu-devel] [PATCH] pseries: do not allow memory-less/cpu-less NUMA node Laurent Vivier
2019-08-30 16:34 ` Daniel P. Berrangé
2019-08-30 17:45   ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2019-09-02  6:27     ` David Gibson
2019-09-02  8:57       ` Daniel P. Berrangé
2019-09-02  9:11         ` David Gibson
2019-09-02 13:58       ` Greg Kurz [this message]
2019-09-02  9:14 ` [Qemu-devel] " David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190902155853.5ecb42e2@bahia.lan \
    --to=groug@kaod.org \
    --cc=berrange@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).