From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:35545)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bharata@linux.vnet.ibm.com>) id 1RRHvk-0006rB-Hr
	for qemu-devel@nongnu.org; Fri, 18 Nov 2011 01:28:09 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <bharata@linux.vnet.ibm.com>) id 1RRHvi-0006e2-NE
	for qemu-devel@nongnu.org; Fri, 18 Nov 2011 01:28:08 -0500
Received: from e28smtp02.in.ibm.com ([122.248.162.2]:60557)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bharata@linux.vnet.ibm.com>) id 1RRHvf-0006dh-LC
	for qemu-devel@nongnu.org; Fri, 18 Nov 2011 01:28:06 -0500
Received: from /spool/local
	by e28smtp02.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <bharata@linux.vnet.ibm.com>;
	Fri, 18 Nov 2011 11:57:46 +0530
Received: from d28av04.in.ibm.com (d28av04.in.ibm.com [9.184.220.66])
	by d28relay01.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	pAI6RgND4853878
	for <qemu-devel@nongnu.org>; Fri, 18 Nov 2011 11:57:43 +0530
Received: from d28av04.in.ibm.com (loopback [127.0.0.1])
	by d28av04.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	pAI6RWAi015940
	for <qemu-devel@nongnu.org>; Fri, 18 Nov 2011 17:27:32 +1100
Date: Fri, 18 Nov 2011 11:57:27 +0530
From: Bharata B Rao <bharata@linux.vnet.ibm.com>
Message-ID: <20111118062727.GD4993@in.ibm.com>
References: <20111117085751.16490.43574.malonedeb@chaenomeles.canonical.com>
	<20111117085751.16490.43574.malonedeb@chaenomeles.canonical.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20111117085751.16490.43574.malonedeb@chaenomeles.canonical.com>
Subject: Re: [Qemu-devel] [Bug 891525] [NEW] Guest kernel crashes when
 booting a NUMA guest	without explicitly specifying cpus= in -numa option
Reply-To: bharata@linux.vnet.ibm.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Bharata B Rao <891525@bugs.launchpad.net>
Cc: qemu-devel@nongnu.org

The reason for guest kernel crash is because qemu enumerates the VCPUs
in a round robin fashion between the nodes. As per this comment from
vl.c, guest kernel is supposed to handle this:

/* assigning the VCPUs round-robin is easier to implement, guest OSes
 * must cope with this anyway, because there are BIOSes out there in
 * real machines which also use this scheme.
 */

I am not sure if this would be considered a bug in the guest kernel,
but I have verifed that enumerating the VCPUs in a serial manner between
nodes fixes the problem for me. I am not aware of the history behind
round robin assignment nor do I understand the full implications of
changing it, but here is a potential patch that fixes the problem for me.
---

Fix VCPU enumeration between nodes

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Currently VCPUs are assigned to nodes in round robin manner.
This is seen to break guest kernel for x86_64-softmmu. Hence assign
VCPUs serially between nodes.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---

 vl.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)


diff --git a/vl.c b/vl.c
index f5afed4..75348d0 100644
--- a/vl.c
+++ b/vl.c
@@ -3307,13 +3307,18 @@ int main(int argc, char **argv, char **envp)
             if (node_cpumask[i] != 0)
                 break;
         }
-        /* assigning the VCPUs round-robin is easier to implement, guest OSes
-         * must cope with this anyway, because there are BIOSes out there in
-         * real machines which also use this scheme.
-         */
+        /* Assign VCPUs to nodes in serial manner */
         if (i == nb_numa_nodes) {
+            int cpus_per_node = smp_cpus / nb_numa_nodes;
+
             for (i = 0; i < smp_cpus; i++) {
-                node_cpumask[i % nb_numa_nodes] |= 1 << i;
+                int nodeid = i / cpus_per_node;
+
+                /* Extra VCPUs goto Node 0 */
+                if (nodeid >= nb_numa_nodes) {
+                    nodeid = 0;
+                }
+                node_cpumask[nodeid] |= 1 << i;
             }
         }
     }