From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37880) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axjnE-0007mw-33 for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1axjn0-0005LY-VZ for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:46 -0400 Received: from e37.co.us.ibm.com ([32.97.110.158]:40276) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axjn0-0005Dk-Mg for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:38 -0400 Received: from localhost by e37.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 3 May 2016 17:31:00 -0600 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable From: Michael Roth In-Reply-To: <1461066701-23212-1-git-send-email-bharata@linux.vnet.ibm.com> References: <1461066701-23212-1-git-send-email-bharata@linux.vnet.ibm.com> Message-ID: <20160503233051.19088.74310@loki> Date: Tue, 03 May 2016 18:30:51 -0500 Subject: Re: [Qemu-devel] [RFC PATCH v0 for 2.7] spapr: Work around the memory hotplug failure with DDW List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Bharata B Rao , qemu-devel@nongnu.org Cc: qemu-ppc@nongnu.org, david@gibson.dropbear.id.au, aik@au1.ibm.com, Nathan Fontenot Quoting Bharata B Rao (2016-04-19 06:51:41) > Memory hotplug can fail for some combinations of RAM and maxmem when > DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends > on maximum addressable memory by guest and this value is currently > being calculated wrongly by the guest kernel routine memory_hotplug_max(). > While there is an attempt to fix the guest kernel(*), this patch works > around the problem within QEMU itself. > = > memory_hotplug_max() routine in the guest kernel arrives at max > addressable memory by multiplying lmb-size with the lmb-count obtained > from ibm,dyanmic-memory property. There are two assumptions here: > = > - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM > where only hot-pluggable LMBs are present in this property. > - The memory area comprising of RAM and hotplug region is contiguous: This > needn't be true always for PowerKVM as there can be gap between > boot time RAM and hotplug region. > = > This work around involves having all the LMBs (RMA, rest of the boot time > LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that > guest kernel's calculation of max addressable memory comes out correct > resulting in correct DDW value which prevents memory hotplug failures. > memory@0 is created for RMA, but RMA LMBs are also represented as > "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a > revert of e8f986fc57a664a74b9f685b466506366a15201b. > = > In addition to this, the alignment of hotplug memory region is reduced fr= om > current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any > gaps between boot time RAM and hotplug region. I don't see the actual change to SPAPR_HOTPLUG_MEM_ALIGN here? Is it aligned by some other means? > = > This change has a side effect on how the memory nodes in DT are > represented before and after this change. As an example consider > a guest with the following memory related command line options: > = > -m 4G,slots=3D32,maxmem=3D8G -numa node,nodeid=3D0,mem=3D2G -numa node,no= deid=3D1,mem=3D2G > = > Before this change, the guest would have > = > Scenario 1 > ---------- > memory@0 for RMA > memory@80000000 for rest of the boot time memory > ibm,dynamic-reconfiguration-memory for hot-pluggable memory. > = > After this commit, the guest will have > = > Scenario 2 > ---------- > memory@0 for RMA > ibm,dynamic-reconfiguration-memory for the entire memory including > RMA, boot time memory as well as hot-pluggable memory. > = > If an existing guest having DT nodes as in Scenario 1 above is migrated > to a QEMU which has this change, at the target, it continues to have the > DT nodes as in Scenario 1. However after 1st reboot, the DT representation > changes over to Scenario 2. > = > I haven't yet looked at Jian Jun's DRC migration patchset to ascertain > if this change works well with DRC migration. > = > (*) https://patchwork.ozlabs.org/patch/606912/ > = > Signed-off-by: Bharata B Rao > Cc: Nathan Fontenot > Cc: Michael Roth > --- > hw/ppc/spapr.c | 59 +++++++++++++++++++++++++++++++++++---------= ------ > include/hw/ppc/spapr.h | 1 + > 2 files changed, 43 insertions(+), 17 deletions(-) > = > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 79a70a9..6d8de2e 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState *s= papr, void *fdt) > } > if (!mem_start) { > /* ppc_spapr_init() checks for rma_size <=3D node0_size alre= ady */ > - spapr_populate_memory_node(fdt, i, 0, spapr->rma_size); > mem_start +=3D spapr->rma_size; > node_size -=3D spapr->rma_size; > } > @@ -759,18 +758,13 @@ static int spapr_populate_drconf_memory(sPAPRMachin= eState *spapr, void *fdt) > int ret, i, offset; > uint64_t lmb_size =3D SPAPR_MEMORY_BLOCK_SIZE; > uint32_t prop_lmb_size[] =3D {0, cpu_to_be32(lmb_size)}; > - uint32_t nr_lmbs =3D (machine->maxram_size - machine->ram_size)/lmb_= size; > + uint32_t nr_rma_lmbs =3D spapr->rma_size / lmb_size; > + uint32_t nr_lmbs =3D machine->maxram_size / lmb_size; > + uint32_t nr_assigned_lmbs =3D machine->ram_size / lmb_size; > uint32_t *int_buf, *cur_index, buf_len; > int nr_nodes =3D nb_numa_nodes ? nb_numa_nodes : 1; > = > /* > - * Don't create the node if there are no DR LMBs. > - */ > - if (!nr_lmbs) { > - return 0; > - } > - > - /* > * Allocate enough buffer size to fit in ibm,dynamic-memory > * or ibm,associativity-lookup-arrays > */ > @@ -802,9 +796,15 @@ static int spapr_populate_drconf_memory(sPAPRMachine= State *spapr, void *fdt) > for (i =3D 0; i < nr_lmbs; i++) { > sPAPRDRConnector *drc; > sPAPRDRConnectorClass *drck; > - uint64_t addr =3D i * lmb_size + spapr->hotplug_memory.base;; > + uint64_t addr; > uint32_t *dynamic_memory =3D cur_index; > = > + if (i < nr_assigned_lmbs) { > + addr =3D i * lmb_size; > + } else { > + addr =3D (i - nr_assigned_lmbs) * lmb_size + > + spapr->hotplug_memory.base; If the fix relies on there being no gap between hotplug_memory.base and machine->ram_size, could we instead assert that (nr_assigned_lmbs * lmb_size =3D=3D spapr->hotplug_memory.base) and then use the same addr calculation for all lmbs (here and elsewhere)? Otherwise the patch looks good, and it seems like a reasonable workaround to have on the QEMU side even if we still pursue the kernel fix. > + } > drc =3D spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB, > addr/lmb_size); > g_assert(drc); > @@ -817,7 +817,11 @@ static int spapr_populate_drconf_memory(sPAPRMachine= State *spapr, void *fdt) > dynamic_memory[4] =3D cpu_to_be32(numa_get_node(addr, NULL)); > if (addr < machine->ram_size || > memory_region_present(get_system_memory(), addr)) { > - dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED); > + if (i < nr_rma_lmbs) { > + dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_RESERV= ED); > + } else { > + dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGN= ED); > + } > } else { > dynamic_memory[5] =3D cpu_to_be32(0); > } > @@ -879,6 +883,8 @@ int spapr_h_cas_compose_response(sPAPRMachineState *s= papr, > /* Generate ibm,dynamic-reconfiguration-memory node if required */ > if (memory_update && smc->dr_lmb_enabled) { > _FDT((spapr_populate_drconf_memory(spapr, fdt))); > + } else { > + _FDT((spapr_populate_memory(spapr, fdt))); > } > = > /* Pack resulting tree */ > @@ -916,10 +922,23 @@ static void spapr_finalize_fdt(sPAPRMachineState *s= papr, > /* open out the base tree into a temp buffer for the final tweaks */ > _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE))); > = > - ret =3D spapr_populate_memory(spapr, fdt); > - if (ret < 0) { > - fprintf(stderr, "couldn't setup memory nodes in fdt\n"); > - exit(1); > + /* > + * Add memory@0 node to represent RMA. Rest of the memory is either > + * represented by memory nodes or ibm,dynamic-reconfiguration-memory > + * node later during ibm,client-architecture-support call. > + * > + * If NUMA is configured, ensure that memory@0 ends up in the > + * first memory-less node. > + */ > + if (nb_numa_nodes) { > + for (i =3D 0; i < nb_numa_nodes; ++i) { > + if (numa_info[i].node_mem) { > + spapr_populate_memory_node(fdt, i, 0, spapr->rma_size); > + break; > + } > + } > + } else { > + spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size); > } > = > ret =3D spapr_populate_vdevice(spapr->vio_bus, fdt); > @@ -1659,14 +1678,20 @@ static void spapr_create_lmb_dr_connectors(sPAPRM= achineState *spapr) > { > MachineState *machine =3D MACHINE(spapr); > uint64_t lmb_size =3D SPAPR_MEMORY_BLOCK_SIZE; > - uint32_t nr_lmbs =3D (machine->maxram_size - machine->ram_size)/lmb_= size; > + uint32_t nr_lmbs =3D machine->maxram_size / lmb_size; > + uint32_t nr_assigned_lmbs =3D machine->ram_size / lmb_size; > int i; > = > for (i =3D 0; i < nr_lmbs; i++) { > sPAPRDRConnector *drc; > uint64_t addr; > = > - addr =3D i * lmb_size + spapr->hotplug_memory.base; > + if (i < nr_assigned_lmbs) { > + addr =3D i * lmb_size; > + } else { > + addr =3D (i - nr_assigned_lmbs) * lmb_size + > + spapr->hotplug_memory.base; > + } > drc =3D spapr_dr_connector_new(OBJECT(spapr), SPAPR_DR_CONNECTOR= _TYPE_LMB, > addr/lmb_size); > qemu_register_reset(spapr_drc_reset, drc); > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index 098d85d..9f2050d 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -627,5 +627,6 @@ int spapr_rng_populate_dt(void *fdt); > * property under ibm,dynamic-reconfiguration-memory node. > */ > #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008 > +#define SPAPR_LMB_FLAGS_RESERVED 0x00000080 > = > #endif /* !defined (__HW_SPAPR_H__) */ > -- = > 2.1.0 >=20