From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:37880)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mdroth@linux.vnet.ibm.com>) id 1axjnE-0007mw-33
	for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:59 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mdroth@linux.vnet.ibm.com>) id 1axjn0-0005LY-VZ
	for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:46 -0400
Received: from e37.co.us.ibm.com ([32.97.110.158]:40276)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mdroth@linux.vnet.ibm.com>) id 1axjn0-0005Dk-Mg
	for qemu-devel@nongnu.org; Tue, 03 May 2016 19:31:38 -0400
Received: from localhost
	by e37.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mdroth@linux.vnet.ibm.com>;
	Tue, 3 May 2016 17:31:00 -0600
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
From: Michael Roth <mdroth@linux.vnet.ibm.com>
In-Reply-To: <1461066701-23212-1-git-send-email-bharata@linux.vnet.ibm.com>
References: <1461066701-23212-1-git-send-email-bharata@linux.vnet.ibm.com>
Message-ID: <20160503233051.19088.74310@loki>
Date: Tue, 03 May 2016 18:30:51 -0500
Subject: Re: [Qemu-devel] [RFC PATCH v0 for 2.7] spapr: Work around the
 memory hotplug failure with DDW
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Bharata B Rao <bharata@linux.vnet.ibm.com>, qemu-devel@nongnu.org
Cc: qemu-ppc@nongnu.org, david@gibson.dropbear.id.au, aik@au1.ibm.com, Nathan Fontenot <nfont@linux.vnet.ibm.com>

Quoting Bharata B Rao (2016-04-19 06:51:41)
> Memory hotplug can fail for some combinations of RAM and maxmem when
> DDW is enabled in the presence of devices like nec-xhci-usb. DDW depends
> on maximum addressable memory by guest and this value is currently
> being calculated wrongly by the guest kernel routine memory_hotplug_max().
> While there is an attempt to fix the guest kernel(*), this patch works
> around the problem within QEMU itself.
> =

> memory_hotplug_max() routine in the guest kernel arrives at max
> addressable memory by multiplying lmb-size with the lmb-count obtained
> from ibm,dyanmic-memory property. There are two assumptions here:
> =

> - All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
>   where only hot-pluggable LMBs are present in this property.
> - The memory area comprising of RAM and hotplug region is contiguous: This
>   needn't be true always for PowerKVM as there can be gap between
>   boot time RAM and hotplug region.
> =

> This work around involves having all the LMBs (RMA, rest of the boot time
> LMBs and hot-pluggable LMBs) as part of ibm,dynamic-memory so that
> guest kernel's calculation of max addressable memory comes out correct
> resulting in correct DDW value which prevents memory hotplug failures.
> memory@0 is created for RMA, but RMA LMBs are also represented as
> "reserved" LMBs in ibm,dynamic-memory. Parts of this are essenitally a
> revert of e8f986fc57a664a74b9f685b466506366a15201b.
> =

> In addition to this, the alignment of hotplug memory region is reduced fr=
om
> current 1G to 256M (LMB size in PowerKVM) so that we don't end up with any
> gaps between boot time RAM and hotplug region.

I don't see the actual change to SPAPR_HOTPLUG_MEM_ALIGN here? Is it aligned
by some other means?

> =

> This change has a side effect on how the memory nodes in DT are
> represented before and after this change. As an example consider
> a guest with the following memory related command line options:
> =

> -m 4G,slots=3D32,maxmem=3D8G -numa node,nodeid=3D0,mem=3D2G -numa node,no=
deid=3D1,mem=3D2G
> =

> Before this change, the guest would have
> =

> Scenario 1
> ----------
> memory@0 for RMA
> memory@80000000 for rest of the boot time memory
> ibm,dynamic-reconfiguration-memory for hot-pluggable memory.
> =

> After this commit, the guest will have
> =

> Scenario 2
> ----------
> memory@0 for RMA
> ibm,dynamic-reconfiguration-memory for the entire memory including
> RMA, boot time memory as well as hot-pluggable memory.
> =

> If an existing guest having DT nodes as in Scenario 1 above is migrated
> to a QEMU which has this change, at the target, it continues to have the
> DT nodes as in Scenario 1. However after 1st reboot, the DT representation
> changes over to Scenario 2.
> =

> I haven't yet looked at Jian Jun's DRC migration patchset to ascertain
> if this change works well with DRC migration.
> =

> (*) https://patchwork.ozlabs.org/patch/606912/
> =

> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Cc: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>  hw/ppc/spapr.c         | 59 +++++++++++++++++++++++++++++++++++---------=
------
>  include/hw/ppc/spapr.h |  1 +
>  2 files changed, 43 insertions(+), 17 deletions(-)
> =

> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 79a70a9..6d8de2e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -566,7 +566,6 @@ static int spapr_populate_memory(sPAPRMachineState *s=
papr, void *fdt)
>          }
>          if (!mem_start) {
>              /* ppc_spapr_init() checks for rma_size <=3D node0_size alre=
ady */
> -            spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
>              mem_start +=3D spapr->rma_size;
>              node_size -=3D spapr->rma_size;
>          }
> @@ -759,18 +758,13 @@ static int spapr_populate_drconf_memory(sPAPRMachin=
eState *spapr, void *fdt)
>      int ret, i, offset;
>      uint64_t lmb_size =3D SPAPR_MEMORY_BLOCK_SIZE;
>      uint32_t prop_lmb_size[] =3D {0, cpu_to_be32(lmb_size)};
> -    uint32_t nr_lmbs =3D (machine->maxram_size - machine->ram_size)/lmb_=
size;
> +    uint32_t nr_rma_lmbs =3D spapr->rma_size / lmb_size;
> +    uint32_t nr_lmbs =3D machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs =3D machine->ram_size / lmb_size;
>      uint32_t *int_buf, *cur_index, buf_len;
>      int nr_nodes =3D nb_numa_nodes ? nb_numa_nodes : 1;
> =

>      /*
> -     * Don't create the node if there are no DR LMBs.
> -     */
> -    if (!nr_lmbs) {
> -        return 0;
> -    }
> -
> -    /*
>       * Allocate enough buffer size to fit in ibm,dynamic-memory
>       * or ibm,associativity-lookup-arrays
>       */
> @@ -802,9 +796,15 @@ static int spapr_populate_drconf_memory(sPAPRMachine=
State *spapr, void *fdt)
>      for (i =3D 0; i < nr_lmbs; i++) {
>          sPAPRDRConnector *drc;
>          sPAPRDRConnectorClass *drck;
> -        uint64_t addr =3D i * lmb_size + spapr->hotplug_memory.base;;
> +        uint64_t addr;
>          uint32_t *dynamic_memory =3D cur_index;
> =

> +        if (i < nr_assigned_lmbs) {
> +            addr =3D i * lmb_size;
> +        } else {
> +            addr =3D (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;

If the fix relies on there being no gap between hotplug_memory.base and
machine->ram_size, could we instead assert that (nr_assigned_lmbs *
lmb_size =3D=3D spapr->hotplug_memory.base) and then use the same addr
calculation for all lmbs (here and elsewhere)?

Otherwise the patch looks good, and it seems like a reasonable
workaround to have on the QEMU side even if we still pursue the kernel
fix.

> +        }
>          drc =3D spapr_dr_connector_by_id(SPAPR_DR_CONNECTOR_TYPE_LMB,
>                                         addr/lmb_size);
>          g_assert(drc);
> @@ -817,7 +817,11 @@ static int spapr_populate_drconf_memory(sPAPRMachine=
State *spapr, void *fdt)
>          dynamic_memory[4] =3D cpu_to_be32(numa_get_node(addr, NULL));
>          if (addr < machine->ram_size ||
>                      memory_region_present(get_system_memory(), addr)) {
> -            dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGNED);
> +            if (i < nr_rma_lmbs) {
> +                dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_RESERV=
ED);
> +            } else {
> +                dynamic_memory[5] =3D cpu_to_be32(SPAPR_LMB_FLAGS_ASSIGN=
ED);
> +            }
>          } else {
>              dynamic_memory[5] =3D cpu_to_be32(0);
>          }
> @@ -879,6 +883,8 @@ int spapr_h_cas_compose_response(sPAPRMachineState *s=
papr,
>      /* Generate ibm,dynamic-reconfiguration-memory node if required */
>      if (memory_update && smc->dr_lmb_enabled) {
>          _FDT((spapr_populate_drconf_memory(spapr, fdt)));
> +    } else {
> +        _FDT((spapr_populate_memory(spapr, fdt)));
>      }
> =

>      /* Pack resulting tree */
> @@ -916,10 +922,23 @@ static void spapr_finalize_fdt(sPAPRMachineState *s=
papr,
>      /* open out the base tree into a temp buffer for the final tweaks */
>      _FDT((fdt_open_into(spapr->fdt_skel, fdt, FDT_MAX_SIZE)));
> =

> -    ret =3D spapr_populate_memory(spapr, fdt);
> -    if (ret < 0) {
> -        fprintf(stderr, "couldn't setup memory nodes in fdt\n");
> -        exit(1);
> +    /*
> +     * Add memory@0 node to represent RMA. Rest of the memory is either
> +     * represented by memory nodes or ibm,dynamic-reconfiguration-memory
> +     * node later during ibm,client-architecture-support call.
> +     *
> +     * If NUMA is configured, ensure that memory@0 ends up in the
> +     * first memory-less node.
> +     */
> +    if (nb_numa_nodes) {
> +        for (i =3D 0; i < nb_numa_nodes; ++i) {
> +            if (numa_info[i].node_mem) {
> +                spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
> +                break;
> +            }
> +        }
> +    } else {
> +        spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
>      }
> =

>      ret =3D spapr_populate_vdevice(spapr->vio_bus, fdt);
> @@ -1659,14 +1678,20 @@ static void spapr_create_lmb_dr_connectors(sPAPRM=
achineState *spapr)
>  {
>      MachineState *machine =3D MACHINE(spapr);
>      uint64_t lmb_size =3D SPAPR_MEMORY_BLOCK_SIZE;
> -    uint32_t nr_lmbs =3D (machine->maxram_size - machine->ram_size)/lmb_=
size;
> +    uint32_t nr_lmbs =3D machine->maxram_size / lmb_size;
> +    uint32_t nr_assigned_lmbs =3D machine->ram_size / lmb_size;
>      int i;
> =

>      for (i =3D 0; i < nr_lmbs; i++) {
>          sPAPRDRConnector *drc;
>          uint64_t addr;
> =

> -        addr =3D i * lmb_size + spapr->hotplug_memory.base;
> +        if (i < nr_assigned_lmbs) {
> +            addr =3D i * lmb_size;
> +        } else {
> +            addr =3D (i - nr_assigned_lmbs) * lmb_size +
> +                    spapr->hotplug_memory.base;
> +        }
>          drc =3D spapr_dr_connector_new(OBJECT(spapr), SPAPR_DR_CONNECTOR=
_TYPE_LMB,
>                                       addr/lmb_size);
>          qemu_register_reset(spapr_drc_reset, drc);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 098d85d..9f2050d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -627,5 +627,6 @@ int spapr_rng_populate_dt(void *fdt);
>   * property under ibm,dynamic-reconfiguration-memory node.
>   */
>  #define SPAPR_LMB_FLAGS_ASSIGNED 0x00000008
> +#define SPAPR_LMB_FLAGS_RESERVED 0x00000080
> =

>  #endif /* !defined (__HW_SPAPR_H__) */
> -- =

> 2.1.0
>=20