From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([209.51.188.92]:53286)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <eric.auger@redhat.com>) id 1gzGQs-00008a-3H
	for qemu-devel@nongnu.org; Thu, 28 Feb 2019 02:48:43 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <eric.auger@redhat.com>) id 1gzGQp-0007Sm-SM
	for qemu-devel@nongnu.org; Thu, 28 Feb 2019 02:48:41 -0500
References: <20190220224003.4420-1-eric.auger@redhat.com>
	<20190222172742.18c3835a@redhat.com>
	<c016024e-9f4b-8953-f492-8802675f7e22@redhat.com>
	<20190225104212.7d40e65e@Igors-MacBook-Pro.local>
	<70249194-349e-37f6-0e8d-dc50b39082b7@redhat.com>
	<cd283485-49a9-ddd1-a020-26110ee093be@redhat.com>
	<20190226175653.6ca2b6c4@Igors-MacBook-Pro.local>
	<c68ba5a2-0d24-c1ad-c2af-c957768abf9f@redhat.com>
	<20190227111025.4bb39cc7@redhat.com>
	<116c5375-0ff4-8f91-ac05-05a53e7fe206@redhat.com>
	<5FC3163CFD30C246ABAA99954A238FA8392D6690@lhreml524-mbs.china.huawei.com>
	<20190227185123.314171ae@redhat.com>
From: Auger Eric <eric.auger@redhat.com>
Message-ID: <f9518e2d-0e46-d25f-f9f5-809b21f744aa@redhat.com>
Date: Thu, 28 Feb 2019 08:48:18 +0100
MIME-Version: 1.0
In-Reply-To: <20190227185123.314171ae@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion
 and PCDIMM/NVDIMM support
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Igor Mammedov <imammedo@redhat.com>, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: "peter.maydell@linaro.org" <peter.maydell@linaro.org>, "drjones@redhat.com" <drjones@redhat.com>, "david@redhat.com" <david@redhat.com>, Linuxarm <linuxarm@huawei.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "dgilbert@redhat.com" <dgilbert@redhat.com>, "qemu-arm@nongnu.org" <qemu-arm@nongnu.org>, "david@gibson.dropbear.id.au" <david@gibson.dropbear.id.au>, "eric.auger.pro@gmail.com" <eric.auger.pro@gmail.com>

Hi Igor, Shameer,

On 2/27/19 6:51 PM, Igor Mammedov wrote:
> On Wed, 27 Feb 2019 10:41:45 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>=20
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Auger Eric [mailto:eric.auger@redhat.com]
>>> Sent: 27 February 2019 10:27
>>> To: Igor Mammedov <imammedo@redhat.com>
>>> Cc: peter.maydell@linaro.org; drjones@redhat.com; david@redhat.com;
>>> dgilbert@redhat.com; Shameerali Kolothum Thodi
>>> <shameerali.kolothum.thodi@huawei.com>; qemu-devel@nongnu.org;
>>> qemu-arm@nongnu.org; eric.auger.pro@gmail.com;
>>> david@gibson.dropbear.id.au
>>> Subject: Re: [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expa=
nsion
>>> and PCDIMM/NVDIMM support
>>>
>>> Hi Igor, Shameer,
>>>
>>> On 2/27/19 11:10 AM, Igor Mammedov wrote: =20
>>>> On Tue, 26 Feb 2019 18:53:24 +0100
>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>> =20
>>>>> Hi Igor,
>>>>>
>>>>> On 2/26/19 5:56 PM, Igor Mammedov wrote: =20
>>>>>> On Tue, 26 Feb 2019 14:11:58 +0100
>>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>> =20
>>>>>>> Hi Igor,
>>>>>>>
>>>>>>> On 2/26/19 9:40 AM, Auger Eric wrote: =20
>>>>>>>> Hi Igor,
>>>>>>>>
>>>>>>>> On 2/25/19 10:42 AM, Igor Mammedov wrote: =20
>>>>>>>>> On Fri, 22 Feb 2019 18:35:26 +0100
>>>>>>>>> Auger Eric <eric.auger@redhat.com> wrote:
>>>>>>>>> =20
>>>>>>>>>> Hi Igor,
>>>>>>>>>>
>>>>>>>>>> On 2/22/19 5:27 PM, Igor Mammedov wrote: =20
>>>>>>>>>>> On Wed, 20 Feb 2019 23:39:46 +0100
>>>>>>>>>>> Eric Auger <eric.auger@redhat.com> wrote:
>>>>>>>>>>> =20
>>>>>>>>>>>> This series aims to bump the 255GB RAM limit in machvirt and=
 to
>>>>>>>>>>>> support device memory in general, and especially =20
>>> PCDIMM/NVDIMM. =20
>>>>>>>>>>>>
>>>>>>>>>>>> In machvirt versions < 4.0, the initial RAM starts at 1GB an=
d can
>>>>>>>>>>>> grow up to 255GB. From 256GB onwards we find IO regions such=
 as =20
>>> the =20
>>>>>>>>>>>> additional GICv3 RDIST region, high PCIe ECAM region and hig=
h =20
>>> PCIe =20
>>>>>>>>>>>> MMIO region. The address map was 1TB large. This corresponde=
d =20
>>> to =20
>>>>>>>>>>>> the max IPA capacity KVM was able to manage.
>>>>>>>>>>>>
>>>>>>>>>>>> Since 4.20, the host kernel is able to support a larger and =
dynamic
>>>>>>>>>>>> IPA range. So the guest physical address can go beyond the 1=
TB. =20
>>> The =20
>>>>>>>>>>>> max GPA size depends on the host kernel configuration and ph=
ysical =20
>>> CPUs. =20
>>>>>>>>>>>>
>>>>>>>>>>>> In this series we use this feature and allow the RAM to grow=
 =20
>>> without =20
>>>>>>>>>>>> any other limit than the one put by the host kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> The RAM still starts at 1GB. First comes the initial ram (-m=
) of size
>>>>>>>>>>>> ram_size and then comes the device memory (,maxmem) of size
>>>>>>>>>>>> maxram_size - ram_size. The device memory is potentially =20
>>> hotpluggable =20
>>>>>>>>>>>> depending on the instantiated memory objects.
>>>>>>>>>>>>
>>>>>>>>>>>> IO regions previously located between 256GB and 1TB are move=
d =20
>>> after =20
>>>>>>>>>>>> the RAM. Their offset is dynamically computed, depends on =20
>>> ram_size =20
>>>>>>>>>>>> and maxram_size. Size alignment is enforced.
>>>>>>>>>>>>
>>>>>>>>>>>> In case maxmem value is inferior to 255GB, the legacy memory=
 =20
>>> map =20
>>>>>>>>>>>> still is used. The change of memory map becomes effective fr=
om 4.0
>>>>>>>>>>>> onwards.
>>>>>>>>>>>>
>>>>>>>>>>>> As we keep the initial RAM at 1GB base address, we do not ne=
ed to =20
>>> do =20
>>>>>>>>>>>> invasive changes in the EDK2 FW. It seems nobody is eager to=
 do
>>>>>>>>>>>> that job at the moment.
>>>>>>>>>>>>
>>>>>>>>>>>> Device memory being put just after the initial RAM, it is po=
ssible
>>>>>>>>>>>> to get access to this feature while keeping a 1TB address ma=
p.
>>>>>>>>>>>>
>>>>>>>>>>>> This series reuses/rebases patches initially submitted by Sh=
ameer
>>>>>>>>>>>> in [1] and Kwangwoo in [2] for the PC-DIMM and NV-DIMM parts=
.
>>>>>>>>>>>>
>>>>>>>>>>>> Functionally, the series is split into 3 parts:
>>>>>>>>>>>> 1) bump of the initial RAM limit [1 - 9] and change in
>>>>>>>>>>>>    the memory map =20
>>>>>>>>>>> =20
>>>>>>>>>>>> 2) Support of PC-DIMM [10 - 13] =20
>>>>>>>>>>> Is this part complete ACPI wise (for coldplug)? I haven't not=
iced
>>>>>>>>>>> DSDT AML here no E820 changes, so ACPI wise pc-dimm shouldn't=
 be
>>>>>>>>>>> visible to the guest. It might be that DT is masking problem
>>>>>>>>>>> but well, that won't work on ACPI only guests. =20
>>>>>>>>>>
>>>>>>>>>> guest /proc/meminfo or "lshw -class memory" reflects the amoun=
t of =20
>>> mem =20
>>>>>>>>>> added with the DIMM slots. =20
>>>>>>>>> Question is how does it get there? Does it come from DT or from=
 =20
>>> firmware =20
>>>>>>>>> via UEFI interfaces?
>>>>>>>>> =20
>>>>>>>>>> So it looks fine to me. Isn't E820 a pure x86 matter? =20
>>>>>>>>> sorry for misleading, I've meant is UEFI GetMemoryMap().
>>>>>>>>> On x86, I'm wary of adding PC-DIMMs to E802 which then gets exp=
osed
>>>>>>>>> via UEFI GetMemoryMap() as guest kernel might start using it as=
 =20
>>> normal =20
>>>>>>>>> memory early at boot and later put that memory into zone normal=
 and =20
>>> hence =20
>>>>>>>>> make it non-hot-un-pluggable. The same concerns apply to DT bas=
ed =20
>>> means =20
>>>>>>>>> of discovery.
>>>>>>>>> (That's guest issue but it's easy to workaround it not putting =
=20
>>> hotpluggable =20
>>>>>>>>> memory into UEFI GetMemoryMap() or DT and let DSDT describe it =
=20
>>> properly) =20
>>>>>>>>> That way memory doesn't get (ab)used by firmware or early boot =
=20
>>> kernel stages =20
>>>>>>>>> and doesn't get locked up.
>>>>>>>>> =20
>>>>>>>>>> What else would you expect in the dsdt? =20
>>>>>>>>> Memory device descriptions, look for code that adds PNP0C80 wit=
h =20
>>> _CRS =20
>>>>>>>>> describing memory ranges =20
>>>>>>>>
>>>>>>>> OK thank you for the explanations. I will work on PNP0C80 additi=
on then.
>>>>>>>> Does it mean that in ACPI mode we must not output DT hotplug mem=
ory
>>>>>>>> nodes or assuming that PNP0C80 is properly described, it will "o=
verride"
>>>>>>>> DT description? =20
>>>>>>>
>>>>>>> After further investigations, I think the pieces you pointed out =
are
>>>>>>> added by Shameer's series, ie. through the build_memory_hotplug_a=
ml()
>>>>>>> call. So I suggest we separate the concerns: this series brings s=
upport
>>>>>>> for DIMM coldplug. hotplug, including all the relevant ACPI struc=
tures
>>>>>>> will be added later on by Shameer. =20
>>>>>>
>>>>>> Maybe we should not put pc-dimms in DT for this series until it ge=
ts clear
>>>>>> if it doesn't conflict with ACPI in some way. =20
>>>>>
>>>>> I guess you mean removing the DT hotpluggable memory nodes only in =
ACPI
>>>>> mode? Otherwise you simply remove the DIMM feature, right? =20
>>>> Something like this so DT won't get in conflict with ACPI.
>>>> Only we don't have a switch for it something like, -machine fdt=3Don=
 (with =20
>>> default off) =20
>>>> =20
>>>>> I double checked and if you remove the hotpluggable memory DT nodes=
 in
>>>>> ACPI mode:
>>>>> - you do not see the PCDIMM slots in guest /proc/meminfo anymore. S=
o I
>>>>> guess you're right, if the DT nodes are available, that memory is
>>>>> considered as not unpluggable by the guest.
>>>>> - You can see the NVDIMM slots using ndctl list -u. You can mount a=
 DAX
>>>>> system.
>>>>>
>>>>> Hotplug/unplug is clearly not supported by this series and any atte=
mpt
>>>>> results in "memory hotplug is not supported". Is it really an issue=
 if
>>>>> the guest does not consider DIMM slots as not hot-unpluggable memor=
y? I
>>>>> am not even sure the guest kernel would support to unplug that memo=
ry.
>>>>>
>>>>> In case we want all ACPI tables to be ready for making this memory =
seen
>>>>> as hot-unpluggable we need some Shameer's patches on top of this se=
ries. =20
>>>> May be we should push for this way (into 4.0), it's just a several p=
atches
>>>> after all or even merge them in your series (I'd guess it would need=
 to be
>>>> rebased on top of your latest work) =20
>>>
>>> Shameer, would you agree if we merge PATCH 1 of your RFC hotplug seri=
es
>>> (without the reduced hw_reduced_acpi flag) in this series and isolate=
 in
>>> a second PATCH the acpi_memory_hotplug_init() + build_memory_hotplug_=
aml
>>> called in virt code? =20
> probably we can do it as transitional step as we need working mmio inte=
rface
> in place for build_memory_hotplug_aml() to work, provided it won't crea=
te
> migration issues (do we need VMSTATE_MEMORY_HOTPLUG for cold-plug case?=
).
>=20
> What about dummy initial GED (empty device), that manages mmio region o=
nly
> and then later it will be filled with remaining logic IRQ. In this case=
 mmio region
> and vmstate won't change (maybe) so it won't cause ABI or migration iss=
ues.


>=20
>=20
>> Sure, that=E2=80=99s fine with me. So what would you use for the event=
_handler_method in
>> build_memory_hotplug_aml()? GPO0 device?
>=20
> a method name not defined in spec, so it won't be called might do.

At this point the event_handler_method, ie. \_SB.GPO0._E02, is not
supposed to be called, right? So effectivily we should be able to use
any other method name (unlinked to any GPIO/GED). I guess at this stage
only the PNP0C80 definition blocks + methods are used.

What still remains fuzzy for me is in case of cold plug the mmio hotplug
control region part only is read (despite the slot selection of course)
and returns 0 for addr/size and also flags meaning the slot is not
enabled. So despite the slots are advertised as hotpluggable/enabled in
the SRAT; I am not sure for the OS it actually makes any difference
whether the DSDT definition blocks are described or not.

To be honest I am afraid this is too late to add those additional
features for 4.0 now. This is going to jeopardize the first preliminary
part which is the introduction of the new memory map, allowing the
expansion of the initial RAM and paving the way for device memory
introduction. So I think I am going to resend the first 10 patches in a
standalone series. And we can iterate on the PCDIMM/NVDIMM parts
independently.

Thanks

Eric
>=20
>=20
>>
>> Thanks,
>> Shameer
>>
>>> Then would remain the GED/GPIO actual integration.
>>>
>>> Thanks
>>>
>>> Eric =20
>>>> =20
>>>>> Also don't DIMM slots already make sense in DT mode. Usually we acc=
ept
>>>>> to add one feature in DT and then in ACPI. For instance we can bene=
fit =20
>>>> usually it doesn't conflict with each other (at least I'm not aware =
of it)
>>>> but I see a problem with in this case.
>>>> =20
>>>>> from nvdimm in dt mode right? So, considering an incremental approa=
ch I
>>>>> would be in favour of keeping the DT nodes. =20
>>>> I'd guess it is the same as for DIMMs, ACPI support for NVDIMMs is m=
uch
>>>> more versatile.
>>>>
>>>> I consider target application of arm/virt as a board that's used to
>>>> run in production generic ACPI capable guest in most use cases and
>>>> various DT only guests as secondary ones. It's hard to make
>>>> both usecases be happy with defaults (that's probably  one of the
>>>> reasons why 'sbsa' board is being added).
>>>>
>>>> So I'd give priority to ACPI based arm/virt versus DT when defaults =
are
>>>> considered.
>>>> =20
>>>>> Thanks
>>>>>
>>>>> Eric =20
>>>>>>
>>>>>>
>>>>>>
>>>>>> =20
>>>> =20
>=20
>=20