From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:42439) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gzMK3-0006RX-1M for qemu-devel@nongnu.org; Thu, 28 Feb 2019 09:06:04 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gzMJu-0007o8-AH for qemu-devel@nongnu.org; Thu, 28 Feb 2019 09:06:02 -0500 Date: Thu, 28 Feb 2019 15:05:31 +0100 From: Igor Mammedov Message-ID: <20190228150531.184f5974@redhat.com> In-Reply-To: References: <20190220224003.4420-1-eric.auger@redhat.com> <20190222172742.18c3835a@redhat.com> <20190225104212.7d40e65e@Igors-MacBook-Pro.local> <70249194-349e-37f6-0e8d-dc50b39082b7@redhat.com> <20190226175653.6ca2b6c4@Igors-MacBook-Pro.local> <20190227111025.4bb39cc7@redhat.com> <116c5375-0ff4-8f91-ac05-05a53e7fe206@redhat.com> <5FC3163CFD30C246ABAA99954A238FA8392D6690@lhreml524-mbs.china.huawei.com> <20190227185123.314171ae@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expansion and PCDIMM/NVDIMM support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Auger Eric Cc: Shameerali Kolothum Thodi , "peter.maydell@linaro.org" , "drjones@redhat.com" , "david@redhat.com" , Linuxarm , "qemu-devel@nongnu.org" , "dgilbert@redhat.com" , "qemu-arm@nongnu.org" , "david@gibson.dropbear.id.au" , "eric.auger.pro@gmail.com" On Thu, 28 Feb 2019 08:48:18 +0100 Auger Eric wrote: > Hi Igor, Shameer, >=20 > On 2/27/19 6:51 PM, Igor Mammedov wrote: > > On Wed, 27 Feb 2019 10:41:45 +0000 > > Shameerali Kolothum Thodi wrote: > > =20 > >> Hi Eric, > >> =20 > >>> -----Original Message----- > >>> From: Auger Eric [mailto:eric.auger@redhat.com] > >>> Sent: 27 February 2019 10:27 > >>> To: Igor Mammedov > >>> Cc: peter.maydell@linaro.org; drjones@redhat.com; david@redhat.com; > >>> dgilbert@redhat.com; Shameerali Kolothum Thodi > >>> ; qemu-devel@nongnu.org; > >>> qemu-arm@nongnu.org; eric.auger.pro@gmail.com; > >>> david@gibson.dropbear.id.au > >>> Subject: Re: [Qemu-devel] [PATCH v7 00/17] ARM virt: Initial RAM expa= nsion > >>> and PCDIMM/NVDIMM support > >>> > >>> Hi Igor, Shameer, > >>> > >>> On 2/27/19 11:10 AM, Igor Mammedov wrote: =20 > >>>> On Tue, 26 Feb 2019 18:53:24 +0100 > >>>> Auger Eric wrote: > >>>> =20 > >>>>> Hi Igor, > >>>>> > >>>>> On 2/26/19 5:56 PM, Igor Mammedov wrote: =20 > >>>>>> On Tue, 26 Feb 2019 14:11:58 +0100 > >>>>>> Auger Eric wrote: > >>>>>> =20 > >>>>>>> Hi Igor, > >>>>>>> > >>>>>>> On 2/26/19 9:40 AM, Auger Eric wrote: =20 > >>>>>>>> Hi Igor, > >>>>>>>> > >>>>>>>> On 2/25/19 10:42 AM, Igor Mammedov wrote: =20 > >>>>>>>>> On Fri, 22 Feb 2019 18:35:26 +0100 > >>>>>>>>> Auger Eric wrote: > >>>>>>>>> =20 > >>>>>>>>>> Hi Igor, > >>>>>>>>>> > >>>>>>>>>> On 2/22/19 5:27 PM, Igor Mammedov wrote: =20 > >>>>>>>>>>> On Wed, 20 Feb 2019 23:39:46 +0100 > >>>>>>>>>>> Eric Auger wrote: > >>>>>>>>>>> =20 > >>>>>>>>>>>> This series aims to bump the 255GB RAM limit in machvirt and= to > >>>>>>>>>>>> support device memory in general, and especially =20 > >>> PCDIMM/NVDIMM. =20 > >>>>>>>>>>>> > >>>>>>>>>>>> In machvirt versions < 4.0, the initial RAM starts at 1GB an= d can > >>>>>>>>>>>> grow up to 255GB. From 256GB onwards we find IO regions such= as =20 > >>> the =20 > >>>>>>>>>>>> additional GICv3 RDIST region, high PCIe ECAM region and hig= h =20 > >>> PCIe =20 > >>>>>>>>>>>> MMIO region. The address map was 1TB large. This corresponde= d =20 > >>> to =20 > >>>>>>>>>>>> the max IPA capacity KVM was able to manage. > >>>>>>>>>>>> > >>>>>>>>>>>> Since 4.20, the host kernel is able to support a larger and = dynamic > >>>>>>>>>>>> IPA range. So the guest physical address can go beyond the 1= TB. =20 > >>> The =20 > >>>>>>>>>>>> max GPA size depends on the host kernel configuration and ph= ysical =20 > >>> CPUs. =20 > >>>>>>>>>>>> > >>>>>>>>>>>> In this series we use this feature and allow the RAM to grow= =20 > >>> without =20 > >>>>>>>>>>>> any other limit than the one put by the host kernel. > >>>>>>>>>>>> > >>>>>>>>>>>> The RAM still starts at 1GB. First comes the initial ram (-m= ) of size > >>>>>>>>>>>> ram_size and then comes the device memory (,maxmem) of size > >>>>>>>>>>>> maxram_size - ram_size. The device memory is potentially = =20 > >>> hotpluggable =20 > >>>>>>>>>>>> depending on the instantiated memory objects. > >>>>>>>>>>>> > >>>>>>>>>>>> IO regions previously located between 256GB and 1TB are move= d =20 > >>> after =20 > >>>>>>>>>>>> the RAM. Their offset is dynamically computed, depends on = =20 > >>> ram_size =20 > >>>>>>>>>>>> and maxram_size. Size alignment is enforced. > >>>>>>>>>>>> > >>>>>>>>>>>> In case maxmem value is inferior to 255GB, the legacy memory= =20 > >>> map =20 > >>>>>>>>>>>> still is used. The change of memory map becomes effective fr= om 4.0 > >>>>>>>>>>>> onwards. > >>>>>>>>>>>> > >>>>>>>>>>>> As we keep the initial RAM at 1GB base address, we do not ne= ed to =20 > >>> do =20 > >>>>>>>>>>>> invasive changes in the EDK2 FW. It seems nobody is eager to= do > >>>>>>>>>>>> that job at the moment. > >>>>>>>>>>>> > >>>>>>>>>>>> Device memory being put just after the initial RAM, it is po= ssible > >>>>>>>>>>>> to get access to this feature while keeping a 1TB address ma= p. > >>>>>>>>>>>> > >>>>>>>>>>>> This series reuses/rebases patches initially submitted by Sh= ameer > >>>>>>>>>>>> in [1] and Kwangwoo in [2] for the PC-DIMM and NV-DIMM parts. > >>>>>>>>>>>> > >>>>>>>>>>>> Functionally, the series is split into 3 parts: > >>>>>>>>>>>> 1) bump of the initial RAM limit [1 - 9] and change in > >>>>>>>>>>>> the memory map =20 > >>>>>>>>>>> =20 > >>>>>>>>>>>> 2) Support of PC-DIMM [10 - 13] =20 > >>>>>>>>>>> Is this part complete ACPI wise (for coldplug)? I haven't not= iced > >>>>>>>>>>> DSDT AML here no E820 changes, so ACPI wise pc-dimm shouldn't= be > >>>>>>>>>>> visible to the guest. It might be that DT is masking problem > >>>>>>>>>>> but well, that won't work on ACPI only guests. =20 > >>>>>>>>>> > >>>>>>>>>> guest /proc/meminfo or "lshw -class memory" reflects the amoun= t of =20 > >>> mem =20 > >>>>>>>>>> added with the DIMM slots. =20 > >>>>>>>>> Question is how does it get there? Does it come from DT or from= =20 > >>> firmware =20 > >>>>>>>>> via UEFI interfaces? > >>>>>>>>> =20 > >>>>>>>>>> So it looks fine to me. Isn't E820 a pure x86 matter? =20 > >>>>>>>>> sorry for misleading, I've meant is UEFI GetMemoryMap(). > >>>>>>>>> On x86, I'm wary of adding PC-DIMMs to E802 which then gets exp= osed > >>>>>>>>> via UEFI GetMemoryMap() as guest kernel might start using it as= =20 > >>> normal =20 > >>>>>>>>> memory early at boot and later put that memory into zone normal= and =20 > >>> hence =20 > >>>>>>>>> make it non-hot-un-pluggable. The same concerns apply to DT bas= ed =20 > >>> means =20 > >>>>>>>>> of discovery. > >>>>>>>>> (That's guest issue but it's easy to workaround it not putting = =20 > >>> hotpluggable =20 > >>>>>>>>> memory into UEFI GetMemoryMap() or DT and let DSDT describe it = =20 > >>> properly) =20 > >>>>>>>>> That way memory doesn't get (ab)used by firmware or early boot = =20 > >>> kernel stages =20 > >>>>>>>>> and doesn't get locked up. > >>>>>>>>> =20 > >>>>>>>>>> What else would you expect in the dsdt? =20 > >>>>>>>>> Memory device descriptions, look for code that adds PNP0C80 wit= h =20 > >>> _CRS =20 > >>>>>>>>> describing memory ranges =20 > >>>>>>>> > >>>>>>>> OK thank you for the explanations. I will work on PNP0C80 additi= on then. > >>>>>>>> Does it mean that in ACPI mode we must not output DT hotplug mem= ory > >>>>>>>> nodes or assuming that PNP0C80 is properly described, it will "o= verride" > >>>>>>>> DT description? =20 > >>>>>>> > >>>>>>> After further investigations, I think the pieces you pointed out = are > >>>>>>> added by Shameer's series, ie. through the build_memory_hotplug_a= ml() > >>>>>>> call. So I suggest we separate the concerns: this series brings s= upport > >>>>>>> for DIMM coldplug. hotplug, including all the relevant ACPI struc= tures > >>>>>>> will be added later on by Shameer. =20 > >>>>>> > >>>>>> Maybe we should not put pc-dimms in DT for this series until it ge= ts clear > >>>>>> if it doesn't conflict with ACPI in some way. =20 > >>>>> > >>>>> I guess you mean removing the DT hotpluggable memory nodes only in = ACPI > >>>>> mode? Otherwise you simply remove the DIMM feature, right? =20 > >>>> Something like this so DT won't get in conflict with ACPI. > >>>> Only we don't have a switch for it something like, -machine fdt=3Don= (with =20 > >>> default off) =20 > >>>> =20 > >>>>> I double checked and if you remove the hotpluggable memory DT nodes= in > >>>>> ACPI mode: > >>>>> - you do not see the PCDIMM slots in guest /proc/meminfo anymore. S= o I > >>>>> guess you're right, if the DT nodes are available, that memory is > >>>>> considered as not unpluggable by the guest. > >>>>> - You can see the NVDIMM slots using ndctl list -u. You can mount a= DAX > >>>>> system. > >>>>> > >>>>> Hotplug/unplug is clearly not supported by this series and any atte= mpt > >>>>> results in "memory hotplug is not supported". Is it really an issue= if > >>>>> the guest does not consider DIMM slots as not hot-unpluggable memor= y? I > >>>>> am not even sure the guest kernel would support to unplug that memo= ry. > >>>>> > >>>>> In case we want all ACPI tables to be ready for making this memory = seen > >>>>> as hot-unpluggable we need some Shameer's patches on top of this se= ries. =20 > >>>> May be we should push for this way (into 4.0), it's just a several p= atches > >>>> after all or even merge them in your series (I'd guess it would need= to be > >>>> rebased on top of your latest work) =20 > >>> > >>> Shameer, would you agree if we merge PATCH 1 of your RFC hotplug seri= es > >>> (without the reduced hw_reduced_acpi flag) in this series and isolate= in > >>> a second PATCH the acpi_memory_hotplug_init() + build_memory_hotplug_= aml > >>> called in virt code? =20 > > probably we can do it as transitional step as we need working mmio inte= rface > > in place for build_memory_hotplug_aml() to work, provided it won't crea= te > > migration issues (do we need VMSTATE_MEMORY_HOTPLUG for cold-plug case?= ). > >=20 > > What about dummy initial GED (empty device), that manages mmio region o= nly > > and then later it will be filled with remaining logic IRQ. In this case= mmio region > > and vmstate won't change (maybe) so it won't cause ABI or migration iss= ues. =20 >=20 >=20 > >=20 > > =20 > >> Sure, that=E2=80=99s fine with me. So what would you use for the event= _handler_method in > >> build_memory_hotplug_aml()? GPO0 device? =20 > >=20 > > a method name not defined in spec, so it won't be called might do. =20 >=20 > At this point the event_handler_method, ie. \_SB.GPO0._E02, is not > supposed to be called, right? So effectivily we should be able to use > any other method name (unlinked to any GPIO/GED). I guess at this stage > only the PNP0C80 definition blocks + methods are used. pretty much yes. =20 > What still remains fuzzy for me is in case of cold plug the mmio hotplug > control region part only is read (despite the slot selection of course) > and returns 0 for addr/size and also flags meaning the slot is not > enabled. If you mean guest reads 0s than it looks broken, could you show trace log with mhp_* tracepoints enabled during a dimm hotplug. > So despite the slots are advertised as hotpluggable/enabled in > the SRAT; I am not sure for the OS it actually makes any difference > whether the DSDT definition blocks are described or not. SRAT isn't used fro informing guests about amount of present RAM, it holds affinity information for present and possible RAM > To be honest I am afraid this is too late to add those additional > features for 4.0 now. This is going to jeopardize the first preliminary > part which is the introduction of the new memory map, allowing the > expansion of the initial RAM and paving the way for device memory > introduction. So I think I am going to resend the first 10 patches in a > standalone series. And we can iterate on the PCDIMM/NVDIMM parts > independently. sounds good to me, I'll try to review 1-10 today=20 =20 > Thanks >=20 > Eric > >=20 > > =20 > >> > >> Thanks, > >> Shameer > >> =20 > >>> Then would remain the GED/GPIO actual integration. > >>> > >>> Thanks > >>> > >>> Eric =20 > >>>> =20 > >>>>> Also don't DIMM slots already make sense in DT mode. Usually we acc= ept > >>>>> to add one feature in DT and then in ACPI. For instance we can bene= fit =20 > >>>> usually it doesn't conflict with each other (at least I'm not aware = of it) > >>>> but I see a problem with in this case. > >>>> =20 > >>>>> from nvdimm in dt mode right? So, considering an incremental approa= ch I > >>>>> would be in favour of keeping the DT nodes. =20 > >>>> I'd guess it is the same as for DIMMs, ACPI support for NVDIMMs is m= uch > >>>> more versatile. > >>>> > >>>> I consider target application of arm/virt as a board that's used to > >>>> run in production generic ACPI capable guest in most use cases and > >>>> various DT only guests as secondary ones. It's hard to make > >>>> both usecases be happy with defaults (that's probably one of the > >>>> reasons why 'sbsa' board is being added). > >>>> > >>>> So I'd give priority to ACPI based arm/virt versus DT when defaults = are > >>>> considered. > >>>> =20 > >>>>> Thanks > >>>>> > >>>>> Eric =20 > >>>>>> > >>>>>> > >>>>>> > >>>>>> =20 > >>>> =20 > >=20 > > =20