From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45004) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cwCZT-0001bB-EV for qemu-devel@nongnu.org; Thu, 06 Apr 2017 14:55:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cwCZO-0001jj-FV for qemu-devel@nongnu.org; Thu, 06 Apr 2017 14:55:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37272) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cwCZO-0001iy-5L for qemu-devel@nongnu.org; Thu, 06 Apr 2017 14:55:46 -0400 References: <76795e20-2f20-1e54-cfa5-7444f28b18ee@huawei.com> <20170321113428.GC15920@cbox> <58D17AF0.2010802@arm.com> <20170321193933.GB31111@cbox> <58DA3F68.6090901@arm.com> <20170328112328.GA31156@cbox> <20170328115413.GJ23682@e104320-lin> <58DA67BA.8070404@arm.com> <5b7352f4-4965-3ed5-3879-db871797be47@huawei.com> <20170329103658.GQ23682@e104320-lin> <2a427164-9b37-6711-3a56-906634ba7f12@redhat.com> <7c5c8ab7-8fcc-1c98-0bc1-cccb66c4c84d@huawei.com> From: Laszlo Ersek Message-ID: <6ac1597a-2ed5-36b2-848d-5fd048b16d66@redhat.com> Date: Thu, 6 Apr 2017 20:55:34 +0200 MIME-Version: 1.0 In-Reply-To: <7c5c8ab7-8fcc-1c98-0bc1-cccb66c4c84d@huawei.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH] kvm: pass the virtual SEI syndrome to guest OS List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: gengdongjiu , Achin Gupta Cc: ard.biesheuvel@linaro.org, edk2-devel@lists.01.org, qemu-devel@nongnu.org, zhaoshenglong@huawei.com, James Morse , Christoffer Dall , xiexiuqi@huawei.com, Marc Zyngier , catalin.marinas@arm.com, will.deacon@arm.com, christoffer.dall@linaro.org, rkrcmar@redhat.com, suzuki.poulose@arm.com, andre.przywara@arm.com, mark.rutland@arm.com, vladimir.murzin@arm.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, wangxiongfeng2@huawei.com, wuquanming@huawei.com, huangshaoyu@huawei.com, Leif.Lindholm@linaro.comnd@arm.com, Michael Tsirkin , Igor Mammedov On 04/06/17 14:35, gengdongjiu wrote: > Dear, Laszlo > Thanks for your detailed explanation. >=20 > On 2017/3/29 19:58, Laszlo Ersek wrote: >> (This ought to be one of the longest address lists I've ever seen :) >> Thanks for the CC. I'm glad Shannon is already on the CC list. For goo= d >> measure, I'm adding MST and Igor.) >> >> On 03/29/17 12:36, Achin Gupta wrote: >>> Hi gengdongjiu, >>> >>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote: >>>> >>>> Hi Laszlo/Biesheuvel/Qemu developer, >>>> >>>> Now I encounter a issue and want to consult with you in ARM64 pla= tform=EF=BC=8C as described below: >>>> >>>> when guest OS happen synchronous or asynchronous abort, kvm needs >>>> to send the error address to Qemu or UEFI through sigbus to >>>> dynamically generate APEI table. from my investigation, there are >>>> two ways: >>>> >>>> (1) Qemu get the error address, and generate the APEI table, then >>>> notify UEFI to know this generation, then inject abort error to >>>> guest OS, guest OS read the APEI table. >>>> (2) Qemu get the error address, and let UEFI to generate the APEI >>>> table, then inject abort error to guest OS, guest OS read the APEI >>>> table. >>> >>> Just being pedantic! I don't think we are talking about creating the = APEI table >>> dynamically here. The issue is: Once KVM has received an error that i= s destined >>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can injec= t the error >>> into the guest OS, a CPER (Common Platform Error Record) has to be ge= nerated >>> corresponding to the error source (GHES corresponding to memory subsy= stem, >>> processor etc) to allow the guest OS to do anything meaningful with t= he >>> error. So who should create the CPER is the question. >>> >>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an erro= r arrives >>> at EL3 and secure firmware (at EL3 or a lower secure exception level)= is >>> responsible for creating the CPER. ARM is experimenting with using a = Standalone >>> MM EDK2 image in the secure world to do the CPER creation. This will = avoid >>> adding the same code in ARM TF in EL3 (better for security). The erro= r will then >>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM= Trusted >>> Firmware. >>> >>> Qemu is essentially fulfilling the role of secure firmware at the EL2= /EL1 >>> interface (as discussed with Christoffer below). So it should generat= e the CPER >>> before injecting the error. >>> >>> This is corresponds to (1) above apart from notifying UEFI (I am assu= ming you >>> mean guest UEFI). At this time, the guest OS already knows where to p= ick up the >>> CPER from through the HEST. Qemu has to create the CPER and populate = its address >>> at the address exported in the HEST. Guest UEFI should not be involve= d in this >>> flow. Its job was to create the HEST at boot and that has been done b= y this >>> stage. >>> >>> Qemu folk will be able to add but it looks like support for CPER gene= ration will >>> need to be added to Qemu. We need to resolve this. >>> >>> Do shout if I am missing anything above. >> >> After reading this email, the use case looks *very* similar to what >> we've just done with VMGENID for QEMU 2.9. >> >> We have a facility between QEMU and the guest firmware, called "ACPI >> linker/loader", with which QEMU instructs the firmware to >> >> - allocate and download blobs into guest RAM (AcpiNVS type memory) -- >> ALLOCATE command, >> >> - relocate pointers in those blobs, to fields in other (or the same) >> blobs -- ADD_POINTER command, >> >> - set ACPI table checksums -- ADD_CHECKSUM command, >> >> - and send GPAs of fields within such blobs back to QEMU -- >> WRITE_POINTER command. >> >> This is how I imagine we can map the facility to the current use case >> (note that this is the first time I read about HEST / GHES / CPER): >> >> etc/acpi/tables etc/hardware_errors >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> +-----------+ >> +--------------+ | address | +-> +--------------+ >> | HEST + | registers | | | Error Status | >> + +------------+ | +---------+ | | Data Block 1 | >> | | GHES | --> | | address | --------+ | +------------+ >> | | GHES | --> | | address | ------+ | | CPER | >> | | GHES | --> | | address | ----+ | | | CPER | >> | | GHES | --> | | address | -+ | | | | CPER | >> +-+------------+ +-+---------+ | | | +-+------------+ >> | | | >> | | +---> +--------------+ >> | | | Error Status | >> | | | Data Block 2 | >> | | | +------------+ >> | | | | CPER | >> | | | | CPER | >> | | +-+------------+ >> | | >> | +-----> +--------------+ >> | | Error Status | >> | | Data Block 3 | >> | | +------------+ >> | | | CPER | >> | +-+------------+ >> | >> +--------> +--------------+ >> | Error Status | >> | Data Block 4 | >> | +------------+ >> | | CPER | >> | | CPER | >> | | CPER | >> +-+------------+ >> >> (1) QEMU generates the HEST ACPI table. This table goes in the current >> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N >> GHES objects in the HEST. >> >> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU >> also populates this blob. >> >> (2a) Given N error sources, the (unnamed) table of address registers >> will contain N address registers. >> >> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will >> also contain N Error Status Data Blocks. >> >> I don't know about the sizing (number of CPERs) each Error Status Data >> Block has to contain, but I understand it is all pre-allocated as far = as >> the OS is concerned, which matches our capabilities well. > here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg b= lob will also contain N Error Status Data Blocks", > Because the CPER numbers is not fixed, how to assign each "Error Status= Data Block" size using one "etc/hardwre_errors" fw_cfg blob. > when use one etc/hardwre_errors, will the N Error Status Data Block use= one continuous buffer? as shown below. if so, maybe it not convenient fo= r each data block size extension. > I see the bios_linker_loader_alloc will allocate one continuous buffer = for a blob(such as VMGENID_GUID_FW_CFG_FILE) >=20 > /* Allocate guest memory for the Data fw_cfg blob */ > bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 40= 96, > false /* page boundary, high memory */); >=20 >=20 >=20 > -> +--------------+ > | HEST + | registers | | Error Status | > + +------------+ | +---------+ | Data Block | > | | GHES | --> | | address | --------+-->| +------------+ > | | GHES | --> | | address | ------+ | | CPER | > | | GHES | --> | | address | ----+ | | | CPER | > | | GHES | --> | | address | -+ | | | | CPER | > +-+------------+ +-+---------+ | | +---> +--------------+ > | | | | CPER | > | | | | CPER | > | +-----> +--------------+ > | | | CPER | > +--------> +--------------+ > | | CPER | > | | CPER | > | | CPER | > +-+------------+ >=20 >=20 >=20 > so how about we use separate etc/hardwre_errorsN for each Error Status = status Block? then >=20 > etc/hardwre_errors0 > etc/hardwre_errors1 > ................... > etc/hardwre_errors10 > (the max N is 10) >=20 >=20 > the N can be one of below values, according to ACPI spec "Table 18-345 = Hardware Error Notification Structure" > 0 =E2=80=93 Polled > 1 =E2=80=93 External Interrupt > 2 =E2=80=93 Local Interrupt > 3 =E2=80=93 SCI > 4 =E2=80=93 NMI > 5 - CMCI > 6 - MCE > 7 - GPIO-Signal > 8 - ARMv8 SEA > 9 - ARMv8 SEI > 10 - External Interrupt - GSIV I'm unsure if, by "not fixed", you are saying the number of CPER entries that fits in Error Status Data Block N is not *uniform* across 0 <=3D N <=3D 10 [1] or the number of CPER entries that fits in Error Status Data Block N is not *known* in advance, for all of 0 <=3D N <=3D 10 [2] Which one is your point? If [1], that's no problem; you can simply sum the individual error status data block sizes in advance, and allocate "etc/hardware_errors" accordingly, using the total size. (Allocating one shared fw_cfg blob for all status data blocks is more memory efficient, as each ALLOCATE command will allocate whole pages (rounded up from the actual blob size).) If your point is [2], then splitting the error status data blocks to separate fw_cfg blobs makes no difference: regardless of whether we try to place all the error status data blocks in a single fw_cfg blob, or in separate fw_cfg blobs, the individual data block cannot be resized at OS runtime, so there's no way to make it work. Thanks, Laszlo >=20 >=20 >=20 >=20 >> >> (3) QEMU generates the ACPI linker/loader script for the firmware, as >> always. >> >> (3a) The HEST table is part of "etc/acpi/tables", which the firmware >> already allocates memory for, and downloads (because QEMU already >> generates an ALLOCATE linker/loader command for it already). >> >> (3b) QEMU will have to create another ALLOCATE command for the >> "etc/hardware_errors" blob. The firmware allocates memory for this blo= b, >> and downloads it. >> >> (4) QEMU generates, in the ACPI linker/loader script for the firwmare,= N >> ADD_POINTER commands, which point the GHES."Error Status >> Address" fields in the HEST table, to the corresponding address >> registers in the downloaded "etc/hardware_errors" blob. >> >> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that t= he >> HEST table is correctly checksummed after executing the N ADD_POINTER >> commands from (4). >> >> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing t= he >> address registers (located in guest memory, in the downloaded >> "etc/hardware_errors" blob) to the respective Error Status Data Blocks= . >> >> (7) (This is the trick.) For this step, we need a third, write-only >> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the >> firmware can send back the guest-side allocation addresses to QEMU. >> >> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries. >> QEMU generates N WRITE_POINTER commands for the firmware. >> >> For error source K (0 <=3D K < N), QEMU instructs the firmware to >> calculate the guest address of Error Status Data Block K, from the >> QEMU-dictated offset within "etc/hardware_errors", and from the >> guest-determined allocation base address for "etc/hardware_errors". Th= e >> firmware then writes the calculated address back to fw_cfg file >> "etc/hardware_errors_addr", at offset K*8, according to the >> WRITE_POINTER command. >> >> This way QEMU will know the GPA of each Error Status Data Block. >> >> (In fact this can be simplified to a single WRITE_POINTER command: the >> address of the "address register table" can be sent back to QEMU as >> well, which already contains all Error Status Data Block addresses.) >> >> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to co= me >> through a signalfd -- QEMU can format the CPER right into guest memory= , >> and then inject whatever interrupt (or assert whatever GPIO line) is >> necessary for notifying the guest. >> >> (9) This notification (in virtual hardware) can either be handled by t= he >> guest kernel stand-alone, or else the guest kernel can invoke an ACPI >> event handler method with it (which would be in the DSDT or one of the >> SSDTs, also generated by QEMU). The ACPI event handler method could >> invoke the specific guest kernel driver for errror handling via a >> Notify() operation. >> >> I'm attracted to the above design because: >> - it would leave the firmware alone after OS boot, and >> - it would leave the firmware blissfully ignorant about HEST, GHES, >> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented >> in the first place.) >> >> Thanks >> Laszlo >> >>>> Do you think which modules generates the APEI table is better? UE= FI or Qemu? >>>> >>>> >>>> >>>> >>>> On 2017/3/28 21:40, James Morse wrote: >>>>> Hi gengdongjiu, >>>>> >>>>> On 28/03/17 13:16, gengdongjiu wrote: >>>>>> On 2017/3/28 19:54, Achin Gupta wrote: >>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote: >>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote: >>>>>>>>> On the host, part of UEFI is involved to generate the CPER reco= rds. >>>>>>>>> In a guest?, I don't know. >>>>>>>>> Qemu could generate the records, or drive some other component = to do it. >>>>>>>> >>>>>>>> I think I am beginning to understand this a bit. Since the guet= UEFI >>>>>>>> instance is specifically built for the machine it runs on, QEMU'= s virt >>>>>>>> machine in this case, they could simply agree (by some contract)= to >>>>>>>> place the records at some specific location in memory, and if th= e guest >>>>>>>> kernel asks its guest UEFI for that location, things should just= work by >>>>>>>> having logic in QEMU to process error reports and populate guest= memory. >>>>>>>> >>>>>>>> Is this how others see the world too? >>>>>>> >>>>>>> I think so! >>>>>>> >>>>>>> AFAIU, the memory where CPERs will reside should be specified in = a GHES entry in >>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest= UEFI creates a >>>>>>> HEST for the guest Kernel? >>>>>>> >>>>>>> If so, then the question is how the guest UEFI finds out where QE= MU (acting as >>>>>>> EL3 firmware) will populate the CPERs. This could either be a con= tract between >>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [= 1]) to ask QEMU >>>>>>> where the memory is. >>>>>> >>>>>> whether invoke the guest UEFI will be complex? not see the advanta= ge. it seems x86 Qemu >>>>>> directly generate the ACPI table, but I am not sure, we are checki= ng the qemu >>>>> logical. >>>>>> let Qemu generate CPER record may be clear. >>>>> >>>>> At boot UEFI in the guest will need to make sure the areas of memor= y that may be >>>>> used for CPER records are reserved. Whether UEFI or Qemu decides wh= ere these are >>>>> needs deciding, (but probably not here)... >>>>> >>>>> At runtime, when an error has occurred, I agree it would be simpler= (fewer >>>>> components involved) if Qemu generates the CPER records. But if UEF= I made the >>>>> memory choice above they need to interact and it gets complicated a= gain. The >>>>> CPER records are defined in the UEFI spec, so I would expect UEFI t= o contain >>>>> code to generate/parse them. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> James >>>>> >>>>> >>>>> . >>>>> >>>> >> >> >> . >> >=20