From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57525) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fEGkV-0005XD-SR for qemu-devel@nongnu.org; Thu, 03 May 2018 12:06:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fEGkQ-0003VE-O0 for qemu-devel@nongnu.org; Thu, 03 May 2018 12:06:27 -0400 Received: from 10.mo69.mail-out.ovh.net ([46.105.73.241]:35288) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fEGkQ-0003UM-Cx for qemu-devel@nongnu.org; Thu, 03 May 2018 12:06:22 -0400 Received: from player714.ha.ovh.net (unknown [10.109.108.70]) by mo69.mail-out.ovh.net (Postfix) with ESMTP id C622B138AA for ; Thu, 3 May 2018 18:06:20 +0200 (CEST) References: <20180419124331.3915-1-clg@kaod.org> <20180419124331.3915-7-clg@kaod.org> <20180426071129.GJ8800@umbus.fritz.box> <9273a240-6518-155f-ed78-79abe53761e3@kaod.org> <20180503053532.GR13229@umbus.fritz.box> From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= Message-ID: <9c6d62fe-40e7-0bb9-678e-ed8246373e98@kaod.org> Date: Thu, 3 May 2018 18:06:14 +0200 MIME-Version: 1.0 In-Reply-To: <20180503053532.GR13229@umbus.fritz.box> Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Benjamin Herrenschmidt On 05/03/2018 07:35 AM, David Gibson wrote: > On Thu, Apr 26, 2018 at 11:27:21AM +0200, C=E9dric Le Goater wrote: >> On 04/26/2018 09:11 AM, David Gibson wrote: >>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, C=E9dric Le Goater wrote: >>>> The XIVE presenter engine uses a set of registers to handle priority >>>> management and interrupt acknowledgment among other things. The most >>>> important ones being : >>>> >>>> - Interrupt Priority Register (PIPR) >>>> - Interrupt Pending Buffer (IPB) >>>> - Current Processor Priority (CPPR) >>>> - Notification Source Register (NSR) >>>> >>>> There is one set of registers per level of privilege, four in all : >>>> HW, HV pool, OS and User. These are called rings. All registers are >>>> accessible through a specific MMIO region called the Thread Interrup= t >>>> Management Areas (TIMA) but, depending on the privilege level of the >>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the >>>> OS privilege and therefore can only accesses the OS and the User >>>> rings. The others are for hypervisor levels. >>>> >>>> The CPU interrupt state is modeled with a XiveNVT object which store= s >>>> the values of the different registers. The different TIMA views are >>>> mapped at the same address for each CPU and 'current_cpu' is used to >>>> retrieve the XiveNVT holding the ring registers. >>>> >>>> Signed-off-by: C=E9dric Le Goater >>>> --- >>>> >>>> Changes since v2 : >>>> >>>> - introduced the XiveFabric interface >>>> >>>> hw/intc/spapr_xive.c | 25 ++++ >>>> hw/intc/xive.c | 279 +++++++++++++++++++++++++++++++++= +++++++++++ >>>> include/hw/ppc/spapr_xive.h | 5 + >>>> include/hw/ppc/xive.h | 31 +++++ >>>> include/hw/ppc/xive_regs.h | 84 +++++++++++++ >>>> 5 files changed, 424 insertions(+) >>>> >>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c >>>> index 90cde8a4082d..f07832bf0a00 100644 >>>> --- a/hw/intc/spapr_xive.c >>>> +++ b/hw/intc/spapr_xive.c >>>> @@ -13,6 +13,7 @@ >>>> #include "target/ppc/cpu.h" >>>> #include "sysemu/cpus.h" >>>> #include "monitor/monitor.h" >>>> +#include "hw/ppc/spapr.h" >>>> #include "hw/ppc/spapr_xive.h" >>>> #include "hw/ppc/xive.h" >>>> #include "hw/ppc/xive_regs.h" >>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, = Error **errp) >>>> =20 >>>> /* Allocate the Interrupt Virtualization Table */ >>>> xive->ivt =3D g_new0(XiveIVE, xive->nr_irqs); >>>> + >>>> + /* The Thread Interrupt Management Area has the same address fo= r >>>> + * each chip. On sPAPR, we only need to expose the User and OS >>>> + * level views of the TIMA. >>>> + */ >>>> + xive->tm_base =3D XIVE_TM_BASE; >>> >>> The constant should probably have PAPR in the name somewhere, since >>> it's just for PAPR machines (same for the ESB mappings, actually). >> >> ok.=20 >> >> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in=20 >> case we want to change the value when the guest is instantiated.=20 >> I doubt it but this is an address in the global address space, so=20 >> letting the machine have control is better I think. >=20 > I agree. >=20 >>>> + >>>> + memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive), >>>> + &xive_tm_user_ops, xive, "xive.tima.user"= , >>>> + 1ull << TM_SHIFT); >>>> + sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user); >>>> + >>>> + memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive), >>>> + &xive_tm_os_ops, xive, "xive.tima.os", >>>> + 1ull << TM_SHIFT); >>>> + sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os); >>>> } >>>> =20 >>>> static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn) >>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *= xf, uint32_t lisn) >>>> return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL; >>>> } >>>> =20 >>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server) >>>> +{ >>>> + PowerPCCPU *cpu =3D spapr_find_cpu(server); >>>> + >>>> + return cpu ? XIVE_NVT(cpu->intc) : NULL; >>>> +} >>> >>> So this is a bit of a tangent, but I've been thinking of implementing >>> a scheme where there's an opaque pointer in the cpu structure for the >>> use of the machine. I'm planning for that to replace the intc pointe= r >>> (which isn't really used directly by the cpu). That would allow us to >>> have spapr put a structure there and have both xics and xive pointers >>> which could be useful later on. >> >> ok. That should simplify the patchset at the end, in which we need to=20 >> switch the 'intc' pointer.=20 >> >>> I think we'd need something similar to correctly handle migration of >>> the VPA state, which is currently horribly broken. >>> >>>> + >>>> static const VMStateDescription vmstate_spapr_xive_ive =3D { >>>> .name =3D TYPE_SPAPR_XIVE "/ive", >>>> .version_id =3D 1, >>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *k= lass, void *data) >>>> dc->vmsd =3D &vmstate_spapr_xive; >>>> =20 >>>> xfc->get_ive =3D spapr_xive_get_ive; >>>> + xfc->get_nvt =3D spapr_xive_get_nvt; >>>> } >>>> =20 >>>> static const TypeInfo spapr_xive_info =3D { >>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c >>>> index dccad0318834..5691bb9474e4 100644 >>>> --- a/hw/intc/xive.c >>>> +++ b/hw/intc/xive.c >>>> @@ -14,7 +14,278 @@ >>>> #include "sysemu/cpus.h" >>>> #include "sysemu/dma.h" >>>> #include "monitor/monitor.h" >>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */ >>>> #include "hw/ppc/xive.h" >>>> +#include "hw/ppc/xive_regs.h" >>>> + >>>> +/* >>>> + * XIVE Interrupt Presenter >>>> + */ >>>> + >>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt) >>>> +{ >>>> + return 0; >>>> +} >>>> + >>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr) >>>> +{ >>>> + if (cppr > XIVE_PRIORITY_MAX) { >>>> + cppr =3D 0xff; >>>> + } >>>> + >>>> + nvt->ring_os[TM_CPPR] =3D cppr; >>> >>> Surely this needs to recheck if we should be interrupting the cpu? >> >> yes. In patch 9, when we introduce the nvt notify routine. >=20 > Ok. >=20 >>>> +} >>>> + >>>> +/* >>>> + * OS Thread Interrupt Management Area MMIO >>>> + */ >>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset, >>>> + unsigned size) >>>> +{ >>>> + uint64_t ret =3D -1; >>>> + >>>> + if (offset =3D=3D TM_SPC_ACK_OS_REG && size =3D=3D 2) { >>>> + ret =3D xive_nvt_accept(nvt); >>>> + } else { >>>> + qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%" >>>> + HWADDR_PRIx" size %d\n", offset, size); >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +#define TM_RING(offset) ((offset) & 0xf0) >>>> + >>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset, >>>> + unsigned size) >>>> +{ >>>> + PowerPCCPU *cpu =3D POWERPC_CPU(current_cpu); >>> >>> So, as I said on a previous version of this, we can actually correctl= y >>> represent different mappings in different cpu spaces, by exploiting >>> cpu->as and not just having them all point to &address_space_memory. >> >> Yes, you did and I haven't studied the question yet. For the next vers= ion. >=20 > So, it's possible that using the cpu->as thing will be more trouble > that it's worth.=20 One of the trouble is the number of memory regions to use, one per cpu,=20 and the KVM support. Having a single region is much easier.=20 > I am a little concerned about using current_cpu though. =20 > First, will it work with KVM with kernel_irqchip=3Doff - the > cpus are running truly concurrently, FWIW, I didn't see any issue yet while stressing.=20 > but we still need to work out who's poking at the TIMA. =20 I understand. The registers are accessed by the current cpu to set the=20 CPPR and to ack an interrupt. But when we route an event, we also access=20 and modify the registers. Do you suggest some locking ? I am not sure how are protected the TIMA region accesses vs. the routing, which is=20 necessarily initiated by an ESB MMIO though. > Second, are there any cases where we might > need to trip this "on behalf of" a specific cpu that's not the current > one. ah. yes. sort of :) only in powernv, when the xive is reseted (and when=20 dumping the state for debug). The IC has a way to access indirectly the registers of a HW thread.=20 It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of=20 the targeted thread and then loads on the indirect TIMA can be done=20 as if it was the current thread. The indirect TIMA is mapped 4 pages=20 after the IC BAR. The resulting memory region op is a little ugly and might need=20 some rework :=20 static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset, unsigned size) { PowerPCCPU **cpuptr =3D opaque; PowerPCCPU *cpu =3D *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu); ... >>>> + XiveNVT *nvt =3D XIVE_NVT(cpu->intc); >>>> + uint64_t ret =3D -1; >>>> + int i; >>>> + >>>> + if (offset >=3D TM_SPC_ACK_EBB) { >>>> + return xive_tm_read_special(nvt, offset, size); >>>> + } >>>> + >>>> + if (TM_RING(offset) !=3D TM_QW1_OS) { >>>> + qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non= -OS ring @%" >>>> + HWADDR_PRIx"\n", offset); >>>> + return ret; >>> >>> Just return -1 would be clearer here; >> >> ok. >> >>> >>>> + } >>>> + >>>> + ret =3D 0; >>>> + for (i =3D 0; i < size; i++) { >>>> + ret |=3D (uint64_t) nvt->regs[offset + i] << (8 * (size - i= - 1)); >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static bool xive_tm_is_readonly(uint8_t offset) >>>> +{ >>>> + return offset !=3D TM_QW1_OS + TM_CPPR; >>>> +} >>>> + >>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset, >>>> + uint64_t value, unsigned si= ze) >>>> +{ >>>> + /* TODO: support TM_SPC_SET_OS_PENDING */ >>>> + >>>> + /* TODO: support TM_SPC_ACK_OS_EL */ >>>> +} >>>> + >>>> +static void xive_tm_os_write(void *opaque, hwaddr offset, >>>> + uint64_t value, unsigned size) >>>> +{ >>>> + PowerPCCPU *cpu =3D POWERPC_CPU(current_cpu); >>>> + XiveNVT *nvt =3D XIVE_NVT(cpu->intc); >>>> + int i; >>>> + >>>> + if (offset >=3D TM_SPC_ACK_EBB) { >>>> + xive_tm_write_special(nvt, offset, value, size); >>>> + return; >>>> + } >>>> + >>>> + if (TM_RING(offset) !=3D TM_QW1_OS) { >>> >>> Why have this if you have separate OS and user regions as you appear >>> to do below? >> >> This is another problem we are trying to solve.=20 >> >> The registers a CPU can access depends on the TIMA view it is using.=20 >> The OS TIMA view only sees the OS ring registers. The HV view sees all= .=20 >> >>> Or to look at it another way, shouldn't it be possible to make the >>> read/write accessors the same for the OS and user rings? >> >> For some parts yes, but the special load/store addresses are different >> for each view, the read-only register also. It seemed easier to duplic= ate. >> >> I think the problem will become clearer (or worse) with pnv which uses= =20 >> the HV mode. >=20 > Oh. I had the impression that each ring had a basically identical set > of registers and you just had access to the region for your ring and > the ones below. Are you saying instead it's basically a single block > of registers with various different privilege levels for each of them? yes. I think I answered this question more clearly in a previous email. > [snip] >>>> +} >>>> + >>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp) >>>> +{ >>>> + qemu_unregister_reset(xive_nvt_reset, dev); >>>> +} >>>> + >>>> +static void xive_nvt_init(Object *obj) >>>> +{ >>>> + XiveNVT *nvt =3D XIVE_NVT(obj); >>>> + >>>> + nvt->ring_os =3D &nvt->regs[TM_QW1_OS]; >>> >>> The ring_os field is basically pointless, being just an offset into a >>> structure you already have. A macro or inline would be a better idea= . >> >> ok. I liked the idea but I agree it's overkill to have an init routine >> just for this. I will find something. >=20 > That too, but it's also something that looks like an optimization but > isn't, which is bad practice. On modern cpus math is cheap (and this > is just a trivial offset), memory accesses are expensive. You're > essentially caching this offset - raising all the usual invalidation > questions for a cache - when caching it is *more* expensive than just > computing it every time. ok. removing this offset was a good opportunity to generalize the=20 routing algorithm and use a 'ring' parameter in all routines. Same=20 for the accept path.=20 C.