From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54318) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZAGTt-0005GW-3r for qemu-devel@nongnu.org; Wed, 01 Jul 2015 07:47:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZAGTp-00054k-PA for qemu-devel@nongnu.org; Wed, 01 Jul 2015 07:47:08 -0400 Received: from e23smtp03.au.ibm.com ([202.81.31.145]:37986) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZAGTp-00054J-6O for qemu-devel@nongnu.org; Wed, 01 Jul 2015 07:47:05 -0400 Received: from /spool/local by e23smtp03.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 1 Jul 2015 21:47:01 +1000 Received: from d23relay10.au.ibm.com (d23relay10.au.ibm.com [9.190.26.77]) by d23dlp02.au.ibm.com (Postfix) with ESMTP id AC6AF2BB003F for ; Wed, 1 Jul 2015 21:46:59 +1000 (EST) Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay10.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t61Bkpcq35193054 for ; Wed, 1 Jul 2015 21:46:59 +1000 Received: from d23av02.au.ibm.com (localhost [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t61BkRkf021075 for ; Wed, 1 Jul 2015 21:46:27 +1000 Message-ID: <5593D2F9.6010600@linux.vnet.ibm.com> Date: Wed, 01 Jul 2015 19:46:01 +0800 From: Hong Bo Li MIME-Version: 1.0 References: <20150629115710-mutt-send-email-mst@redhat.com> <5592345B.9070505@linux.vnet.ibm.com> <20150701080106-mutt-send-email-mst@redhat.com> <55939D29.1060701@linux.vnet.ibm.com> <20150701100206-mutt-send-email-mst@redhat.com> <5593AF27.8030208@linux.vnet.ibm.com> <20150701111650-mutt-send-email-mst@redhat.com> <5593BB28.7060904@linux.vnet.ibm.com> <20150701122334-mutt-send-email-mst@redhat.com> <5593CAEA.2000301@linux.vnet.ibm.com> <20150701131635-mutt-send-email-mst@redhat.com> In-Reply-To: <20150701131635-mutt-send-email-mst@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v2 1/1] KVM s390 pci infrastructure modelling List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: cornelia.huck@de.ibm.com, borntraeger@de.ibm.com, qemu-devel@nongnu.org, agraf@suse.de On 7/1/2015 19:23, Michael S. Tsirkin wrote: > On Wed, Jul 01, 2015 at 07:11:38PM +0800, Hong Bo Li wrote: >> >> On 7/1/2015 18:36, Michael S. Tsirkin wrote: >>> On Wed, Jul 01, 2015 at 06:04:24PM +0800, Hong Bo Li wrote: >>>> On 7/1/2015 17:22, Michael S. Tsirkin wrote: >>>>> On Wed, Jul 01, 2015 at 05:13:11PM +0800, Hong Bo Li wrote: >>>>>> On 7/1/2015 16:05, Michael S. Tsirkin wrote: >>>>>>> On Wed, Jul 01, 2015 at 03:56:25PM +0800, Hong Bo Li wrote: >>>>>>>> On 7/1/2015 14:22, Michael S. Tsirkin wrote: >>>>>>>>> On Tue, Jun 30, 2015 at 02:16:59PM +0800, Hong Bo Li wrote: >>>>>>>>>> On 6/29/2015 18:01, Michael S. Tsirkin wrote: >>>>>>>>>>> On Mon, Jun 29, 2015 at 05:24:53PM +0800, Hong Bo Li wrote: >>>>>>>>>>>> This patch introduce a new facility(and bus) >>>>>>>>>>>> to hold devices representing information actually >>>>>>>>>>>> provided by s390 firmware and I/O configuration. >>>>>>>>>>>> usage example: >>>>>>>>>>>> -device s390-pcihost >>>>>>>>>>>> -device vfio-pci,host=0000:00:00.0,id=vpci1 >>>>>>>>>>>> -device zpci,fid=2,uid=5,pci_id=vpci1,id=zpci1 >>>>>>>>>>>> >>>>>>>>>>>> The first line will create a s390 pci host bridge >>>>>>>>>>>> and init the root bus. The second line will create >>>>>>>>>>>> a standard vfio pci device, and attach it to the >>>>>>>>>>>> root bus. These are similiar to the standard process >>>>>>>>>>>> to define a pci device on other platform. >>>>>>>>>>>> >>>>>>>>>>>> The third line will create a s390 pci device to >>>>>>>>>>>> store s390 specific information, and references >>>>>>>>>>>> the corresponding vfio pci device via device id. >>>>>>>>>>>> We create a s390 pci facility bus to hold all the >>>>>>>>>>>> zpci devices. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Hong Bo Li >>>>>>>>>>> It's mostly up to s390 maintainers, but I'd like to note >>>>>>>>>>> one thing below >>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>> hw/s390x/s390-pci-bus.c | 314 +++++++++++++++++++++++++++++++++------------ >>>>>>>>>>>> hw/s390x/s390-pci-bus.h | 48 ++++++- >>>>>>>>>>>> hw/s390x/s390-pci-inst.c | 4 +- >>>>>>>>>>>> hw/s390x/s390-virtio-ccw.c | 5 +- >>>>>>>>>>>> 4 files changed, 283 insertions(+), 88 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c >>>>>>>>>>>> index 560b66a..d5e7b2e 100644 >>>>>>>>>>>> --- a/hw/s390x/s390-pci-bus.c >>>>>>>>>>>> +++ b/hw/s390x/s390-pci-bus.c >>>>>>>>>>>> @@ -32,8 +32,8 @@ int chsc_sei_nt2_get_event(void *res) >>>>>>>>>>>> PciCcdfErr *eccdf; >>>>>>>>>>>> int rc = 1; >>>>>>>>>>>> SeiContainer *sei_cont; >>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>> if (!s) { >>>>>>>>>>>> return rc; >>>>>>>>>>>> @@ -72,8 +72,8 @@ int chsc_sei_nt2_get_event(void *res) >>>>>>>>>>>> int chsc_sei_nt2_have_event(void) >>>>>>>>>>>> { >>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>> if (!s) { >>>>>>>>>>>> return 0; >>>>>>>>>>>> @@ -82,20 +82,32 @@ int chsc_sei_nt2_have_event(void) >>>>>>>>>>>> return !QTAILQ_EMPTY(&s->pending_sei); >>>>>>>>>>>> } >>>>>>>>>>>> +void s390_pci_device_enable(S390PCIBusDevice *zpci) >>>>>>>>>>>> +{ >>>>>>>>>>>> + zpci->fh = zpci->fh | 1 << ENABLE_BIT_OFFSET; >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>> +void s390_pci_device_disable(S390PCIBusDevice *zpci) >>>>>>>>>>>> +{ >>>>>>>>>>>> + zpci->fh = zpci->fh & ~(1 << ENABLE_BIT_OFFSET); >>>>>>>>>>>> + if (zpci->is_unplugged) >>>>>>>>>>>> + object_unparent(OBJECT(zpci)); >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>> S390PCIBusDevice *s390_pci_find_dev_by_fid(uint32_t fid) >>>>>>>>>>>> { >>>>>>>>>>>> S390PCIBusDevice *pbdev; >>>>>>>>>>>> - int i; >>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>> + BusChild *kid; >>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>> if (!s) { >>>>>>>>>>>> return NULL; >>>>>>>>>>>> } >>>>>>>>>>>> - for (i = 0; i < PCI_SLOT_MAX; i++) { >>>>>>>>>>>> - pbdev = &s->pbdev[i]; >>>>>>>>>>>> - if ((pbdev->fh != 0) && (pbdev->fid == fid)) { >>>>>>>>>>>> + QTAILQ_FOREACH(kid, &s->fbus->qbus.children, sibling) { >>>>>>>>>>>> + pbdev = (S390PCIBusDevice *)kid->child; >>>>>>>>>>>> + if (pbdev->fid == fid) { >>>>>>>>>>>> return pbdev; >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> @@ -126,39 +138,24 @@ void s390_pci_sclp_configure(int configure, SCCB *sccb) >>>>>>>>>>>> return; >>>>>>>>>>>> } >>>>>>>>>>>> -static uint32_t s390_pci_get_pfid(PCIDevice *pdev) >>>>>>>>>>>> -{ >>>>>>>>>>>> - return PCI_SLOT(pdev->devfn); >>>>>>>>>>>> -} >>>>>>>>>>>> - >>>>>>>>>>>> -static uint32_t s390_pci_get_pfh(PCIDevice *pdev) >>>>>>>>>>>> -{ >>>>>>>>>>>> - return PCI_SLOT(pdev->devfn) | FH_VIRT; >>>>>>>>>>>> -} >>>>>>>>>>>> - >>>>>>>>>>>> S390PCIBusDevice *s390_pci_find_dev_by_idx(uint32_t idx) >>>>>>>>>>>> { >>>>>>>>>>>> S390PCIBusDevice *pbdev; >>>>>>>>>>>> - int i; >>>>>>>>>>>> - int j = 0; >>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>> + BusChild *kid; >>>>>>>>>>>> + int i = 0; >>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>> if (!s) { >>>>>>>>>>>> return NULL; >>>>>>>>>>>> } >>>>>>>>>>>> - for (i = 0; i < PCI_SLOT_MAX; i++) { >>>>>>>>>>>> - pbdev = &s->pbdev[i]; >>>>>>>>>>>> - >>>>>>>>>>>> - if (pbdev->fh == 0) { >>>>>>>>>>>> - continue; >>>>>>>>>>>> - } >>>>>>>>>>>> - >>>>>>>>>>>> - if (j == idx) { >>>>>>>>>>>> + QTAILQ_FOREACH(kid, &s->fbus->qbus.children, sibling) { >>>>>>>>>>>> + pbdev = (S390PCIBusDevice *)kid->child; >>>>>>>>>>>> + if (i == idx) { >>>>>>>>>>>> return pbdev; >>>>>>>>>>>> } >>>>>>>>>>>> - j++; >>>>>>>>>>>> + i++; >>>>>>>>>>>> } >>>>>>>>>>>> return NULL; >>>>>>>>>>> This relies on the order of children on the qbus, that's wrong I think. >>>>>>>>>>> Generally I'm not sure why do you convert all slot lookups to child >>>>>>>>>>> lookups: more code to achieve the same effect? >>>>>>>>>> Thank you Michael. >>>>>>>>>> I do the change due to two reasons: >>>>>>>>>> 1. The old implement only supports one s390 pci root bus, and 32(PCI_SLOT_MAX) >>>>>>>>>> slots at most. So when it comes to multiple s390 pci root buses, the old code >>>>>>>>>> does not work. >>>>>>>>>> 2. Now the zpci device "S390PCIBusDevice" is only a structure to store >>>>>>>>>> s390 specific information, so we can attach all the zpci devices to a >>>>>>>>>> s390 pci facility bus. Since these zpci device has no relation with the "slot", >>>>>>>>>> so the order of them does not matter. >>>>>>>>> But you make this order guest-visible which seems wrong. >>>>>>>>> >>>>>>>> The guest uses a s390 specific "list pci" instruction to get all the zpci >>>>>>>> devices, and will >>>>>>>> create a root s390 pci bus for each device. So the order has no relation >>>>>>>> with the pci >>>>>>>> topology on guest. >>>>>>>> >>>>>>>> If we assign too many zpci devices to one guest, the "list pci" instruction >>>>>>>> will use a >>>>>>>> resume token to get all the zpci devices. For example, first time we return >>>>>>>> 32 zpci >>>>>>>> devices to guest. Next time we'll return another 32 zpci devices. The resume >>>>>>>> token >>>>>>>> is used to store the beginning of zpci devices that will be returned to >>>>>>>> guest at next time. >>>>>>>> >>>>>>>> So, if we change the order of the zpci device on s390 facility bus, it may >>>>>>>> change the >>>>>>>> "batch" in which this device be returned to guest. But this will not change >>>>>>>> the pci >>>>>>>> topology on guest. >>>>>>> Yes but that's still guest visible, and will break >>>>>>> for example if guest is migrated between qemu instances >>>>>>> where list order is different precisely when >>>>>>> it's enumerating the bus. >>>>>>> >>>>>> Yes, and the list order is not the only s390 specific information that >>>>>> exposed to >>>>>> guest. Besides that, we need to migrate all other zpci information. For >>>>>> now, >>>>>> we have no plan to support zpci migration yet. >>>>> BTW how will hotplug work? If it happens while guest >>>>> enumerates the bus the naturally all index values >>>>> become invalid. >>>> The list zpci only happen when the guest doing pci_base_init() for s390. >>>> At that moment, hotplug does not work yet. >>> You can't prevent this: user can request hotplug at this time. >>> >>>> And assume we have >>>> that case, we still have the index issue even when scan standard pci >>>> bus. Please see my following words. >>>> >>>>> Just don't expose internal qdev data structures to guest. >>>>> It's not by chance that we don't have a look up by index >>>>> capability, it's an attempt to enfoce sane usage. >>>>> You are misusing the API with your hack. >>>> The resume token of list zpci is indeed an index of iteration:( >>>> >>>>> PCI has standard ways to enumerate the bus, maybe you >>>>> should emulate it. Or find some other way that works. >>>>> The idea to poke at s->fbus->qbus and count things there >>>>> is a bad one. >>>>> >>>> I can define multiple zpci buses, and attach zpci device to a slot of a root >>>> bus. >>>> Then I need to add a api to the common pci code to do the scan of all the >>>> pci host bridges. And in this way, it still has the index issue. I need to >>>> scan >>> >from the first bus to count the index. So any suggestion from you? >>> OK, I looked at arch/s390/pci/pci.c. >>> First of all, it seems to run the regular PCI thing on bridges. >>> >>> zdev->bus = pci_scan_root_bus(NULL, ZPCI_BUS_NR, &pci_root_ops, >>> zdev, &resources); >> At this moment, the guest has got all the zpci devices through clp list zpci >> instruction. For each device, in the pci_scan_root_bus(), it will create >> a root bus. So for s390, we get pci devices first, then create a new root bus >> for it. > I don't see this in guest code. > > I looked at pci_scan_root_bus and it's completely generic. > It sets up the bus: > b = pci_create_root_bus(parent, bus, ops, sysdata, resources); > > then it scans it: > max = pci_scan_child_bus(b); > > > that one does > /* Go find them, Rover! */ > for (devfn = 0; devfn < 0x100; devfn += 8) > pci_scan_slot(bus, devfn); > > next > dev = pci_scan_single_device(bus, devfn); > > and so on. Eventually you get > if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000)) > return NULL; > > and that one does the clp thing using zpci_cfg_load. > pci_base_init()-> clp_scan_pci_devices(): rc = clp_list_pci(rrb, __clp_add); In this function, there is a while loop to get all the zpci devices by means of resume token(index). And for each device, __clp_add()-> clp_add_pci_device(); In clp_add_pci_device(), we use the zpci information to create a struct zpci_dev zdev. Then zpci_create_device()->zpci_scan_bus()->pci_scan_root_bus() zdev->bus = pci_scan_root_bus(NULL, ZPCI_BUS_NR, &pci_root_ops, zdev, &resources); So, you see, each zdev has its own root bus. And there is no child bus under that root bus. > > >>> so to me, it looks like there's no need to expose >>> non-root buses through special means. >>> >>> What to do for root buses is a different question but again, >>> you definitely do not want to rely on the order of things >>> on that linked list. >>> The simplest thing is to ask user to give them unique >>> numbers, or find some stable way to sort them that >>> does not rely on order of initialization (e.g. device IDs?). >>> >>> But again, this only works ok for root buses. >>> >> Basically, it does not exposed the buses to guest, it exposed an index >> to guest. >> Here is the process to get all the zpci device for a guest. >> For example: we have 10 zpci devices, and the batch size for list zpci >> instruction is 4. >> First, qemu will return devices 0-3, index of list zpci is 0 >> Second, qemu will return device 4-7, index of list zpci is 4 >> Third, qemu will return device 8-9, index of list zpci is 8 >> We have device id, but list zpci does not use that as a flag to get >> next batch, it use an index instead. >> This process is defined by s390 arch, we can't change it. >> So no matter how we organize zpci devices in qemu, slot or link list. >> We could not get rid of the index issue. >> >> How about I add a flag to identify whether the link list >> is valid or not. When a hotplug/unplug event occurred, I will >> reset the index, and make the guest refetch the zpci devices >> from the beginning. >> >> >> >> > You should just use something stable for IDs. > And avoid doing it for anything that isn't a root or maybe a bridge > since it'll just cause everyone maintainance problems down the road. > The list zpci instruction is defined by arch, not a software thing, I could not change it to use a ID instead...