From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49391) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZAHMQ-0004pt-Gp for qemu-devel@nongnu.org; Wed, 01 Jul 2015 08:43:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZAHML-0007KH-CG for qemu-devel@nongnu.org; Wed, 01 Jul 2015 08:43:30 -0400 Received: from e28smtp06.in.ibm.com ([122.248.162.6]:60152) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZAHMK-0007JI-E6 for qemu-devel@nongnu.org; Wed, 01 Jul 2015 08:43:25 -0400 Received: from /spool/local by e28smtp06.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 1 Jul 2015 18:13:19 +0530 Received: from d28relay02.in.ibm.com (d28relay02.in.ibm.com [9.184.220.59]) by d28dlp02.in.ibm.com (Postfix) with ESMTP id 08B56394005B for ; Wed, 1 Jul 2015 18:13:19 +0530 (IST) Received: from d28av01.in.ibm.com (d28av01.in.ibm.com [9.184.220.63]) by d28relay02.in.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t61CgtKR8126674 for ; Wed, 1 Jul 2015 18:12:56 +0530 Received: from d28av01.in.ibm.com (localhost [127.0.0.1]) by d28av01.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t61Cgsal019060 for ; Wed, 1 Jul 2015 18:12:54 +0530 Message-ID: <5593E04C.3080204@linux.vnet.ibm.com> Date: Wed, 01 Jul 2015 20:42:52 +0800 From: Hong Bo Li MIME-Version: 1.0 References: <20150701080106-mutt-send-email-mst@redhat.com> <55939D29.1060701@linux.vnet.ibm.com> <20150701100206-mutt-send-email-mst@redhat.com> <5593AF27.8030208@linux.vnet.ibm.com> <20150701111650-mutt-send-email-mst@redhat.com> <5593BB28.7060904@linux.vnet.ibm.com> <20150701122334-mutt-send-email-mst@redhat.com> <5593CAEA.2000301@linux.vnet.ibm.com> <20150701131635-mutt-send-email-mst@redhat.com> <5593D2F9.6010600@linux.vnet.ibm.com> <20150701134925-mutt-send-email-mst@redhat.com> In-Reply-To: <20150701134925-mutt-send-email-mst@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v2 1/1] KVM s390 pci infrastructure modelling List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: cornelia.huck@de.ibm.com, borntraeger@de.ibm.com, qemu-devel@nongnu.org, agraf@suse.de On 7/1/2015 19:57, Michael S. Tsirkin wrote: > On Wed, Jul 01, 2015 at 07:46:01PM +0800, Hong Bo Li wrote: >> >> On 7/1/2015 19:23, Michael S. Tsirkin wrote: >>> On Wed, Jul 01, 2015 at 07:11:38PM +0800, Hong Bo Li wrote: >>>> On 7/1/2015 18:36, Michael S. Tsirkin wrote: >>>>> On Wed, Jul 01, 2015 at 06:04:24PM +0800, Hong Bo Li wrote: >>>>>> On 7/1/2015 17:22, Michael S. Tsirkin wrote: >>>>>>> On Wed, Jul 01, 2015 at 05:13:11PM +0800, Hong Bo Li wrote: >>>>>>>> On 7/1/2015 16:05, Michael S. Tsirkin wrote: >>>>>>>>> On Wed, Jul 01, 2015 at 03:56:25PM +0800, Hong Bo Li wrote: >>>>>>>>>> On 7/1/2015 14:22, Michael S. Tsirkin wrote: >>>>>>>>>>> On Tue, Jun 30, 2015 at 02:16:59PM +0800, Hong Bo Li wrote: >>>>>>>>>>>> On 6/29/2015 18:01, Michael S. Tsirkin wrote: >>>>>>>>>>>>> On Mon, Jun 29, 2015 at 05:24:53PM +0800, Hong Bo Li wrote: >>>>>>>>>>>>>> This patch introduce a new facility(and bus) >>>>>>>>>>>>>> to hold devices representing information actually >>>>>>>>>>>>>> provided by s390 firmware and I/O configuration. >>>>>>>>>>>>>> usage example: >>>>>>>>>>>>>> -device s390-pcihost >>>>>>>>>>>>>> -device vfio-pci,host=0000:00:00.0,id=vpci1 >>>>>>>>>>>>>> -device zpci,fid=2,uid=5,pci_id=vpci1,id=zpci1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> The first line will create a s390 pci host bridge >>>>>>>>>>>>>> and init the root bus. The second line will create >>>>>>>>>>>>>> a standard vfio pci device, and attach it to the >>>>>>>>>>>>>> root bus. These are similiar to the standard process >>>>>>>>>>>>>> to define a pci device on other platform. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The third line will create a s390 pci device to >>>>>>>>>>>>>> store s390 specific information, and references >>>>>>>>>>>>>> the corresponding vfio pci device via device id. >>>>>>>>>>>>>> We create a s390 pci facility bus to hold all the >>>>>>>>>>>>>> zpci devices. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Hong Bo Li >>>>>>>>>>>>> It's mostly up to s390 maintainers, but I'd like to note >>>>>>>>>>>>> one thing below >>>>>>>>>>>>> >>>>>>>>>>>>>> --- >>>>>>>>>>>>>> hw/s390x/s390-pci-bus.c | 314 +++++++++++++++++++++++++++++++++------------ >>>>>>>>>>>>>> hw/s390x/s390-pci-bus.h | 48 ++++++- >>>>>>>>>>>>>> hw/s390x/s390-pci-inst.c | 4 +- >>>>>>>>>>>>>> hw/s390x/s390-virtio-ccw.c | 5 +- >>>>>>>>>>>>>> 4 files changed, 283 insertions(+), 88 deletions(-) >>>>>>>>>>>>>> >>>>>>>>>>>>>> diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c >>>>>>>>>>>>>> index 560b66a..d5e7b2e 100644 >>>>>>>>>>>>>> --- a/hw/s390x/s390-pci-bus.c >>>>>>>>>>>>>> +++ b/hw/s390x/s390-pci-bus.c >>>>>>>>>>>>>> @@ -32,8 +32,8 @@ int chsc_sei_nt2_get_event(void *res) >>>>>>>>>>>>>> PciCcdfErr *eccdf; >>>>>>>>>>>>>> int rc = 1; >>>>>>>>>>>>>> SeiContainer *sei_cont; >>>>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>>>> if (!s) { >>>>>>>>>>>>>> return rc; >>>>>>>>>>>>>> @@ -72,8 +72,8 @@ int chsc_sei_nt2_get_event(void *res) >>>>>>>>>>>>>> int chsc_sei_nt2_have_event(void) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>>>> if (!s) { >>>>>>>>>>>>>> return 0; >>>>>>>>>>>>>> @@ -82,20 +82,32 @@ int chsc_sei_nt2_have_event(void) >>>>>>>>>>>>>> return !QTAILQ_EMPTY(&s->pending_sei); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> +void s390_pci_device_enable(S390PCIBusDevice *zpci) >>>>>>>>>>>>>> +{ >>>>>>>>>>>>>> + zpci->fh = zpci->fh | 1 << ENABLE_BIT_OFFSET; >>>>>>>>>>>>>> +} >>>>>>>>>>>>>> + >>>>>>>>>>>>>> +void s390_pci_device_disable(S390PCIBusDevice *zpci) >>>>>>>>>>>>>> +{ >>>>>>>>>>>>>> + zpci->fh = zpci->fh & ~(1 << ENABLE_BIT_OFFSET); >>>>>>>>>>>>>> + if (zpci->is_unplugged) >>>>>>>>>>>>>> + object_unparent(OBJECT(zpci)); >>>>>>>>>>>>>> +} >>>>>>>>>>>>>> + >>>>>>>>>>>>>> S390PCIBusDevice *s390_pci_find_dev_by_fid(uint32_t fid) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> S390PCIBusDevice *pbdev; >>>>>>>>>>>>>> - int i; >>>>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>>>> + BusChild *kid; >>>>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>>>> if (!s) { >>>>>>>>>>>>>> return NULL; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> - for (i = 0; i < PCI_SLOT_MAX; i++) { >>>>>>>>>>>>>> - pbdev = &s->pbdev[i]; >>>>>>>>>>>>>> - if ((pbdev->fh != 0) && (pbdev->fid == fid)) { >>>>>>>>>>>>>> + QTAILQ_FOREACH(kid, &s->fbus->qbus.children, sibling) { >>>>>>>>>>>>>> + pbdev = (S390PCIBusDevice *)kid->child; >>>>>>>>>>>>>> + if (pbdev->fid == fid) { >>>>>>>>>>>>>> return pbdev; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> @@ -126,39 +138,24 @@ void s390_pci_sclp_configure(int configure, SCCB *sccb) >>>>>>>>>>>>>> return; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> -static uint32_t s390_pci_get_pfid(PCIDevice *pdev) >>>>>>>>>>>>>> -{ >>>>>>>>>>>>>> - return PCI_SLOT(pdev->devfn); >>>>>>>>>>>>>> -} >>>>>>>>>>>>>> - >>>>>>>>>>>>>> -static uint32_t s390_pci_get_pfh(PCIDevice *pdev) >>>>>>>>>>>>>> -{ >>>>>>>>>>>>>> - return PCI_SLOT(pdev->devfn) | FH_VIRT; >>>>>>>>>>>>>> -} >>>>>>>>>>>>>> - >>>>>>>>>>>>>> S390PCIBusDevice *s390_pci_find_dev_by_idx(uint32_t idx) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> S390PCIBusDevice *pbdev; >>>>>>>>>>>>>> - int i; >>>>>>>>>>>>>> - int j = 0; >>>>>>>>>>>>>> - S390pciState *s = S390_PCI_HOST_BRIDGE( >>>>>>>>>>>>>> - object_resolve_path(TYPE_S390_PCI_HOST_BRIDGE, NULL)); >>>>>>>>>>>>>> + BusChild *kid; >>>>>>>>>>>>>> + int i = 0; >>>>>>>>>>>>>> + S390PCIFacility *s = S390_PCI_FACILITY( >>>>>>>>>>>>>> + object_resolve_path(TYPE_S390_PCI_FACILITY, NULL)); >>>>>>>>>>>>>> if (!s) { >>>>>>>>>>>>>> return NULL; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> - for (i = 0; i < PCI_SLOT_MAX; i++) { >>>>>>>>>>>>>> - pbdev = &s->pbdev[i]; >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - if (pbdev->fh == 0) { >>>>>>>>>>>>>> - continue; >>>>>>>>>>>>>> - } >>>>>>>>>>>>>> - >>>>>>>>>>>>>> - if (j == idx) { >>>>>>>>>>>>>> + QTAILQ_FOREACH(kid, &s->fbus->qbus.children, sibling) { >>>>>>>>>>>>>> + pbdev = (S390PCIBusDevice *)kid->child; >>>>>>>>>>>>>> + if (i == idx) { >>>>>>>>>>>>>> return pbdev; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> - j++; >>>>>>>>>>>>>> + i++; >>>>>>>>>>>>>> } >>>>>>>>>>>>>> return NULL; >>>>>>>>>>>>> This relies on the order of children on the qbus, that's wrong I think. >>>>>>>>>>>>> Generally I'm not sure why do you convert all slot lookups to child >>>>>>>>>>>>> lookups: more code to achieve the same effect? >>>>>>>>>>>> Thank you Michael. >>>>>>>>>>>> I do the change due to two reasons: >>>>>>>>>>>> 1. The old implement only supports one s390 pci root bus, and 32(PCI_SLOT_MAX) >>>>>>>>>>>> slots at most. So when it comes to multiple s390 pci root buses, the old code >>>>>>>>>>>> does not work. >>>>>>>>>>>> 2. Now the zpci device "S390PCIBusDevice" is only a structure to store >>>>>>>>>>>> s390 specific information, so we can attach all the zpci devices to a >>>>>>>>>>>> s390 pci facility bus. Since these zpci device has no relation with the "slot", >>>>>>>>>>>> so the order of them does not matter. >>>>>>>>>>> But you make this order guest-visible which seems wrong. >>>>>>>>>>> >>>>>>>>>> The guest uses a s390 specific "list pci" instruction to get all the zpci >>>>>>>>>> devices, and will >>>>>>>>>> create a root s390 pci bus for each device. So the order has no relation >>>>>>>>>> with the pci >>>>>>>>>> topology on guest. >>>>>>>>>> >>>>>>>>>> If we assign too many zpci devices to one guest, the "list pci" instruction >>>>>>>>>> will use a >>>>>>>>>> resume token to get all the zpci devices. For example, first time we return >>>>>>>>>> 32 zpci >>>>>>>>>> devices to guest. Next time we'll return another 32 zpci devices. The resume >>>>>>>>>> token >>>>>>>>>> is used to store the beginning of zpci devices that will be returned to >>>>>>>>>> guest at next time. >>>>>>>>>> >>>>>>>>>> So, if we change the order of the zpci device on s390 facility bus, it may >>>>>>>>>> change the >>>>>>>>>> "batch" in which this device be returned to guest. But this will not change >>>>>>>>>> the pci >>>>>>>>>> topology on guest. >>>>>>>>> Yes but that's still guest visible, and will break >>>>>>>>> for example if guest is migrated between qemu instances >>>>>>>>> where list order is different precisely when >>>>>>>>> it's enumerating the bus. >>>>>>>>> >>>>>>>> Yes, and the list order is not the only s390 specific information that >>>>>>>> exposed to >>>>>>>> guest. Besides that, we need to migrate all other zpci information. For >>>>>>>> now, >>>>>>>> we have no plan to support zpci migration yet. >>>>>>> BTW how will hotplug work? If it happens while guest >>>>>>> enumerates the bus the naturally all index values >>>>>>> become invalid. >>>>>> The list zpci only happen when the guest doing pci_base_init() for s390. >>>>>> At that moment, hotplug does not work yet. >>>>> You can't prevent this: user can request hotplug at this time. >>>>> >>>>>> And assume we have >>>>>> that case, we still have the index issue even when scan standard pci >>>>>> bus. Please see my following words. >>>>>> >>>>>>> Just don't expose internal qdev data structures to guest. >>>>>>> It's not by chance that we don't have a look up by index >>>>>>> capability, it's an attempt to enfoce sane usage. >>>>>>> You are misusing the API with your hack. >>>>>> The resume token of list zpci is indeed an index of iteration:( >>>>>> >>>>>>> PCI has standard ways to enumerate the bus, maybe you >>>>>>> should emulate it. Or find some other way that works. >>>>>>> The idea to poke at s->fbus->qbus and count things there >>>>>>> is a bad one. >>>>>>> >>>>>> I can define multiple zpci buses, and attach zpci device to a slot of a root >>>>>> bus. >>>>>> Then I need to add a api to the common pci code to do the scan of all the >>>>>> pci host bridges. And in this way, it still has the index issue. I need to >>>>>> scan >>>>> >from the first bus to count the index. So any suggestion from you? >>>>> OK, I looked at arch/s390/pci/pci.c. >>>>> First of all, it seems to run the regular PCI thing on bridges. >>>>> >>>>> zdev->bus = pci_scan_root_bus(NULL, ZPCI_BUS_NR, &pci_root_ops, >>>>> zdev, &resources); >>>> At this moment, the guest has got all the zpci devices through clp list zpci >>>> instruction. For each device, in the pci_scan_root_bus(), it will create >>>> a root bus. So for s390, we get pci devices first, then create a new root bus >>>> for it. >>> I don't see this in guest code. >>> >>> I looked at pci_scan_root_bus and it's completely generic. >>> It sets up the bus: >>> b = pci_create_root_bus(parent, bus, ops, sysdata, resources); >>> >>> then it scans it: >>> max = pci_scan_child_bus(b); >>> >>> >>> that one does >>> /* Go find them, Rover! */ >>> for (devfn = 0; devfn < 0x100; devfn += 8) >>> pci_scan_slot(bus, devfn); >>> >>> next >>> dev = pci_scan_single_device(bus, devfn); >>> >>> and so on. Eventually you get >>> if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000)) >>> return NULL; >>> >>> and that one does the clp thing using zpci_cfg_load. >>> >> pci_base_init()-> clp_scan_pci_devices(): >> rc = clp_list_pci(rrb, __clp_add); >> In this function, there is a while loop to get all the zpci devices by means >> of >> resume token(index). And for each device, >> __clp_add()-> clp_add_pci_device(); >> In clp_add_pci_device(), we use the zpci information to create a struct >> zpci_dev zdev. >> Then zpci_create_device()->zpci_scan_bus()->pci_scan_root_bus() >> zdev->bus = pci_scan_root_bus(NULL, ZPCI_BUS_NR, &pci_root_ops, >> zdev, &resources); >> So, you see, each zdev has its own root bus. And there is no child bus under >> that root bus. > Right - zdev *is* the root. But there are pci devices hanging off it. We have multiple zdevs in kernel, and each zdev only has one pci device attached to it. > > So why not model it like this? > > vfio should attach to zdev, zdev is the pci host. > > Also, you can stick a pci to pci bridge under the root, and > everything will just work. > > > > > >>> >>>>> so to me, it looks like there's no need to expose >>>>> non-root buses through special means. >>>>> >>>>> What to do for root buses is a different question but again, >>>>> you definitely do not want to rely on the order of things >>>>> on that linked list. >>>>> The simplest thing is to ask user to give them unique >>>>> numbers, or find some stable way to sort them that >>>>> does not rely on order of initialization (e.g. device IDs?). >>>>> >>>>> But again, this only works ok for root buses. >>>>> >>>> Basically, it does not exposed the buses to guest, it exposed an index >>>> to guest. >>>> Here is the process to get all the zpci device for a guest. >>>> For example: we have 10 zpci devices, and the batch size for list zpci >>>> instruction is 4. >>>> First, qemu will return devices 0-3, index of list zpci is 0 >>>> Second, qemu will return device 4-7, index of list zpci is 4 >>>> Third, qemu will return device 8-9, index of list zpci is 8 >>>> We have device id, but list zpci does not use that as a flag to get >>>> next batch, it use an index instead. >>>> This process is defined by s390 arch, we can't change it. >>>> So no matter how we organize zpci devices in qemu, slot or link list. >>>> We could not get rid of the index issue. >>>> >>>> How about I add a flag to identify whether the link list >>>> is valid or not. When a hotplug/unplug event occurred, I will >>>> reset the index, and make the guest refetch the zpci devices >>> >from the beginning. >>>> >>>> >>>> >>> You should just use something stable for IDs. >>> And avoid doing it for anything that isn't a root or maybe a bridge >>> since it'll just cause everyone maintainance problems down the road. >>> >> The list zpci instruction is defined by arch, not a software thing, I could >> not >> change it to use a ID instead...