From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49751) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zr0it-0003ln-Sw for qemu-devel@nongnu.org; Tue, 27 Oct 2015 05:39:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zr0iq-0000G1-H5 for qemu-devel@nongnu.org; Tue, 27 Oct 2015 05:39:19 -0400 Received: from [59.151.112.132] (port=4971 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zr0ip-0000F6-9X for qemu-devel@nongnu.org; Tue, 27 Oct 2015 05:39:16 -0400 References: <1445830158-20721-1-git-send-email-caoj.fnst@cn.fujitsu.com> <1445830158-20721-3-git-send-email-caoj.fnst@cn.fujitsu.com> <20151026100131-mutt-send-email-mst@redhat.com> <562E151C.1040208@cn.fujitsu.com> <20151026140916-mutt-send-email-mst@redhat.com> From: Cao jin Message-ID: <562F4635.1060503@cn.fujitsu.com> Date: Tue, 27 Oct 2015 17:39:01 +0800 MIME-Version: 1.0 In-Reply-To: <20151026140916-mutt-send-email-mst@redhat.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v5 2/2] enable multi-function hot-add List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" Cc: izumi.taku@jp.fujitsu.com, alex.williamson@redhat.com, qemu-devel@nongnu.org hello Michael, there is still a long story with many personal understanding below, hope to get directions. If all my understanding are right, I think maybe I could say: finally I understand your implementation:) On 10/26/2015 08:14 PM, Michael S. Tsirkin wrote: > On Mon, Oct 26, 2015 at 07:57:16PM +0800, Cao jin wrote: >> Hi, >> warning, there is a long story below O:-) >> >> On 10/26/2015 04:29 PM, Michael S. Tsirkin wrote: >>> On Mon, Oct 26, 2015 at 11:29:18AM +0800, Cao jin wrote: >>>> Enable pcie device multifunction hot, just ensure the function 0 >>>> added last, then driver will got the notification to scan all the >>>> function in the slot. >>>> >>>> Signed-off-by: Cao jin > > > In my opinion, what's confusing you is that you keep > thinking about ARI. > ARI is just a way where PCI_SLOT(devfn) can be != 0 > when you are actually just part of the same device. > > But with or without ARI, PCIE devices with upstream ports > can only occupy slot 0. > Yes, I think I got you, please see my understanding and correct me please if I am wrong: In the time of PCI bus, there are as many as 21 slots(device) under a PCI-PCI bridge, as IDSEL pin can be only connected to AD[31:11], but system software still need so scan the bus to enumerate device from 0 to 31, and each function has 8 functions at most, so there is devices[PCI_SLOT_MAX * PCI_FUNC_MAX] under struct PCIBus. But PCIe bus is composed of point-to-point Links that interconnect a set of components, so physically, there is only 1 slot(device) at the leaf tip of the topology, and so PCI_SLOT(devfn) must be 0, and the old(PCI) spec says there is 8 functions at most in a device, if there are totally only 8 functions under a downstream port, that would be a big restriction. For backward compatible and other reasons, PCIe designed the ARI feature, to extend device up to 256 functions, so we now could have PCI_SLOT(devfn) can be != 0. I think it also could be said that, ARI give the standard that vendors could follow to product a PCIe device which could has up to 256 functions(in reality, at least now, I never heard of a 256 functions device, yes I am a newbie about this:P) summary: 1. PCI device, each slot has 8 functions at most, 32 slots at most(actually 21 slots) below a PCI-PCI bridge 2. PCIe device without ARI, only 1 slot with 8 functions at most below the downstream port 3. PCIe device with ARI, only 1 slot with 256 functions at most I did the test as following, start vm with: ./qemu-system-x86_64 -M q35 -device ioh3420,bus=pcie.0,addr=1c.0,port=1,id=br1,chassis=1 -hda /var/lib/libvirt/images/fedora21.qcow2 -smp 2 --enable-kvm -m 1024 --monitor stdio then: device_add e1000,multifunction=on,bus=br1,id=net0,addr=2.0 find that guest cannot find net0 at addr=2.0 indeed, and net0 also flash and gone via "info pci", so the ARI feature in the future is going to fix this kind issue? Is the understanding above right? > So let's check for that. > > >>>> --- >>>> hw/pci/pci.c | 31 ++++++++++++++++++++++++++++++- >>>> hw/pci/pci_host.c | 13 +++++++++++-- >>>> hw/pci/pcie.c | 18 +++++++++--------- >>>> include/hw/pci/pci.h | 1 + >>>> 4 files changed, 51 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >>>> index b0bf540..6f43b12 100644 >>>> --- a/hw/pci/pci.c >>>> +++ b/hw/pci/pci.c >>>> @@ -847,6 +847,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, >>>> PCIConfigWriteFunc *config_write = pc->config_write; >>>> Error *local_err = NULL; >>>> AddressSpace *dma_as; >>>> + DeviceState *dev = DEVICE(pci_dev); >>>> + >>>> + pci_dev->bus = bus; >>>> >>>> if (devfn < 0) { >>>> for(devfn = bus->devfn_min ; devfn < ARRAY_SIZE(bus->devices); >>>> @@ -864,9 +867,17 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, >>>> PCI_SLOT(devfn), PCI_FUNC(devfn), name, >>>> bus->devices[devfn]->name); >>>> return NULL; >>>> + } else if (dev->hotplugged && >>>> + pci_get_function_0(pci_dev)) { >>>> + error_setg(errp, "PCI: slot %d function 0 already ocuppied by %s," >>>> + " new func %s cannot be exposed to guest.", >>>> + PCI_SLOT(devfn), >>>> + bus->devices[PCI_DEVFN(PCI_SLOT(devfn), 0)]->name, >>>> + name); >>>> + >>>> + return NULL; >>>> } >>>> >>>> - pci_dev->bus = bus; >>>> pci_dev->devfn = devfn; >>>> dma_as = pci_device_iommu_address_space(pci_dev); >>>> >>>> @@ -2454,6 +2465,24 @@ void pci_bus_get_w64_range(PCIBus *bus, Range *range) >>>> pci_for_each_device_under_bus(bus, pci_dev_get_w64, range); >>>> } >>>> >>>> +/* ARI device function number range is 0-255, means has only 1 function0; >>>> + * while non-ARI device has 1 function0 in each slot. non-ARI device could >>>> + * be PCI or PCIe, and there is up to 32 slots for PCI */ >>>> +PCIDevice *pci_get_function_0(PCIDevice *pci_dev) >>>> +{ >>>> + PCIDevice *parent_dev; >>>> + >>>> + parent_dev = pci_bridge_get_device(pci_dev->bus); >>>> + if (pcie_cap_is_arifwd_enabled(parent_dev) && >>>> + pci_find_capability(pci_dev, PCI_EXT_CAP_ID_ARI)) { >>>> + /* ARI enabled */ >>>> + return pci_dev->bus->devices[0]; >>> >>> That's wrong I think since software might enable ARI after hotplug. >> I read pci_configure_ari() in linux driver again, find something new to me: driver enable ARI only by seeing the ari forwarding bit in upstream bridge and the ARI capability register in function 0. if it is like this, I think, if we do function hotplug, while want to enable ARI, we must ensure function0 has ARI capability register, but driver won`t see the other functions who don`t have ARI capability register >> According to the spec, ARI feature is enabled only based on the following 2 >> conditions: >> 1. For an ARI Downstream Port, the capability is communicated through the >> Device Capabilities 2 register. >> 2. For an ARI Device, the capability is communicated through the ARI >> Capability structure. >> >> also according to the driver code pci_configure_ari(), I think my >> implementation does follows spec? >> >> And as you know, only ARI feature is enabled, we return >> pci_dev->bus->devices[0] > I think I am also wrong here yesterday, the saying may should be: it is only non root complex integrated device who return pci_dev->bus->devices[0]. > > Yes but that's not the point. > The point is whether a device can occupy slot != 0. > >>> And I'm not sure all functions must have the ARI capability. >>> >> Based on the understanding above, to enable ARI, function0 must have ARI capability, but if the other non-zero functions don`t have, linux driver won`t see it >> do you means there maybe the following condition: ARI forwarding bit is >> enabled in downstream port, but the functions below, some have ARI >> Capability structure while the others don`t. Shouldn`t we forbid the >> condition? Because If this condition happens, I am not sure whether the >> device could work normally. >> >> In the IMPLEMENTATION NOTE: ARI Forwarding Enable Being Set Inappropriately, >> It seems the spec don`t want that complicated condition? While actually, >> qemu calls pcie_cap_arifwd_init() in root port/downstream port >> initialization first. >> >>> I don't see why don't you just go by spec: >>> >> >> I do read the spec, and also referred the pcie driver code(see >> pci_configure_ari()). I think it is my inaccurate understanding about >> "upstream port" results in my implementation(I also consult PCISIG support >> team for this question, see attachment). The concept of "upstream port" in >> ARI device definition confuse me a lot. >> >> Talking about the definition of ARI device, I always thought the "upstream >> port" should on the ARI device itself(like the figure I drew before), or >> else why the definition add the words "with an upstream port"? It seems to >> me that the emphasis is on "with an upstream port", and implies to me that >> the non-ARI device doesn`t have an upstream port. Seeing their replies in >> attachment, says that I am wrong at this point, both of them have an >> upstream port.(their saying actually make me more confused at that time, but >> now I think I am clear about this concept after reading your implementation >> below) >> >> After reading your implementation, and the PCISIG support team replies >> again, finally I figure out that, "upstream port" in ARI device definition >> is just a port whose position is closer to root complex, and the point is, >> the "upstream port" doesn`t need to exist on the ARI device itself, which >> also means, take Figure 6-13 in PCIe spec 3.1, the Root Port A is the >> "upstream port" for ARI Device X, and the Downstream Port D in Switch is the >> "upstream port" for ARI Device Y. Now, do I understand it right? >> >> Hope I can get you understood from the long description above. If I still >> don`t understand it right, please point it out, after all, it confused me >> for a long time. >> >>> static >>> bool pcie_has_upstream_port(PCIDevice *dev) >>> { >>> PCIDevice *parent_dev = pci_bridge_get_device(pci_dev->bus); >>> >>> /* >>> * Device associated with an upstream port. >>> * As there are several types of these, it's easier to check the >>> * parent device: upstream ports are always connected to >>> * root or downstream ports. >>> */ >>> return parent_dev && >>> pci_is_express(parent_dev) && >>> parent_dev->exp.exp_cap && >>> (pcie_cap_get_type(parent_dev) == PCI_EXP_TYPE_ROOT_PORT || >>> pcie_cap_get_type(parent_dev) == PCI_EXP_TYPE_DOWNSTREAM); >>> } >>> >> >> Assume my understanding is right, which means both ARI and non-ARI device >> have the upstream port(root port or downstream port), could the existence of >> upstream port be the judgment condition? > > This tells us whether we are behind a port that > can address devices in slot != 0. > > Seems I find something, according to spec: Endpoints are classified as either legacy, PCI Express, or Root Complex Integrated Endpoints. I think this is also what you means in comment "As there are several types of these" And "we are behind a port" means the device is not a Root Complex Integrated, only the other two types can be behind a port. Am I right about this? >>> >>> PCIDevice *pci_get_function_0(PCIDevice *pci_dev) >>> { >>> if (pcie_has_upstream_port(dev)) { >>> /* With an upstream PCIE port, we only support 1 device at slot 0 */ >>> return pci_dev->bus->devices[0]; >>> } else { >>> /* Other bus types might support multiple devices at slots 0-31 */ >>> return pci_dev->bus->devices[PCI_DEVFN(PCI_SLOT(pci_dev->devfn), 0)]; >>> } >>> } >>> >> >>>> + } else { >>>> + /* no ARI */ >>>> + return pci_dev->bus->devices[PCI_DEVFN(PCI_SLOT(pci_dev->devfn), 0)]; >>>> + } >>>> +} >>>> + >>>> static const TypeInfo pci_device_type_info = { >>>> .name = TYPE_PCI_DEVICE, >>>> .parent = TYPE_DEVICE, >>>> diff --git a/hw/pci/pci_host.c b/hw/pci/pci_host.c >>>> index 3e26f92..63d7d2f 100644 >>>> --- a/hw/pci/pci_host.c >>>> +++ b/hw/pci/pci_host.c >>>> @@ -20,6 +20,7 @@ >>>> >>>> #include "hw/pci/pci.h" >>>> #include "hw/pci/pci_host.h" >>>> +#include "hw/pci/pci_bus.h" >>>> #include "trace.h" >>>> >>>> /* debug PCI */ >>>> @@ -75,7 +76,11 @@ void pci_data_write(PCIBus *s, uint32_t addr, uint32_t val, int len) >>>> PCIDevice *pci_dev = pci_dev_find_by_addr(s, addr); >>>> uint32_t config_addr = addr & (PCI_CONFIG_SPACE_SIZE - 1); >>>> >>>> - if (!pci_dev) { >>>> + /* non-zero functions are only exposed when function 0 is present, >>>> + * allowing direct removal of unexposed functions. >>>> + */ >>>> + if (!pci_dev || >>>> + (pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev))) { >>>> return; >>>> } >>>> >>>> @@ -91,7 +96,11 @@ uint32_t pci_data_read(PCIBus *s, uint32_t addr, int len) >>>> uint32_t config_addr = addr & (PCI_CONFIG_SPACE_SIZE - 1); >>>> uint32_t val; >>>> >>>> - if (!pci_dev) { >>>> + /* non-zero functions are only exposed when function 0 is present, >>>> + * allowing direct removal of unexposed functions. >>>> + */ >>>> + if (!pci_dev || >>>> + (pci_dev->qdev.hotplugged && !pci_get_function_0(pci_dev))) { >>>> return ~0x0; >>>> } >>>> >>>> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c >>>> index b1adeaf..4ba9501 100644 >>>> --- a/hw/pci/pcie.c >>>> +++ b/hw/pci/pcie.c >>>> @@ -249,16 +249,16 @@ void pcie_cap_slot_hotplug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, >>>> return; >>>> } >>>> >>>> - /* TODO: multifunction hot-plug. >>>> - * Right now, only a device of function = 0 is allowed to be >>>> - * hot plugged/unplugged. >>>> + /* To enable multifunction hot-plug, we just ensure the function >>>> + * 0 added last. When function 0 is added, we set the sltsta and >>>> + * inform OS via event notification. >>>> */ >>>> - assert(PCI_FUNC(pci_dev->devfn) == 0); >>>> - >>>> - pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA, >>>> - PCI_EXP_SLTSTA_PDS); >>>> - pcie_cap_slot_event(PCI_DEVICE(hotplug_dev), >>>> - PCI_EXP_HP_EV_PDC | PCI_EXP_HP_EV_ABP); >>>> + if (pci_get_function_0(pci_dev)) { >>>> + pci_word_test_and_set_mask(exp_cap + PCI_EXP_SLTSTA, >>>> + PCI_EXP_SLTSTA_PDS); >>>> + pcie_cap_slot_event(PCI_DEVICE(hotplug_dev), >>>> + PCI_EXP_HP_EV_PDC | PCI_EXP_HP_EV_ABP); >>>> + } >>>> } >>>> >>>> static void pcie_unplug_device(PCIBus *bus, PCIDevice *dev, void *opaque) >>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h >>>> index f5e7fd8..379b6e1 100644 >>>> --- a/include/hw/pci/pci.h >>>> +++ b/include/hw/pci/pci.h >>>> @@ -397,6 +397,7 @@ void pci_for_each_bus_depth_first(PCIBus *bus, >>>> void *(*begin)(PCIBus *bus, void *parent_state), >>>> void (*end)(PCIBus *bus, void *state), >>>> void *parent_state); >>>> +PCIDevice *pci_get_function_0(PCIDevice *pci_dev); >>>> >>>> /* Use this wrapper when specific scan order is not required. */ >>>> static inline >>>> -- >>>> 2.1.0 >>> . >>> >> >> -- >> Yours Sincerely, >> >> Cao Jin > > . > -- Yours Sincerely, Cao Jin