From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)" <longpeng2@huawei.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: <bhelgaas@google.com>, <linux-pci@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <jianjay.zhou@huawei.com>,
<zhuangshengen@huawei.com>, <arei.gonglei@huawei.com>,
<yechuan@huawei.com>, <huangzhichao@huawei.com>,
<xiehong@huawei.com>
Subject: Re: [RFC 0/4] pci/sriov: support VFs dynamic addition
Date: Mon, 14 Nov 2022 22:06:49 +0800 [thread overview]
Message-ID: <3a8efc92-eda8-9c61-50c5-5ec97e2e2342@huawei.com> (raw)
In-Reply-To: <Y3I+Fs0/dXH/hnpL@unreal>
在 2022/11/14 21:09, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Hi leon,
>>>>
>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>> From: Longpeng <longpeng2@huawei.com>
>>>>>>
>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>
>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>> the driver to these VFs?
>>>>>
>>>> It is neither. In our test, we already created physical VFs before, so we
>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>> probes PF, it just returns an error if the function is VF.
>>>
>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>> true, It won't even try to probe VFs.
>>>
>>>>
>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>> long procedure. Each step costs only a little, but the total cost is not
>>>> acceptable in some time-sensitive cases.
>>>
>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>
>>> And only last part (assigning to users) is time-sensitive operation.
>>>
>> The VF creation and configuration are also time-sensitive in some cases, for
>> example, the hypervisor live update case (such as [1]):
>> save VMs -> kexec -> restore VMs
>>
>> After the new kernel starts, the VFs must be added into the system, and then
>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>> at once and increase the downtime.
>>
>> If we can enable the VFs that are used by existing VMs then restore the VMs
>> and enable other unused VFs at last, the downtime would be significantly
>> reduced.
>>
>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
>
> Like it is written in presentation, the standard way of doing it is done
> by VFIO live migration feature, where 2K+ VMs are migrated to another server
> at the time first server is scheduled for maintenance.
>
Live migration is not the best choice in production environment, it's
too heavy. Some cloud providers prefer to using hypervisor live update
in their system, such as AWS's nitro hypervisor.
> However, even in live update case mentioned in the presentation, you
> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
> so you don't need per-VF id enable knob.
>
The presentation is just a reference, some points could be optimized
including disable PFs/VFs and enable PFs/VFs.
Hypervisor live update can finish in less than 1 second, so the cost of
disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too
high.
>>
>>>>
>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>
>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>> that supports such large number of VFs?
>>>
>> Physical device. Some devices in the market support the large number of VFs,
>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>> software should keep pace with times too.
>
> Our devices (and Intel too) support many VFs too. The thing is that
> servers are unlikely to be able to support 10 physical devices with 2K+
> VFs. There are many limitations that will make such is not usable.
> Like, global MSI-X pool and PCI bandwidth to support all these devices.
>
>>
>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>> they rely on PCI subsystem to have VFs ready and configured.
>>>
>> I see, but maybe this change could be a choice for some users.
>
> It should come with relevant driver changes and very strong justification why
> such functionality is needed now and can't be achieved by anything else
> except user-facing sysfs.
>
Adding 2K+ VFs to the sysfs need too much time.
Look at the bottomhalf of the hypervisor live update:
kexec --> add 2K VFs --> restore VMs
The downtime can be reduced if the sequence is:
kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs
> I don't see anything in this presentation and discussion that supports
> need of such UAPI.
> > Thanks
>
>>
>>> Thanks
>>> .
> .
next prev parent reply other threads:[~2022-11-14 14:07 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-11 14:27 [RFC 0/4] pci/sriov: support VFs dynamic addition Longpeng(Mike)
2022-11-11 14:27 ` [RFC 1/4] pci/sriov: extract sriov_numvfs common helper Longpeng(Mike)
2022-11-11 14:27 ` [RFC 2/4] pci/sriov: add vf_bitmap to mark the vf id allocation Longpeng(Mike)
2022-11-11 14:27 ` [RFC 3/4] pci/sriov: add sriov_numfs_no_scan interface Longpeng(Mike)
2022-11-11 14:27 ` [RFC 4/4] pci/sriov: add sriov_scan_vf_id interface Longpeng(Mike)
2022-11-11 16:39 ` [RFC 0/4] pci/sriov: support VFs dynamic addition Leon Romanovsky
2022-11-13 13:47 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-14 7:04 ` Leon Romanovsky
2022-11-14 12:38 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-14 13:09 ` Leon Romanovsky
2022-11-14 14:06 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.) [this message]
2022-11-14 14:20 ` Leon Romanovsky
2022-11-15 1:38 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15 1:50 ` Oliver O'Halloran
2022-11-15 8:32 ` Leon Romanovsky
2022-11-15 9:36 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15 10:02 ` Leon Romanovsky
2022-11-15 10:27 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15 12:49 ` Oliver O'Halloran
2022-11-15 2:06 ` Oliver O'Halloran
2022-11-16 0:52 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-11 23:07 ` Bjorn Helgaas
2022-11-13 13:49 ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3a8efc92-eda8-9c61-50c5-5ec97e2e2342@huawei.com \
--to=longpeng2@huawei.com \
--cc=arei.gonglei@huawei.com \
--cc=bhelgaas@google.com \
--cc=huangzhichao@huawei.com \
--cc=jianjay.zhou@huawei.com \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=xiehong@huawei.com \
--cc=yechuan@huawei.com \
--cc=zhuangshengen@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox