Linux PCI subsystem development
 help / color / mirror / Atom feed
From: "Longpeng (Mike, Cloud Infrastructure Service Product Dept.)"  <longpeng2@huawei.com>
To: Leon Romanovsky <leon@kernel.org>
Cc: <bhelgaas@google.com>, <linux-pci@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <jianjay.zhou@huawei.com>,
	<zhuangshengen@huawei.com>, <arei.gonglei@huawei.com>,
	<yechuan@huawei.com>, <huangzhichao@huawei.com>,
	<xiehong@huawei.com>
Subject: Re: [RFC 0/4] pci/sriov: support VFs dynamic addition
Date: Mon, 14 Nov 2022 22:06:49 +0800	[thread overview]
Message-ID: <3a8efc92-eda8-9c61-50c5-5ec97e2e2342@huawei.com> (raw)
In-Reply-To: <Y3I+Fs0/dXH/hnpL@unreal>



在 2022/11/14 21:09, Leon Romanovsky 写道:
> On Mon, Nov 14, 2022 at 08:38:42PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>
>>
>> 在 2022/11/14 15:04, Leon Romanovsky 写道:
>>> On Sun, Nov 13, 2022 at 09:47:12PM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>>>> Hi leon,
>>>>
>>>> 在 2022/11/12 0:39, Leon Romanovsky 写道:
>>>>> On Fri, Nov 11, 2022 at 10:27:18PM +0800, Longpeng(Mike) wrote:
>>>>>> From: Longpeng <longpeng2@huawei.com>
>>>>>>
>>>>>> We can enable SRIOV and add VFs by /sys/bus/pci/devices/..../sriov_numvfs, but
>>>>>> this operation needs to spend lots of time if there has a large amount of VFs.
>>>>>> For example, if the machine has 10 PFs and 250 VFs per-PF, enable all the VFs
>>>>>> concurrently would cost about 200-250ms. However most of them are not need to be
>>>>>> used at the moment, so we can enable SRIOV first but add VFs on demand.
>>>>>
>>>>> It is unclear what took 200-250ms, is it physical VF creation or bind of
>>>>> the driver to these VFs?
>>>>>
>>>> It is neither. In our test, we already created physical VFs before, so we
>>>> skipped the 100ms waiting when writing PCI_SRIOV_CTRL. And our driver only
>>>> probes PF, it just returns an error if the function is VF.
>>>
>>> It means that you didn't try sriov_drivers_autoprobe. Once it is set to
>>> true, It won't even try to probe VFs.
>>>
>>>>
>>>> The hotspot is the sriov_add_vfs (but no driver probe in fact) which is a
>>>> long procedure. Each step costs only a little, but the total cost is not
>>>> acceptable in some time-sensitive cases.
>>>
>>> This is also cryptic to me. In standard SR-IOV deployment, all VFs are
>>> created and configured while operator booted the machine with sriov_drivers_autoprobe
>>> set to false. Once this machine is ready, VFs are assigned to relevant VMs/users
>>> through orchestration SW (IMHO, it is supported by all orchestration SW).
>>>
>>> And only last part (assigning to users) is time-sensitive operation.
>>>
>> The VF creation and configuration are also time-sensitive in some cases, for
>> example, the hypervisor live update case (such as [1]):
>>   save VMs -> kexec -> restore VMs
>>
>> After the new kernel starts, the VFs must be added into the system, and then
>> assign the original VFs to the QEMU. This means we must enable all 2K+ VFs
>> at once and increase the downtime.
>>
>> If we can enable the VFs that are used by existing VMs then restore the VMs
>> and enable other unused VFs at last, the downtime would be significantly
>> reduced.
>>
>> [1] https://static.sched.com/hosted_files/kvmforum2022/65/kvmforum2022-Preserving%20IOMMU%20states%20during%20kexec%20reboot-v4.pdf
> 
> Like it is written in presentation, the standard way of doing it is done
> by VFIO live migration feature, where 2K+ VMs are migrated to another server
> at the time first server is scheduled for maintenance.
> 
Live migration is not the best choice in production environment, it's 
too heavy. Some cloud providers prefer to using hypervisor live update 
in their system, such as AWS's nitro hypervisor.

> However, even in live update case mentioned in the presentation, you
> should disable ALL PFs/VFs and enable ALL PFs/VFs at the same time,
> so you don't need per-VF id enable knob.
> 
The presentation is just a reference, some points could be optimized 
including disable PFs/VFs and enable PFs/VFs.

Hypervisor live update can finish in less than 1 second, so the cost of 
disabling PFs/VFs and enabling PFs/VFs (~200-250ms or even worst) is too 
high.

>>
>>>>
>>>> What’s more, the sriov_add_vfs adds the VFs of a PF one by one. So we can
>>>> mostly support 10 concurrent calls if there has 10 PFs.
>>>
>>> I wondered, are you using real HW? or QEMU SR-IOV? What is your server
>>> that supports such large number of VFs?
>>>
>> Physical device. Some devices in the market support the large number of VFs,
>> especially in the hardware offloading area, e.g DPU/IPU. I think the SR-IOV
>> software should keep pace with times too.
> 
> Our devices (and Intel too) support many VFs too. The thing is that
> servers are unlikely to be able to support 10 physical devices with 2K+
> VFs. There are many limitations that will make such is not usable.
> Like, global MSI-X pool and PCI bandwidth to support all these devices.
> 
>>
>>> BTW, Your change will probably break all SR-IOV devices in the market as
>>> they rely on PCI subsystem to have VFs ready and configured.
>>>
>> I see, but maybe this change could be a choice for some users.
> 
> It should come with relevant driver changes and very strong justification why
> such functionality is needed now and can't be achieved by anything else
> except user-facing sysfs.
> 
Adding 2K+ VFs to the sysfs need too much time.

Look at the bottomhalf of the hypervisor live update:
kexec --> add 2K VFs --> restore VMs

The downtime can be reduced if the sequence is:
kexec --> add 100 VFs(the VMs used) --> resotre VMs --> add 1.9K VFs


> I don't see anything in this presentation and discussion that supports
> need of such UAPI.
>  > Thanks
> 
>>
>>> Thanks
>>> .
> .

  reply	other threads:[~2022-11-14 14:07 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-11 14:27 [RFC 0/4] pci/sriov: support VFs dynamic addition Longpeng(Mike)
2022-11-11 14:27 ` [RFC 1/4] pci/sriov: extract sriov_numvfs common helper Longpeng(Mike)
2022-11-11 14:27 ` [RFC 2/4] pci/sriov: add vf_bitmap to mark the vf id allocation Longpeng(Mike)
2022-11-11 14:27 ` [RFC 3/4] pci/sriov: add sriov_numfs_no_scan interface Longpeng(Mike)
2022-11-11 14:27 ` [RFC 4/4] pci/sriov: add sriov_scan_vf_id interface Longpeng(Mike)
2022-11-11 16:39 ` [RFC 0/4] pci/sriov: support VFs dynamic addition Leon Romanovsky
2022-11-13 13:47   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-14  7:04     ` Leon Romanovsky
2022-11-14 12:38       ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-14 13:09         ` Leon Romanovsky
2022-11-14 14:06           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.) [this message]
2022-11-14 14:20             ` Leon Romanovsky
2022-11-15  1:38               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15  1:50               ` Oliver O'Halloran
2022-11-15  8:32                 ` Leon Romanovsky
2022-11-15  9:36                   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15 10:02                     ` Leon Romanovsky
2022-11-15 10:27                       ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-15 12:49                   ` Oliver O'Halloran
2022-11-15  2:06             ` Oliver O'Halloran
2022-11-16  0:52               ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-11-11 23:07 ` Bjorn Helgaas
2022-11-13 13:49   ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a8efc92-eda8-9c61-50c5-5ec97e2e2342@huawei.com \
    --to=longpeng2@huawei.com \
    --cc=arei.gonglei@huawei.com \
    --cc=bhelgaas@google.com \
    --cc=huangzhichao@huawei.com \
    --cc=jianjay.zhou@huawei.com \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=xiehong@huawei.com \
    --cc=yechuan@huawei.com \
    --cc=zhuangshengen@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox