From: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
To: liulongfang <liulongfang@huawei.com>,
Alex Williamson <alex.williamson@redhat.com>
Cc: "jgg@nvidia.com" <jgg@nvidia.com>,
Jonathan Cameron <jonathan.cameron@huawei.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linuxarm@openeuler.org" <linuxarm@openeuler.org>
Subject: RE: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of the XQC address
Date: Mon, 13 May 2024 08:35:10 +0000 [thread overview]
Message-ID: <4e7f3027d07645aca80e811598d4ce21@huawei.com> (raw)
In-Reply-To: <e63c0c85-7f3a-100c-5059-322268b3f517@huawei.com>
> -----Original Message-----
> From: liulongfang <liulongfang@huawei.com>
> Sent: Monday, May 13, 2024 9:16 AM
> To: Alex Williamson <alex.williamson@redhat.com>; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: jgg@nvidia.com; Jonathan Cameron <jonathan.cameron@huawei.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linuxarm@openeuler.org
> Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of
> the XQC address
>
> On 2024/5/9 22:29, Alex Williamson wrote:
> > On Thu, 9 May 2024 09:37:51 +0000
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> wrote:
> >
> >>> -----Original Message-----
> >>> From: Alex Williamson <alex.williamson@redhat.com>
> >>> Sent: Wednesday, May 8, 2024 7:00 PM
> >>> To: liulongfang <liulongfang@huawei.com>
> >>> Cc: jgg@nvidia.com; Shameerali Kolothum Thodi
> >>> <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron
> >>> <jonathan.cameron@huawei.com>; kvm@vger.kernel.org; linux-
> >>> kernel@vger.kernel.org; linuxarm@openeuler.org
> >>> Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register
> location of
> >>> the XQC address
> >>
> >> [...]
> >>
> >>>> HiSilicon accelerator equipment can perform general services after
> >>> completing live migration.
> >>>> This kind of business is executed through the user mode driver and only
> >>> needs to use SQE and CQE.
> >>>>
> >>>> At the same time, this device can also perform kernel-mode services in
> the
> >>> VM through the crypto
> >>>> subsystem. This kind of service requires the use of EQE.
> >>>>
> >>>> Finally, if the device is abnormal, the driver needs to perform a device
> >>> reset, and AEQE needs to
> >>>> be used in this case.
> >>>>
> >>>> Therefore, a complete device live migration function needs to ensure
> that
> >>> device functions are
> >>>> normal in all these scenarios.
> >>>> Therefore, this data still needs to be migrated.
> >>>
> >>> Ok, I had jumped to an in-kernel host driver in reference to "kernel
> >>> mode" rather than a guest kernel. Migrating with bad data only affects
> >>> the current configuration of the device, reloading a guest driver to
> >>> update these registers or a reset of the device would allow proper
> >>> operation of the device, correct?
> >>
> >> Yes, after talking to Longfang, the device RAS will trigger a reset and
> >> would function after reset.
> >>
> >>>
> >>> But I think this still isn't really a complete solution, we know
> >>> there's a bug in the migration data stream, so not only would we fix
> >>> the data stream, but I think we should also take measures to prevent
> >>> loading a known bad data stream. AIUI migration of this device while
> >>> running in kernel mode (ie. a kernel driver within a guest VM) is
> >>> broken. Therefore, the least we can do in a new kernel, knowing that
> >>> there was previously a bug in the migration data stream, is to fail to
> >>> load that migration data because it risks this scenario where the
> >>> device is broken after migration. Shouldn't we then also increment a
> >>> migration version field in the data stream to block migrations that
> >>> risk this breakage, or barring that, change the magic data field to
> >>> prevent the migration? Thanks,
> >>
> >> Ok. We could add a new ACC_DEV_MAGIC_V2 and prevent the migration
> >> in vf_qm_check_match(). The only concern here is that, it will completely
> >> block old kernel to new kernel migration and since we can recover the
> >> device after the reset whether it is too restrictive or not.
> >
> > What's the impact to the running driver, kernel or userspace, if the
> > device is reset? Migration is intended to be effectively transparent
>
> If the device is reset, the user's task needs to be restarted.
> If an exception has been detected, the best way is not to migrate.
>
> > to the driver. If the driver stalls and needs to reset the device,
> > what has the migration driver accomplished versus an offline migration?
> >
> > If there's a way to detect from the migration data if the device is
> > running in kernel mode or user mode then you could potentially accept
> > and send v1 magic conditional that the device is in user mode and
> > require v2 magic for any migration where the device is in kernel mode.
> > This all adds complication though and seems like it has corner cases
> > where we might allow migration to an old kernel that might trap the
> > device there if the use case changes.
> >
>
> The driver does not support checking whether the device is running in
> kernel mode or user mode.
> Moreover, the device supports user-mode services and kernel-mode services
> to run at the same time.
>
> > Essentially it comes down to what should the migration experience be
> > and while restricting old->new and new->old migration is undesirable,
> > it seems old->old migration is effectively already broken anyway. As
> > you consider a v2 magic, perhaps consider how the migration data
> > structure might be improved overall to better handle new features and
> > bugs. Thanks,
> >
>
> We discussed a plan:
> Update ACC_DEV_MAGIC to ACC_DEV_MAGIC_VERSION and configure its
> last byte
> as version information:
>
> /* QM match information, last byte is version number */
> #define ACC_DEV_MAGIC_VERSION 0XACCDEVFEEDCAFE01
Oops..cant have V there. But the idea is replace magic with last byte
as version info which can be used in future for handling bugs/features
etc.
Thanks,
Shameer
next prev parent reply other threads:[~2024-05-13 8:35 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-25 13:23 [PATCH v6 0/5] add debugfs to hisilicon migration driver Longfang Liu
2024-04-25 13:23 ` [PATCH v6 1/5] hisi_acc_vfio_pci: extract public functions for container_of Longfang Liu
2024-04-25 13:23 ` [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of the XQC address Longfang Liu
2024-05-03 16:11 ` Alex Williamson
2024-05-07 8:29 ` liulongfang
2024-05-07 11:10 ` Shameerali Kolothum Thodi
2024-05-07 12:35 ` Alex Williamson
2024-05-08 7:18 ` liulongfang
2024-05-08 17:59 ` Alex Williamson
2024-05-09 9:37 ` Shameerali Kolothum Thodi
2024-05-09 14:29 ` Alex Williamson
[not found] ` <e63c0c85-7f3a-100c-5059-322268b3f517@huawei.com>
2024-05-13 8:35 ` Shameerali Kolothum Thodi [this message]
2024-04-25 13:23 ` [PATCH v6 3/5] hisi_acc_vfio_pci: create subfunction for data reading Longfang Liu
2024-05-03 16:25 ` Alex Williamson
2024-05-07 8:30 ` liulongfang
2024-04-25 13:23 ` [PATCH v6 4/5] hisi_acc_vfio_pci: register debugfs for hisilicon migration driver Longfang Liu
2024-05-03 17:21 ` Alex Williamson
2024-05-07 8:06 ` Shameerali Kolothum Thodi
2024-04-25 13:23 ` [PATCH v6 5/5] Documentation: add debugfs description for hisi migration Longfang Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4e7f3027d07645aca80e811598d4ce21@huawei.com \
--to=shameerali.kolothum.thodi@huawei.com \
--cc=alex.williamson@redhat.com \
--cc=jgg@nvidia.com \
--cc=jonathan.cameron@huawei.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@openeuler.org \
--cc=liulongfang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox