public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] bugfix some issues under abnormal scenarios.
@ 2026-01-04  7:07 Longfang Liu
  2026-01-04  7:07 ` [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue Longfang Liu
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Longfang Liu @ 2026-01-04  7:07 UTC (permalink / raw)
  To: alex.williamson, jgg, jonathan.cameron; +Cc: kvm, linux-kernel, liulongfang

In certain reset scenarios, repeated migration scenarios, and error injection
scenarios, it is essential to ensure that the device driver functions properly.
Issues arising in these scenarios need to be addressed and fixed

Longfang Liu (3):
  hisi_acc_vfio_pci: update status after RAS error
  hisi_acc_vfio_pci: resolve duplicate migration states
  hisi_acc_vfio_pci: fix the queue parameter anomaly issue

Weili Qian (1):
  hisi_acc_vfio_pci: fix VF reset timeout issue

 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 40 +++++++++++++++----
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.h    |  2 +
 2 files changed, 34 insertions(+), 8 deletions(-)

-- 
2.24.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue
  2026-01-04  7:07 [PATCH 0/4] bugfix some issues under abnormal scenarios Longfang Liu
@ 2026-01-04  7:07 ` Longfang Liu
  2026-01-16 16:47   ` Alex Williamson
  2026-01-04  7:07 ` [PATCH 2/4] hisi_acc_vfio_pci: update status after RAS error Longfang Liu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Longfang Liu @ 2026-01-04  7:07 UTC (permalink / raw)
  To: alex.williamson, jgg, jonathan.cameron; +Cc: kvm, linux-kernel, liulongfang

From: Weili Qian <qianweili@huawei.com>

If device error occurs during live migration, qemu will
reset the VF. At this time, VF reset and device reset are performed
simultaneously. The VF reset will timeout. Therefore, the QM_RESETTING
flag is used to ensure that VF reset and device reset are performed
serially.

Fixes: b0eed085903e ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Weili Qian <qianweili@huawei.com>
---
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 24 +++++++++++++++++++
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.h    |  2 ++
 2 files changed, 26 insertions(+)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index fe2ffcd00d6e..d55365b21f78 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1188,14 +1188,37 @@ hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev,
 	return 0;
 }
 
+static void hisi_acc_vf_pci_reset_prepare(struct pci_dev *pdev)
+{
+	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
+	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
+	struct device *dev = &qm->pdev->dev;
+	u32 delay = 0;
+
+	/* All reset requests need to be queued for processing */
+	while (test_and_set_bit(QM_RESETTING, &qm->misc_ctl)) {
+		msleep(1);
+		if (++delay > QM_RESET_WAIT_TIMEOUT) {
+			dev_err(dev, "reset prepare failed\n");
+			return;
+		}
+	}
+
+	hisi_acc_vdev->set_reset_flag = true;
+}
+
 static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
 {
 	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
+	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
 
 	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
 				VFIO_MIGRATION_STOP_COPY)
 		return;
 
+	if (hisi_acc_vdev->set_reset_flag)
+		clear_bit(QM_RESETTING, &qm->misc_ctl);
+
 	mutex_lock(&hisi_acc_vdev->state_mutex);
 	hisi_acc_vf_reset(hisi_acc_vdev);
 	mutex_unlock(&hisi_acc_vdev->state_mutex);
@@ -1746,6 +1769,7 @@ static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
 MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
 
 static const struct pci_error_handlers hisi_acc_vf_err_handlers = {
+	.reset_prepare = hisi_acc_vf_pci_reset_prepare,
 	.reset_done = hisi_acc_vf_pci_aer_reset_done,
 	.error_detected = vfio_pci_core_aer_err_detected,
 };
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
index cd55eba64dfb..a3d91a31e3d8 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
@@ -27,6 +27,7 @@
 
 #define ERROR_CHECK_TIMEOUT		100
 #define CHECK_DELAY_TIME		100
+#define QM_RESET_WAIT_TIMEOUT  60000
 
 #define QM_SQC_VFT_BASE_SHIFT_V2	28
 #define QM_SQC_VFT_BASE_MASK_V2		GENMASK(15, 0)
@@ -128,6 +129,7 @@ struct hisi_acc_vf_migration_file {
 struct hisi_acc_vf_core_device {
 	struct vfio_pci_core_device core_device;
 	u8 match_done;
+	bool set_reset_flag;
 	/*
 	 * io_base is only valid when dev_opened is true,
 	 * which is protected by open_mutex.
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] hisi_acc_vfio_pci: update status after RAS error
  2026-01-04  7:07 [PATCH 0/4] bugfix some issues under abnormal scenarios Longfang Liu
  2026-01-04  7:07 ` [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue Longfang Liu
@ 2026-01-04  7:07 ` Longfang Liu
  2026-01-04  7:07 ` [PATCH 3/4] hisi_acc_vfio_pci: resolve duplicate migration states Longfang Liu
  2026-01-04  7:07 ` [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue Longfang Liu
  3 siblings, 0 replies; 9+ messages in thread
From: Longfang Liu @ 2026-01-04  7:07 UTC (permalink / raw)
  To: alex.williamson, jgg, jonathan.cameron; +Cc: kvm, linux-kernel, liulongfang

After a RAS error occurs on the accelerator device, the accelerator
device will be reset. The live migration state will be abnormal
after reset, and the original state needs to be restored during
the reset process.
Therefore, reset processing needs to be performed in a live
migration scenario.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index d55365b21f78..e782c2274871 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1212,8 +1212,7 @@ static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
 	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
 	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
 
-	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
-				VFIO_MIGRATION_STOP_COPY)
+	if (!hisi_acc_vdev->core_device.vdev.mig_ops)
 		return;
 
 	if (hisi_acc_vdev->set_reset_flag)
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] hisi_acc_vfio_pci: resolve duplicate migration states
  2026-01-04  7:07 [PATCH 0/4] bugfix some issues under abnormal scenarios Longfang Liu
  2026-01-04  7:07 ` [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue Longfang Liu
  2026-01-04  7:07 ` [PATCH 2/4] hisi_acc_vfio_pci: update status after RAS error Longfang Liu
@ 2026-01-04  7:07 ` Longfang Liu
  2026-01-04  7:07 ` [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue Longfang Liu
  3 siblings, 0 replies; 9+ messages in thread
From: Longfang Liu @ 2026-01-04  7:07 UTC (permalink / raw)
  To: alex.williamson, jgg, jonathan.cameron; +Cc: kvm, linux-kernel, liulongfang

In special scenarios involving duplicate migrations, after the
first migration is completed, if the original VF device is used
again and then migrated to another destination, the state indicating
data migration completion for the VF device is not reset.
This results in the second migration to the destination being skipped
without performing data migration.
After the modification, it ensures that a complete data migration
is performed after the subsequent migration.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index e782c2274871..394f1952a7ed 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1583,6 +1583,7 @@ static int hisi_acc_vfio_pci_open_device(struct vfio_device *core_vdev)
 		}
 		hisi_acc_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING;
 		hisi_acc_vdev->dev_opened = true;
+		hisi_acc_vdev->match_done = 0;
 		mutex_unlock(&hisi_acc_vdev->open_mutex);
 	}
 
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue
  2026-01-04  7:07 [PATCH 0/4] bugfix some issues under abnormal scenarios Longfang Liu
                   ` (2 preceding siblings ...)
  2026-01-04  7:07 ` [PATCH 3/4] hisi_acc_vfio_pci: resolve duplicate migration states Longfang Liu
@ 2026-01-04  7:07 ` Longfang Liu
  2026-01-16 17:07   ` Alex Williamson
  3 siblings, 1 reply; 9+ messages in thread
From: Longfang Liu @ 2026-01-04  7:07 UTC (permalink / raw)
  To: alex.williamson, jgg, jonathan.cameron; +Cc: kvm, linux-kernel, liulongfang

When the number of QPs initialized by the device, as read via vft, is zero,
it indicates either an abnormal device configuration or an abnormal read
result.
Returning 0 directly in this case would allow the live migration operation
to complete successfully, leading to incorrect parameter configuration after
migration and preventing the service from recovering normal functionality.
Therefore, in such situations, an error should be returned to roll back the
live migration operation.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
---
 drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index 394f1952a7ed..e0cc20f5f38b 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -406,7 +406,7 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
 	struct hisi_qm *pf_qm = hisi_acc_vdev->pf_qm;
 	struct device *dev = &vf_qm->pdev->dev;
 	u32 que_iso_state;
-	int ret;
+	int qp_num, ret;
 
 	if (migf->total_length < QM_MATCH_SIZE || hisi_acc_vdev->match_done)
 		return 0;
@@ -423,18 +423,18 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
 	}
 
 	/* VF qp num check */
-	ret = qm_get_vft(vf_qm, &vf_qm->qp_base);
-	if (ret <= 0) {
+	qp_num = qm_get_vft(vf_qm, &vf_qm->qp_base);
+	if (qp_num <= 0) {
 		dev_err(dev, "failed to get vft qp nums\n");
-		return ret;
+		return -EINVAL;
 	}
 
-	if (ret != vf_data->qp_num) {
+	if (qp_num != vf_data->qp_num) {
 		dev_err(dev, "failed to match VF qp num\n");
 		return -EINVAL;
 	}
 
-	vf_qm->qp_num = ret;
+	vf_qm->qp_num = qp_num;
 
 	/* VF isolation state check */
 	ret = qm_read_regs(pf_qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1);
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue
  2026-01-04  7:07 ` [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue Longfang Liu
@ 2026-01-16 16:47   ` Alex Williamson
  2026-01-20  7:32     ` liulongfang
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2026-01-16 16:47 UTC (permalink / raw)
  To: Longfang Liu; +Cc: alex.williamson, jgg, jonathan.cameron, kvm, linux-kernel

On Sun, 4 Jan 2026 15:07:03 +0800
Longfang Liu <liulongfang@huawei.com> wrote:

> From: Weili Qian <qianweili@huawei.com>
> 
> If device error occurs during live migration, qemu will
> reset the VF. At this time, VF reset and device reset are performed
> simultaneously. The VF reset will timeout. Therefore, the QM_RESETTING
> flag is used to ensure that VF reset and device reset are performed
> serially.
> 
> Fixes: b0eed085903e ("hisi_acc_vfio_pci: Add support for VFIO live migration")
> Signed-off-by: Weili Qian <qianweili@huawei.com>
> ---
>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 24 +++++++++++++++++++
>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.h    |  2 ++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> index fe2ffcd00d6e..d55365b21f78 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> @@ -1188,14 +1188,37 @@ hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev,
>  	return 0;
>  }
>  
> +static void hisi_acc_vf_pci_reset_prepare(struct pci_dev *pdev)
> +{
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
> +	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
> +	struct device *dev = &qm->pdev->dev;
> +	u32 delay = 0;
> +
> +	/* All reset requests need to be queued for processing */
> +	while (test_and_set_bit(QM_RESETTING, &qm->misc_ctl)) {
> +		msleep(1);
> +		if (++delay > QM_RESET_WAIT_TIMEOUT) {
> +			dev_err(dev, "reset prepare failed\n");
> +			return;
> +		}
> +	}
> +
> +	hisi_acc_vdev->set_reset_flag = true;
> +}
> +
>  static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
>  {
>  	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
> +	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
>  
>  	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
>  				VFIO_MIGRATION_STOP_COPY)
>  		return;
>  
> +	if (hisi_acc_vdev->set_reset_flag)
> +		clear_bit(QM_RESETTING, &qm->misc_ctl);


.reset_prepare sets QM_RESETTING unconditionally, .reset_done clears
QM_RESETTING conditionally based on the migration state.  In 2/ this
becomes conditional on the device supporting migration ops.  Doesn't
this enable a scenario where a device that does not support migration
puts QM_RESETTING into an inconsistent state that is never cleared?
Should the clear_bit() occur before the migration state/capability
check?

Thanks,
Alex

> +
>  	mutex_lock(&hisi_acc_vdev->state_mutex);
>  	hisi_acc_vf_reset(hisi_acc_vdev);
>  	mutex_unlock(&hisi_acc_vdev->state_mutex);
> @@ -1746,6 +1769,7 @@ static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
>  MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
>  
>  static const struct pci_error_handlers hisi_acc_vf_err_handlers = {
> +	.reset_prepare = hisi_acc_vf_pci_reset_prepare,
>  	.reset_done = hisi_acc_vf_pci_aer_reset_done,
>  	.error_detected = vfio_pci_core_aer_err_detected,
>  };
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
> index cd55eba64dfb..a3d91a31e3d8 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
> @@ -27,6 +27,7 @@
>  
>  #define ERROR_CHECK_TIMEOUT		100
>  #define CHECK_DELAY_TIME		100
> +#define QM_RESET_WAIT_TIMEOUT  60000
>  
>  #define QM_SQC_VFT_BASE_SHIFT_V2	28
>  #define QM_SQC_VFT_BASE_MASK_V2		GENMASK(15, 0)
> @@ -128,6 +129,7 @@ struct hisi_acc_vf_migration_file {
>  struct hisi_acc_vf_core_device {
>  	struct vfio_pci_core_device core_device;
>  	u8 match_done;
> +	bool set_reset_flag;
>  	/*
>  	 * io_base is only valid when dev_opened is true,
>  	 * which is protected by open_mutex.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue
  2026-01-04  7:07 ` [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue Longfang Liu
@ 2026-01-16 17:07   ` Alex Williamson
  2026-01-20  7:51     ` liulongfang
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2026-01-16 17:07 UTC (permalink / raw)
  To: Longfang Liu; +Cc: alex.williamson, jgg, jonathan.cameron, kvm, linux-kernel

On Sun, 4 Jan 2026 15:07:06 +0800
Longfang Liu <liulongfang@huawei.com> wrote:

> When the number of QPs initialized by the device, as read via vft, is zero,
> it indicates either an abnormal device configuration or an abnormal read
> result.
> Returning 0 directly in this case would allow the live migration operation
> to complete successfully, leading to incorrect parameter configuration after
> migration and preventing the service from recovering normal functionality.
> Therefore, in such situations, an error should be returned to roll back the
> live migration operation.
> 
> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
> ---
>  drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> index 394f1952a7ed..e0cc20f5f38b 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> @@ -406,7 +406,7 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
>  	struct hisi_qm *pf_qm = hisi_acc_vdev->pf_qm;
>  	struct device *dev = &vf_qm->pdev->dev;
>  	u32 que_iso_state;
> -	int ret;
> +	int qp_num, ret;
>  
>  	if (migf->total_length < QM_MATCH_SIZE || hisi_acc_vdev->match_done)
>  		return 0;
> @@ -423,18 +423,18 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
>  	}
>  
>  	/* VF qp num check */
> -	ret = qm_get_vft(vf_qm, &vf_qm->qp_base);
> -	if (ret <= 0) {
> +	qp_num = qm_get_vft(vf_qm, &vf_qm->qp_base);
> +	if (qp_num <= 0) {
>  		dev_err(dev, "failed to get vft qp nums\n");
> -		return ret;
> +		return -EINVAL;
>  	}

Do you really want to clobber the errno or should this be something
like:

		return qp_num < 0 ? qp_num : -EINVAL;

And if you do that it might make sense to continue to use ret rather
than add the new variable.  Thanks,

Alex

>  
> -	if (ret != vf_data->qp_num) {
> +	if (qp_num != vf_data->qp_num) {
>  		dev_err(dev, "failed to match VF qp num\n");
>  		return -EINVAL;
>  	}
>  
> -	vf_qm->qp_num = ret;
> +	vf_qm->qp_num = qp_num;
>  
>  	/* VF isolation state check */
>  	ret = qm_read_regs(pf_qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue
  2026-01-16 16:47   ` Alex Williamson
@ 2026-01-20  7:32     ` liulongfang
  0 siblings, 0 replies; 9+ messages in thread
From: liulongfang @ 2026-01-20  7:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: alex.williamson, jgg, jonathan.cameron, kvm, linux-kernel

On 2026/1/17 0:47, Alex Williamson wrote:
> On Sun, 4 Jan 2026 15:07:03 +0800
> Longfang Liu <liulongfang@huawei.com> wrote:
> 
>> From: Weili Qian <qianweili@huawei.com>
>>
>> If device error occurs during live migration, qemu will
>> reset the VF. At this time, VF reset and device reset are performed
>> simultaneously. The VF reset will timeout. Therefore, the QM_RESETTING
>> flag is used to ensure that VF reset and device reset are performed
>> serially.
>>
>> Fixes: b0eed085903e ("hisi_acc_vfio_pci: Add support for VFIO live migration")
>> Signed-off-by: Weili Qian <qianweili@huawei.com>
>> ---
>>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    | 24 +++++++++++++++++++
>>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.h    |  2 ++
>>  2 files changed, 26 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> index fe2ffcd00d6e..d55365b21f78 100644
>> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> @@ -1188,14 +1188,37 @@ hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev,
>>  	return 0;
>>  }
>>  
>> +static void hisi_acc_vf_pci_reset_prepare(struct pci_dev *pdev)
>> +{
>> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
>> +	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
>> +	struct device *dev = &qm->pdev->dev;
>> +	u32 delay = 0;
>> +
>> +	/* All reset requests need to be queued for processing */
>> +	while (test_and_set_bit(QM_RESETTING, &qm->misc_ctl)) {
>> +		msleep(1);
>> +		if (++delay > QM_RESET_WAIT_TIMEOUT) {
>> +			dev_err(dev, "reset prepare failed\n");
>> +			return;
>> +		}
>> +	}
>> +
>> +	hisi_acc_vdev->set_reset_flag = true;
>> +}
>> +
>>  static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
>>  {
>>  	struct hisi_acc_vf_core_device *hisi_acc_vdev = hisi_acc_drvdata(pdev);
>> +	struct hisi_qm *qm = hisi_acc_vdev->pf_qm;
>>  
>>  	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
>>  				VFIO_MIGRATION_STOP_COPY)
>>  		return;
>>  
>> +	if (hisi_acc_vdev->set_reset_flag)
>> +		clear_bit(QM_RESETTING, &qm->misc_ctl);
> 
> 
> .reset_prepare sets QM_RESETTING unconditionally, .reset_done clears
> QM_RESETTING conditionally based on the migration state.  In 2/ this
> becomes conditional on the device supporting migration ops.  Doesn't
> this enable a scenario where a device that does not support migration
> puts QM_RESETTING into an inconsistent state that is never cleared?
> Should the clear_bit() occur before the migration state/capability
> check?
>

Yes,  it makes more sense to move clear_bit() before the migration state
or capability check.

Thanks,
Longfang.

> Thanks,
> Alex
> 
>> +
>>  	mutex_lock(&hisi_acc_vdev->state_mutex);
>>  	hisi_acc_vf_reset(hisi_acc_vdev);
>>  	mutex_unlock(&hisi_acc_vdev->state_mutex);
>> @@ -1746,6 +1769,7 @@ static const struct pci_device_id hisi_acc_vfio_pci_table[] = {
>>  MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table);
>>  
>>  static const struct pci_error_handlers hisi_acc_vf_err_handlers = {
>> +	.reset_prepare = hisi_acc_vf_pci_reset_prepare,
>>  	.reset_done = hisi_acc_vf_pci_aer_reset_done,
>>  	.error_detected = vfio_pci_core_aer_err_detected,
>>  };
>> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
>> index cd55eba64dfb..a3d91a31e3d8 100644
>> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
>> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h
>> @@ -27,6 +27,7 @@
>>  
>>  #define ERROR_CHECK_TIMEOUT		100
>>  #define CHECK_DELAY_TIME		100
>> +#define QM_RESET_WAIT_TIMEOUT  60000
>>  
>>  #define QM_SQC_VFT_BASE_SHIFT_V2	28
>>  #define QM_SQC_VFT_BASE_MASK_V2		GENMASK(15, 0)
>> @@ -128,6 +129,7 @@ struct hisi_acc_vf_migration_file {
>>  struct hisi_acc_vf_core_device {
>>  	struct vfio_pci_core_device core_device;
>>  	u8 match_done;
>> +	bool set_reset_flag;
>>  	/*
>>  	 * io_base is only valid when dev_opened is true,
>>  	 * which is protected by open_mutex.
> 
> .
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue
  2026-01-16 17:07   ` Alex Williamson
@ 2026-01-20  7:51     ` liulongfang
  0 siblings, 0 replies; 9+ messages in thread
From: liulongfang @ 2026-01-20  7:51 UTC (permalink / raw)
  To: Alex Williamson; +Cc: alex.williamson, jgg, jonathan.cameron, kvm, linux-kernel


On 2026/1/17 1:07, Alex Williamson wrote:
> On Sun, 4 Jan 2026 15:07:06 +0800
> Longfang Liu <liulongfang@huawei.com> wrote:
> 
>> When the number of QPs initialized by the device, as read via vft, is zero,
>> it indicates either an abnormal device configuration or an abnormal read
>> result.
>> Returning 0 directly in this case would allow the live migration operation
>> to complete successfully, leading to incorrect parameter configuration after
>> migration and preventing the service from recovering normal functionality.
>> Therefore, in such situations, an error should be returned to roll back the
>> live migration operation.
>>
>> Signed-off-by: Longfang Liu <liulongfang@huawei.com>
>> ---
>>  drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> index 394f1952a7ed..e0cc20f5f38b 100644
>> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
>> @@ -406,7 +406,7 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
>>  	struct hisi_qm *pf_qm = hisi_acc_vdev->pf_qm;
>>  	struct device *dev = &vf_qm->pdev->dev;
>>  	u32 que_iso_state;
>> -	int ret;
>> +	int qp_num, ret;
>>  
>>  	if (migf->total_length < QM_MATCH_SIZE || hisi_acc_vdev->match_done)
>>  		return 0;
>> @@ -423,18 +423,18 @@ static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev,
>>  	}
>>  
>>  	/* VF qp num check */
>> -	ret = qm_get_vft(vf_qm, &vf_qm->qp_base);
>> -	if (ret <= 0) {
>> +	qp_num = qm_get_vft(vf_qm, &vf_qm->qp_base);
>> +	if (qp_num <= 0) {
>>  		dev_err(dev, "failed to get vft qp nums\n");
>> -		return ret;
>> +		return -EINVAL;
>>  	}
> 
> Do you really want to clobber the errno or should this be something
> like:
> 
> 		return qp_num < 0 ? qp_num : -EINVAL;
> 
> And if you do that it might make sense to continue to use ret rather
> than add the new variable.  Thanks,
>

OK, your proposed fix doesn't require introducing a new variable.
I'll address these issues in the next version.

Thanks.
Longfang.

> Alex
> 
>>  
>> -	if (ret != vf_data->qp_num) {
>> +	if (qp_num != vf_data->qp_num) {
>>  		dev_err(dev, "failed to match VF qp num\n");
>>  		return -EINVAL;
>>  	}
>>  
>> -	vf_qm->qp_num = ret;
>> +	vf_qm->qp_num = qp_num;
>>  
>>  	/* VF isolation state check */
>>  	ret = qm_read_regs(pf_qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1);
> 
> .
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-20  7:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-04  7:07 [PATCH 0/4] bugfix some issues under abnormal scenarios Longfang Liu
2026-01-04  7:07 ` [PATCH 1/4] hisi_acc_vfio_pci: fix VF reset timeout issue Longfang Liu
2026-01-16 16:47   ` Alex Williamson
2026-01-20  7:32     ` liulongfang
2026-01-04  7:07 ` [PATCH 2/4] hisi_acc_vfio_pci: update status after RAS error Longfang Liu
2026-01-04  7:07 ` [PATCH 3/4] hisi_acc_vfio_pci: resolve duplicate migration states Longfang Liu
2026-01-04  7:07 ` [PATCH 4/4] hisi_acc_vfio_pci: fix the queue parameter anomaly issue Longfang Liu
2026-01-16 17:07   ` Alex Williamson
2026-01-20  7:51     ` liulongfang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox