* [PATCH v4 4/5] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
The statement "the hv_pci_bus_exit() call releases structures of all its
child devices" in commit d6af2ed29c7c is not true: in the path
hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
child "struct hv_pci_dev *hpdev" that is created earlier in
pci_devices_present_work() -> new_pcichild_device().
The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
where the old version of hv_pci_bus_exit() was used; when the commit was
rebased and merged into the upstream, people didn't notice that it's
not really necessary. The commit itself doesn't cause any issue, but it
makes hv_pci_probe() more complicated. Revert it to facilitate some
upcoming changes to hv_pci_probe().
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Wei Hu <weh@microsoft.com>
Cc: stable@vger.kernel.org
---
v2:
No change to the patch body.
Added Wei Hu's Acked-by.
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v4:
NO change since v3.
drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
1 file changed, 34 insertions(+), 37 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index a826b41c949a1..1a5296fad1c48 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3318,8 +3318,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
struct pci_bus_d0_entry *d0_entry;
struct hv_pci_compl comp_pkt;
struct pci_packet *pkt;
+ bool retry = true;
int ret;
+enter_d0_retry:
/*
* Tell the host that the bus is ready to use, and moved into the
* powered-on state. This includes telling the host which region
@@ -3346,6 +3348,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
if (ret)
goto exit;
+ /*
+ * In certain case (Kdump) the pci device of interest was
+ * not cleanly shut down and resource is still held on host
+ * side, the host could return invalid device status.
+ * We need to explicitly request host to release the resource
+ * and try to enter D0 again.
+ */
+ if (comp_pkt.completion_status < 0 && retry) {
+ retry = false;
+
+ dev_err(&hdev->device, "Retrying D0 Entry\n");
+
+ /*
+ * Hv_pci_bus_exit() calls hv_send_resource_released()
+ * to free up resources of its child devices.
+ * In the kdump kernel we need to set the
+ * wslot_res_allocated to 255 so it scans all child
+ * devices to release resources allocated in the
+ * normal kernel before panic happened.
+ */
+ hbus->wslot_res_allocated = 255;
+
+ ret = hv_pci_bus_exit(hdev, true);
+
+ if (ret == 0) {
+ kfree(pkt);
+ goto enter_d0_retry;
+ }
+ dev_err(&hdev->device,
+ "Retrying D0 failed with ret %d\n", ret);
+ }
+
if (comp_pkt.completion_status < 0) {
dev_err(&hdev->device,
"PCI Pass-through VSP failed D0 Entry with status %x\n",
@@ -3591,7 +3625,6 @@ static int hv_pci_probe(struct hv_device *hdev,
struct hv_pcibus_device *hbus;
u16 dom_req, dom;
char *name;
- bool enter_d0_retry = true;
int ret;
bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
@@ -3708,47 +3741,11 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_fwnode;
-retry:
ret = hv_pci_query_relations(hdev);
if (ret)
goto free_irq_domain;
ret = hv_pci_enter_d0(hdev);
- /*
- * In certain case (Kdump) the pci device of interest was
- * not cleanly shut down and resource is still held on host
- * side, the host could return invalid device status.
- * We need to explicitly request host to release the resource
- * and try to enter D0 again.
- * Since the hv_pci_bus_exit() call releases structures
- * of all its child devices, we need to start the retry from
- * hv_pci_query_relations() call, requesting host to send
- * the synchronous child device relations message before this
- * information is needed in hv_send_resources_allocated()
- * call later.
- */
- if (ret == -EPROTO && enter_d0_retry) {
- enter_d0_retry = false;
-
- dev_err(&hdev->device, "Retrying D0 Entry\n");
-
- /*
- * Hv_pci_bus_exit() calls hv_send_resources_released()
- * to free up resources of its child devices.
- * In the kdump kernel we need to set the
- * wslot_res_allocated to 255 so it scans all child
- * devices to release resources allocated in the
- * normal kernel before panic happened.
- */
- hbus->wslot_res_allocated = 255;
- ret = hv_pci_bus_exit(hdev, true);
-
- if (ret == 0)
- goto retry;
-
- dev_err(&hdev->device,
- "Retrying D0 failed with ret %d\n", ret);
- }
if (ret)
goto free_irq_domain;
--
2.25.1
^ permalink raw reply related
* [PATCH v4 5/5] PCI: hv: Add a per-bus mutex state_lock
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
In the case of fast device addition/removal, it's possible that
hv_eject_device_work() can start to run before create_root_hv_pci_bus()
starts to run; as a result, the pci_get_domain_bus_and_slot() in
hv_eject_device_work() can return a 'pdev' of NULL, and
hv_eject_device_work() can remove the 'hpdev', and immediately send a
message PCI_EJECTION_COMPLETE to the host, and the host immediately
unassigns the PCI device from the guest; meanwhile,
create_root_hv_pci_bus() and the PCI device driver can be probing the
dead PCI device and reporting timeout errors.
Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the
mutex before powering on the PCI bus in hv_pci_enter_d0(): when
hv_eject_device_work() starts to run, it's able to find the 'pdev' and call
pci_stop_and_remove_bus_device(pdev): if the PCI device driver has
loaded, the PCI device driver's probe() function is already called in
create_root_hv_pci_bus() -> pci_bus_add_devices(), and now
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able
to call the PCI device driver's remove() function and remove the device
reliably; if the PCI device driver hasn't loaded yet, the function call
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to
remove the PCI device reliably and the PCI device driver's probe()
function won't be called; if the PCI device driver's probe() is already
running (e.g., systemd-udev is loading the PCI device driver), it must
be holding the per-device lock, and after the probe() finishes and releases
the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is
able to proceed to remove the device reliably.
Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: stable@vger.kernel.org
---
v2:
Removed the "debug code".
Fixed the "goto out" in hv_pci_resume() [Michael Kelley]
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v4:
Added Lorenzo's Acked-by.
drivers/pci/controller/pci-hyperv.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1a5296fad1c48..2d93d0c4f10db 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -489,7 +489,10 @@ struct hv_pcibus_device {
struct fwnode_handle *fwnode;
/* Protocol version negotiated with the host */
enum pci_protocol_version_t protocol_version;
+
+ struct mutex state_lock;
enum hv_pcibus_state state;
+
struct hv_device *hdev;
resource_size_t low_mmio_space;
resource_size_t high_mmio_space;
@@ -2605,6 +2608,8 @@ static void pci_devices_present_work(struct work_struct *work)
if (!dr)
return;
+ mutex_lock(&hbus->state_lock);
+
/* First, mark all existing children as reported missing. */
spin_lock_irqsave(&hbus->device_list_lock, flags);
list_for_each_entry(hpdev, &hbus->children, list_entry) {
@@ -2686,6 +2691,8 @@ static void pci_devices_present_work(struct work_struct *work)
break;
}
+ mutex_unlock(&hbus->state_lock);
+
kfree(dr);
}
@@ -2834,6 +2841,8 @@ static void hv_eject_device_work(struct work_struct *work)
hpdev = container_of(work, struct hv_pci_dev, wrk);
hbus = hpdev->hbus;
+ mutex_lock(&hbus->state_lock);
+
/*
* Ejection can come before or after the PCI bus has been set up, so
* attempt to find it and tear down the bus state, if it exists. This
@@ -2870,6 +2879,8 @@ static void hv_eject_device_work(struct work_struct *work)
put_pcichild(hpdev);
put_pcichild(hpdev);
/* hpdev has been freed. Do not use it any more. */
+
+ mutex_unlock(&hbus->state_lock);
}
/**
@@ -3636,6 +3647,7 @@ static int hv_pci_probe(struct hv_device *hdev,
return -ENOMEM;
hbus->bridge = bridge;
+ mutex_init(&hbus->state_lock);
hbus->state = hv_pcibus_init;
hbus->wslot_res_allocated = -1;
@@ -3745,9 +3757,11 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_irq_domain;
+ mutex_lock(&hbus->state_lock);
+
ret = hv_pci_enter_d0(hdev);
if (ret)
- goto free_irq_domain;
+ goto release_state_lock;
ret = hv_pci_allocate_bridge_windows(hbus);
if (ret)
@@ -3765,12 +3779,15 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_windows;
+ mutex_unlock(&hbus->state_lock);
return 0;
free_windows:
hv_pci_free_bridge_windows(hbus);
exit_d0:
(void) hv_pci_bus_exit(hdev, true);
+release_state_lock:
+ mutex_unlock(&hbus->state_lock);
free_irq_domain:
irq_domain_remove(hbus->irq_domain);
free_fwnode:
@@ -4020,20 +4037,26 @@ static int hv_pci_resume(struct hv_device *hdev)
if (ret)
goto out;
+ mutex_lock(&hbus->state_lock);
+
ret = hv_pci_enter_d0(hdev);
if (ret)
- goto out;
+ goto release_state_lock;
ret = hv_send_resources_allocated(hdev);
if (ret)
- goto out;
+ goto release_state_lock;
prepopulate_bars(hbus);
hv_pci_restore_msi_state(hbus);
hbus->state = hv_pcibus_installed;
+ mutex_unlock(&hbus->state_lock);
return 0;
+
+release_state_lock:
+ mutex_unlock(&hbus->state_lock);
out:
vmbus_close(hdev->channel);
return ret;
--
2.25.1
^ permalink raw reply related
* RE: [PATCH v2 1/2] x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
From: Dexuan Cui @ 2023-06-15 4:57 UTC (permalink / raw)
To: Michael Kelley (LINUX), bp@alien8.de
Cc: KY Srinivasan, Haiyang Zhang, catalin.marinas@arm.com,
will@kernel.org, tglx@linutronix.de, mingo@redhat.com,
dave.hansen@linux.intel.com, hpa@zytor.com,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
Wei Liu, linux-arm-kernel@lists.infradead.org, x86@kernel.org
In-Reply-To: <BYAPR21MB168882149A235E16CDECA1D8D750A@BYAPR21MB1688.namprd21.prod.outlook.com>
> From: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Sent: Thursday, June 8, 2023 7:39 AM
> To: bp@alien8.de
>
> From: Wei Liu <wei.liu@kernel.org>
> >
> > On Tue, May 23, 2023 at 10:14:21AM -0700, Michael Kelley wrote:
> > [...]
> > > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > > index 0f1001d..3ceb9df 100644
> > > --- a/include/linux/cpuhotplug.h
> > > +++ b/include/linux/cpuhotplug.h
> > > @@ -200,6 +200,7 @@ enum cpuhp_state {
> > >
> > > /* Online section invoked on the hotplugged CPU from the hotplug
> thread */
> > > CPUHP_AP_ONLINE_IDLE,
> > > + CPUHP_AP_HYPERV_ONLINE,
> >
> > x86 maintainers, are you okay with this?
>
> Boris -- Are you OK with this, and could give an ACK? This small patch
> set fixes a problem introduced into 6.4-rc1 by other Confidential VM
> changes, so this fix needs to be incorporated before 6.4 is released.
>
> Michael
Hi Boris, gentle ping -- I hope this patch can be accepted soon as I
and Tianyu need to rebase our SNP/TDX patches on this patch.
> > > CPUHP_AP_KVM_ONLINE,
> > > CPUHP_AP_SCHED_WAIT_EMPTY,
> > > CPUHP_AP_SMPBOOT_THREADS,
> > > --
> > > 1.8.3.1
> > >
^ permalink raw reply
* RE: [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Dexuan Cui @ 2023-06-15 5:15 UTC (permalink / raw)
To: Michael Kelley (LINUX), KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, daniel.lezcano@linaro.org, tglx@linutronix.de,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <1686325621-16382-1-git-send-email-mikelley@microsoft.com>
> From: Michael Kelley (LINUX) <mikelley@microsoft.com>
> Sent: Friday, June 9, 2023 8:47 AM
> ...
Looks good to me.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
^ permalink raw reply
* [PATCH v3 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-15 11:14 UTC (permalink / raw)
To: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
vkuznets, ssengar, shradhagupta, weh
Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
handler when completion interrupt happens. EQs are destroyed when
ucontext is deallocated.
The change calls some public APIs in mana ethernet driver to
allocate EQs and other resources. Ehe EQ process routine is also shared
by mana ethernet and mana ib drivers.
Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Wei Hu <weh@microsoft.com>
---
v2: Use ibdev_dbg to print error messages and return -ENOMEN
when kzalloc fails.
v3: Check return value on mana_ib_gd_destroy_dma_region(). Remove most
debug prints.
drivers/infiniband/hw/mana/cq.c | 35 ++++-
drivers/infiniband/hw/mana/main.c | 85 ++++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 4 +
drivers/infiniband/hw/mana/qp.c | 79 ++++++++++-
.../net/ethernet/microsoft/mana/gdma_main.c | 131 ++++++++++--------
drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
include/net/mana/gdma.h | 9 +-
7 files changed, 280 insertions(+), 64 deletions(-)
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index d141cab8a1e6..b6f61cd2d5eb 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
struct ib_device *ibdev = ibcq->device;
struct mana_ib_create_cq ucmd = {};
struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ struct gdma_dev *gd;
int err;
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gd = mdev->gdma_dev;
+ gc = gd->gdma_context;
if (udata->inlen < sizeof(ucmd))
return -EINVAL;
+ cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
+ 0 : attr->comp_vector;
+
err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
if (err) {
ibdev_dbg(ibdev,
@@ -69,11 +76,35 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
struct ib_device *ibdev = ibcq->device;
struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ struct gdma_dev *gd;
+ int err;
+
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gd = mdev->gdma_dev;
+ gc = gd->gdma_context;
+
- mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
- ib_umem_release(cq->umem);
+
+ if (atomic_read(&ibcq->usecnt) == 0) {
+ err = mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+ if (err) {
+ ibdev_dbg(ibdev,
+ "Faile to destroy dma region, %d\n", err);
+ return err;
+ }
+ kfree(gc->cq_table[cq->id]);
+ gc->cq_table[cq->id] = NULL;
+ ib_umem_release(cq->umem);
+ }
return 0;
}
+
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq)
+{
+ struct mana_ib_cq *cq = ctx;
+
+ cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 7be4c3adb4e2..e2affb6ae5ad 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -143,6 +143,79 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
return err;
}
+static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
+ struct mana_ib_dev *mdev)
+{
+ struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+ struct ib_device *ibdev = ucontext->ibucontext.device;
+ struct gdma_queue *eq;
+ int i;
+
+ if (!ucontext->eqs)
+ return;
+
+ for (i = 0; i < gc->max_num_queues; i++) {
+ eq = ucontext->eqs[i].eq;
+ if (!eq)
+ continue;
+
+ mana_gd_destroy_queue(gc, eq);
+ }
+
+ kfree(ucontext->eqs);
+ ucontext->eqs = NULL;
+}
+
+static int mana_ib_create_eq(struct mana_ib_ucontext *ucontext,
+ struct mana_ib_dev *mdev)
+{
+ struct gdma_queue_spec spec = {};
+ struct gdma_queue *queue;
+ struct gdma_context *gc;
+ struct ib_device *ibdev;
+ struct gdma_dev *gd;
+ int err;
+ int i;
+
+ if (!ucontext || !mdev)
+ return -EINVAL;
+
+ ibdev = ucontext->ibucontext.device;
+ gd = mdev->gdma_dev;
+
+ gc = gd->gdma_context;
+
+ ucontext->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
+ GFP_KERNEL);
+ if (!ucontext->eqs)
+ return -ENOMEM;
+
+ spec.type = GDMA_EQ;
+ spec.monitor_avl_buf = false;
+ spec.queue_size = EQ_SIZE;
+ spec.eq.callback = NULL;
+ spec.eq.context = ucontext->eqs;
+ spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+ spec.eq.msix_allocated = true;
+
+ for (i = 0; i < gc->max_num_queues; i++) {
+ spec.eq.msix_index = i;
+ err = mana_gd_create_mana_eq(gd, &spec, &queue);
+ if (err)
+ goto out;
+
+ queue->eq.disable_needed = true;
+ ucontext->eqs[i].eq = queue;
+ }
+
+ return 0;
+
+out:
+ ibdev_dbg(ibdev, "Failed to allocated eq err %d\n", err);
+ mana_ib_destroy_eq(ucontext, mdev);
+ return err;
+}
+
static int mana_gd_destroy_doorbell_page(struct gdma_context *gc,
int doorbell_page)
{
@@ -225,7 +298,17 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
ucontext->doorbell = doorbell_page;
+ ret = mana_ib_create_eq(ucontext, mdev);
+ if (ret) {
+ ibdev_dbg(ibdev, "Failed to create eq's , ret %d\n", ret);
+ goto err;
+ }
+
return 0;
+
+err:
+ mana_gd_destroy_doorbell_page(gc, doorbell_page);
+ return ret;
}
void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
@@ -240,6 +323,8 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
gc = mdev->gdma_dev->gdma_context;
+ mana_ib_destroy_eq(mana_ucontext, mdev);
+
ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
if (ret)
ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 502cc8672eef..9672fa1670a5 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -67,6 +67,7 @@ struct mana_ib_cq {
int cqe;
u64 gdma_region;
u64 id;
+ u32 comp_vector;
};
struct mana_ib_qp {
@@ -86,6 +87,7 @@ struct mana_ib_qp {
struct mana_ib_ucontext {
struct ib_ucontext ibucontext;
u32 doorbell;
+ struct mana_eq *eqs;
};
struct mana_ib_rwq_ind_table {
@@ -159,4 +161,6 @@ int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext);
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq);
+
#endif
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 54b61930a7fd..b8fcb7a8eae0 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -96,16 +96,20 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
struct mana_ib_dev *mdev =
container_of(pd->device, struct mana_ib_dev, ib_dev);
+ struct ib_ucontext *ib_ucontext = pd->uobject->context;
struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
struct mana_ib_create_qp_rss_resp resp = {};
struct mana_ib_create_qp_rss ucmd = {};
+ struct mana_ib_ucontext *mana_ucontext;
struct gdma_dev *gd = mdev->gdma_dev;
mana_handle_t *mana_ind_table;
struct mana_port_context *mpc;
+ struct gdma_queue *gdma_cq;
struct mana_context *mc;
struct net_device *ndev;
struct mana_ib_cq *cq;
struct mana_ib_wq *wq;
+ struct mana_eq *eq;
unsigned int ind_tbl_size;
struct ib_cq *ibcq;
struct ib_wq *ibwq;
@@ -114,6 +118,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
int ret;
mc = gd->driver_data;
+ mana_ucontext =
+ container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
if (!udata || udata->inlen < sizeof(ucmd))
return -EINVAL;
@@ -180,6 +186,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
for (i = 0; i < ind_tbl_size; i++) {
struct mana_obj_spec wq_spec = {};
struct mana_obj_spec cq_spec = {};
+ unsigned int max_num_queues = gd->gdma_context->max_num_queues;
ibwq = ind_tbl->ind_tbl[i];
wq = container_of(ibwq, struct mana_ib_wq, ibwq);
@@ -193,7 +200,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
cq_spec.gdma_region = cq->gdma_region;
cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+ eq = &mana_ucontext->eqs[cq->comp_vector % max_num_queues];
+ cq_spec.attached_eq = eq->eq->id;
ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
&wq_spec, &cq_spec, &wq->rx_object);
@@ -215,6 +223,22 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
resp.entries[i].wqid = wq->id;
mana_ind_table[i] = wq->rx_object;
+
+ if (gd->gdma_context->cq_table[cq->id] == NULL) {
+
+ gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+ if (!gdma_cq) {
+ ret = -ENOMEM;
+ goto free_cq;
+ }
+
+ gdma_cq->cq.context = cq;
+ gdma_cq->type = GDMA_CQ;
+ gdma_cq->cq.callback = mana_ib_cq_handler;
+ gdma_cq->id = cq->id;
+ gd->gdma_context->cq_table[cq->id] = gdma_cq;
+ }
+
}
resp.num_entries = i;
@@ -224,7 +248,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
ucmd.rx_hash_key_len,
ucmd.rx_hash_key);
if (ret)
- goto fail;
+ goto free_cq;
ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
if (ret) {
@@ -238,6 +262,23 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
return 0;
+free_cq:
+ {
+ int j = i;
+ u64 cqid;
+
+ while (j-- > 0) {
+ cqid = resp.entries[j].cqid;
+ gdma_cq = gd->gdma_context->cq_table[cqid];
+ cq = gdma_cq->cq.context;
+ if (atomic_read(&cq->ibcq.usecnt) == 0) {
+ kfree(gd->gdma_context->cq_table[cqid]);
+ gd->gdma_context->cq_table[cqid] = NULL;
+ }
+ }
+
+ }
+
fail:
while (i-- > 0) {
ibwq = ind_tbl->ind_tbl[i];
@@ -269,10 +310,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
struct mana_obj_spec wq_spec = {};
struct mana_obj_spec cq_spec = {};
struct mana_port_context *mpc;
+ struct gdma_queue *gdma_cq;
struct mana_context *mc;
struct net_device *ndev;
struct ib_umem *umem;
- int err;
+ struct mana_eq *eq;
+ int err, eq_vec;
u32 port;
mc = gd->driver_data;
@@ -350,7 +393,9 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
cq_spec.gdma_region = send_cq->gdma_region;
cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+ eq_vec = send_cq->comp_vector % gd->gdma_context->max_num_queues;
+ eq = &mana_ucontext->eqs[eq_vec];
+ cq_spec.attached_eq = eq->eq->id;
err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
&cq_spec, &qp->tx_object);
@@ -368,6 +413,23 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
qp->sq_id = wq_spec.queue_index;
send_cq->id = cq_spec.queue_index;
+ if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
+
+ gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+ if (!gdma_cq) {
+ err = -ENOMEM;
+ goto err_destroy_wqobj_and_cq;
+ }
+
+ gdma_cq->cq.context = send_cq;
+ gdma_cq->type = GDMA_CQ;
+ gdma_cq->cq.callback = mana_ib_cq_handler;
+ gdma_cq->id = send_cq->id;
+ gd->gdma_context->cq_table[send_cq->id] = gdma_cq;
+ } else {
+ gdma_cq = gd->gdma_context->cq_table[send_cq->id];
+ }
+
ibdev_dbg(&mdev->ib_dev,
"ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
qp->tx_object, qp->sq_id, send_cq->id);
@@ -381,12 +443,17 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
ibdev_dbg(&mdev->ib_dev,
"Failed copy udata for create qp-raw, %d\n",
err);
- goto err_destroy_wq_obj;
+ goto err_destroy_wqobj_and_cq;
}
return 0;
-err_destroy_wq_obj:
+err_destroy_wqobj_and_cq:
+ if (atomic_read(&send_cq->ibcq.usecnt) == 0) {
+ kfree(gdma_cq);
+ gd->gdma_context->cq_table[send_cq->id] = NULL;
+ }
+
mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
err_destroy_dma_region:
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..8231d77628d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -368,53 +368,57 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
}
}
-static void mana_gd_process_eq_events(void *arg)
+static void mana_gd_process_eq_events(struct list_head *eq_list)
{
u32 owner_bits, new_bits, old_bits;
union gdma_eqe_info eqe_info;
struct gdma_eqe *eq_eqe_ptr;
- struct gdma_queue *eq = arg;
struct gdma_context *gc;
+ struct gdma_queue *eq;
struct gdma_eqe *eqe;
u32 head, num_eqe;
int i;
- gc = eq->gdma_dev->gdma_context;
+ list_for_each_entry_rcu(eq, eq_list, entry) {
+ gc = eq->gdma_dev->gdma_context;
- num_eqe = eq->queue_size / GDMA_EQE_SIZE;
- eq_eqe_ptr = eq->queue_mem_ptr;
+ num_eqe = eq->queue_size / GDMA_EQE_SIZE;
+ eq_eqe_ptr = eq->queue_mem_ptr;
- /* Process up to 5 EQEs at a time, and update the HW head. */
- for (i = 0; i < 5; i++) {
- eqe = &eq_eqe_ptr[eq->head % num_eqe];
- eqe_info.as_uint32 = eqe->eqe_info;
- owner_bits = eqe_info.owner_bits;
+ /* Process up to 5 EQEs at a time, and update the HW head. */
+ for (i = 0; i < 5; i++) {
+ eqe = &eq_eqe_ptr[eq->head % num_eqe];
+ eqe_info.as_uint32 = eqe->eqe_info;
+ owner_bits = eqe_info.owner_bits;
- old_bits = (eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
- /* No more entries */
- if (owner_bits == old_bits)
- break;
+ old_bits =
+ (eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
+ /* No more entries */
+ if (owner_bits == old_bits)
+ break;
- new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
- if (owner_bits != new_bits) {
- dev_err(gc->dev, "EQ %d: overflow detected\n", eq->id);
- break;
- }
+ new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
+ if (owner_bits != new_bits) {
+ dev_err(gc->dev, "EQ %d: overflow detected\n",
+ eq->id);
+ break;
+ }
- /* Per GDMA spec, rmb is necessary after checking owner_bits, before
- * reading eqe.
- */
- rmb();
+ /* Per GDMA spec, rmb is necessary after checking
+ * owner_bits, before reading eqe.
+ */
+ rmb();
- mana_gd_process_eqe(eq);
+ mana_gd_process_eqe(eq);
- eq->head++;
- }
+ eq->head++;
+ }
- head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
+ head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
- mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type, eq->id,
- head, SET_ARM_BIT);
+ mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type,
+ eq->id, head, SET_ARM_BIT);
+ }
}
static int mana_gd_register_irq(struct gdma_queue *queue,
@@ -432,44 +436,49 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
gc = gd->gdma_context;
r = &gc->msix_resource;
dev = gc->dev;
+ msi_index = spec->eq.msix_index;
spin_lock_irqsave(&r->lock, flags);
- msi_index = find_first_zero_bit(r->map, r->size);
- if (msi_index >= r->size || msi_index >= gc->num_msix_usable) {
- err = -ENOSPC;
- } else {
- bitmap_set(r->map, msi_index, 1);
- queue->eq.msix_index = msi_index;
- }
-
- spin_unlock_irqrestore(&r->lock, flags);
+ if (!spec->eq.msix_allocated) {
+ msi_index = find_first_zero_bit(r->map, r->size);
+ if (msi_index >= r->size || msi_index >= gc->num_msix_usable)
+ err = -ENOSPC;
+ else
+ bitmap_set(r->map, msi_index, 1);
- if (err) {
- dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
- err, msi_index, r->size, gc->num_msix_usable);
+ if (err) {
+ dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
+ err, msi_index, r->size, gc->num_msix_usable);
- return err;
+ goto out;
+ }
}
+ queue->eq.msix_index = msi_index;
gic = &gc->irq_contexts[msi_index];
- WARN_ON(gic->handler || gic->arg);
+ list_add_rcu(&queue->entry, &gic->eq_list);
- gic->arg = queue;
+ WARN_ON(gic->handler);
gic->handler = mana_gd_process_eq_events;
- return 0;
+out:
+ spin_unlock_irqrestore(&r->lock, flags);
+
+ return err;
}
-static void mana_gd_deregiser_irq(struct gdma_queue *queue)
+static void mana_gd_deregister_irq(struct gdma_queue *queue)
{
struct gdma_dev *gd = queue->gdma_dev;
struct gdma_irq_context *gic;
struct gdma_context *gc;
struct gdma_resource *r;
unsigned int msix_index;
+ struct list_head *p, *n;
+ struct gdma_queue *eq;
unsigned long flags;
gc = gd->gdma_context;
@@ -480,13 +489,25 @@ static void mana_gd_deregiser_irq(struct gdma_queue *queue)
if (WARN_ON(msix_index >= gc->num_msix_usable))
return;
+ spin_lock_irqsave(&r->lock, flags);
+
gic = &gc->irq_contexts[msix_index];
- gic->handler = NULL;
- gic->arg = NULL;
- spin_lock_irqsave(&r->lock, flags);
- bitmap_clear(r->map, msix_index, 1);
+ list_for_each_safe(p, n, &gic->eq_list) {
+ eq = list_entry(p, struct gdma_queue, entry);
+ if (queue == eq) {
+ list_del_rcu(&eq->entry);
+ break;
+ }
+ }
+
+ if (list_empty(&gic->eq_list)) {
+ gic->handler = NULL;
+ bitmap_clear(r->map, msix_index, 1);
+ }
+
spin_unlock_irqrestore(&r->lock, flags);
+ synchronize_rcu();
queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
}
@@ -550,7 +571,7 @@ static void mana_gd_destroy_eq(struct gdma_context *gc, bool flush_evenets,
dev_warn(gc->dev, "Failed to flush EQ: %d\n", err);
}
- mana_gd_deregiser_irq(queue);
+ mana_gd_deregister_irq(queue);
if (queue->eq.disable_needed)
mana_gd_disable_queue(queue);
@@ -565,7 +586,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
u32 log2_num_entries;
int err;
- queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
+ queue->eq.msix_index = spec->eq.msix_index;
log2_num_entries = ilog2(queue->queue_size / GDMA_EQE_SIZE);
@@ -602,6 +623,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
mana_gd_destroy_eq(gc, false, queue);
return err;
}
+EXPORT_SYMBOL(mana_gd_create_mana_eq);
static void mana_gd_create_cq(const struct gdma_queue_spec *spec,
struct gdma_queue *queue)
@@ -873,6 +895,7 @@ void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
mana_gd_free_memory(gmi);
kfree(queue);
}
+EXPORT_SYMBOL(mana_gd_destroy_queue);
int mana_gd_verify_vf_version(struct pci_dev *pdev)
{
@@ -1188,7 +1211,7 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
struct gdma_irq_context *gic = arg;
if (gic->handler)
- gic->handler(gic->arg);
+ gic->handler(&gic->eq_list);
return IRQ_HANDLED;
}
@@ -1241,7 +1264,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
for (i = 0; i < nvec; i++) {
gic = &gc->irq_contexts[i];
gic->handler = NULL;
- gic->arg = NULL;
+ INIT_LIST_HEAD(&gic->eq_list);
if (!i)
snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 06d6292e09b3..85345225813f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1156,6 +1156,7 @@ static int mana_create_eq(struct mana_context *ac)
spec.eq.callback = NULL;
spec.eq.context = ac->eqs;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+ spec.eq.msix_allocated = false;
for (i = 0; i < gc->max_num_queues; i++) {
err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..cc728fc42043 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -6,6 +6,7 @@
#include <linux/dma-mapping.h>
#include <linux/netdevice.h>
+#include <linux/list.h>
#include "shm_channel.h"
@@ -291,6 +292,8 @@ struct gdma_queue {
u32 head;
u32 tail;
+ struct list_head entry;
+
/* Extra fields specific to EQ/CQ. */
union {
struct {
@@ -325,6 +328,8 @@ struct gdma_queue_spec {
void *context;
unsigned long log2_throttle_limit;
+ bool msix_allocated;
+ unsigned int msix_index;
} eq;
struct {
@@ -340,8 +345,8 @@ struct gdma_queue_spec {
#define MANA_IRQ_NAME_SZ 32
struct gdma_irq_context {
- void (*handler)(void *arg);
- void *arg;
+ void (*handler)(struct list_head *arg);
+ struct list_head eq_list;
char name[MANA_IRQ_NAME_SZ];
};
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v2 1/2] x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
From: Borislav Petkov @ 2023-06-15 11:40 UTC (permalink / raw)
To: Michael Kelley
Cc: kys, haiyangz, wei.liu, decui, catalin.marinas, will, tglx, mingo,
dave.hansen, hpa, linux-kernel, linux-hyperv, linux-arm-kernel,
x86
In-Reply-To: <1684862062-51576-1-git-send-email-mikelley@microsoft.com>
On Tue, May 23, 2023 at 10:14:21AM -0700, Michael Kelley wrote:
> These commits
>
> a494aef23dfc ("PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg")
> 2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
>
> update the Hyper-V virtual PCI driver to use the hyperv_pcpu_input_arg
> because that memory will be correctly marked as decrypted or encrypted
> for all VM types (CoCo or normal). But problems ensue when CPUs in the
> VM go online or offline after virtual PCI devices have been configured.
>
> When a CPU is brought online, the hyperv_pcpu_input_arg for that CPU is
> initialized by hv_cpu_init() running under state CPUHP_AP_ONLINE_DYN.
> But this state occurs after state CPUHP_AP_IRQ_AFFINITY_ONLINE, which
> may call the virtual PCI driver and fault trying to use the as yet
> uninitialized hyperv_pcpu_input_arg. A similar problem occurs in a CoCo
> VM if the MMIO read and write hypercalls are used from state
> CPUHP_AP_IRQ_AFFINITY_ONLINE.
>
> When a CPU is taken offline, IRQs may be reassigned in state
> CPUHP_TEARDOWN_CPU. Again, the virtual PCI driver may fault trying to
> use the hyperv_pcpu_input_arg that has already been freed by a
> higher state.
>
> Fix the onlining problem by adding state CPUHP_AP_HYPERV_ONLINE
> immediately after CPUHP_AP_ONLINE_IDLE (similar to CPUHP_AP_KVM_ONLINE)
> and before CPUHP_AP_IRQ_AFFINITY_ONLINE. Use this new state for
> Hyper-V initialization so that hyperv_pcpu_input_arg is allocated
> early enough.
>
> Fix the offlining problem by not freeing hyperv_pcpu_input_arg when
> a CPU goes offline. Retain the allocated memory, and reuse it if
> the CPU comes back online later.
>
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
>
> Changes in v2:
> * Put CPUHP_AP_HYPERV_ONLINE before CPUHP_AP_KVM_ONLINE [Vitaly
> Kuznetsov]
>
> arch/x86/hyperv/hv_init.c | 2 +-
> drivers/hv/hv_common.c | 48 +++++++++++++++++++++++-----------------------
> include/linux/cpuhotplug.h | 1 +
> 3 files changed, 26 insertions(+), 25 deletions(-)
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v3 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Simon Horman @ 2023-06-15 14:00 UTC (permalink / raw)
To: Wei Hu
Cc: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
vkuznets, ssengar, shradhagupta
In-Reply-To: <20230615111412.1687573-1-weh@microsoft.com>
On Thu, Jun 15, 2023 at 11:14:12AM +0000, Wei Hu wrote:
Hi Wei Hu,
some minor nits from my side.
...
> @@ -69,11 +76,35 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
> struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
> struct ib_device *ibdev = ibcq->device;
> struct mana_ib_dev *mdev;
> + struct gdma_context *gc;
> + struct gdma_dev *gd;
> + int err;
> +
>
> mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> + gd = mdev->gdma_dev;
> + gc = gd->gdma_context;
> +
>
> - mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> - ib_umem_release(cq->umem);
> +
> + if (atomic_read(&ibcq->usecnt) == 0) {
> + err = mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> + if (err) {
> + ibdev_dbg(ibdev,
> + "Faile to destroy dma region, %d\n", err);
nit: Faile -> Failed
> + return err;
> + }
> + kfree(gc->cq_table[cq->id]);
> + gc->cq_table[cq->id] = NULL;
> + ib_umem_release(cq->umem);
> + }
>
> return 0;
> }
...
> diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
> index 7be4c3adb4e2..e2affb6ae5ad 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -143,6 +143,79 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
> return err;
> }
>
> +static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
> + struct mana_ib_dev *mdev)
> +{
> + struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> + struct ib_device *ibdev = ucontext->ibucontext.device;
nit: ibdev is set but unused.
GCC 12.3.0 with W=1 says:
drivers/infiniband/hw/mana/main.c: In function 'mana_ib_destroy_eq':
drivers/infiniband/hw/mana/main.c:150:27: warning: unused variable 'ibdev' [-Wunused-variable]
150 | struct ib_device *ibdev = ucontext->ibucontext.device;
> + struct gdma_queue *eq;
> + int i;
> +
> + if (!ucontext->eqs)
> + return;
> +
> + for (i = 0; i < gc->max_num_queues; i++) {
> + eq = ucontext->eqs[i].eq;
> + if (!eq)
> + continue;
> +
> + mana_gd_destroy_queue(gc, eq);
> + }
> +
> + kfree(ucontext->eqs);
> + ucontext->eqs = NULL;
> +}
...
^ permalink raw reply
* [PATCH] net: mana: Batch ringing RX queue doorbell on receiving packets
From: longli @ 2023-06-15 23:27 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Leon Romanovsky, Shradha Gupta, Ajay Sharma, Shachar Raindel,
Stephen Hemminger, linux-hyperv, netdev, linux-kernel
Cc: linux-rdma, Long Li, stable
From: Long Li <longli@microsoft.com>
It's inefficient to ring the doorbell page every time a WQE is posted to
the received queue.
Move the code for ringing doorbell page to where after we have posted all
WQEs to the receive queue during a callback from napi_poll().
Tests showed no regression in network latency benchmarks.
Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index cd4d5ceb9f2d..ef1f0ce8e44d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq *rxq)
recv_buf_oob = &rxq->rx_oobs[curr_index];
- err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob->wqe_req,
- &recv_buf_oob->wqe_inf);
+ err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob->wqe_req,
+ &recv_buf_oob->wqe_inf);
if (WARN_ON_ONCE(err))
return;
@@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
mana_process_rx_cqe(rxq, cq, &comp[i]);
}
+ if (comp_read) {
+ struct gdma_context *gc = rxq->gdma_rq->gdma_dev->gdma_context;
+
+ mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
+ }
+
if (rxq->xdp_flush)
xdp_do_flush();
}
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v3 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: kernel test robot @ 2023-06-16 0:02 UTC (permalink / raw)
To: Wei Hu, netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg,
leon, kys, haiyangz, wei.liu, decui, davem, edumazet, kuba,
pabeni, vkuznets, ssengar, shradhagupta
Cc: oe-kbuild-all
In-Reply-To: <20230615111412.1687573-1-weh@microsoft.com>
Hi Wei,
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on horms-ipvs/master v6.4-rc6 next-20230615]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Wei-Hu/RDMA-mana_ib-Add-EQ-interrupt-support-to-mana-ib-driver/20230615-191709
base: linus/master
patch link: https://lore.kernel.org/r/20230615111412.1687573-1-weh%40microsoft.com
patch subject: [PATCH v3 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230616/202306160702.qHOTsE7v-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build):
git checkout linus/master
b4 shazam https://lore.kernel.org/r/20230615111412.1687573-1-weh@microsoft.com
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/infiniband/hw/mana/
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306160702.qHOTsE7v-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/infiniband/hw/mana/main.c: In function 'mana_ib_destroy_eq':
>> drivers/infiniband/hw/mana/main.c:150:27: warning: unused variable 'ibdev' [-Wunused-variable]
150 | struct ib_device *ibdev = ucontext->ibucontext.device;
| ^~~~~
vim +/ibdev +150 drivers/infiniband/hw/mana/main.c
145
146 static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
147 struct mana_ib_dev *mdev)
148 {
149 struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> 150 struct ib_device *ibdev = ucontext->ibucontext.device;
151 struct gdma_queue *eq;
152 int i;
153
154 if (!ucontext->eqs)
155 return;
156
157 for (i = 0; i < gc->max_num_queues; i++) {
158 eq = ucontext->eqs[i].eq;
159 if (!eq)
160 continue;
161
162 mana_gd_destroy_queue(gc, eq);
163 }
164
165 kfree(ucontext->eqs);
166 ucontext->eqs = NULL;
167 }
168
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* [PATCH v7 0/2] Support TDX guests on Hyper-V (the x86/tdx part)
From: Dexuan Cui @ 2023-06-16 4:46 UTC (permalink / raw)
To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
mikelley
Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui
The two patches (which are based on the latest x86/tdx branch in the tip
tree) are the x86/tdx part of the v6 patchset:
https://lwn.net/ml/linux-kernel/20230504225351.10765-1-decui@microsoft.com/
The other patches of the v6 patchset needs more changes in preparation for
the upcoming paravisor support, so let me post the x86/tdx part first.
This v7 patchset addressed Dave's comments on patch 1:
see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com/
Patch 2 is just a repost. There was a race between set_memory_encrypted()
and load_unaligned_zeropad(), which has been fixed by the 3 patches of
Kirill in the x86/tdx branch of the tip tree:
3f6819dd192e ("x86/mm: Allow guest.enc_status_change_prepare() to fail")
195edce08b63 ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()")
94142c9d1bdf ("x86/mm: Fix enc_status_change_finish_noop()")
(see https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/tdx)
If you want to view the patchset on github, it is here:
https://github.com/dcui/tdx/commits/decui/upstream-tip/x86/tdx/v7
Thanks,
Dexuan
Dexuan Cui (2):
x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
x86/tdx: Support vmalloc() for tdx_enc_status_changed()
arch/x86/coco/tdx/tdx.c | 123 +++++++++++++++++++++++++++++++---------
1 file changed, 96 insertions(+), 27 deletions(-)
--
2.25.1
^ permalink raw reply
* [PATCH v7 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
From: Dexuan Cui @ 2023-06-16 4:47 UTC (permalink / raw)
To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
mikelley
Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui
In-Reply-To: <20230616044701.15888-1-decui@microsoft.com>
GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
operation for the pages in the region starting at the GPA specified
in R11.
When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
the retry error when set_memory_decrypted() is called to decrypt up to
1GB of swiotlb bounce buffers.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes in v2:
Used __tdx_hypercall() directly in tdx_map_gpa().
Added a max_retry_cnt of 1000.
Renamed a few variables, e.g., r11 -> map_fail_paddr.
Changes in v3:
Changed max_retry_cnt from 1000 to 3.
Changes in v4:
__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
Added Kirill's Acked-by.
Changes in v5:
Added Michael's Reviewed-by.
Changes in v6: None.
Changes in v7:
Addressed Dave's comments:
see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com
arch/x86/coco/tdx/tdx.c | 65 +++++++++++++++++++++++++++++++++--------
1 file changed, 53 insertions(+), 12 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index cde174f4e239..5b62a1f5bd79 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -28,6 +28,8 @@
#define TDVMCALL_MAP_GPA 0x10001
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003
+#define TDVMCALL_STATUS_RETRY 1
+
/* MMIO direction */
#define EPT_READ 0
#define EPT_WRITE 1
@@ -777,14 +779,16 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
}
/*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest. The VMM is expected to change its mapping
- * of the page in response.
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>".
*/
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ const int max_retries_per_page = 3;
+ struct tdx_hypercall_args args;
+ u64 map_fail_paddr, ret;
+ int retry_count = 0;
if (!enc) {
/* Set the shared (decrypted) bits: */
@@ -792,12 +796,49 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
end |= cc_mkdec(0);
}
- /*
- * Notify the VMM about page mapping conversion. More info about ABI
- * can be found in TDX Guest-Host-Communication Interface (GHCI),
- * section "TDG.VP.VMCALL<MapGPA>"
- */
- if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ while (retry_count < max_retries_per_page) {
+ memset(&args, 0, sizeof(args));
+ args.r10 = TDX_HYPERCALL_STANDARD;
+ args.r11 = TDVMCALL_MAP_GPA;
+ args.r12 = start;
+ args.r13 = end - start;
+
+ ret = __tdx_hypercall_ret(&args);
+ if (ret != TDVMCALL_STATUS_RETRY)
+ return !ret;
+ /*
+ * The guest must retry the operation for the pages in the
+ * region starting at the GPA specified in R11. R11 comes
+ * from the untrusted VMM. Sanity check it.
+ */
+ map_fail_paddr = args.r11;
+ if (map_fail_paddr < start || map_fail_paddr >= end)
+ return false;
+
+ /* "Consume" a retry without forward progress */
+ if (map_fail_paddr == start) {
+ retry_count++;
+ continue;
+ }
+
+ start = map_fail_paddr;
+ retry_count = 0;
+ }
+
+ return false;
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!tdx_map_gpa(start, end, enc))
return false;
/* private->shared conversion requires only MapGPA call */
--
2.25.1
^ permalink raw reply related
* [PATCH v7 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()
From: Dexuan Cui @ 2023-06-16 4:47 UTC (permalink / raw)
To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
mikelley
Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui
In-Reply-To: <20230616044701.15888-1-decui@microsoft.com>
When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
arch/x86/coco/tdx/tdx.c | 76 ++++++++++++++++++++++++++++-------------
1 file changed, 52 insertions(+), 24 deletions(-)
Changes in v2:
Changed tdx_enc_status_changed() in place.
Changes in v3:
No change since v2.
Changes in v4:
Added Kirill's Co-developed-by since Kirill helped to improve the
code by adding tdx_enc_status_changed_phys().
Thanks Kirill for the clarification on load_unaligned_zeropad()!
Changes in v5:
Added Kirill's Signed-off-by.
Added Michael's Reviewed-by.
Changes in v6: None.
Changes in v7: None.
Note: there was a race between set_memory_encrypted() and
load_unaligned_zeropad(), which has been fixed by the 3 patches of
Kirill in the x86/tdx branch of the tip tree.
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 5b62a1f5bd79..8b2a2dcb2efd 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/mm.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -778,6 +779,34 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
return true;
}
+static bool try_accept_page(phys_addr_t start, phys_addr_t end)
+{
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ unsigned long len = end - start;
+
+ /*
+ * Try larger accepts first. It gives chance to VMM to keep
+ * 1G/2M SEPT entries where possible and speeds up process by
+ * cutting number of hypercalls (if successful).
+ */
+
+ if (try_accept_one(&start, len, PG_LEVEL_1G))
+ continue;
+
+ if (try_accept_one(&start, len, PG_LEVEL_2M))
+ continue;
+
+ if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ return false;
+ }
+
+ return true;
+}
+
/*
* Notify the VMM about page mapping conversion. More info about ABI
* can be found in TDX Guest-Host-Communication Interface (GHCI),
@@ -828,6 +857,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
return false;
}
+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+ bool enc)
+{
+ if (!tdx_map_gpa(start, end, enc))
+ return false;
+
+ /* private->shared conversion requires only MapGPA call */
+ if (!enc)
+ return true;
+
+ return try_accept_page(start, end);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
@@ -835,37 +877,23 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ unsigned long start = vaddr;
+ unsigned long end = start + numpages * PAGE_SIZE;
- if (!tdx_map_gpa(start, end, enc))
+ if (offset_in_page(start) != 0)
return false;
- /* private->shared conversion requires only MapGPA call */
- if (!enc)
- return true;
+ if (!is_vmalloc_addr((void *)start))
+ return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
- /*
- * For shared->private conversion, accept the page using
- * TDX_ACCEPT_PAGE TDX module call.
- */
while (start < end) {
- unsigned long len = end - start;
+ phys_addr_t start_pa = slow_virt_to_phys((void *)start);
+ phys_addr_t end_pa = start_pa + PAGE_SIZE;
- /*
- * Try larger accepts first. It gives chance to VMM to keep
- * 1G/2M SEPT entries where possible and speeds up process by
- * cutting number of hypercalls (if successful).
- */
-
- if (try_accept_one(&start, len, PG_LEVEL_1G))
- continue;
-
- if (try_accept_one(&start, len, PG_LEVEL_2M))
- continue;
-
- if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
return false;
+
+ start += PAGE_SIZE;
}
return true;
--
2.25.1
^ permalink raw reply related
* RE: [PATCH] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Haiyang Zhang @ 2023-06-16 16:49 UTC (permalink / raw)
To: longli@linuxonhyperv.com, KY Srinivasan, Wei Liu, Dexuan Cui,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Leon Romanovsky, Shradha Gupta, Ajay Sharma, Shachar Raindel,
Stephen Hemminger, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: linux-rdma@vger.kernel.org, Long Li, stable@vger.kernel.org
In-Reply-To: <1686871671-31110-1-git-send-email-longli@linuxonhyperv.com>
> -----Original Message-----
> From: longli@linuxonhyperv.com <longli@linuxonhyperv.com>
> Sent: Thursday, June 15, 2023 7:28 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>; Paolo
> Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>; Shradha
> Gupta <shradhagupta@linux.microsoft.com>; Ajay Sharma
> <sharmaajay@microsoft.com>; Shachar Raindel <shacharr@microsoft.com>;
> Stephen Hemminger <stephen@networkplumber.org>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: linux-rdma@vger.kernel.org; Long Li <longli@microsoft.com>;
> stable@vger.kernel.org
> Subject: [PATCH] net: mana: Batch ringing RX queue doorbell on receiving
> packets
>
> From: Long Li <longli@microsoft.com>
>
> It's inefficient to ring the doorbell page every time a WQE is posted to
> the received queue.
>
> Move the code for ringing doorbell page to where after we have posted all
> WQEs to the receive queue during a callback from napi_poll().
>
> Tests showed no regression in network latency benchmarks.
>
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network
> Adapter (MANA)")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index cd4d5ceb9f2d..ef1f0ce8e44d 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq
> *rxq)
>
> recv_buf_oob = &rxq->rx_oobs[curr_index];
>
> - err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob-
> >wqe_req,
> - &recv_buf_oob->wqe_inf);
> + err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob-
> >wqe_req,
> + &recv_buf_oob->wqe_inf);
> if (WARN_ON_ONCE(err))
> return;
>
> @@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
> mana_process_rx_cqe(rxq, cq, &comp[i]);
> }
>
> + if (comp_read) {
> + struct gdma_context *gc = rxq->gdma_rq->gdma_dev-
> >gdma_context;
> +
> + mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
> + }
> +
Thank you!
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* RE: [PATCH] net: mana: Batch ringing RX queue doorbell on receiving packets
From: Long Li @ 2023-06-16 18:54 UTC (permalink / raw)
To: Haiyang Zhang, longli@linuxonhyperv.com, KY Srinivasan, Wei Liu,
Dexuan Cui, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Leon Romanovsky, Shradha Gupta, Ajay Sharma,
Shachar Raindel, Stephen Hemminger, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <PH7PR21MB3116FB2C7E12556B0007C9BFCA58A@PH7PR21MB3116.namprd21.prod.outlook.com>
Hi,
I'm sending v2 to address some corner cases discovered during tests.
Thanks,
Long
> -----Original Message-----
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Sent: Friday, June 16, 2023 9:49 AM
> To: longli@linuxonhyperv.com; KY Srinivasan <kys@microsoft.com>; Wei Liu
> <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; David S. Miller
> <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub
> Kicinski <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com>; Leon
> Romanovsky <leon@kernel.org>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Ajay Sharma
> <sharmaajay@microsoft.com>; Shachar Raindel <shacharr@microsoft.com>;
> Stephen Hemminger <stephen@networkplumber.org>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: linux-rdma@vger.kernel.org; Long Li <longli@microsoft.com>;
> stable@vger.kernel.org
> Subject: RE: [PATCH] net: mana: Batch ringing RX queue doorbell on receiving
> packets
>
>
>
> > -----Original Message-----
> > From: longli@linuxonhyperv.com <longli@linuxonhyperv.com>
> > Sent: Thursday, June 15, 2023 7:28 PM
> > To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> > <decui@microsoft.com>; David S. Miller <davem@davemloft.net>; Eric
> > Dumazet <edumazet@google.com>; Jakub Kicinski <kuba@kernel.org>;
> Paolo
> > Abeni <pabeni@redhat.com>; Leon Romanovsky <leon@kernel.org>;
> Shradha
> > Gupta <shradhagupta@linux.microsoft.com>; Ajay Sharma
> > <sharmaajay@microsoft.com>; Shachar Raindel <shacharr@microsoft.com>;
> > Stephen Hemminger <stephen@networkplumber.org>; linux-
> > hyperv@vger.kernel.org; netdev@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Cc: linux-rdma@vger.kernel.org; Long Li <longli@microsoft.com>;
> > stable@vger.kernel.org
> > Subject: [PATCH] net: mana: Batch ringing RX queue doorbell on
> > receiving packets
> >
> > From: Long Li <longli@microsoft.com>
> >
> > It's inefficient to ring the doorbell page every time a WQE is posted
> > to the received queue.
> >
> > Move the code for ringing doorbell page to where after we have posted
> > all WQEs to the receive queue during a callback from napi_poll().
> >
> > Tests showed no regression in network latency benchmarks.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure
> > Network Adapter (MANA)")
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index cd4d5ceb9f2d..ef1f0ce8e44d 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1383,8 +1383,8 @@ static void mana_post_pkt_rxq(struct mana_rxq
> > *rxq)
> >
> > recv_buf_oob = &rxq->rx_oobs[curr_index];
> >
> > - err = mana_gd_post_and_ring(rxq->gdma_rq, &recv_buf_oob-
> > >wqe_req,
> > - &recv_buf_oob->wqe_inf);
> > + err = mana_gd_post_work_request(rxq->gdma_rq, &recv_buf_oob-
> > >wqe_req,
> > + &recv_buf_oob->wqe_inf);
> > if (WARN_ON_ONCE(err))
> > return;
> >
> > @@ -1654,6 +1654,12 @@ static void mana_poll_rx_cq(struct mana_cq
> *cq)
> > mana_process_rx_cqe(rxq, cq, &comp[i]);
> > }
> >
> > + if (comp_read) {
> > + struct gdma_context *gc = rxq->gdma_rq->gdma_dev-
> > >gdma_context;
> > +
> > + mana_gd_wq_ring_doorbell(gc, rxq->gdma_rq);
> > + }
> > +
>
> Thank you!
>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
^ permalink raw reply
* Re: [PATCH v4 0/5] pci-hyperv: Fix race condition bugs for fast device hotplug
From: Wei Liu @ 2023-06-18 3:06 UTC (permalink / raw)
To: Dexuan Cui
Cc: bhelgaas, davem, edumazet, haiyangz, jakeo, kuba, kw, kys, leon,
linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm, wei.liu,
longli, boqun.feng, ssengar, helgaas, linux-hyperv, linux-kernel,
linux-rdma, netdev, josete, simon.horman
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
On Wed, Jun 14, 2023 at 09:44:46PM -0700, Dexuan Cui wrote:
> Before the guest finishes probing a device, the host may be already starting
> to remove the device. Currently there are multiple race condition bugs in the
> pci-hyperv driver, which can cause the guest to panic. The patchset fixes
> the crashes.
>
> The patchset also does some cleanup work: patch 3 removes the useless
> hv_pcichild_state, and patch 4 reverts an old patch which is not really
> useful (without patch 4, it would be hard to make patch 5 clean).
>
> Patch 6 in v3 is dropped for now since it's a feature rather than a fix.
> Patch 6 will be split into two patches as suggested by Lorenzo and will be
> posted after the 5 patches are accepted first.
>
> The v4 addressed Lorenzo's comments and added Lorenzo' Acks to patch
> 1, 3 and 5.
>
> The v4 is based on v6.4-rc6, and can apply cleanly to the Hyper-V tree's
> hyperv-fixes branch.
>
> The patchset is also availsble in my github branch:
> https://github.com/dcui/tdx/commits/decui/vpci/v6.4-rc6-vpci-v4
>
> FYI, v3 can be found here:
> https://lwn.net/ml/linux-kernel/20230420024037.5921-1-decui@microsoft.com/
>
> Please review. Thanks!
>
>
> Dexuan Cui (5):
> PCI: hv: Fix a race condition bug in hv_pci_query_relations()
> PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
> PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
> Revert "PCI: hv: Fix a timing issue which causes kdump to fail
> occasionally"
> PCI: hv: Add a per-bus mutex state_lock
Applied to hyperv-fixes. Thanks.
^ permalink raw reply
* Re: [PATCH v7 0/2] Support TDX guests on Hyper-V (the x86/tdx part)
From: Kirill A. Shutemov @ 2023-06-19 13:47 UTC (permalink / raw)
To: Dexuan Cui
Cc: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
mikelley, linux-kernel, Tianyu.Lan, rick.p.edgecombe
In-Reply-To: <20230616044701.15888-1-decui@microsoft.com>
On Thu, Jun 15, 2023 at 09:46:59PM -0700, Dexuan Cui wrote:
> The two patches (which are based on the latest x86/tdx branch in the tip
> tree) are the x86/tdx part of the v6 patchset:
> https://lwn.net/ml/linux-kernel/20230504225351.10765-1-decui@microsoft.com/
>
> The other patches of the v6 patchset needs more changes in preparation for
> the upcoming paravisor support, so let me post the x86/tdx part first.
>
> This v7 patchset addressed Dave's comments on patch 1:
> see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com/
>
> Patch 2 is just a repost. There was a race between set_memory_encrypted()
> and load_unaligned_zeropad(), which has been fixed by the 3 patches of
> Kirill in the x86/tdx branch of the tip tree:
> 3f6819dd192e ("x86/mm: Allow guest.enc_status_change_prepare() to fail")
> 195edce08b63 ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()")
> 94142c9d1bdf ("x86/mm: Fix enc_status_change_finish_noop()")
> (see https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/tdx)
>
> If you want to view the patchset on github, it is here:
> https://github.com/dcui/tdx/commits/decui/upstream-tip/x86/tdx/v7
JFYI, it won't apply to tip/master. Unaccepted memory changed the code you
patching.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply
* Re: [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Daniel Lezcano @ 2023-06-19 16:16 UTC (permalink / raw)
To: Michael Kelley, kys, haiyangz, wei.liu, decui, tglx, linux-kernel,
linux-hyperv
In-Reply-To: <1686325621-16382-1-git-send-email-mikelley@microsoft.com>
On 09/06/2023 17:47, Michael Kelley wrote:
> Current code assigns either the Hyper-V TSC page or MSR-based ref counter
> as the sched clock. This may be sub-optimal in two cases. First, if there
> is hardware support to ensure consistent TSC frequency across live
> migrations and Hyper-V is using that support, the raw TSC is a faster
> source of time than the Hyper-V TSC page. Second, the MSR-based ref
> counter is relatively slow because reads require a trap to the hypervisor.
> As such, it should never be used as the sched clock. The native sched
> clock based on the raw TSC or jiffies is much better.
>
> Rework the sched clock setup so it is set to the TSC page only if
> Hyper-V indicates that the TSC may have inconsistent frequency across
> live migrations. Also, remove the code that sets the sched clock to
> the MSR-based ref counter. In the cases where it is not set, the sched
> clock will then be the native sched clock.
>
> As part of the rework, always enable both the TSC page clocksource and
> the MSR-based ref counter clocksource. Set the ratings so the TSC page
> clocksource is preferred. While the MSR-based ref counter clocksource
> is unlikely to ever be the default, having it available for manual
> selection is convenient for development purposes.
>
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
The patch does not apply, does it depend on another patch?
Rejected chunk:
--- drivers/clocksource/hyperv_timer.c
+++ drivers/clocksource/hyperv_timer.c
@@ -485,15 +485,9 @@ static u64 notrace read_hv_clock_msr_cs(struct
clocksource *arg)
return read_hv_clock_msr();
}
-static u64 noinstr read_hv_sched_clock_msr(void)
-{
- return (read_hv_clock_msr() - hv_sched_clock_offset) *
- (NSEC_PER_SEC / HV_CLOCK_HZ);
-}
-
static struct clocksource hyperv_cs_msr = {
.name = "hyperv_clocksource_msr",
- .rating = 500,
+ .rating = 495,
.read = read_hv_clock_msr_cs,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply
* RE: [PATCH v7 0/2] Support TDX guests on Hyper-V (the x86/tdx part)
From: Dexuan Cui @ 2023-06-19 16:23 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: ak@linux.intel.com, arnd@arndb.de, bp@alien8.de,
brijesh.singh@amd.com, dan.j.williams@intel.com,
dave.hansen@intel.com, dave.hansen@linux.intel.com, Haiyang Zhang,
hpa@zytor.com, jane.chu@oracle.com,
kirill.shutemov@linux.intel.com, KY Srinivasan,
linux-arch@vger.kernel.org, linux-hyperv@vger.kernel.org,
luto@kernel.org, mingo@redhat.com, peterz@infradead.org,
rostedt@goodmis.org, sathyanarayanan.kuppuswamy@linux.intel.com,
seanjc@google.com, tglx@linutronix.de, tony.luck@intel.com,
wei.liu@kernel.org, x86@kernel.org, Michael Kelley (LINUX),
linux-kernel@vger.kernel.org, Tianyu Lan,
rick.p.edgecombe@intel.com
In-Reply-To: <20230619134709.6c4sgargh67xwc5g@box.shutemov.name>
> From: Kirill A. Shutemov <kirill@shutemov.name>
> Sent: Monday, June 19, 2023 6:47 AM
> ...
> JFYI, it won't apply to tip/master. Unaccepted memory changed the code you
> patching.
Thanks for letting me know! I'll rebase to tip/master and repost shortly.
^ permalink raw reply
* RE: [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Michael Kelley (LINUX) @ 2023-06-19 16:44 UTC (permalink / raw)
To: Daniel Lezcano, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, tglx@linutronix.de, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org
In-Reply-To: <fdc643c4-6298-d337-1d8d-3f28f6c1acfc@linaro.org>
From: Daniel Lezcano <daniel.lezcano@linaro.org> Sent: Monday, June 19, 2023 9:16 AM
>
> On 09/06/2023 17:47, Michael Kelley wrote:
> > Current code assigns either the Hyper-V TSC page or MSR-based ref counter
> > as the sched clock. This may be sub-optimal in two cases. First, if there
> > is hardware support to ensure consistent TSC frequency across live
> > migrations and Hyper-V is using that support, the raw TSC is a faster
> > source of time than the Hyper-V TSC page. Second, the MSR-based ref
> > counter is relatively slow because reads require a trap to the hypervisor.
> > As such, it should never be used as the sched clock. The native sched
> > clock based on the raw TSC or jiffies is much better.
> >
> > Rework the sched clock setup so it is set to the TSC page only if
> > Hyper-V indicates that the TSC may have inconsistent frequency across
> > live migrations. Also, remove the code that sets the sched clock to
> > the MSR-based ref counter. In the cases where it is not set, the sched
> > clock will then be the native sched clock.
> >
> > As part of the rework, always enable both the TSC page clocksource and
> > the MSR-based ref counter clocksource. Set the ratings so the TSC page
> > clocksource is preferred. While the MSR-based ref counter clocksource
> > is unlikely to ever be the default, having it available for manual
> > selection is convenient for development purposes.
> >
> > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
>
> The patch does not apply, does it depend on another patch?
It should apply to linux-next. It depends on two previous patches from
Peter Zijlstra in the sched/core branch of tip. See:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=9397fa2ea3e7634f61da1ab76b9eb88ba04dfdfc
and
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=e39acc37db34f6688e2c16e958fb1d662c422c81
Michael
>
> Rejected chunk:
>
> --- drivers/clocksource/hyperv_timer.c
> +++ drivers/clocksource/hyperv_timer.c
> @@ -485,15 +485,9 @@ static u64 notrace read_hv_clock_msr_cs(struct
> clocksource *arg)
> return read_hv_clock_msr();
> }
>
> -static u64 noinstr read_hv_sched_clock_msr(void)
> -{
> - return (read_hv_clock_msr() - hv_sched_clock_offset) *
> - (NSEC_PER_SEC / HV_CLOCK_HZ);
> -}
> -
> static struct clocksource hyperv_cs_msr = {
> .name = "hyperv_clocksource_msr",
> - .rating = 500,
> + .rating = 495,
> .read = read_hv_clock_msr_cs,
> .mask = CLOCKSOURCE_MASK(64),
> .flags = CLOCK_SOURCE_IS_CONTINUOUS,
>
>
> --
^ permalink raw reply
* Re: [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Daniel Lezcano @ 2023-06-19 16:58 UTC (permalink / raw)
To: Michael Kelley (LINUX), KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <BYAPR21MB1688E1163BB36DF03CC8E00BD75FA@BYAPR21MB1688.namprd21.prod.outlook.com>
On 19/06/2023 18:44, Michael Kelley (LINUX) wrote:
> From: Daniel Lezcano <daniel.lezcano@linaro.org> Sent: Monday, June 19, 2023 9:16 AM
>>
>> On 09/06/2023 17:47, Michael Kelley wrote:
>>> Current code assigns either the Hyper-V TSC page or MSR-based ref counter
>>> as the sched clock. This may be sub-optimal in two cases. First, if there
>>> is hardware support to ensure consistent TSC frequency across live
>>> migrations and Hyper-V is using that support, the raw TSC is a faster
>>> source of time than the Hyper-V TSC page. Second, the MSR-based ref
>>> counter is relatively slow because reads require a trap to the hypervisor.
>>> As such, it should never be used as the sched clock. The native sched
>>> clock based on the raw TSC or jiffies is much better.
>>>
>>> Rework the sched clock setup so it is set to the TSC page only if
>>> Hyper-V indicates that the TSC may have inconsistent frequency across
>>> live migrations. Also, remove the code that sets the sched clock to
>>> the MSR-based ref counter. In the cases where it is not set, the sched
>>> clock will then be the native sched clock.
>>>
>>> As part of the rework, always enable both the TSC page clocksource and
>>> the MSR-based ref counter clocksource. Set the ratings so the TSC page
>>> clocksource is preferred. While the MSR-based ref counter clocksource
>>> is unlikely to ever be the default, having it available for manual
>>> selection is convenient for development purposes.
>>>
>>> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
>>
>> The patch does not apply, does it depend on another patch?
>
> It should apply to linux-next. It depends on two previous patches from
> Peter Zijlstra in the sched/core branch of tip. See:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=9397fa2ea3e7634f61da1ab76b9eb88ba04dfdfc
>
> and
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=e39acc37db34f6688e2c16e958fb1d662c422c81
Yeah, but the branch is tip/timers/core
Could you respin against it ?
Thanks
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply
* Re: Fwd: nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-06-19 17:19 UTC (permalink / raw)
To: Bagas Sanjaya, Linux Kernel Mailing List, Linux Regressions,
Linux Kernel Network Developers, Linux BPF, Linux on Hyper-V
Cc: Michael Kelley, Haiyang Zhang, Paolo Abeni, David S. Miller,
Jakub Kicinski
In-Reply-To: <15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com>
[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]
On 30.05.23 14:25, Bagas Sanjaya wrote:
>
> I notice a regression report on Bugzilla [1]. Quoting from it:
> [...]
> #regzbot introduced: dca5161f9bd052 https://bugzilla.kernel.org/show_bug.cgi?id=217503
> #regzbot title: net_ratelimit and nvsp_rndis_pkt_complete error due to SEND_RNDIS_PKT status check
>
> Thanks.
>
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217503
#regzbot resolve: turned out it's not a regression, see
https://bugzilla.kernel.org/show_bug.cgi?id=217503#c10
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
#regzbot ignore-activity
^ permalink raw reply
* RE: [PATCH 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Michael Kelley (LINUX) @ 2023-06-19 17:41 UTC (permalink / raw)
To: Daniel Lezcano, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, tglx@linutronix.de, linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org
In-Reply-To: <07c5efe5-4eb0-121f-7b50-8f3fba68beab@linaro.org>
From: Daniel Lezcano <daniel.lezcano@linaro.org> Sent: Monday, June 19, 2023 9:58 AM
>
> On 19/06/2023 18:44, Michael Kelley (LINUX) wrote:
> > From: Daniel Lezcano <daniel.lezcano@linaro.org> Sent: Monday, June 19, 2023 9:16 AM
> >>
> >> On 09/06/2023 17:47, Michael Kelley wrote:
> >>> Current code assigns either the Hyper-V TSC page or MSR-based ref counter
> >>> as the sched clock. This may be sub-optimal in two cases. First, if there
> >>> is hardware support to ensure consistent TSC frequency across live
> >>> migrations and Hyper-V is using that support, the raw TSC is a faster
> >>> source of time than the Hyper-V TSC page. Second, the MSR-based ref
> >>> counter is relatively slow because reads require a trap to the hypervisor.
> >>> As such, it should never be used as the sched clock. The native sched
> >>> clock based on the raw TSC or jiffies is much better.
> >>>
> >>> Rework the sched clock setup so it is set to the TSC page only if
> >>> Hyper-V indicates that the TSC may have inconsistent frequency across
> >>> live migrations. Also, remove the code that sets the sched clock to
> >>> the MSR-based ref counter. In the cases where it is not set, the sched
> >>> clock will then be the native sched clock.
> >>>
> >>> As part of the rework, always enable both the TSC page clocksource and
> >>> the MSR-based ref counter clocksource. Set the ratings so the TSC page
> >>> clocksource is preferred. While the MSR-based ref counter clocksource
> >>> is unlikely to ever be the default, having it available for manual
> >>> selection is convenient for development purposes.
> >>>
> >>> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> >>
> >> The patch does not apply, does it depend on another patch?
> >
> > It should apply to linux-next. It depends on two previous patches from
> > Peter Zijlstra in the sched/core branch of tip. See:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=9397fa2ea3e7634f61da1ab76b9eb88ba04dfdfc
> >
> > and
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=sched/core&id=e39acc37db34f6688e2c16e958fb1d662c422c81
>
> Yeah, but the branch is tip/timers/core
>
> Could you respin against it ?
>
Ah, OK. Just to confirm, you are saying that this patch would go through the
tip/timers/core branch, which does *not* have Peter's patches. Resolving the
merge conflict will be done within the tip branches.
Yes, I'll respin for tip/timers/core.
Michael
^ permalink raw reply
* [PATCH v2 1/1] clocksource: hyper-v: Rework clocksource and sched clock setup
From: Michael Kelley @ 2023-06-19 19:02 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, daniel.lezcano, tglx, linux-kernel,
linux-hyperv
Cc: mikelley
Current code assigns either the Hyper-V TSC page or MSR-based ref counter
as the sched clock. This may be sub-optimal in two cases. First, if there
is hardware support to ensure consistent TSC frequency across live
migrations and Hyper-V is using that support, the raw TSC is a faster
source of time than the Hyper-V TSC page. Second, the MSR-based ref
counter is relatively slow because reads require a trap to the hypervisor.
As such, it should never be used as the sched clock. The native sched
clock based on the raw TSC or jiffies is much better.
Rework the sched clock setup so it is set to the TSC page only if
Hyper-V indicates that the TSC may have inconsistent frequency across
live migrations. Also, remove the code that sets the sched clock to
the MSR-based ref counter. In the cases where it is not set, the sched
clock will then be the native sched clock.
As part of the rework, always enable both the TSC page clocksource and
the MSR-based ref counter clocksource. Set the ratings so the TSC page
clocksource is preferred. While the MSR-based ref counter clocksource
is unlikely to ever be the default, having it available for manual
selection is convenient for development purposes.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
---
Changes in v2:
* Rebased to tip/timers/core branch [Daniel Lezcano]
drivers/clocksource/hyperv_timer.c | 54 ++++++++++++++++----------------------
1 file changed, 23 insertions(+), 31 deletions(-)
diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c
index bcd9042..9fc008c 100644
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -475,15 +475,9 @@ static u64 notrace read_hv_clock_msr_cs(struct clocksource *arg)
return read_hv_clock_msr();
}
-static u64 notrace read_hv_sched_clock_msr(void)
-{
- return (read_hv_clock_msr() - hv_sched_clock_offset) *
- (NSEC_PER_SEC / HV_CLOCK_HZ);
-}
-
static struct clocksource hyperv_cs_msr = {
.name = "hyperv_clocksource_msr",
- .rating = 500,
+ .rating = 495,
.read = read_hv_clock_msr_cs,
.mask = CLOCKSOURCE_MASK(64),
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
@@ -513,7 +507,7 @@ static __always_inline void hv_setup_sched_clock(void *sched_clock)
static __always_inline void hv_setup_sched_clock(void *sched_clock) {}
#endif /* CONFIG_GENERIC_SCHED_CLOCK */
-static bool __init hv_init_tsc_clocksource(void)
+static void __init hv_init_tsc_clocksource(void)
{
union hv_reference_tsc_msr tsc_msr;
@@ -524,17 +518,14 @@ static bool __init hv_init_tsc_clocksource(void)
* Hyper-V Reference TSC rating, causing the generic TSC to be used.
* TSC_INVARIANT is not offered on ARM64, so the Hyper-V Reference
* TSC will be preferred over the virtualized ARM64 arch counter.
- * While the Hyper-V MSR clocksource won't be used since the
- * Reference TSC clocksource is present, change its rating as
- * well for consistency.
*/
if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT) {
hyperv_cs_tsc.rating = 250;
- hyperv_cs_msr.rating = 250;
+ hyperv_cs_msr.rating = 245;
}
if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
- return false;
+ return;
hv_read_reference_counter = read_hv_clock_tsc;
@@ -565,33 +556,34 @@ static bool __init hv_init_tsc_clocksource(void)
clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
- hv_sched_clock_offset = hv_read_reference_counter();
- hv_setup_sched_clock(read_hv_sched_clock_tsc);
-
- return true;
+ /*
+ * If TSC is invariant, then let it stay as the sched clock since it
+ * will be faster than reading the TSC page. But if not invariant, use
+ * the TSC page so that live migrations across hosts with different
+ * frequencies is handled correctly.
+ */
+ if (!(ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)) {
+ hv_sched_clock_offset = hv_read_reference_counter();
+ hv_setup_sched_clock(read_hv_sched_clock_tsc);
+ }
}
void __init hv_init_clocksource(void)
{
/*
- * Try to set up the TSC page clocksource. If it succeeds, we're
- * done. Otherwise, set up the MSR clocksource. At least one of
- * these will always be available except on very old versions of
- * Hyper-V on x86. In that case we won't have a Hyper-V
+ * Try to set up the TSC page clocksource, then the MSR clocksource.
+ * At least one of these will always be available except on very old
+ * versions of Hyper-V on x86. In that case we won't have a Hyper-V
* clocksource, but Linux will still run with a clocksource based
* on the emulated PIT or LAPIC timer.
+ *
+ * Never use the MSR clocksource as sched clock. It's too slow.
+ * Better to use the native sched clock as the fallback.
*/
- if (hv_init_tsc_clocksource())
- return;
-
- if (!(ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE))
- return;
-
- hv_read_reference_counter = read_hv_clock_msr;
- clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+ hv_init_tsc_clocksource();
- hv_sched_clock_offset = hv_read_reference_counter();
- hv_setup_sched_clock(read_hv_sched_clock_msr);
+ if (ms_hyperv.features & HV_MSR_TIME_REF_COUNT_AVAILABLE)
+ clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
}
void __init hv_remap_tsc_clocksource(void)
--
1.8.3.1
^ permalink raw reply related
* [GIT PULL] Hyper-V fixes for 6.4-rc8
From: Wei Liu @ 2023-06-19 23:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wei Liu, Linux on Hyper-V List, Linux Kernel List, kys, haiyangz,
decui, Michael Kelley
Hi Linus,
The following changes since commit ac9a78681b921877518763ba0e89202254349d1b:
Linux 6.4-rc1 (2023-05-07 13:34:35 -0700)
are available in the Git repository at:
ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20230619
for you to fetch changes up to 067d6ec7ed5b49380688e06c1e5f883a71bef4fe:
PCI: hv: Add a per-bus mutex state_lock (2023-06-18 03:05:40 +0000)
----------------------------------------------------------------
hyperv-fixes for 6.4-rc8
- Fix races in Hyper-V PCI controller (Dexuan Cui)
- Fix handling of hyperv_pcpu_input_arg (Michael Kelley)
- Fix vmbus_wait_for_unload to scan present CPUs (Michael Kelley)
- Call hv_synic_free in the failure path of hv_synic_alloc (Dexuan
Cui)
- Add noop for real mode handlers for virtual trust level code
(Saurabh Sengar)
----------------------------------------------------------------
Dexuan Cui (6):
Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails
PCI: hv: Fix a race condition bug in hv_pci_query_relations()
PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
PCI: hv: Add a per-bus mutex state_lock
Michael Kelley (3):
Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs
x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
arm64/hyperv: Use CPUHP_AP_HYPERV_ONLINE state to fix CPU online sequencing
Saurabh Sengar (1):
x86/hyperv/vtl: Add noop for realmode pointers
arch/arm64/hyperv/mshyperv.c | 2 +-
arch/x86/hyperv/hv_init.c | 2 +-
arch/x86/hyperv/hv_vtl.c | 2 +
drivers/hv/channel_mgmt.c | 18 ++++-
drivers/hv/hv_common.c | 48 ++++++-------
drivers/hv/vmbus_drv.c | 5 +-
drivers/pci/controller/pci-hyperv.c | 139 +++++++++++++++++++++---------------
include/linux/cpuhotplug.h | 1 +
8 files changed, 129 insertions(+), 88 deletions(-)
^ permalink raw reply
* [PATCH 0/3] Do IRQ move cleanup with a timer instead of an IPI
From: Xin Li @ 2023-06-19 23:16 UTC (permalink / raw)
To: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
linux-perf-users, x86
Cc: tglx, mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
dimitri.sivanich, russ.anderson, dvhart, andy, joro,
suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu
No point to waste a vector for cleaning up the leftovers of a moved
interrupt. Aside of that this must be the lowest priority of all vectors
which makes FRED systems utilizing vectors 0x10-0x1f more complicated
than necessary.
Schedule a timer instead.
Thomas Gleixner (2):
x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup()
x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
Xin Li (1):
tools: Get rid of IRQ_MOVE_CLEANUP_VECTOR from tools
arch/x86/include/asm/hw_irq.h | 4 +-
arch/x86/include/asm/idtentry.h | 1 -
arch/x86/include/asm/irq_vectors.h | 7 --
arch/x86/kernel/apic/vector.c | 109 ++++++++++++++----
arch/x86/kernel/idt.c | 1 -
arch/x86/platform/uv/uv_irq.c | 2 +-
drivers/iommu/amd/iommu.c | 2 +-
drivers/iommu/hyperv-iommu.c | 4 +-
drivers/iommu/intel/irq_remapping.c | 2 +-
tools/arch/x86/include/asm/irq_vectors.h | 7 --
.../beauty/tracepoints/x86_irq_vectors.sh | 2 +-
11 files changed, 92 insertions(+), 49 deletions(-)
--
2.34.1
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox