From: mhkelley58@gmail.com
To: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, lpieralisi@kernel.org, kw@linux.com,
robh@kernel.org, bhelgaas@google.com,
James.Bottomley@HansenPartnership.com,
martin.petersen@oracle.com, arnd@arndb.de,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-pci@vger.kernel.org, linux-scsi@vger.kernel.org,
linux-arch@vger.kernel.org
Cc: maz@kernel.org, den@valinux.co.jp, jgowans@amazon.com,
dawei.li@shingroup.cn
Subject: [RFC 11/12] Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining CPUs
Date: Mon, 3 Jun 2024 22:09:39 -0700 [thread overview]
Message-ID: <20240604050940.859909-12-mhklinux@outlook.com> (raw)
In-Reply-To: <20240604050940.859909-1-mhklinux@outlook.com>
From: Michael Kelley <mhklinux@outlook.com>
vmbus_irq_set_affinity() may issue a MODIFYCHANNEL request to Hyper-V to
target a VMBus channel interrupt to a different CPU. While newer versions
of Hyper-V send a response to the guest when the change is complete,
vmbus_irq_set_affinity() does not wait for the response because it is
running with interrupts disabled. So Hyper-V may continue to direct
interrupts to the old CPU for a short window after vmbus_irq_set_affinity()
completes. This lag is not a problem during normal operation. But if
the old CPU is taken offline during that window, Hyper-V may drop
the interrupt because the synic in the target CPU is disabled. Dropping
the interrupt may cause the VMBus channel to hang.
To prevent this, wait for in-process MODIFYCHANNEL requests when taking
a CPU offline. On newer versions of Hyper-V, completion can be confirmed
by waiting for the response sent by Hyper-V. But on older versions of
Hyper-V that don't send a response, wait a fixed interval of time that
empirically should be "long enough", as that's the best that can be done.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
drivers/hv/channel.c | 3 ++
drivers/hv/channel_mgmt.c | 32 ++++--------------
drivers/hv/hv.c | 70 +++++++++++++++++++++++++++++++++++----
drivers/hv/hyperv_vmbus.h | 8 +++++
4 files changed, 81 insertions(+), 32 deletions(-)
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index b7920072e243..b7ee95373049 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -246,6 +246,9 @@ int vmbus_send_modifychannel(struct vmbus_channel *channel, u32 target_vp)
ret = vmbus_post_msg(&msg, sizeof(msg), false);
trace_vmbus_send_modifychannel(&msg, ret);
+ if (!ret)
+ vmbus_connection.modchan_sent++;
+
return ret;
}
EXPORT_SYMBOL_GPL(vmbus_send_modifychannel);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index da42aaae994e..960a2f0367d8 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -1465,40 +1465,20 @@ static void vmbus_ongpadl_created(struct vmbus_channel_message_header *hdr)
* vmbus_onmodifychannel_response - Modify Channel response handler.
*
* This is invoked when we received a response to our channel modify request.
- * Find the matching request, copy the response and signal the requesting thread.
+ * Increment the count of responses received. No locking is needed because
+ * responses are always received on the VMBUS_CONNECT_CPU.
*/
static void vmbus_onmodifychannel_response(struct vmbus_channel_message_header *hdr)
{
struct vmbus_channel_modifychannel_response *response;
- struct vmbus_channel_msginfo *msginfo;
- unsigned long flags;
response = (struct vmbus_channel_modifychannel_response *)hdr;
+ if (response->status)
+ pr_err("Error status %x in MODIFYCHANNEL response for relid %d\n",
+ response->status, response->child_relid);
+ vmbus_connection.modchan_completed++;
trace_vmbus_onmodifychannel_response(response);
-
- /*
- * Find the modify msg, copy the response and signal/unblock the wait event.
- */
- spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
-
- list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list, msglistentry) {
- struct vmbus_channel_message_header *responseheader =
- (struct vmbus_channel_message_header *)msginfo->msg;
-
- if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) {
- struct vmbus_channel_modifychannel *modifymsg;
-
- modifymsg = (struct vmbus_channel_modifychannel *)msginfo->msg;
- if (modifymsg->child_relid == response->child_relid) {
- memcpy(&msginfo->response.modify_response, response,
- sizeof(*response));
- complete(&msginfo->waitevent);
- break;
- }
- }
- }
- spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
}
/*
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index a8ad728354cb..76658dfc5008 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -401,6 +401,56 @@ void hv_synic_disable_regs(unsigned int cpu)
disable_percpu_irq(vmbus_irq);
}
+static void hv_synic_wait_for_modifychannel(int cpu)
+{
+ int i = 5;
+ u64 base;
+
+ /*
+ * If we're on a VMBus version where MODIFYCHANNEL doesn't send acks,
+ * just sleep for 20 milliseconds and hope that gives Hyper-V enough
+ * time to process them. Empirical data on recent server-class CPUs
+ * (both x86 and arm64) shows that the Hyper-V response is typically
+ * received and processed in the guest within a few hundred
+ * microseconds. The 20 millisecond wait is somewhat arbitrary and
+ * intended to give plenty to time in case there are multiple
+ * MODIFYCHANNEL requests in progress and the host is busy. It's
+ * the best we can do.
+ */
+ if (vmbus_proto_version < VERSION_WIN10_V5_3) {
+ usleep_range(20000, 25000);
+ return;
+ }
+
+ /*
+ * Otherwise compare the current value of modchan_completed against
+ * modchan_sent. If some MODIFYCHANNEL requests have been sent that
+ * haven't completed, sleep 5 milliseconds and check again. If the
+ * requests still haven't completed after 5 attempts, output a
+ * message and proceed anyway.
+ *
+ * Hyper-V guarantees to process MODIFYCHANNEL requests in the order
+ * they are received from the guest, so simply comparing the counts
+ * is sufficient.
+ *
+ * Note that this check may encompass MODIFYCHANNEL requests that are
+ * unrelated to the CPU that is going offline. But the only effect is
+ * to potentially wait a little bit longer than necessary. CPUs going
+ * offline and affinity changes that result in MODIFYCHANNEL are
+ * relatively rare and it's not worth the complexity to track them more
+ * precisely.
+ */
+ base = READ_ONCE(vmbus_connection.modchan_sent);
+ while (READ_ONCE(vmbus_connection.modchan_completed) < base && i) {
+ usleep_range(5000, 10000);
+ i--;
+ }
+
+ if (i == 0)
+ pr_err("Timed out waiting for MODIFYCHANNEL. CPU %d sent %lld completed %lld\n",
+ cpu, base, vmbus_connection.modchan_completed);
+}
+
#define HV_MAX_TRIES 3
/*
* Scan the event flags page of 'this' CPU looking for any bit that is set. If we find one
@@ -485,13 +535,21 @@ int hv_synic_cleanup(unsigned int cpu)
/*
* channel_found == false means that any channels that were previously
* assigned to the CPU have been reassigned elsewhere with a call of
- * vmbus_send_modifychannel(). Scan the event flags page looking for
- * bits that are set and waiting with a timeout for vmbus_chan_sched()
- * to process such bits. If bits are still set after this operation
- * and VMBus is connected, fail the CPU offlining operation.
+ * vmbus_send_modifychannel(). First wait until any MODIFYCHANNEL
+ * requests have been completed by Hyper-V, after which we know that
+ * no new bits in the event flags will be set. Then scan the event flags
+ * page looking for bits that are set and waiting with a timeout for
+ * vmbus_chan_sched() to process such bits. If bits are still set
+ * after this operation, fail the CPU offlining operation.
*/
- if (vmbus_proto_version >= VERSION_WIN10_V4_1 && hv_synic_event_pending())
- return -EBUSY;
+ if (vmbus_proto_version >= VERSION_WIN10_V4_1) {
+ hv_synic_wait_for_modifychannel(cpu);
+ if (hv_synic_event_pending()) {
+ pr_err("Events pending when trying to offline CPU %d\n",
+ cpu);
+ return -EBUSY;
+ }
+ }
always_cleanup:
hv_stimer_legacy_cleanup(cpu);
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index bf35bb40c55e..571b2955b38e 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -264,6 +264,14 @@ struct vmbus_connection {
struct irq_domain *vmbus_irq_domain;
struct irq_chip vmbus_irq_chip;
+ /*
+ * VM-wide counts of MODIFYCHANNEL messages sent and completed.
+ * Used when taking a CPU offline to make sure the relevant
+ * MODIFYCHANNEL messages have been completed.
+ */
+ u64 modchan_sent;
+ u64 modchan_completed;
+
/*
* An offer message is handled first on the work_queue, and then
* is further handled on handle_primary_chan_wq or
--
2.25.1
next prev parent reply other threads:[~2024-06-04 5:10 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-04 5:09 [RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts mhkelley58
2024-06-04 5:09 ` [RFC 01/12] Drivers: hv: vmbus: Drop unsupported VMBus devices earlier mhkelley58
2024-06-24 7:11 ` Wei Liu
2024-06-04 5:09 ` [RFC 02/12] Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs group mhkelley58
2024-06-04 5:09 ` [RFC 03/12] Drivers: hv: vmbus: Add an IRQ name to VMBus channels mhkelley58
2024-06-04 5:09 ` [RFC 04/12] PCI: hv: Annotate the VMBus channel IRQ name mhkelley58
2024-09-20 23:13 ` Bjorn Helgaas
2024-06-04 5:09 ` [RFC 05/12] scsi: storvsc: " mhkelley58
2024-06-04 5:09 ` [RFC 06/12] genirq: Add per-cpu flow handler with conditional IRQ stats mhkelley58
2024-06-04 18:13 ` Thomas Gleixner
2024-06-04 23:03 ` Michael Kelley
2024-06-05 13:20 ` Thomas Gleixner
2024-06-05 13:45 ` Michael Kelley
2024-06-05 14:19 ` Thomas Gleixner
2024-06-06 3:14 ` Michael Kelley
2024-06-06 9:34 ` Thomas Gleixner
2024-06-06 14:34 ` Michael Kelley
2024-06-04 5:09 ` [RFC 07/12] Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus connection mhkelley58
2024-06-04 5:09 ` [RFC 08/12] Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid mapping mhkelley58
2024-06-04 5:09 ` [RFC 09/12] Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts mhkelley58
2024-06-04 5:09 ` [RFC 10/12] Drivers: hv: vmbus: Implement vmbus_irq_set_affinity mhkelley58
2024-06-04 5:09 ` mhkelley58 [this message]
2024-06-24 17:55 ` [RFC 11/12] Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining CPUs Boqun Feng
2024-06-24 19:32 ` Michael Kelley
2024-06-04 5:09 ` [RFC 12/12] Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going offline mhkelley58
2024-09-16 18:15 ` [RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts Michael Kelley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240604050940.859909-12-mhklinux@outlook.com \
--to=mhkelley58@gmail.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=arnd@arndb.de \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dawei.li@shingroup.cn \
--cc=decui@microsoft.com \
--cc=den@valinux.co.jp \
--cc=haiyangz@microsoft.com \
--cc=hpa@zytor.com \
--cc=jgowans@amazon.com \
--cc=kw@linux.com \
--cc=kys@microsoft.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-hyperv@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=lpieralisi@kernel.org \
--cc=martin.petersen@oracle.com \
--cc=maz@kernel.org \
--cc=mhklinux@outlook.com \
--cc=mingo@redhat.com \
--cc=robh@kernel.org \
--cc=tglx@linutronix.de \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox