* [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
@ 2025-10-02 5:05 longli
2025-10-07 2:05 ` Martin K. Petersen
2025-10-07 15:41 ` Michael Kelley
0 siblings, 2 replies; 7+ messages in thread
From: longli @ 2025-10-02 5:05 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
James E.J. Bottomley, Martin K. Petersen, James Bottomley,
linux-hyperv, linux-scsi, linux-kernel
Cc: Long Li
From: Long Li <longli@microsoft.com>
When selecting an outgoing channel for I/O, storvsc tries to select a
channel with a returning CPU that is not the same as issuing CPU. This
worked well in the past, however it doesn't work well when the Hyper-V
exposes a large number of channels (up to the number of all CPUs). Use
a different CPU for returning channel is not efficient on Hyper-V.
Change this behavior by preferring to the channel with the same CPU
as the current I/O issuing CPU whenever possible.
Tests have shown improvements in newer Hyper-V/Azure environment, and
no regression with older Hyper-V/Azure environments.
Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++--------------------
1 file changed, 45 insertions(+), 51 deletions(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index d9e59204a9c3..092939791ea0 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
}
/*
- * Our channel array is sparsley populated and we
+ * Our channel array could be sparsley populated and we
* initiated I/O on a processor/hw-q that does not
* currently have a designated channel. Fix this.
* The strategy is simple:
- * I. Ensure NUMA locality
- * II. Distribute evenly (best effort)
+ * I. Prefer the channel associated with the current CPU
+ * II. Ensure NUMA locality
+ * III. Distribute evenly (best effort)
*/
+ /* Prefer the channel on the I/O issuing processor/hw-q */
+ if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
+ return stor_device->stor_chns[q_num];
+
node_mask = cpumask_of_node(cpu_to_node(q_num));
num_channels = 0;
@@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device,
/* See storvsc_change_target_cpu(). */
outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
if (outgoing_channel != NULL) {
- if (outgoing_channel->target_cpu == q_num) {
- /*
- * Ideally, we want to pick a different channel if
- * available on the same NUMA node.
- */
- node_mask = cpumask_of_node(cpu_to_node(q_num));
- for_each_cpu_wrap(tgt_cpu,
- &stor_device->alloced_cpus, q_num + 1) {
- if (!cpumask_test_cpu(tgt_cpu, node_mask))
- continue;
- if (tgt_cpu == q_num)
- continue;
- channel = READ_ONCE(
- stor_device->stor_chns[tgt_cpu]);
- if (channel == NULL)
- continue;
- if (hv_get_avail_to_write_percent(
- &channel->outbound)
- > ring_avail_percent_lowater) {
- outgoing_channel = channel;
- goto found_channel;
- }
- }
+ if (hv_get_avail_to_write_percent(&outgoing_channel->outbound)
+ > ring_avail_percent_lowater)
+ goto found_channel;
- /*
- * All the other channels on the same NUMA node are
- * busy. Try to use the channel on the current CPU
- */
- if (hv_get_avail_to_write_percent(
- &outgoing_channel->outbound)
- > ring_avail_percent_lowater)
+ /*
+ * Channel is busy, try to find a channel on the same NUMA node
+ */
+ node_mask = cpumask_of_node(cpu_to_node(q_num));
+ for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
+ q_num + 1) {
+ if (!cpumask_test_cpu(tgt_cpu, node_mask))
+ continue;
+ channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
+ if (!channel)
+ continue;
+ if (hv_get_avail_to_write_percent(&channel->outbound)
+ > ring_avail_percent_lowater) {
+ outgoing_channel = channel;
goto found_channel;
+ }
+ }
- /*
- * If we reach here, all the channels on the current
- * NUMA node are busy. Try to find a channel in
- * other NUMA nodes
- */
- for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
- if (cpumask_test_cpu(tgt_cpu, node_mask))
- continue;
- channel = READ_ONCE(
- stor_device->stor_chns[tgt_cpu]);
- if (channel == NULL)
- continue;
- if (hv_get_avail_to_write_percent(
- &channel->outbound)
- > ring_avail_percent_lowater) {
- outgoing_channel = channel;
- goto found_channel;
- }
+ /*
+ * If we reach here, all the channels on the current
+ * NUMA node are busy. Try to find a channel in
+ * all NUMA nodes
+ */
+ for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
+ q_num + 1) {
+ channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
+ if (!channel)
+ continue;
+ if (hv_get_avail_to_write_percent(&channel->outbound)
+ > ring_avail_percent_lowater) {
+ outgoing_channel = channel;
+ goto found_channel;
}
}
+ /*
+ * If we reach here, all the channels are busy. Use the
+ * original channel found.
+ */
} else {
spin_lock_irqsave(&stor_device->lock, flags);
outgoing_channel = stor_device->stor_chns[q_num];
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
@ 2025-10-07 2:05 ` Martin K. Petersen
2025-10-07 15:41 ` Michael Kelley
1 sibling, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2025-10-07 2:05 UTC (permalink / raw)
To: longli
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
James E.J. Bottomley, Martin K. Petersen, James Bottomley,
linux-hyperv, linux-scsi, linux-kernel, Long Li
Long,
> When selecting an outgoing channel for I/O, storvsc tries to select a
> channel with a returning CPU that is not the same as issuing CPU. This
> worked well in the past, however it doesn't work well when the Hyper-V
> exposes a large number of channels (up to the number of all CPUs). Use
> a different CPU for returning channel is not efficient on Hyper-V.
Applied to 6.18/scsi-staging, thanks!
--
Martin K. Petersen
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
2025-10-07 2:05 ` Martin K. Petersen
@ 2025-10-07 15:41 ` Michael Kelley
2025-10-08 0:55 ` Long Li
1 sibling, 1 reply; 7+ messages in thread
From: Michael Kelley @ 2025-10-07 15:41 UTC (permalink / raw)
To: longli@linux.microsoft.com, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Long Li
From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: Wednesday, October 1, 2025 10:06 PM
>
> When selecting an outgoing channel for I/O, storvsc tries to select a
> channel with a returning CPU that is not the same as issuing CPU. This
> worked well in the past, however it doesn't work well when the Hyper-V
> exposes a large number of channels (up to the number of all CPUs). Use
> a different CPU for returning channel is not efficient on Hyper-V.
>
> Change this behavior by preferring to the channel with the same CPU
> as the current I/O issuing CPU whenever possible.
>
> Tests have shown improvements in newer Hyper-V/Azure environment, and
> no regression with older Hyper-V/Azure environments.
>
> Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
> drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++--------------------
> 1 file changed, 45 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index d9e59204a9c3..092939791ea0 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> }
>
> /*
> - * Our channel array is sparsley populated and we
> + * Our channel array could be sparsley populated and we
> * initiated I/O on a processor/hw-q that does not
> * currently have a designated channel. Fix this.
> * The strategy is simple:
> - * I. Ensure NUMA locality
> - * II. Distribute evenly (best effort)
> + * I. Prefer the channel associated with the current CPU
> + * II. Ensure NUMA locality
> + * III. Distribute evenly (best effort)
> */
>
> + /* Prefer the channel on the I/O issuing processor/hw-q */
> + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> + return stor_device->stor_chns[q_num];
> +
Hmmm. When get_og_chn() is called, we know that
stor_device->stor_chns[q_num] is NULL since storvsc_do_io() has
already handled the non-NULL case. And the checks are all done
with stor_device->lock held, so the stor_chns array can't change.
Hence the above code will return NULL, which will cause a NULL
reference when storvsc_do_io() sends out the VMBus packet.
My recollection is that get_og_chan() is called when there is no
channel that interrupts the current CPU (that's what it means
for stor_device->stor_chns[<current CPU>] to be NULL). So the
algorithm must pick a channel that interrupts some other CPU,
preferably a CPU in the current NUMA node. Adding code to prefer
the channel associated with the current CPU doesn't make sense in
get_og_chn(), as get_og_chn() is only called when it is already
known that there is no such channel.
Or is there a case that I'm missing? Regardless, the above code
seems problematic because it would return NULL.
Michael
> node_mask = cpumask_of_node(cpu_to_node(q_num));
>
> num_channels = 0;
> @@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device,
> /* See storvsc_change_target_cpu(). */
> outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
> if (outgoing_channel != NULL) {
> - if (outgoing_channel->target_cpu == q_num) {
> - /*
> - * Ideally, we want to pick a different channel if
> - * available on the same NUMA node.
> - */
> - node_mask = cpumask_of_node(cpu_to_node(q_num));
> - for_each_cpu_wrap(tgt_cpu,
> - &stor_device->alloced_cpus, q_num + 1) {
> - if (!cpumask_test_cpu(tgt_cpu, node_mask))
> - continue;
> - if (tgt_cpu == q_num)
> - continue;
> - channel = READ_ONCE(
> - stor_device->stor_chns[tgt_cpu]);
> - if (channel == NULL)
> - continue;
> - if (hv_get_avail_to_write_percent(
> - &channel->outbound)
> - > ring_avail_percent_lowater) {
> - outgoing_channel = channel;
> - goto found_channel;
> - }
> - }
> + if (hv_get_avail_to_write_percent(&outgoing_channel->outbound)
> + > ring_avail_percent_lowater)
> + goto found_channel;
>
> - /*
> - * All the other channels on the same NUMA node are
> - * busy. Try to use the channel on the current CPU
> - */
> - if (hv_get_avail_to_write_percent(
> - &outgoing_channel->outbound)
> - > ring_avail_percent_lowater)
> + /*
> + * Channel is busy, try to find a channel on the same NUMA node
> + */
> + node_mask = cpumask_of_node(cpu_to_node(q_num));
> + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> + q_num + 1) {
> + if (!cpumask_test_cpu(tgt_cpu, node_mask))
> + continue;
> + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> + if (!channel)
> + continue;
> + if (hv_get_avail_to_write_percent(&channel->outbound)
> + > ring_avail_percent_lowater) {
> + outgoing_channel = channel;
> goto found_channel;
> + }
> + }
>
> - /*
> - * If we reach here, all the channels on the current
> - * NUMA node are busy. Try to find a channel in
> - * other NUMA nodes
> - */
> - for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> - if (cpumask_test_cpu(tgt_cpu, node_mask))
> - continue;
> - channel = READ_ONCE(
> - stor_device->stor_chns[tgt_cpu]);
> - if (channel == NULL)
> - continue;
> - if (hv_get_avail_to_write_percent(
> - &channel->outbound)
> - > ring_avail_percent_lowater) {
> - outgoing_channel = channel;
> - goto found_channel;
> - }
> + /*
> + * If we reach here, all the channels on the current
> + * NUMA node are busy. Try to find a channel in
> + * all NUMA nodes
> + */
> + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> + q_num + 1) {
> + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> + if (!channel)
> + continue;
> + if (hv_get_avail_to_write_percent(&channel->outbound)
> + > ring_avail_percent_lowater) {
> + outgoing_channel = channel;
> + goto found_channel;
> }
> }
> + /*
> + * If we reach here, all the channels are busy. Use the
> + * original channel found.
> + */
> } else {
> spin_lock_irqsave(&stor_device->lock, flags);
> outgoing_channel = stor_device->stor_chns[q_num];
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-07 15:41 ` Michael Kelley
@ 2025-10-08 0:55 ` Long Li
2025-10-08 15:30 ` Michael Kelley
0 siblings, 1 reply; 7+ messages in thread
From: Long Li @ 2025-10-08 0:55 UTC (permalink / raw)
To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley,
Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Tuesday, October 7, 2025 8:42 AM
> To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; James E.J. Bottomley
> <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>;
> linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: Long Li <longli@microsoft.com>
> Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> same CPU as on the I/O issuing CPU
>
> From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> Wednesday, October 1, 2025 10:06 PM
> >
> > When selecting an outgoing channel for I/O, storvsc tries to select a
> > channel with a returning CPU that is not the same as issuing CPU. This
> > worked well in the past, however it doesn't work well when the Hyper-V
> > exposes a large number of channels (up to the number of all CPUs). Use
> > a different CPU for returning channel is not efficient on Hyper-V.
> >
> > Change this behavior by preferring to the channel with the same CPU as
> > the current I/O issuing CPU whenever possible.
> >
> > Tests have shown improvements in newer Hyper-V/Azure environment, and
> > no regression with older Hyper-V/Azure environments.
> >
> > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> > drivers/scsi/storvsc_drv.c | 96
> > ++++++++++++++++++--------------------
> > 1 file changed, 45 insertions(+), 51 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index d9e59204a9c3..092939791ea0 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct
> storvsc_device *stor_device,
> > }
> >
> > /*
> > - * Our channel array is sparsley populated and we
> > + * Our channel array could be sparsley populated and we
> > * initiated I/O on a processor/hw-q that does not
> > * currently have a designated channel. Fix this.
> > * The strategy is simple:
> > - * I. Ensure NUMA locality
> > - * II. Distribute evenly (best effort)
> > + * I. Prefer the channel associated with the current CPU
> > + * II. Ensure NUMA locality
> > + * III. Distribute evenly (best effort)
> > */
> >
> > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > + return stor_device->stor_chns[q_num];
> > +
>
> Hmmm. When get_og_chn() is called, we know that stor_device-
> >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-
> NULL case. And the checks are all done with stor_device->lock held, so the
> stor_chns array can't change.
> Hence the above code will return NULL, which will cause a NULL reference when
> storvsc_do_io() sends out the VMBus packet.
>
> My recollection is that get_og_chan() is called when there is no channel that
> interrupts the current CPU (that's what it means for stor_device-
> >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel
> that interrupts some other CPU, preferably a CPU in the current NUMA node.
> Adding code to prefer the channel associated with the current CPU doesn't make
> sense in get_og_chn(), as get_og_chn() is only called when it is already known
> that there is no such channel.
The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for primary channel) and handle_sc_creation() (for subchannels).
As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will guarantee we are getting a channel. If the check fails, the code follows the old behavior to find a channel.
This check is needed because storvsc supports change_target_cpu_callback() callback via vmbus.
Thanks,
Long
>
> Or is there a case that I'm missing? Regardless, the above code seems
> problematic because it would return NULL.
>
> Michael
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-08 0:55 ` Long Li
@ 2025-10-08 15:30 ` Michael Kelley
2025-10-08 17:20 ` Long Li
0 siblings, 1 reply; 7+ messages in thread
From: Michael Kelley @ 2025-10-08 15:30 UTC (permalink / raw)
To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
>
> > -----Original Message-----
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Tuesday, October 7, 2025 8:42 AM
> > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang
> > Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> > <decui@microsoft.com>; James E.J. Bottomley
> > <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> > <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>;
> > linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Cc: Long Li <longli@microsoft.com>
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> > same CPU as on the I/O issuing CPU
> >
> > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > Wednesday, October 1, 2025 10:06 PM
> > >
> > > When selecting an outgoing channel for I/O, storvsc tries to select a
> > > channel with a returning CPU that is not the same as issuing CPU. This
> > > worked well in the past, however it doesn't work well when the Hyper-V
> > > exposes a large number of channels (up to the number of all CPUs). Use
> > > a different CPU for returning channel is not efficient on Hyper-V.
> > >
> > > Change this behavior by preferring to the channel with the same CPU as
> > > the current I/O issuing CPU whenever possible.
> > >
> > > Tests have shown improvements in newer Hyper-V/Azure environment, and
> > > no regression with older Hyper-V/Azure environments.
> > >
> > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > Signed-off-by: Long Li <longli@microsoft.com>
> > > ---
> > > drivers/scsi/storvsc_drv.c | 96
> > > ++++++++++++++++++--------------------
> > > 1 file changed, 45 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index d9e59204a9c3..092939791ea0 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> > > }
> > >
> > > /*
> > > - * Our channel array is sparsley populated and we
> > > + * Our channel array could be sparsley populated and we
> > > * initiated I/O on a processor/hw-q that does not
> > > * currently have a designated channel. Fix this.
> > > * The strategy is simple:
> > > - * I. Ensure NUMA locality
> > > - * II. Distribute evenly (best effort)
> > > + * I. Prefer the channel associated with the current CPU
> > > + * II. Ensure NUMA locality
> > > + * III. Distribute evenly (best effort)
> > > */
> > >
> > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > + return stor_device->stor_chns[q_num];
> > > +
> >
> > Hmmm. When get_og_chn() is called, we know that stor_device-
> > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-
> > NULL case. And the checks are all done with stor_device->lock held, so the
> > stor_chns array can't change.
> > Hence the above code will return NULL, which will cause a NULL reference when
> > storvsc_do_io() sends out the VMBus packet.
> >
> > My recollection is that get_og_chan() is called when there is no channel that
> > interrupts the current CPU (that's what it means for stor_device-
> > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel
> > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > Adding code to prefer the channel associated with the current CPU doesn't make
> > sense in get_og_chn(), as get_og_chn() is only called when it is already known
> > that there is no such channel.
>
> The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for
> primary channel) and handle_sc_creation() (for subchannels).
OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry in
stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU bit in
alloced_cpus is *not* set. All the places that manipulate these fields update both so
they are in sync with each other. Hence I'll agree the code you've added in get_og_chn()
will never return a NULL value.
(However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL, the
corresponding CPU bit in alloced_cpus may or may not be set.)
>
> As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will
> guarantee we are getting a channel. If the check fails, the code follows the old behavior
> to find a channel.
>
> This check is needed because storvsc supports change_target_cpu_callback() callback
> via vmbus.
But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied the
code here for discussion purposes. This is the only place that get_og_chn() is called:
spin_lock_irqsave(&stor_device->lock, flags);
outgoing_channel = stor_device->stor_chns[q_num];
if (outgoing_channel != NULL) {
spin_unlock_irqrestore(&stor_device->lock, flags);
goto found_channel;
}
outgoing_channel = get_og_chn(stor_device, q_num);
spin_unlock_irqrestore(&stor_device->lock, flags);
The code gets the spin lock, then reads the stor_chns[] entry. If the entry is
non-NULL, then we've found a suitable channel and get_og_chn() is *not*
called. The only time get_og_chan() is called is when the stor_chn[] entry
*is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
So the check you've added in get_og_chn() can never be true and the check
is not needed. You said the check is needed because of
change_target_cpu_callback(), which will invoke
storvsc_change_target_cpu(). But I don't see how that matters given the
checks done before get_og_chn() is called. The spin lock synchronizes with
any changes made by storvsc_change_target_cpu().
Related, there are a couple of occurrences of code like this:
for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
q_num + 1) {
channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
if (!channel)
continue;
Given your point that the alloced_cpus and stor_chns[] entries are
always in sync with each other, the check for channel being NULL
is not necessary.
Michael
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-08 15:30 ` Michael Kelley
@ 2025-10-08 17:20 ` Long Li
2025-10-08 18:24 ` Michael Kelley
0 siblings, 1 reply; 7+ messages in thread
From: Long Li @ 2025-10-08 17:20 UTC (permalink / raw)
To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley,
Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
> Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> same CPU as on the I/O issuing CPU
>
> From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@outlook.com>
> > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>;
> > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J.
> > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K.
> > > Petersen <martin.petersen@oracle.com>; James Bottomley
> > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org;
> > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org
> > > Cc: Long Li <longli@microsoft.com>
> > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > channel with the same CPU as on the I/O issuing CPU
> > >
> > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > > Wednesday, October 1, 2025 10:06 PM
> > > >
> > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > select a channel with a returning CPU that is not the same as
> > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > well when the Hyper-V exposes a large number of channels (up to
> > > > the number of all CPUs). Use a different CPU for returning channel is not
> efficient on Hyper-V.
> > > >
> > > > Change this behavior by preferring to the channel with the same
> > > > CPU as the current I/O issuing CPU whenever possible.
> > > >
> > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > and no regression with older Hyper-V/Azure environments.
> > > >
> > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > > drivers/scsi/storvsc_drv.c | 96
> > > > ++++++++++++++++++--------------------
> > > > 1 file changed, 45 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > 100644
> > > > --- a/drivers/scsi/storvsc_drv.c
> > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel
> *get_og_chn(struct storvsc_device *stor_device,
> > > > }
> > > >
> > > > /*
> > > > - * Our channel array is sparsley populated and we
> > > > + * Our channel array could be sparsley populated and we
> > > > * initiated I/O on a processor/hw-q that does not
> > > > * currently have a designated channel. Fix this.
> > > > * The strategy is simple:
> > > > - * I. Ensure NUMA locality
> > > > - * II. Distribute evenly (best effort)
> > > > + * I. Prefer the channel associated with the current CPU
> > > > + * II. Ensure NUMA locality
> > > > + * III. Distribute evenly (best effort)
> > > > */
> > > >
> > > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > + return stor_device->stor_chns[q_num];
> > > > +
> > >
> > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > >the non-
> > > NULL case. And the checks are all done with stor_device->lock held,
> > > so the stor_chns array can't change.
> > > Hence the above code will return NULL, which will cause a NULL
> > > reference when
> > > storvsc_do_io() sends out the VMBus packet.
> > >
> > > My recollection is that get_og_chan() is called when there is no
> > > channel that interrupts the current CPU (that's what it means for
> > > stor_device-
> > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > >channel
> > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > Adding code to prefer the channel associated with the current CPU
> > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > when it is already known that there is no such channel.
> >
> > The initial values for stor_chns[] and alloced_cpus are set in
> > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> subchannels).
>
> OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU
> bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> both so they are in sync with each other. Hence I'll agree the code you've added
> in get_og_chn() will never return a NULL value.
>
> (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL,
> the corresponding CPU bit in alloced_cpus may or may not be set.)
>
> >
> > As a result, the check for cpumask_test_cpu(q_num,
> > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > If the check fails, the code follows the old behavior to find a channel.
> >
> > This check is needed because storvsc supports
> > change_target_cpu_callback() callback via vmbus.
>
> But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> the code here for discussion purposes. This is the only place that get_og_chn() is
> called:
>
> spin_lock_irqsave(&stor_device->lock, flags);
> outgoing_channel = stor_device->stor_chns[q_num];
> if (outgoing_channel != NULL) {
> spin_unlock_irqrestore(&stor_device->lock, flags);
> goto found_channel;
> }
> outgoing_channel = get_og_chn(stor_device, q_num);
> spin_unlock_irqrestore(&stor_device->lock, flags);
>
> The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> only time get_og_chan() is called is when the stor_chn[] entry
> *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> So the check you've added in get_og_chn() can never be true and the check is not
> needed. You said the check is needed because of change_target_cpu_callback(),
> which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> given the checks done before get_og_chn() is called. The spin lock synchronizes
> with any changes made by storvsc_change_target_cpu().
I agree this check may not be necessary. Given the current patch is correct, how about I
submit another patch to remove this check after more testing are done?
>
> Related, there are a couple of occurrences of code like this:
>
> for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> q_num + 1) {
> channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> if (!channel)
> continue;
>
> Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> with each other, the check for channel being NULL is not necessary.
I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
before changing alloced_cpus. It can run at the same time as this code. I think we may need
to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.
Thanks,
Long
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
2025-10-08 17:20 ` Long Li
@ 2025-10-08 18:24 ` Michael Kelley
0 siblings, 0 replies; 7+ messages in thread
From: Michael Kelley @ 2025-10-08 18:24 UTC (permalink / raw)
To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
James Bottomley, linux-hyperv@vger.kernel.org,
linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
From: Long Li <longli@microsoft.com> Sent: Wednesday, October 8, 2025 10:20 AM
>
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> > same CPU as on the I/O issuing CPU
> >
> > From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@outlook.com>
> > > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>;
> > > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J.
> > > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K.
> > > > Petersen <martin.petersen@oracle.com>; James Bottomley
> > > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org;
> > > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org
> > > > Cc: Long Li <longli@microsoft.com>
> > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > > channel with the same CPU as on the I/O issuing CPU
> > > >
> > > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > > > Wednesday, October 1, 2025 10:06 PM
> > > > >
> > > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > > select a channel with a returning CPU that is not the same as
> > > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > > well when the Hyper-V exposes a large number of channels (up to
> > > > > the number of all CPUs). Use a different CPU for returning channel is not efficient on Hyper-V.
> > > > >
> > > > > Change this behavior by preferring to the channel with the same
> > > > > CPU as the current I/O issuing CPU whenever possible.
> > > > >
> > > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > > and no regression with older Hyper-V/Azure environments.
> > > > >
> > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > > ---
> > > > > drivers/scsi/storvsc_drv.c | 96
> > > > > ++++++++++++++++++--------------------
> > > > > 1 file changed, 45 insertions(+), 51 deletions(-)
> > > > >
> > > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > > 100644
> > > > > --- a/drivers/scsi/storvsc_drv.c
> > > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> > > > > }
> > > > >
> > > > > /*
> > > > > - * Our channel array is sparsley populated and we
> > > > > + * Our channel array could be sparsley populated and we
> > > > > * initiated I/O on a processor/hw-q that does not
> > > > > * currently have a designated channel. Fix this.
> > > > > * The strategy is simple:
> > > > > - * I. Ensure NUMA locality
> > > > > - * II. Distribute evenly (best effort)
> > > > > + * I. Prefer the channel associated with the current CPU
> > > > > + * II. Ensure NUMA locality
> > > > > + * III. Distribute evenly (best effort)
> > > > > */
> > > > >
> > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > > + return stor_device->stor_chns[q_num];
> > > > > +
> > > >
> > > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > > >the non-
> > > > NULL case. And the checks are all done with stor_device->lock held,
> > > > so the stor_chns array can't change.
> > > > Hence the above code will return NULL, which will cause a NULL
> > > > reference when
> > > > storvsc_do_io() sends out the VMBus packet.
> > > >
> > > > My recollection is that get_og_chan() is called when there is no
> > > > channel that interrupts the current CPU (that's what it means for
> > > > stor_device-
> > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > > >channel
> > > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > > Adding code to prefer the channel associated with the current CPU
> > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > > when it is already known that there is no such channel.
> > >
> > > The initial values for stor_chns[] and alloced_cpus are set in
> > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> > subchannels).
> >
> > OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> > in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU
> > bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> > both so they are in sync with each other. Hence I'll agree the code you've added
> > in get_og_chn() will never return a NULL value.
> >
> > (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL,
> > the corresponding CPU bit in alloced_cpus may or may not be set.)
> >
> > >
> > > As a result, the check for cpumask_test_cpu(q_num,
> > > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > > If the check fails, the code follows the old behavior to find a channel.
> > >
> > > This check is needed because storvsc supports
> > > change_target_cpu_callback() callback via vmbus.
> >
> > But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> > the code here for discussion purposes. This is the only place that get_og_chn() is
> > called:
> >
> > spin_lock_irqsave(&stor_device->lock, flags);
> > outgoing_channel = stor_device->stor_chns[q_num];
> > if (outgoing_channel != NULL) {
> > spin_unlock_irqrestore(&stor_device->lock, flags);
> > goto found_channel;
> > }
> > outgoing_channel = get_og_chn(stor_device, q_num);
> > spin_unlock_irqrestore(&stor_device->lock, flags);
> >
> > The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> > NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> > only time get_og_chan() is called is when the stor_chn[] entry
> > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> > So the check you've added in get_og_chn() can never be true and the check is not
> > needed. You said the check is needed because of change_target_cpu_callback(),
> > which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> > given the checks done before get_og_chn() is called. The spin lock synchronizes
> > with any changes made by storvsc_change_target_cpu().
>
> I agree this check may not be necessary. Given the current patch is correct, how about I
> submit another patch to remove this check after more testing are done?
That's OK with me.
>
> >
> > Related, there are a couple of occurrences of code like this:
> >
> > for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> > q_num + 1) {
> > channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> > if (!channel)
> > continue;
> >
> > Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> > with each other, the check for channel being NULL is not necessary.
>
> I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
> on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
> before changing alloced_cpus. It can run at the same time as this code. I think we may need
> to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.
Yes, you are right. The spin lock isn't held in these cases, so the checks for NULL
should stay.
Michael
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-10-08 18:24 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
2025-10-07 2:05 ` Martin K. Petersen
2025-10-07 15:41 ` Michael Kelley
2025-10-08 0:55 ` Long Li
2025-10-08 15:30 ` Michael Kelley
2025-10-08 17:20 ` Long Li
2025-10-08 18:24 ` Michael Kelley
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).