[PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
@ 2025-10-02  5:05 longli
  2025-10-07  2:05 ` Martin K. Petersen
  2025-10-07 15:41 ` Michael Kelley
  0 siblings, 2 replies; 7+ messages in thread
From: longli @ 2025-10-02  5:05 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen, James Bottomley,
	linux-hyperv, linux-scsi, linux-kernel
  Cc: Long Li

From: Long Li <longli@microsoft.com>

When selecting an outgoing channel for I/O, storvsc tries to select a
channel with a returning CPU that is not the same as issuing CPU. This
worked well in the past, however it doesn't work well when the Hyper-V
exposes a large number of channels (up to the number of all CPUs). Use
a different CPU for returning channel is not efficient on Hyper-V.

Change this behavior by preferring to the channel with the same CPU
as the current I/O issuing CPU whenever possible.

Tests have shown improvements in newer Hyper-V/Azure environment, and
no regression with older Hyper-V/Azure environments.

Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++--------------------
 1 file changed, 45 insertions(+), 51 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index d9e59204a9c3..092939791ea0 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
 	}
 
 	/*
-	 * Our channel array is sparsley populated and we
+	 * Our channel array could be sparsley populated and we
 	 * initiated I/O on a processor/hw-q that does not
 	 * currently have a designated channel. Fix this.
 	 * The strategy is simple:
-	 * I. Ensure NUMA locality
-	 * II. Distribute evenly (best effort)
+	 * I. Prefer the channel associated with the current CPU
+	 * II. Ensure NUMA locality
+	 * III. Distribute evenly (best effort)
 	 */
 
+	/* Prefer the channel on the I/O issuing processor/hw-q */
+	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
+		return stor_device->stor_chns[q_num];
+
 	node_mask = cpumask_of_node(cpu_to_node(q_num));
 
 	num_channels = 0;
@@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device,
 	/* See storvsc_change_target_cpu(). */
 	outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
 	if (outgoing_channel != NULL) {
-		if (outgoing_channel->target_cpu == q_num) {
-			/*
-			 * Ideally, we want to pick a different channel if
-			 * available on the same NUMA node.
-			 */
-			node_mask = cpumask_of_node(cpu_to_node(q_num));
-			for_each_cpu_wrap(tgt_cpu,
-				 &stor_device->alloced_cpus, q_num + 1) {
-				if (!cpumask_test_cpu(tgt_cpu, node_mask))
-					continue;
-				if (tgt_cpu == q_num)
-					continue;
-				channel = READ_ONCE(
-					stor_device->stor_chns[tgt_cpu]);
-				if (channel == NULL)
-					continue;
-				if (hv_get_avail_to_write_percent(
-							&channel->outbound)
-						> ring_avail_percent_lowater) {
-					outgoing_channel = channel;
-					goto found_channel;
-				}
-			}
+		if (hv_get_avail_to_write_percent(&outgoing_channel->outbound)
+				> ring_avail_percent_lowater)
+			goto found_channel;
 
-			/*
-			 * All the other channels on the same NUMA node are
-			 * busy. Try to use the channel on the current CPU
-			 */
-			if (hv_get_avail_to_write_percent(
-						&outgoing_channel->outbound)
-					> ring_avail_percent_lowater)
+		/*
+		 * Channel is busy, try to find a channel on the same NUMA node
+		 */
+		node_mask = cpumask_of_node(cpu_to_node(q_num));
+		for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
+				  q_num + 1) {
+			if (!cpumask_test_cpu(tgt_cpu, node_mask))
+				continue;
+			channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
+			if (!channel)
+				continue;
+			if (hv_get_avail_to_write_percent(&channel->outbound)
+					> ring_avail_percent_lowater) {
+				outgoing_channel = channel;
 				goto found_channel;
+			}
+		}
 
-			/*
-			 * If we reach here, all the channels on the current
-			 * NUMA node are busy. Try to find a channel in
-			 * other NUMA nodes
-			 */
-			for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
-				if (cpumask_test_cpu(tgt_cpu, node_mask))
-					continue;
-				channel = READ_ONCE(
-					stor_device->stor_chns[tgt_cpu]);
-				if (channel == NULL)
-					continue;
-				if (hv_get_avail_to_write_percent(
-							&channel->outbound)
-						> ring_avail_percent_lowater) {
-					outgoing_channel = channel;
-					goto found_channel;
-				}
+		/*
+		 * If we reach here, all the channels on the current
+		 * NUMA node are busy. Try to find a channel in
+		 * all NUMA nodes
+		 */
+		for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
+				  q_num + 1) {
+			channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
+			if (!channel)
+				continue;
+			if (hv_get_avail_to_write_percent(&channel->outbound)
+					> ring_avail_percent_lowater) {
+				outgoing_channel = channel;
+				goto found_channel;
 			}
 		}
+		/*
+		 * If we reach here, all the channels are busy. Use the
+		 * original channel found.
+		 */
 	} else {
 		spin_lock_irqsave(&stor_device->lock, flags);
 		outgoing_channel = stor_device->stor_chns[q_num];
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-02  5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
@ 2025-10-07  2:05 ` Martin K. Petersen
  2025-10-07 15:41 ` Michael Kelley
  1 sibling, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2025-10-07  2:05 UTC (permalink / raw)
  To: longli
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	James E.J. Bottomley, Martin K. Petersen, James Bottomley,
	linux-hyperv, linux-scsi, linux-kernel, Long Li


Long,

> When selecting an outgoing channel for I/O, storvsc tries to select a
> channel with a returning CPU that is not the same as issuing CPU. This
> worked well in the past, however it doesn't work well when the Hyper-V
> exposes a large number of channels (up to the number of all CPUs). Use
> a different CPU for returning channel is not efficient on Hyper-V.

Applied to 6.18/scsi-staging, thanks!

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-02  5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
  2025-10-07  2:05 ` Martin K. Petersen
@ 2025-10-07 15:41 ` Michael Kelley
  2025-10-08  0:55   ` Long Li
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Kelley @ 2025-10-07 15:41 UTC (permalink / raw)
  To: longli@linux.microsoft.com, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
	James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: Long Li

From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: Wednesday, October 1, 2025 10:06 PM
> 
> When selecting an outgoing channel for I/O, storvsc tries to select a
> channel with a returning CPU that is not the same as issuing CPU. This
> worked well in the past, however it doesn't work well when the Hyper-V
> exposes a large number of channels (up to the number of all CPUs). Use
> a different CPU for returning channel is not efficient on Hyper-V.
> 
> Change this behavior by preferring to the channel with the same CPU
> as the current I/O issuing CPU whenever possible.
> 
> Tests have shown improvements in newer Hyper-V/Azure environment, and
> no regression with older Hyper-V/Azure environments.
> 
> Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++--------------------
>  1 file changed, 45 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index d9e59204a9c3..092939791ea0 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
>  	}
> 
>  	/*
> -	 * Our channel array is sparsley populated and we
> +	 * Our channel array could be sparsley populated and we
>  	 * initiated I/O on a processor/hw-q that does not
>  	 * currently have a designated channel. Fix this.
>  	 * The strategy is simple:
> -	 * I. Ensure NUMA locality
> -	 * II. Distribute evenly (best effort)
> +	 * I. Prefer the channel associated with the current CPU
> +	 * II. Ensure NUMA locality
> +	 * III. Distribute evenly (best effort)
>  	 */
> 
> +	/* Prefer the channel on the I/O issuing processor/hw-q */
> +	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> +		return stor_device->stor_chns[q_num];
> +

Hmmm. When get_og_chn() is called, we know that
stor_device->stor_chns[q_num] is NULL since storvsc_do_io() has
already handled the non-NULL case. And the checks are all done
with stor_device->lock held, so the stor_chns array can't change. 
Hence the above code will return NULL, which will cause a NULL
reference when storvsc_do_io() sends out the VMBus packet.

My recollection is that get_og_chan() is called when there is no
channel that interrupts the current CPU (that's what it means
for stor_device->stor_chns[<current CPU>] to be NULL). So the
algorithm must pick a channel that interrupts some other CPU,
preferably a CPU in the current NUMA node. Adding code to prefer
the channel associated with the current CPU doesn't make sense in
get_og_chn(), as get_og_chn() is only called when it is already 
known that there is no such channel.

Or is there a case that I'm missing? Regardless, the above code
seems problematic because it would return NULL.

Michael

>  	node_mask = cpumask_of_node(cpu_to_node(q_num));
> 
>  	num_channels = 0;
> @@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device,
>  	/* See storvsc_change_target_cpu(). */
>  	outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]);
>  	if (outgoing_channel != NULL) {
> -		if (outgoing_channel->target_cpu == q_num) {
> -			/*
> -			 * Ideally, we want to pick a different channel if
> -			 * available on the same NUMA node.
> -			 */
> -			node_mask = cpumask_of_node(cpu_to_node(q_num));
> -			for_each_cpu_wrap(tgt_cpu,
> -				 &stor_device->alloced_cpus, q_num + 1) {
> -				if (!cpumask_test_cpu(tgt_cpu, node_mask))
> -					continue;
> -				if (tgt_cpu == q_num)
> -					continue;
> -				channel = READ_ONCE(
> -					stor_device->stor_chns[tgt_cpu]);
> -				if (channel == NULL)
> -					continue;
> -				if (hv_get_avail_to_write_percent(
> -							&channel->outbound)
> -						> ring_avail_percent_lowater) {
> -					outgoing_channel = channel;
> -					goto found_channel;
> -				}
> -			}
> +		if (hv_get_avail_to_write_percent(&outgoing_channel->outbound)
> +				> ring_avail_percent_lowater)
> +			goto found_channel;
> 
> -			/*
> -			 * All the other channels on the same NUMA node are
> -			 * busy. Try to use the channel on the current CPU
> -			 */
> -			if (hv_get_avail_to_write_percent(
> -						&outgoing_channel->outbound)
> -					> ring_avail_percent_lowater)
> +		/*
> +		 * Channel is busy, try to find a channel on the same NUMA node
> +		 */
> +		node_mask = cpumask_of_node(cpu_to_node(q_num));
> +		for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> +				  q_num + 1) {
> +			if (!cpumask_test_cpu(tgt_cpu, node_mask))
> +				continue;
> +			channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> +			if (!channel)
> +				continue;
> +			if (hv_get_avail_to_write_percent(&channel->outbound)
> +					> ring_avail_percent_lowater) {
> +				outgoing_channel = channel;
>  				goto found_channel;
> +			}
> +		}
> 
> -			/*
> -			 * If we reach here, all the channels on the current
> -			 * NUMA node are busy. Try to find a channel in
> -			 * other NUMA nodes
> -			 */
> -			for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) {
> -				if (cpumask_test_cpu(tgt_cpu, node_mask))
> -					continue;
> -				channel = READ_ONCE(
> -					stor_device->stor_chns[tgt_cpu]);
> -				if (channel == NULL)
> -					continue;
> -				if (hv_get_avail_to_write_percent(
> -							&channel->outbound)
> -						> ring_avail_percent_lowater) {
> -					outgoing_channel = channel;
> -					goto found_channel;
> -				}
> +		/*
> +		 * If we reach here, all the channels on the current
> +		 * NUMA node are busy. Try to find a channel in
> +		 * all NUMA nodes
> +		 */
> +		for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> +				  q_num + 1) {
> +			channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> +			if (!channel)
> +				continue;
> +			if (hv_get_avail_to_write_percent(&channel->outbound)
> +					> ring_avail_percent_lowater) {
> +				outgoing_channel = channel;
> +				goto found_channel;
>  			}
>  		}
> +		/*
> +		 * If we reach here, all the channels are busy. Use the
> +		 * original channel found.
> +		 */
>  	} else {
>  		spin_lock_irqsave(&stor_device->lock, flags);
>  		outgoing_channel = stor_device->stor_chns[q_num];
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-07 15:41 ` Michael Kelley
@ 2025-10-08  0:55   ` Long Li
  2025-10-08 15:30     ` Michael Kelley
  0 siblings, 1 reply; 7+ messages in thread
From: Long Li @ 2025-10-08  0:55 UTC (permalink / raw)
  To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley,
	Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Tuesday, October 7, 2025 8:42 AM
> To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang
> Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <decui@microsoft.com>; James E.J. Bottomley
> <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>;
> linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: Long Li <longli@microsoft.com>
> Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> same CPU as on the I/O issuing CPU
> 
> From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> Wednesday, October 1, 2025 10:06 PM
> >
> > When selecting an outgoing channel for I/O, storvsc tries to select a
> > channel with a returning CPU that is not the same as issuing CPU. This
> > worked well in the past, however it doesn't work well when the Hyper-V
> > exposes a large number of channels (up to the number of all CPUs). Use
> > a different CPU for returning channel is not efficient on Hyper-V.
> >
> > Change this behavior by preferring to the channel with the same CPU as
> > the current I/O issuing CPU whenever possible.
> >
> > Tests have shown improvements in newer Hyper-V/Azure environment, and
> > no regression with older Hyper-V/Azure environments.
> >
> > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  drivers/scsi/storvsc_drv.c | 96
> > ++++++++++++++++++--------------------
> >  1 file changed, 45 insertions(+), 51 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index d9e59204a9c3..092939791ea0 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct
> storvsc_device *stor_device,
> >  	}
> >
> >  	/*
> > -	 * Our channel array is sparsley populated and we
> > +	 * Our channel array could be sparsley populated and we
> >  	 * initiated I/O on a processor/hw-q that does not
> >  	 * currently have a designated channel. Fix this.
> >  	 * The strategy is simple:
> > -	 * I. Ensure NUMA locality
> > -	 * II. Distribute evenly (best effort)
> > +	 * I. Prefer the channel associated with the current CPU
> > +	 * II. Ensure NUMA locality
> > +	 * III. Distribute evenly (best effort)
> >  	 */
> >
> > +	/* Prefer the channel on the I/O issuing processor/hw-q */
> > +	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > +		return stor_device->stor_chns[q_num];
> > +
> 
> Hmmm. When get_og_chn() is called, we know that stor_device-
> >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-
> NULL case. And the checks are all done with stor_device->lock held, so the
> stor_chns array can't change.
> Hence the above code will return NULL, which will cause a NULL reference when
> storvsc_do_io() sends out the VMBus packet.
> 
> My recollection is that get_og_chan() is called when there is no channel that
> interrupts the current CPU (that's what it means for stor_device-
> >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel
> that interrupts some other CPU, preferably a CPU in the current NUMA node.
> Adding code to prefer the channel associated with the current CPU doesn't make
> sense in get_og_chn(), as get_og_chn() is only called when it is already known
> that there is no such channel.

The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for primary channel) and handle_sc_creation() (for subchannels).

As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will guarantee we are getting a channel. If the check fails, the code follows the old behavior to find a channel.

This check is needed because storvsc supports change_target_cpu_callback() callback via vmbus.

Thanks,

Long

> 
> Or is there a case that I'm missing? Regardless, the above code seems
> problematic because it would return NULL.
> 
> Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-08  0:55   ` Long Li
@ 2025-10-08 15:30     ` Michael Kelley
  2025-10-08 17:20       ` Long Li
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Kelley @ 2025-10-08 15:30 UTC (permalink / raw)
  To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
	James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org

From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> 
> > -----Original Message-----
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Tuesday, October 7, 2025 8:42 AM
> > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang
> > Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> > <decui@microsoft.com>; James E.J. Bottomley
> > <James.Bottomley@HansenPartnership.com>; Martin K. Petersen
> > <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>;
> > linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux-
> > kernel@vger.kernel.org
> > Cc: Long Li <longli@microsoft.com>
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> > same CPU as on the I/O issuing CPU
> >
> > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > Wednesday, October 1, 2025 10:06 PM
> > >
> > > When selecting an outgoing channel for I/O, storvsc tries to select a
> > > channel with a returning CPU that is not the same as issuing CPU. This
> > > worked well in the past, however it doesn't work well when the Hyper-V
> > > exposes a large number of channels (up to the number of all CPUs). Use
> > > a different CPU for returning channel is not efficient on Hyper-V.
> > >
> > > Change this behavior by preferring to the channel with the same CPU as
> > > the current I/O issuing CPU whenever possible.
> > >
> > > Tests have shown improvements in newer Hyper-V/Azure environment, and
> > > no regression with older Hyper-V/Azure environments.
> > >
> > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > Signed-off-by: Long Li <longli@microsoft.com>
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 96
> > > ++++++++++++++++++--------------------
> > >  1 file changed, 45 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index d9e59204a9c3..092939791ea0 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> > >  	}
> > >
> > >  	/*
> > > -	 * Our channel array is sparsley populated and we
> > > +	 * Our channel array could be sparsley populated and we
> > >  	 * initiated I/O on a processor/hw-q that does not
> > >  	 * currently have a designated channel. Fix this.
> > >  	 * The strategy is simple:
> > > -	 * I. Ensure NUMA locality
> > > -	 * II. Distribute evenly (best effort)
> > > +	 * I. Prefer the channel associated with the current CPU
> > > +	 * II. Ensure NUMA locality
> > > +	 * III. Distribute evenly (best effort)
> > >  	 */
> > >
> > > +	/* Prefer the channel on the I/O issuing processor/hw-q */
> > > +	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > +		return stor_device->stor_chns[q_num];
> > > +
> >
> > Hmmm. When get_og_chn() is called, we know that stor_device-
> > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-
> > NULL case. And the checks are all done with stor_device->lock held, so the
> > stor_chns array can't change.
> > Hence the above code will return NULL, which will cause a NULL reference when
> > storvsc_do_io() sends out the VMBus packet.
> >
> > My recollection is that get_og_chan() is called when there is no channel that
> > interrupts the current CPU (that's what it means for stor_device-
> > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel
> > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > Adding code to prefer the channel associated with the current CPU doesn't make
> > sense in get_og_chn(), as get_og_chn() is only called when it is already known
> > that there is no such channel.
> 
> The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for
> primary channel) and handle_sc_creation() (for subchannels).

OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry in
stor_chns[] will not be NULL.  And if the entry in stor_chns[] is NULL, the CPU bit in
alloced_cpus is *not* set. All the places that manipulate these fields update both so
they are in sync with each other.  Hence I'll agree the code you've added in get_og_chn()
will never return a NULL value.

(However, FWIW the reverse is not true:  If the entry in stor_chns[] is not NULL, the
corresponding CPU bit in alloced_cpus may or may not be set.)

> 
> As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will
> guarantee we are getting a channel. If the check fails, the code follows the old behavior
> to find a channel.
> 
> This check is needed because storvsc supports change_target_cpu_callback() callback
> via vmbus.

But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied the
code here for discussion purposes. This is the only place that get_og_chn() is called:

                spin_lock_irqsave(&stor_device->lock, flags);
                outgoing_channel = stor_device->stor_chns[q_num];
                if (outgoing_channel != NULL) {
                        spin_unlock_irqrestore(&stor_device->lock, flags);
                        goto found_channel;
                }
                outgoing_channel = get_og_chn(stor_device, q_num);
                spin_unlock_irqrestore(&stor_device->lock, flags);

The code gets the spin lock, then reads the stor_chns[] entry. If the entry is
non-NULL, then we've found a suitable channel and get_og_chn() is *not*
called. The only time get_og_chan() is called is when the stor_chn[] entry
*is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
So the check you've added in get_og_chn() can never be true and the check
is not needed. You said the check is needed because of 
change_target_cpu_callback(), which will invoke
storvsc_change_target_cpu(). But I don't see how that matters given the
checks done before get_og_chn() is called. The spin lock synchronizes with
any changes made by storvsc_change_target_cpu().

Related, there are a couple of occurrences of code like this:

                for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
                                  q_num + 1) {
                        channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
                        if (!channel)
                                continue;

Given your point that the alloced_cpus and stor_chns[] entries are
always in sync with each other, the check for channel being NULL
is not necessary.

Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-08 15:30     ` Michael Kelley
@ 2025-10-08 17:20       ` Long Li
  2025-10-08 18:24         ` Michael Kelley
  0 siblings, 1 reply; 7+ messages in thread
From: Long Li @ 2025-10-08 17:20 UTC (permalink / raw)
  To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley,
	Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org

> Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> same CPU as on the I/O issuing CPU
> 
> From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@outlook.com>
> > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>;
> > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J.
> > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K.
> > > Petersen <martin.petersen@oracle.com>; James Bottomley
> > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org;
> > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org
> > > Cc: Long Li <longli@microsoft.com>
> > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > channel with the same CPU as on the I/O issuing CPU
> > >
> > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > > Wednesday, October 1, 2025 10:06 PM
> > > >
> > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > select a channel with a returning CPU that is not the same as
> > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > well when the Hyper-V exposes a large number of channels (up to
> > > > the number of all CPUs). Use a different CPU for returning channel is not
> efficient on Hyper-V.
> > > >
> > > > Change this behavior by preferring to the channel with the same
> > > > CPU as the current I/O issuing CPU whenever possible.
> > > >
> > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > and no regression with older Hyper-V/Azure environments.
> > > >
> > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > >  drivers/scsi/storvsc_drv.c | 96
> > > > ++++++++++++++++++--------------------
> > > >  1 file changed, 45 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > 100644
> > > > --- a/drivers/scsi/storvsc_drv.c
> > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel
> *get_og_chn(struct storvsc_device *stor_device,
> > > >  	}
> > > >
> > > >  	/*
> > > > -	 * Our channel array is sparsley populated and we
> > > > +	 * Our channel array could be sparsley populated and we
> > > >  	 * initiated I/O on a processor/hw-q that does not
> > > >  	 * currently have a designated channel. Fix this.
> > > >  	 * The strategy is simple:
> > > > -	 * I. Ensure NUMA locality
> > > > -	 * II. Distribute evenly (best effort)
> > > > +	 * I. Prefer the channel associated with the current CPU
> > > > +	 * II. Ensure NUMA locality
> > > > +	 * III. Distribute evenly (best effort)
> > > >  	 */
> > > >
> > > > +	/* Prefer the channel on the I/O issuing processor/hw-q */
> > > > +	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > +		return stor_device->stor_chns[q_num];
> > > > +
> > >
> > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > >the non-
> > > NULL case. And the checks are all done with stor_device->lock held,
> > > so the stor_chns array can't change.
> > > Hence the above code will return NULL, which will cause a NULL
> > > reference when
> > > storvsc_do_io() sends out the VMBus packet.
> > >
> > > My recollection is that get_og_chan() is called when there is no
> > > channel that interrupts the current CPU (that's what it means for
> > > stor_device-
> > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > >channel
> > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > Adding code to prefer the channel associated with the current CPU
> > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > when it is already known that there is no such channel.
> >
> > The initial values for stor_chns[] and alloced_cpus are set in
> > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> subchannels).
> 
> OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> in stor_chns[] will not be NULL.  And if the entry in stor_chns[] is NULL, the CPU
> bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> both so they are in sync with each other.  Hence I'll agree the code you've added
> in get_og_chn() will never return a NULL value.
> 
> (However, FWIW the reverse is not true:  If the entry in stor_chns[] is not NULL,
> the corresponding CPU bit in alloced_cpus may or may not be set.)
> 
> >
> > As a result, the check for cpumask_test_cpu(q_num,
> > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > If the check fails, the code follows the old behavior to find a channel.
> >
> > This check is needed because storvsc supports
> > change_target_cpu_callback() callback via vmbus.
> 
> But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> the code here for discussion purposes. This is the only place that get_og_chn() is
> called:
> 
>                 spin_lock_irqsave(&stor_device->lock, flags);
>                 outgoing_channel = stor_device->stor_chns[q_num];
>                 if (outgoing_channel != NULL) {
>                         spin_unlock_irqrestore(&stor_device->lock, flags);
>                         goto found_channel;
>                 }
>                 outgoing_channel = get_og_chn(stor_device, q_num);
>                 spin_unlock_irqrestore(&stor_device->lock, flags);
> 
> The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> only time get_og_chan() is called is when the stor_chn[] entry
> *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> So the check you've added in get_og_chn() can never be true and the check is not
> needed. You said the check is needed because of change_target_cpu_callback(),
> which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> given the checks done before get_og_chn() is called. The spin lock synchronizes
> with any changes made by storvsc_change_target_cpu().

I agree this check may not be necessary. Given the current patch is correct, how about I
submit another patch to remove this check after more testing are done?

> 
> Related, there are a couple of occurrences of code like this:
> 
>                 for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
>                                   q_num + 1) {
>                         channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
>                         if (!channel)
>                                 continue;
> 
> Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> with each other, the check for channel being NULL is not necessary.

I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
before changing alloced_cpus. It can run at the same time as this code. I think we may need
to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.
 
Thanks,

Long

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  2025-10-08 17:20       ` Long Li
@ 2025-10-08 18:24         ` Michael Kelley
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Kelley @ 2025-10-08 18:24 UTC (permalink / raw)
  To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen,
	James Bottomley, linux-hyperv@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org

From: Long Li <longli@microsoft.com> Sent: Wednesday, October 8, 2025 10:20 AM
> 
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> > same CPU as on the I/O issuing CPU
> >
> > From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@outlook.com>
> > > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>;
> > > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu
> > > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J.
> > > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K.
> > > > Petersen <martin.petersen@oracle.com>; James Bottomley
> > > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org;
> > > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org
> > > > Cc: Long Li <longli@microsoft.com>
> > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > > channel with the same CPU as on the I/O issuing CPU
> > > >
> > > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent:
> > > > Wednesday, October 1, 2025 10:06 PM
> > > > >
> > > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > > select a channel with a returning CPU that is not the same as
> > > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > > well when the Hyper-V exposes a large number of channels (up to
> > > > > the number of all CPUs). Use a different CPU for returning channel is not efficient on Hyper-V.
> > > > >
> > > > > Change this behavior by preferring to the channel with the same
> > > > > CPU as the current I/O issuing CPU whenever possible.
> > > > >
> > > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > > and no regression with older Hyper-V/Azure environments.
> > > > >
> > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
> > > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > > ---
> > > > >  drivers/scsi/storvsc_drv.c | 96
> > > > > ++++++++++++++++++--------------------
> > > > >  1 file changed, 45 insertions(+), 51 deletions(-)
> > > > >
> > > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > > 100644
> > > > > --- a/drivers/scsi/storvsc_drv.c
> > > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> > > > >  	}
> > > > >
> > > > >  	/*
> > > > > -	 * Our channel array is sparsley populated and we
> > > > > +	 * Our channel array could be sparsley populated and we
> > > > >  	 * initiated I/O on a processor/hw-q that does not
> > > > >  	 * currently have a designated channel. Fix this.
> > > > >  	 * The strategy is simple:
> > > > > -	 * I. Ensure NUMA locality
> > > > > -	 * II. Distribute evenly (best effort)
> > > > > +	 * I. Prefer the channel associated with the current CPU
> > > > > +	 * II. Ensure NUMA locality
> > > > > +	 * III. Distribute evenly (best effort)
> > > > >  	 */
> > > > >
> > > > > +	/* Prefer the channel on the I/O issuing processor/hw-q */
> > > > > +	if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > > +		return stor_device->stor_chns[q_num];
> > > > > +
> > > >
> > > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > > >the non-
> > > > NULL case. And the checks are all done with stor_device->lock held,
> > > > so the stor_chns array can't change.
> > > > Hence the above code will return NULL, which will cause a NULL
> > > > reference when
> > > > storvsc_do_io() sends out the VMBus packet.
> > > >
> > > > My recollection is that get_og_chan() is called when there is no
> > > > channel that interrupts the current CPU (that's what it means for
> > > > stor_device-
> > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > > >channel
> > > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > > Adding code to prefer the channel associated with the current CPU
> > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > > when it is already known that there is no such channel.
> > >
> > > The initial values for stor_chns[] and alloced_cpus are set in
> > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> > subchannels).
> >
> > OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> > in stor_chns[] will not be NULL.  And if the entry in stor_chns[] is NULL, the CPU
> > bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> > both so they are in sync with each other.  Hence I'll agree the code you've added
> > in get_og_chn() will never return a NULL value.
> >
> > (However, FWIW the reverse is not true:  If the entry in stor_chns[] is not NULL,
> > the corresponding CPU bit in alloced_cpus may or may not be set.)
> >
> > >
> > > As a result, the check for cpumask_test_cpu(q_num,
> > > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > > If the check fails, the code follows the old behavior to find a channel.
> > >
> > > This check is needed because storvsc supports
> > > change_target_cpu_callback() callback via vmbus.
> >
> > But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> > the code here for discussion purposes. This is the only place that get_og_chn() is
> > called:
> >
> >                 spin_lock_irqsave(&stor_device->lock, flags);
> >                 outgoing_channel = stor_device->stor_chns[q_num];
> >                 if (outgoing_channel != NULL) {
> >                         spin_unlock_irqrestore(&stor_device->lock, flags);
> >                         goto found_channel;
> >                 }
> >                 outgoing_channel = get_og_chn(stor_device, q_num);
> >                 spin_unlock_irqrestore(&stor_device->lock, flags);
> >
> > The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> > NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> > only time get_og_chan() is called is when the stor_chn[] entry
> > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> > So the check you've added in get_og_chn() can never be true and the check is not
> > needed. You said the check is needed because of change_target_cpu_callback(),
> > which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> > given the checks done before get_og_chn() is called. The spin lock synchronizes
> > with any changes made by storvsc_change_target_cpu().
> 
> I agree this check may not be necessary. Given the current patch is correct, how about I
> submit another patch to remove this check after more testing are done?

That's OK with me.

> 
> >
> > Related, there are a couple of occurrences of code like this:
> >
> >                 for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> >                                   q_num + 1) {
> >                         channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> >                         if (!channel)
> >                                 continue;
> >
> > Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> > with each other, the check for channel being NULL is not necessary.
> 
> I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
> on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
> before changing alloced_cpus. It can run at the same time as this code. I think we may need
> to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.

Yes, you are right. The spin lock isn't held in these cases, so the checks for NULL
should stay.

Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-10-08 18:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02  5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli
2025-10-07  2:05 ` Martin K. Petersen
2025-10-07 15:41 ` Michael Kelley
2025-10-08  0:55   ` Long Li
2025-10-08 15:30     ` Michael Kelley
2025-10-08 17:20       ` Long Li
2025-10-08 18:24         ` Michael Kelley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).