* [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU @ 2025-10-02 5:05 longli 2025-10-07 2:05 ` Martin K. Petersen 2025-10-07 15:41 ` Michael Kelley 0 siblings, 2 replies; 7+ messages in thread From: longli @ 2025-10-02 5:05 UTC (permalink / raw) To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv, linux-scsi, linux-kernel Cc: Long Li From: Long Li <longli@microsoft.com> When selecting an outgoing channel for I/O, storvsc tries to select a channel with a returning CPU that is not the same as issuing CPU. This worked well in the past, however it doesn't work well when the Hyper-V exposes a large number of channels (up to the number of all CPUs). Use a different CPU for returning channel is not efficient on Hyper-V. Change this behavior by preferring to the channel with the same CPU as the current I/O issuing CPU whenever possible. Tests have shown improvements in newer Hyper-V/Azure environment, and no regression with older Hyper-V/Azure environments. Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> Signed-off-by: Long Li <longli@microsoft.com> --- drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++-------------------- 1 file changed, 45 insertions(+), 51 deletions(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, } /* - * Our channel array is sparsley populated and we + * Our channel array could be sparsley populated and we * initiated I/O on a processor/hw-q that does not * currently have a designated channel. Fix this. * The strategy is simple: - * I. Ensure NUMA locality - * II. Distribute evenly (best effort) + * I. Prefer the channel associated with the current CPU + * II. Ensure NUMA locality + * III. Distribute evenly (best effort) */ + /* Prefer the channel on the I/O issuing processor/hw-q */ + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) + return stor_device->stor_chns[q_num]; + node_mask = cpumask_of_node(cpu_to_node(q_num)); num_channels = 0; @@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device, /* See storvsc_change_target_cpu(). */ outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]); if (outgoing_channel != NULL) { - if (outgoing_channel->target_cpu == q_num) { - /* - * Ideally, we want to pick a different channel if - * available on the same NUMA node. - */ - node_mask = cpumask_of_node(cpu_to_node(q_num)); - for_each_cpu_wrap(tgt_cpu, - &stor_device->alloced_cpus, q_num + 1) { - if (!cpumask_test_cpu(tgt_cpu, node_mask)) - continue; - if (tgt_cpu == q_num) - continue; - channel = READ_ONCE( - stor_device->stor_chns[tgt_cpu]); - if (channel == NULL) - continue; - if (hv_get_avail_to_write_percent( - &channel->outbound) - > ring_avail_percent_lowater) { - outgoing_channel = channel; - goto found_channel; - } - } + if (hv_get_avail_to_write_percent(&outgoing_channel->outbound) + > ring_avail_percent_lowater) + goto found_channel; - /* - * All the other channels on the same NUMA node are - * busy. Try to use the channel on the current CPU - */ - if (hv_get_avail_to_write_percent( - &outgoing_channel->outbound) - > ring_avail_percent_lowater) + /* + * Channel is busy, try to find a channel on the same NUMA node + */ + node_mask = cpumask_of_node(cpu_to_node(q_num)); + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, + q_num + 1) { + if (!cpumask_test_cpu(tgt_cpu, node_mask)) + continue; + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); + if (!channel) + continue; + if (hv_get_avail_to_write_percent(&channel->outbound) + > ring_avail_percent_lowater) { + outgoing_channel = channel; goto found_channel; + } + } - /* - * If we reach here, all the channels on the current - * NUMA node are busy. Try to find a channel in - * other NUMA nodes - */ - for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { - if (cpumask_test_cpu(tgt_cpu, node_mask)) - continue; - channel = READ_ONCE( - stor_device->stor_chns[tgt_cpu]); - if (channel == NULL) - continue; - if (hv_get_avail_to_write_percent( - &channel->outbound) - > ring_avail_percent_lowater) { - outgoing_channel = channel; - goto found_channel; - } + /* + * If we reach here, all the channels on the current + * NUMA node are busy. Try to find a channel in + * all NUMA nodes + */ + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, + q_num + 1) { + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); + if (!channel) + continue; + if (hv_get_avail_to_write_percent(&channel->outbound) + > ring_avail_percent_lowater) { + outgoing_channel = channel; + goto found_channel; } } + /* + * If we reach here, all the channels are busy. Use the + * original channel found. + */ } else { spin_lock_irqsave(&stor_device->lock, flags); outgoing_channel = stor_device->stor_chns[q_num]; -- 2.34.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli @ 2025-10-07 2:05 ` Martin K. Petersen 2025-10-07 15:41 ` Michael Kelley 1 sibling, 0 replies; 7+ messages in thread From: Martin K. Petersen @ 2025-10-07 2:05 UTC (permalink / raw) To: longli Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv, linux-scsi, linux-kernel, Long Li Long, > When selecting an outgoing channel for I/O, storvsc tries to select a > channel with a returning CPU that is not the same as issuing CPU. This > worked well in the past, however it doesn't work well when the Hyper-V > exposes a large number of channels (up to the number of all CPUs). Use > a different CPU for returning channel is not efficient on Hyper-V. Applied to 6.18/scsi-staging, thanks! -- Martin K. Petersen ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli 2025-10-07 2:05 ` Martin K. Petersen @ 2025-10-07 15:41 ` Michael Kelley 2025-10-08 0:55 ` Long Li 1 sibling, 1 reply; 7+ messages in thread From: Michael Kelley @ 2025-10-07 15:41 UTC (permalink / raw) To: longli@linux.microsoft.com, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Long Li From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: Wednesday, October 1, 2025 10:06 PM > > When selecting an outgoing channel for I/O, storvsc tries to select a > channel with a returning CPU that is not the same as issuing CPU. This > worked well in the past, however it doesn't work well when the Hyper-V > exposes a large number of channels (up to the number of all CPUs). Use > a different CPU for returning channel is not efficient on Hyper-V. > > Change this behavior by preferring to the channel with the same CPU > as the current I/O issuing CPU whenever possible. > > Tests have shown improvements in newer Hyper-V/Azure environment, and > no regression with older Hyper-V/Azure environments. > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> > Signed-off-by: Long Li <longli@microsoft.com> > --- > drivers/scsi/storvsc_drv.c | 96 ++++++++++++++++++-------------------- > 1 file changed, 45 insertions(+), 51 deletions(-) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index d9e59204a9c3..092939791ea0 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, > } > > /* > - * Our channel array is sparsley populated and we > + * Our channel array could be sparsley populated and we > * initiated I/O on a processor/hw-q that does not > * currently have a designated channel. Fix this. > * The strategy is simple: > - * I. Ensure NUMA locality > - * II. Distribute evenly (best effort) > + * I. Prefer the channel associated with the current CPU > + * II. Ensure NUMA locality > + * III. Distribute evenly (best effort) > */ > > + /* Prefer the channel on the I/O issuing processor/hw-q */ > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) > + return stor_device->stor_chns[q_num]; > + Hmmm. When get_og_chn() is called, we know that stor_device->stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non-NULL case. And the checks are all done with stor_device->lock held, so the stor_chns array can't change. Hence the above code will return NULL, which will cause a NULL reference when storvsc_do_io() sends out the VMBus packet. My recollection is that get_og_chan() is called when there is no channel that interrupts the current CPU (that's what it means for stor_device->stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel that interrupts some other CPU, preferably a CPU in the current NUMA node. Adding code to prefer the channel associated with the current CPU doesn't make sense in get_og_chn(), as get_og_chn() is only called when it is already known that there is no such channel. Or is there a case that I'm missing? Regardless, the above code seems problematic because it would return NULL. Michael > node_mask = cpumask_of_node(cpu_to_node(q_num)); > > num_channels = 0; > @@ -1469,59 +1474,48 @@ static int storvsc_do_io(struct hv_device *device, > /* See storvsc_change_target_cpu(). */ > outgoing_channel = READ_ONCE(stor_device->stor_chns[q_num]); > if (outgoing_channel != NULL) { > - if (outgoing_channel->target_cpu == q_num) { > - /* > - * Ideally, we want to pick a different channel if > - * available on the same NUMA node. > - */ > - node_mask = cpumask_of_node(cpu_to_node(q_num)); > - for_each_cpu_wrap(tgt_cpu, > - &stor_device->alloced_cpus, q_num + 1) { > - if (!cpumask_test_cpu(tgt_cpu, node_mask)) > - continue; > - if (tgt_cpu == q_num) > - continue; > - channel = READ_ONCE( > - stor_device->stor_chns[tgt_cpu]); > - if (channel == NULL) > - continue; > - if (hv_get_avail_to_write_percent( > - &channel->outbound) > - > ring_avail_percent_lowater) { > - outgoing_channel = channel; > - goto found_channel; > - } > - } > + if (hv_get_avail_to_write_percent(&outgoing_channel->outbound) > + > ring_avail_percent_lowater) > + goto found_channel; > > - /* > - * All the other channels on the same NUMA node are > - * busy. Try to use the channel on the current CPU > - */ > - if (hv_get_avail_to_write_percent( > - &outgoing_channel->outbound) > - > ring_avail_percent_lowater) > + /* > + * Channel is busy, try to find a channel on the same NUMA node > + */ > + node_mask = cpumask_of_node(cpu_to_node(q_num)); > + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, > + q_num + 1) { > + if (!cpumask_test_cpu(tgt_cpu, node_mask)) > + continue; > + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); > + if (!channel) > + continue; > + if (hv_get_avail_to_write_percent(&channel->outbound) > + > ring_avail_percent_lowater) { > + outgoing_channel = channel; > goto found_channel; > + } > + } > > - /* > - * If we reach here, all the channels on the current > - * NUMA node are busy. Try to find a channel in > - * other NUMA nodes > - */ > - for_each_cpu(tgt_cpu, &stor_device->alloced_cpus) { > - if (cpumask_test_cpu(tgt_cpu, node_mask)) > - continue; > - channel = READ_ONCE( > - stor_device->stor_chns[tgt_cpu]); > - if (channel == NULL) > - continue; > - if (hv_get_avail_to_write_percent( > - &channel->outbound) > - > ring_avail_percent_lowater) { > - outgoing_channel = channel; > - goto found_channel; > - } > + /* > + * If we reach here, all the channels on the current > + * NUMA node are busy. Try to find a channel in > + * all NUMA nodes > + */ > + for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, > + q_num + 1) { > + channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); > + if (!channel) > + continue; > + if (hv_get_avail_to_write_percent(&channel->outbound) > + > ring_avail_percent_lowater) { > + outgoing_channel = channel; > + goto found_channel; > } > } > + /* > + * If we reach here, all the channels are busy. Use the > + * original channel found. > + */ > } else { > spin_lock_irqsave(&stor_device->lock, flags); > outgoing_channel = stor_device->stor_chns[q_num]; > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-07 15:41 ` Michael Kelley @ 2025-10-08 0:55 ` Long Li 2025-10-08 15:30 ` Michael Kelley 0 siblings, 1 reply; 7+ messages in thread From: Long Li @ 2025-10-08 0:55 UTC (permalink / raw) To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org > -----Original Message----- > From: Michael Kelley <mhklinux@outlook.com> > Sent: Tuesday, October 7, 2025 8:42 AM > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang > Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > <decui@microsoft.com>; James E.J. Bottomley > <James.Bottomley@HansenPartnership.com>; Martin K. Petersen > <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>; > linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux- > kernel@vger.kernel.org > Cc: Long Li <longli@microsoft.com> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the > same CPU as on the I/O issuing CPU > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: > Wednesday, October 1, 2025 10:06 PM > > > > When selecting an outgoing channel for I/O, storvsc tries to select a > > channel with a returning CPU that is not the same as issuing CPU. This > > worked well in the past, however it doesn't work well when the Hyper-V > > exposes a large number of channels (up to the number of all CPUs). Use > > a different CPU for returning channel is not efficient on Hyper-V. > > > > Change this behavior by preferring to the channel with the same CPU as > > the current I/O issuing CPU whenever possible. > > > > Tests have shown improvements in newer Hyper-V/Azure environment, and > > no regression with older Hyper-V/Azure environments. > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> > > Signed-off-by: Long Li <longli@microsoft.com> > > --- > > drivers/scsi/storvsc_drv.c | 96 > > ++++++++++++++++++-------------------- > > 1 file changed, 45 insertions(+), 51 deletions(-) > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > index d9e59204a9c3..092939791ea0 100644 > > --- a/drivers/scsi/storvsc_drv.c > > +++ b/drivers/scsi/storvsc_drv.c > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct > storvsc_device *stor_device, > > } > > > > /* > > - * Our channel array is sparsley populated and we > > + * Our channel array could be sparsley populated and we > > * initiated I/O on a processor/hw-q that does not > > * currently have a designated channel. Fix this. > > * The strategy is simple: > > - * I. Ensure NUMA locality > > - * II. Distribute evenly (best effort) > > + * I. Prefer the channel associated with the current CPU > > + * II. Ensure NUMA locality > > + * III. Distribute evenly (best effort) > > */ > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */ > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) > > + return stor_device->stor_chns[q_num]; > > + > > Hmmm. When get_og_chn() is called, we know that stor_device- > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non- > NULL case. And the checks are all done with stor_device->lock held, so the > stor_chns array can't change. > Hence the above code will return NULL, which will cause a NULL reference when > storvsc_do_io() sends out the VMBus packet. > > My recollection is that get_og_chan() is called when there is no channel that > interrupts the current CPU (that's what it means for stor_device- > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel > that interrupts some other CPU, preferably a CPU in the current NUMA node. > Adding code to prefer the channel associated with the current CPU doesn't make > sense in get_og_chn(), as get_og_chn() is only called when it is already known > that there is no such channel. The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for primary channel) and handle_sc_creation() (for subchannels). As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will guarantee we are getting a channel. If the check fails, the code follows the old behavior to find a channel. This check is needed because storvsc supports change_target_cpu_callback() callback via vmbus. Thanks, Long > > Or is there a case that I'm missing? Regardless, the above code seems > problematic because it would return NULL. > > Michael ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-08 0:55 ` Long Li @ 2025-10-08 15:30 ` Michael Kelley 2025-10-08 17:20 ` Long Li 0 siblings, 1 reply; 7+ messages in thread From: Michael Kelley @ 2025-10-08 15:30 UTC (permalink / raw) To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM > > > -----Original Message----- > > From: Michael Kelley <mhklinux@outlook.com> > > Sent: Tuesday, October 7, 2025 8:42 AM > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; Haiyang > > Zhang <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui > > <decui@microsoft.com>; James E.J. Bottomley > > <James.Bottomley@HansenPartnership.com>; Martin K. Petersen > > <martin.petersen@oracle.com>; James Bottomley <JBottomley@Odin.com>; > > linux-hyperv@vger.kernel.org; linux-scsi@vger.kernel.org; linux- > > kernel@vger.kernel.org > > Cc: Long Li <longli@microsoft.com> > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the > > same CPU as on the I/O issuing CPU > > > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: > > Wednesday, October 1, 2025 10:06 PM > > > > > > When selecting an outgoing channel for I/O, storvsc tries to select a > > > channel with a returning CPU that is not the same as issuing CPU. This > > > worked well in the past, however it doesn't work well when the Hyper-V > > > exposes a large number of channels (up to the number of all CPUs). Use > > > a different CPU for returning channel is not efficient on Hyper-V. > > > > > > Change this behavior by preferring to the channel with the same CPU as > > > the current I/O issuing CPU whenever possible. > > > > > > Tests have shown improvements in newer Hyper-V/Azure environment, and > > > no regression with older Hyper-V/Azure environments. > > > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> > > > Signed-off-by: Long Li <longli@microsoft.com> > > > --- > > > drivers/scsi/storvsc_drv.c | 96 > > > ++++++++++++++++++-------------------- > > > 1 file changed, 45 insertions(+), 51 deletions(-) > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > index d9e59204a9c3..092939791ea0 100644 > > > --- a/drivers/scsi/storvsc_drv.c > > > +++ b/drivers/scsi/storvsc_drv.c > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, > > > } > > > > > > /* > > > - * Our channel array is sparsley populated and we > > > + * Our channel array could be sparsley populated and we > > > * initiated I/O on a processor/hw-q that does not > > > * currently have a designated channel. Fix this. > > > * The strategy is simple: > > > - * I. Ensure NUMA locality > > > - * II. Distribute evenly (best effort) > > > + * I. Prefer the channel associated with the current CPU > > > + * II. Ensure NUMA locality > > > + * III. Distribute evenly (best effort) > > > */ > > > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */ > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) > > > + return stor_device->stor_chns[q_num]; > > > + > > > > Hmmm. When get_og_chn() is called, we know that stor_device- > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled the non- > > NULL case. And the checks are all done with stor_device->lock held, so the > > stor_chns array can't change. > > Hence the above code will return NULL, which will cause a NULL reference when > > storvsc_do_io() sends out the VMBus packet. > > > > My recollection is that get_og_chan() is called when there is no channel that > > interrupts the current CPU (that's what it means for stor_device- > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a channel > > that interrupts some other CPU, preferably a CPU in the current NUMA node. > > Adding code to prefer the channel associated with the current CPU doesn't make > > sense in get_og_chn(), as get_og_chn() is only called when it is already known > > that there is no such channel. > > The initial values for stor_chns[] and alloced_cpus are set in storvsc_channel_init() (for > primary channel) and handle_sc_creation() (for subchannels). OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU bit in alloced_cpus is *not* set. All the places that manipulate these fields update both so they are in sync with each other. Hence I'll agree the code you've added in get_og_chn() will never return a NULL value. (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL, the corresponding CPU bit in alloced_cpus may or may not be set.) > > As a result, the check for cpumask_test_cpu(q_num, &stor_device->alloced_cpus) will > guarantee we are getting a channel. If the check fails, the code follows the old behavior > to find a channel. > > This check is needed because storvsc supports change_target_cpu_callback() callback > via vmbus. But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied the code here for discussion purposes. This is the only place that get_og_chn() is called: spin_lock_irqsave(&stor_device->lock, flags); outgoing_channel = stor_device->stor_chns[q_num]; if (outgoing_channel != NULL) { spin_unlock_irqrestore(&stor_device->lock, flags); goto found_channel; } outgoing_channel = get_og_chn(stor_device, q_num); spin_unlock_irqrestore(&stor_device->lock, flags); The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-NULL, then we've found a suitable channel and get_og_chn() is *not* called. The only time get_og_chan() is called is when the stor_chn[] entry *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set. So the check you've added in get_og_chn() can never be true and the check is not needed. You said the check is needed because of change_target_cpu_callback(), which will invoke storvsc_change_target_cpu(). But I don't see how that matters given the checks done before get_og_chn() is called. The spin lock synchronizes with any changes made by storvsc_change_target_cpu(). Related, there are a couple of occurrences of code like this: for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, q_num + 1) { channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); if (!channel) continue; Given your point that the alloced_cpus and stor_chns[] entries are always in sync with each other, the check for channel being NULL is not necessary. Michael ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-08 15:30 ` Michael Kelley @ 2025-10-08 17:20 ` Long Li 2025-10-08 18:24 ` Michael Kelley 0 siblings, 1 reply; 7+ messages in thread From: Long Li @ 2025-10-08 17:20 UTC (permalink / raw) To: Michael Kelley, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the > same CPU as on the I/O issuing CPU > > From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM > > > > > -----Original Message----- > > > From: Michael Kelley <mhklinux@outlook.com> > > > Sent: Tuesday, October 7, 2025 8:42 AM > > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; > > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu > > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J. > > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K. > > > Petersen <martin.petersen@oracle.com>; James Bottomley > > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org; > > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org > > > Cc: Long Li <longli@microsoft.com> > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning > > > channel with the same CPU as on the I/O issuing CPU > > > > > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: > > > Wednesday, October 1, 2025 10:06 PM > > > > > > > > When selecting an outgoing channel for I/O, storvsc tries to > > > > select a channel with a returning CPU that is not the same as > > > > issuing CPU. This worked well in the past, however it doesn't work > > > > well when the Hyper-V exposes a large number of channels (up to > > > > the number of all CPUs). Use a different CPU for returning channel is not > efficient on Hyper-V. > > > > > > > > Change this behavior by preferring to the channel with the same > > > > CPU as the current I/O issuing CPU whenever possible. > > > > > > > > Tests have shown improvements in newer Hyper-V/Azure environment, > > > > and no regression with older Hyper-V/Azure environments. > > > > > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> > > > > Signed-off-by: Long Li <longli@microsoft.com> > > > > --- > > > > drivers/scsi/storvsc_drv.c | 96 > > > > ++++++++++++++++++-------------------- > > > > 1 file changed, 45 insertions(+), 51 deletions(-) > > > > > > > > diff --git a/drivers/scsi/storvsc_drv.c > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0 > > > > 100644 > > > > --- a/drivers/scsi/storvsc_drv.c > > > > +++ b/drivers/scsi/storvsc_drv.c > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel > *get_og_chn(struct storvsc_device *stor_device, > > > > } > > > > > > > > /* > > > > - * Our channel array is sparsley populated and we > > > > + * Our channel array could be sparsley populated and we > > > > * initiated I/O on a processor/hw-q that does not > > > > * currently have a designated channel. Fix this. > > > > * The strategy is simple: > > > > - * I. Ensure NUMA locality > > > > - * II. Distribute evenly (best effort) > > > > + * I. Prefer the channel associated with the current CPU > > > > + * II. Ensure NUMA locality > > > > + * III. Distribute evenly (best effort) > > > > */ > > > > > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */ > > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) > > > > + return stor_device->stor_chns[q_num]; > > > > + > > > > > > Hmmm. When get_og_chn() is called, we know that stor_device- > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled > > > >the non- > > > NULL case. And the checks are all done with stor_device->lock held, > > > so the stor_chns array can't change. > > > Hence the above code will return NULL, which will cause a NULL > > > reference when > > > storvsc_do_io() sends out the VMBus packet. > > > > > > My recollection is that get_og_chan() is called when there is no > > > channel that interrupts the current CPU (that's what it means for > > > stor_device- > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a > > > >channel > > > that interrupts some other CPU, preferably a CPU in the current NUMA node. > > > Adding code to prefer the channel associated with the current CPU > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called > > > when it is already known that there is no such channel. > > > > The initial values for stor_chns[] and alloced_cpus are set in > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for > subchannels). > > OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry > in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU > bit in alloced_cpus is *not* set. All the places that manipulate these fields update > both so they are in sync with each other. Hence I'll agree the code you've added > in get_og_chn() will never return a NULL value. > > (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL, > the corresponding CPU bit in alloced_cpus may or may not be set.) > > > > > As a result, the check for cpumask_test_cpu(q_num, > > &stor_device->alloced_cpus) will guarantee we are getting a channel. > > If the check fails, the code follows the old behavior to find a channel. > > > > This check is needed because storvsc supports > > change_target_cpu_callback() callback via vmbus. > > But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied > the code here for discussion purposes. This is the only place that get_og_chn() is > called: > > spin_lock_irqsave(&stor_device->lock, flags); > outgoing_channel = stor_device->stor_chns[q_num]; > if (outgoing_channel != NULL) { > spin_unlock_irqrestore(&stor_device->lock, flags); > goto found_channel; > } > outgoing_channel = get_og_chn(stor_device, q_num); > spin_unlock_irqrestore(&stor_device->lock, flags); > > The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non- > NULL, then we've found a suitable channel and get_og_chn() is *not* called. The > only time get_og_chan() is called is when the stor_chn[] entry > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set. > So the check you've added in get_og_chn() can never be true and the check is not > needed. You said the check is needed because of change_target_cpu_callback(), > which will invoke storvsc_change_target_cpu(). But I don't see how that matters > given the checks done before get_og_chn() is called. The spin lock synchronizes > with any changes made by storvsc_change_target_cpu(). I agree this check may not be necessary. Given the current patch is correct, how about I submit another patch to remove this check after more testing are done? > > Related, there are a couple of occurrences of code like this: > > for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, > q_num + 1) { > channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); > if (!channel) > continue; > > Given your point that the alloced_cpus and stor_chns[] entries are always in sync > with each other, the check for channel being NULL is not necessary. I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns before changing alloced_cpus. It can run at the same time as this code. I think we may need to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic. Thanks, Long ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU 2025-10-08 17:20 ` Long Li @ 2025-10-08 18:24 ` Michael Kelley 0 siblings, 0 replies; 7+ messages in thread From: Michael Kelley @ 2025-10-08 18:24 UTC (permalink / raw) To: Long Li, longli@linux.microsoft.com, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, James E.J. Bottomley, Martin K. Petersen, James Bottomley, linux-hyperv@vger.kernel.org, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org From: Long Li <longli@microsoft.com> Sent: Wednesday, October 8, 2025 10:20 AM > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the > > same CPU as on the I/O issuing CPU > > > > From: Long Li <longli@microsoft.com> Sent: Tuesday, October 7, 2025 5:56 PM > > > > > > > -----Original Message----- > > > > From: Michael Kelley <mhklinux@outlook.com> > > > > Sent: Tuesday, October 7, 2025 8:42 AM > > > > To: longli@linux.microsoft.com; KY Srinivasan <kys@microsoft.com>; > > > > Haiyang Zhang <haiyangz@microsoft.com>; Wei Liu > > > > <wei.liu@kernel.org>; Dexuan Cui <decui@microsoft.com>; James E.J. > > > > Bottomley <James.Bottomley@HansenPartnership.com>; Martin K. > > > > Petersen <martin.petersen@oracle.com>; James Bottomley > > > > <JBottomley@Odin.com>; linux-hyperv@vger.kernel.org; > > > > linux-scsi@vger.kernel.org; linux- kernel@vger.kernel.org > > > > Cc: Long Li <longli@microsoft.com> > > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning > > > > channel with the same CPU as on the I/O issuing CPU > > > > > > > > From: longli@linux.microsoft.com <longli@linux.microsoft.com> Sent: > > > > Wednesday, October 1, 2025 10:06 PM > > > > > > > > > > When selecting an outgoing channel for I/O, storvsc tries to > > > > > select a channel with a returning CPU that is not the same as > > > > > issuing CPU. This worked well in the past, however it doesn't work > > > > > well when the Hyper-V exposes a large number of channels (up to > > > > > the number of all CPUs). Use a different CPU for returning channel is not efficient on Hyper-V. > > > > > > > > > > Change this behavior by preferring to the channel with the same > > > > > CPU as the current I/O issuing CPU whenever possible. > > > > > > > > > > Tests have shown improvements in newer Hyper-V/Azure environment, > > > > > and no regression with older Hyper-V/Azure environments. > > > > > > > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com> > > > > > Signed-off-by: Long Li <longli@microsoft.com> > > > > > --- > > > > > drivers/scsi/storvsc_drv.c | 96 > > > > > ++++++++++++++++++-------------------- > > > > > 1 file changed, 45 insertions(+), 51 deletions(-) > > > > > > > > > > diff --git a/drivers/scsi/storvsc_drv.c > > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0 > > > > > 100644 > > > > > --- a/drivers/scsi/storvsc_drv.c > > > > > +++ b/drivers/scsi/storvsc_drv.c > > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device, > > > > > } > > > > > > > > > > /* > > > > > - * Our channel array is sparsley populated and we > > > > > + * Our channel array could be sparsley populated and we > > > > > * initiated I/O on a processor/hw-q that does not > > > > > * currently have a designated channel. Fix this. > > > > > * The strategy is simple: > > > > > - * I. Ensure NUMA locality > > > > > - * II. Distribute evenly (best effort) > > > > > + * I. Prefer the channel associated with the current CPU > > > > > + * II. Ensure NUMA locality > > > > > + * III. Distribute evenly (best effort) > > > > > */ > > > > > > > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */ > > > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus)) > > > > > + return stor_device->stor_chns[q_num]; > > > > > + > > > > > > > > Hmmm. When get_og_chn() is called, we know that stor_device- > > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled > > > > >the non- > > > > NULL case. And the checks are all done with stor_device->lock held, > > > > so the stor_chns array can't change. > > > > Hence the above code will return NULL, which will cause a NULL > > > > reference when > > > > storvsc_do_io() sends out the VMBus packet. > > > > > > > > My recollection is that get_og_chan() is called when there is no > > > > channel that interrupts the current CPU (that's what it means for > > > > stor_device- > > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a > > > > >channel > > > > that interrupts some other CPU, preferably a CPU in the current NUMA node. > > > > Adding code to prefer the channel associated with the current CPU > > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called > > > > when it is already known that there is no such channel. > > > > > > The initial values for stor_chns[] and alloced_cpus are set in > > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for > > subchannels). > > > > OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry > > in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU > > bit in alloced_cpus is *not* set. All the places that manipulate these fields update > > both so they are in sync with each other. Hence I'll agree the code you've added > > in get_og_chn() will never return a NULL value. > > > > (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL, > > the corresponding CPU bit in alloced_cpus may or may not be set.) > > > > > > > > As a result, the check for cpumask_test_cpu(q_num, > > > &stor_device->alloced_cpus) will guarantee we are getting a channel. > > > If the check fails, the code follows the old behavior to find a channel. > > > > > > This check is needed because storvsc supports > > > change_target_cpu_callback() callback via vmbus. > > > > But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied > > the code here for discussion purposes. This is the only place that get_og_chn() is > > called: > > > > spin_lock_irqsave(&stor_device->lock, flags); > > outgoing_channel = stor_device->stor_chns[q_num]; > > if (outgoing_channel != NULL) { > > spin_unlock_irqrestore(&stor_device->lock, flags); > > goto found_channel; > > } > > outgoing_channel = get_og_chn(stor_device, q_num); > > spin_unlock_irqrestore(&stor_device->lock, flags); > > > > The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non- > > NULL, then we've found a suitable channel and get_og_chn() is *not* called. The > > only time get_og_chan() is called is when the stor_chn[] entry > > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set. > > So the check you've added in get_og_chn() can never be true and the check is not > > needed. You said the check is needed because of change_target_cpu_callback(), > > which will invoke storvsc_change_target_cpu(). But I don't see how that matters > > given the checks done before get_og_chn() is called. The spin lock synchronizes > > with any changes made by storvsc_change_target_cpu(). > > I agree this check may not be necessary. Given the current patch is correct, how about I > submit another patch to remove this check after more testing are done? That's OK with me. > > > > > Related, there are a couple of occurrences of code like this: > > > > for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus, > > q_num + 1) { > > channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]); > > if (!channel) > > continue; > > > > Given your point that the alloced_cpus and stor_chns[] entries are always in sync > > with each other, the check for channel being NULL is not necessary. > > I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior > on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns > before changing alloced_cpus. It can run at the same time as this code. I think we may need > to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic. Yes, you are right. The spin lock isn't held in these cases, so the checks for NULL should stay. Michael ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-10-08 18:24 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-02 5:05 [PATCH] scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU longli 2025-10-07 2:05 ` Martin K. Petersen 2025-10-07 15:41 ` Michael Kelley 2025-10-08 0:55 ` Long Li 2025-10-08 15:30 ` Michael Kelley 2025-10-08 17:20 ` Long Li 2025-10-08 18:24 ` Michael Kelley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).