qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
@ 2025-04-17 11:27 lma
  2025-04-18 15:34 ` Stefan Hajnoczi
  0 siblings, 1 reply; 6+ messages in thread
From: lma @ 2025-04-17 11:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, stefanha

Hi all,

In case of SCSI passthrough, If the Block Limits VPD device response is 
absent from hardware, QEMU handles it.

There are several variables involved in this process as follows:
* The bl.max_transfer
* The bl.max_iov that is associated with IOV_MAX.
* The bl.max_hw_iov that is associated with the max_segments sysfs 
setting for the relevant block device on the host.
* The bl.max_hw_transfer that is associated with the BLKSECTGET ioctl, 
in other words related to the current max_sectors_kb sysfs setting of 
the relevant block device on the host.

Then take the smallest value and return it as the result of "Maximum 
transfer length" after relevant calculation, See:
static uint64_t calculate_max_transfer(SCSIDevice *s)
{
     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);

     assert(max_transfer);
     max_transfer = MIN_NON_ZERO(max_transfer,
                                 max_iov * qemu_real_host_page_size());

     return max_transfer / s->blocksize;
}


However, due to the limitation of IOV_MAX, no matter how powerful the 
host scsi hardware is, the "Maximum transfer length" that qemu emulates 
in bl vpd page is capped at 8192 sectors in case of 4kb page size and 
512 bytes logical block size.
For example:
host:~ # sg_vpd -p bl /dev/sda
Block limits VPD page (SBC)
   ......
   Maximum transfer length: 0 blocks [not reported]
   ......


host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
16384

host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
32767

host:~ # cat /sys/class/block/sda/queue/max_segments
4096


Expected:
guest:~ # sg_vpd -p bl /dev/sda
Block limits VPD page (SBC)
   ......
   Maximum transfer length: 0x8000
   ......

guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
16384

guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
32767


Actual:
guest:~ # sg_vpd -p bl /dev/sda
Block limits VPD page (SBC)
   ......
   Maximum transfer length: 0x2000
   ......

guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
4096

guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
32767


It seems the current design logic is not able to fully utilize the 
performance of the scsi hardware. I have two questions:
1. I'm curious that is it reasonable to drop the logic about IOV_MAX 
limitation, directly use the return value of BLKSECTGET as the maximum 
transfer length when QEMU emulates the block limit page of scsi vpd?
    If we doing so, we will have maximum transfer length in the guest 
that is consistent with the capabilities of the host hardware。

2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb in 
guest which doesn't exceed the capabilities of the host hardware(eg: 
16384 in kb) but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
    Any risks in readv()/writev() of raw-posix?

Lin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
  2025-04-17 11:27 A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware lma
@ 2025-04-18 15:34 ` Stefan Hajnoczi
  2025-04-23  9:47   ` lma
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-18 15:34 UTC (permalink / raw)
  To: lma; +Cc: qemu-devel, pbonzini, qemu-block

[-- Attachment #1: Type: text/plain, Size: 4359 bytes --]

On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
> Hi all,
> 
> In case of SCSI passthrough, If the Block Limits VPD device response is
> absent from hardware, QEMU handles it.
> 
> There are several variables involved in this process as follows:
> * The bl.max_transfer
> * The bl.max_iov that is associated with IOV_MAX.
> * The bl.max_hw_iov that is associated with the max_segments sysfs setting
> for the relevant block device on the host.
> * The bl.max_hw_transfer that is associated with the BLKSECTGET ioctl, in
> other words related to the current max_sectors_kb sysfs setting of the
> relevant block device on the host.
> 
> Then take the smallest value and return it as the result of "Maximum
> transfer length" after relevant calculation, See:
> static uint64_t calculate_max_transfer(SCSIDevice *s)
> {
>     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
>     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
> 
>     assert(max_transfer);
>     max_transfer = MIN_NON_ZERO(max_transfer,
>                                 max_iov * qemu_real_host_page_size());
> 
>     return max_transfer / s->blocksize;
> }
> 
> 
> However, due to the limitation of IOV_MAX, no matter how powerful the host
> scsi hardware is, the "Maximum transfer length" that qemu emulates in bl vpd
> page is capped at 8192 sectors in case of 4kb page size and 512 bytes
> logical block size.
> For example:
> host:~ # sg_vpd -p bl /dev/sda
> Block limits VPD page (SBC)
>   ......
>   Maximum transfer length: 0 blocks [not reported]
>   ......
> 
> 
> host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> 16384
> 
> host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> 32767
> 
> host:~ # cat /sys/class/block/sda/queue/max_segments
> 4096
> 
> 
> Expected:
> guest:~ # sg_vpd -p bl /dev/sda
> Block limits VPD page (SBC)
>   ......
>   Maximum transfer length: 0x8000
>   ......
> 
> guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> 16384
> 
> guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> 32767
> 
> 
> Actual:
> guest:~ # sg_vpd -p bl /dev/sda
> Block limits VPD page (SBC)
>   ......
>   Maximum transfer length: 0x2000
>   ......
> 
> guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> 4096
> 
> guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> 32767
> 
> 
> It seems the current design logic is not able to fully utilize the
> performance of the scsi hardware. I have two questions:
> 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
> limitation, directly use the return value of BLKSECTGET as the maximum
> transfer length when QEMU emulates the block limit page of scsi vpd?
>    If we doing so, we will have maximum transfer length in the guest that is
> consistent with the capabilities of the host hardware。
> 
> 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb in guest
> which doesn't exceed the capabilities of the host hardware(eg: 16384 in kb)
> but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
>    Any risks in readv()/writev() of raw-posix?

Not a definitive answer, but just something to encourage discussion:

In theory IOV_MAX should not be factored into the Block Limits VPD page
Maximum Transfer Length field because there is already a HBA limit on
the maximum number of segments. For example, virtio-scsi has a seg_max
Configuration Space field that guest drivers honor independently of
Maximum Transfer Length.

However, I can imagine why MAX_IOV needs to be factored in:

1. The maximum number of segments might be hardcoded in guest drivers
   for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
   guest in that case.

2. Guest physical RAM addresses translate to host virtual memory. That
   means 1 segment as seen by the guest might actually require multiple
   physical DMA segments on the host. A conservative calculation that
   assumes the worst-case 1 iovec per 4 KB memory page prevents the
   host maximum segments limit (note this is not the Maximum Transfer
   Length limit!) from being exceeded.

So there seem to be at least two problems here. If you relax the
calculation there will be corner cases that break because the guest can
send too many segments.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
  2025-04-18 15:34 ` Stefan Hajnoczi
@ 2025-04-23  9:47   ` lma
  2025-04-23 13:24     ` Stefan Hajnoczi
  0 siblings, 1 reply; 6+ messages in thread
From: lma @ 2025-04-23  9:47 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, pbonzini, qemu-block

在 2025-04-18 23:34,Stefan Hajnoczi 写道:
> On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
>> Hi all,
>> 
>> In case of SCSI passthrough, If the Block Limits VPD device response 
>> is
>> absent from hardware, QEMU handles it.
>> 
>> There are several variables involved in this process as follows:
>> * The bl.max_transfer
>> * The bl.max_iov that is associated with IOV_MAX.
>> * The bl.max_hw_iov that is associated with the max_segments sysfs 
>> setting
>> for the relevant block device on the host.
>> * The bl.max_hw_transfer that is associated with the BLKSECTGET ioctl, 
>> in
>> other words related to the current max_sectors_kb sysfs setting of the
>> relevant block device on the host.
>> 
>> Then take the smallest value and return it as the result of "Maximum
>> transfer length" after relevant calculation, See:
>> static uint64_t calculate_max_transfer(SCSIDevice *s)
>> {
>>     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
>>     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
>> 
>>     assert(max_transfer);
>>     max_transfer = MIN_NON_ZERO(max_transfer,
>>                                 max_iov * qemu_real_host_page_size());
>> 
>>     return max_transfer / s->blocksize;
>> }
>> 
>> 
>> However, due to the limitation of IOV_MAX, no matter how powerful the 
>> host
>> scsi hardware is, the "Maximum transfer length" that qemu emulates in 
>> bl vpd
>> page is capped at 8192 sectors in case of 4kb page size and 512 bytes
>> logical block size.
>> For example:
>> host:~ # sg_vpd -p bl /dev/sda
>> Block limits VPD page (SBC)
>>   ......
>>   Maximum transfer length: 0 blocks [not reported]
>>   ......
>> 
>> 
>> host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> 16384
>> 
>> host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> 32767
>> 
>> host:~ # cat /sys/class/block/sda/queue/max_segments
>> 4096
>> 
>> 
>> Expected:
>> guest:~ # sg_vpd -p bl /dev/sda
>> Block limits VPD page (SBC)
>>   ......
>>   Maximum transfer length: 0x8000
>>   ......
>> 
>> guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> 16384
>> 
>> guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> 32767
>> 
>> 
>> Actual:
>> guest:~ # sg_vpd -p bl /dev/sda
>> Block limits VPD page (SBC)
>>   ......
>>   Maximum transfer length: 0x2000
>>   ......
>> 
>> guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> 4096
>> 
>> guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> 32767
>> 
>> 
>> It seems the current design logic is not able to fully utilize the
>> performance of the scsi hardware. I have two questions:
>> 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
>> limitation, directly use the return value of BLKSECTGET as the maximum
>> transfer length when QEMU emulates the block limit page of scsi vpd?
>>    If we doing so, we will have maximum transfer length in the guest 
>> that is
>> consistent with the capabilities of the host hardware。
>> 
>> 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb in 
>> guest
>> which doesn't exceed the capabilities of the host hardware(eg: 16384 
>> in kb)
>> but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
>>    Any risks in readv()/writev() of raw-posix?
> 
> Not a definitive answer, but just something to encourage discussion:
> 
> In theory IOV_MAX should not be factored into the Block Limits VPD page
> Maximum Transfer Length field because there is already a HBA limit on
> the maximum number of segments. For example, virtio-scsi has a seg_max
> Configuration Space field that guest drivers honor independently of
> Maximum Transfer Length.
> 
> However, I can imagine why MAX_IOV needs to be factored in:
> 
> 1. The maximum number of segments might be hardcoded in guest drivers
>    for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
>    guest in that case.
> 
> 2. Guest physical RAM addresses translate to host virtual memory. That
>    means 1 segment as seen by the guest might actually require multiple
>    physical DMA segments on the host. A conservative calculation that
>    assumes the worst-case 1 iovec per 4 KB memory page prevents the
>    host maximum segments limit (note this is not the Maximum Transfer
>    Length limit!) from being exceeded.
> 
> So there seem to be at least two problems here. If you relax the
> calculation there will be corner cases that break because the guest can
> send too many segments.
> 
> Stefan

The maximum allowed value for
/sys/class/block/<GUEST_DEV>/queue/max_sectors_kb in guest os depends
on the smaller of below two items in guest os:
the "maximum transfer length of block limits VPD page"
and
the "/sys/class/block/<GUEST_DEV>/queue/max_hw_sectors_kb".


The "seg_max Configuration Space field" in hw/scsi/virtio-scsi.c:
static const Property virtio_scsi_properties[] = {
     ...
     DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, 
parent_obj.conf.max_sectors,
                                                   0xFFFF),
     ...
};

This field determines the value of max_hw_sectors_kb in sysfs in guest
os, Eg: In case of Logical block size 512 bytes, 0xFFFF sectors means:
max_hw_sectors_kb = 0xFFFF/2 = 32767, I believe many users will keep
this default value when using virtio-scsi, rather than customizing it.

But by the current design and affected by IOV_MAX, the upper limit of
/sys/class/block/<GUEST_DEV>/queue/max_sectors_kb is 4096 for SCSI
passthrough scenario in case of 4kb page size and 512 bytes logical
block size. Therefore, the gap between the upper limit of max_sectors_kb
and the max_hw_sectors_kb is very large.

I think this design logic is a bit strange.

Anyway, Thanks for the detailed answer,
Lin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
  2025-04-23  9:47   ` lma
@ 2025-04-23 13:24     ` Stefan Hajnoczi
       [not found]       ` <32c2072d6fc017786f4d6ef0dd681ae7@suse.de>
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-23 13:24 UTC (permalink / raw)
  To: lma; +Cc: qemu-devel, pbonzini, qemu-block

[-- Attachment #1: Type: text/plain, Size: 6846 bytes --]

On Wed, Apr 23, 2025 at 05:47:44PM +0800, lma wrote:
> 在 2025-04-18 23:34,Stefan Hajnoczi 写道:
> > On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
> > > Hi all,
> > > 
> > > In case of SCSI passthrough, If the Block Limits VPD device response
> > > is
> > > absent from hardware, QEMU handles it.
> > > 
> > > There are several variables involved in this process as follows:
> > > * The bl.max_transfer
> > > * The bl.max_iov that is associated with IOV_MAX.
> > > * The bl.max_hw_iov that is associated with the max_segments sysfs
> > > setting
> > > for the relevant block device on the host.
> > > * The bl.max_hw_transfer that is associated with the BLKSECTGET
> > > ioctl, in
> > > other words related to the current max_sectors_kb sysfs setting of the
> > > relevant block device on the host.
> > > 
> > > Then take the smallest value and return it as the result of "Maximum
> > > transfer length" after relevant calculation, See:
> > > static uint64_t calculate_max_transfer(SCSIDevice *s)
> > > {
> > >     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
> > >     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
> > > 
> > >     assert(max_transfer);
> > >     max_transfer = MIN_NON_ZERO(max_transfer,
> > >                                 max_iov * qemu_real_host_page_size());
> > > 
> > >     return max_transfer / s->blocksize;
> > > }
> > > 
> > > 
> > > However, due to the limitation of IOV_MAX, no matter how powerful
> > > the host
> > > scsi hardware is, the "Maximum transfer length" that qemu emulates
> > > in bl vpd
> > > page is capped at 8192 sectors in case of 4kb page size and 512 bytes
> > > logical block size.
> > > For example:
> > > host:~ # sg_vpd -p bl /dev/sda
> > > Block limits VPD page (SBC)
> > >   ......
> > >   Maximum transfer length: 0 blocks [not reported]
> > >   ......
> > > 
> > > 
> > > host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > 16384
> > > 
> > > host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > 32767
> > > 
> > > host:~ # cat /sys/class/block/sda/queue/max_segments
> > > 4096
> > > 
> > > 
> > > Expected:
> > > guest:~ # sg_vpd -p bl /dev/sda
> > > Block limits VPD page (SBC)
> > >   ......
> > >   Maximum transfer length: 0x8000
> > >   ......
> > > 
> > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > 16384
> > > 
> > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > 32767
> > > 
> > > 
> > > Actual:
> > > guest:~ # sg_vpd -p bl /dev/sda
> > > Block limits VPD page (SBC)
> > >   ......
> > >   Maximum transfer length: 0x2000
> > >   ......
> > > 
> > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > 4096
> > > 
> > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > 32767
> > > 
> > > 
> > > It seems the current design logic is not able to fully utilize the
> > > performance of the scsi hardware. I have two questions:
> > > 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
> > > limitation, directly use the return value of BLKSECTGET as the maximum
> > > transfer length when QEMU emulates the block limit page of scsi vpd?
> > >    If we doing so, we will have maximum transfer length in the guest
> > > that is
> > > consistent with the capabilities of the host hardware。
> > > 
> > > 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb
> > > in guest
> > > which doesn't exceed the capabilities of the host hardware(eg: 16384
> > > in kb)
> > > but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
> > >    Any risks in readv()/writev() of raw-posix?
> > 
> > Not a definitive answer, but just something to encourage discussion:
> > 
> > In theory IOV_MAX should not be factored into the Block Limits VPD page
> > Maximum Transfer Length field because there is already a HBA limit on
> > the maximum number of segments. For example, virtio-scsi has a seg_max
> > Configuration Space field that guest drivers honor independently of
> > Maximum Transfer Length.
> > 
> > However, I can imagine why MAX_IOV needs to be factored in:
> > 
> > 1. The maximum number of segments might be hardcoded in guest drivers
> >    for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
> >    guest in that case.
> > 
> > 2. Guest physical RAM addresses translate to host virtual memory. That
> >    means 1 segment as seen by the guest might actually require multiple
> >    physical DMA segments on the host. A conservative calculation that
> >    assumes the worst-case 1 iovec per 4 KB memory page prevents the
> >    host maximum segments limit (note this is not the Maximum Transfer
> >    Length limit!) from being exceeded.
> > 
> > So there seem to be at least two problems here. If you relax the
> > calculation there will be corner cases that break because the guest can
> > send too many segments.
> > 
> > Stefan
> 
> The maximum allowed value for
> /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb in guest os depends
> on the smaller of below two items in guest os:
> the "maximum transfer length of block limits VPD page"
> and
> the "/sys/class/block/<GUEST_DEV>/queue/max_hw_sectors_kb".
> 
> 
> The "seg_max Configuration Space field" in hw/scsi/virtio-scsi.c:
> static const Property virtio_scsi_properties[] = {
>     ...
>     DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI,
> parent_obj.conf.max_sectors,
>                                                   0xFFFF),
>     ...
> };
> 
> This field determines the value of max_hw_sectors_kb in sysfs in guest
> os, Eg: In case of Logical block size 512 bytes, 0xFFFF sectors means:
> max_hw_sectors_kb = 0xFFFF/2 = 32767, I believe many users will keep
> this default value when using virtio-scsi, rather than customizing it.
> 
> But by the current design and affected by IOV_MAX, the upper limit of
> /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb is 4096 for SCSI
> passthrough scenario in case of 4kb page size and 512 bytes logical
> block size. Therefore, the gap between the upper limit of max_sectors_kb
> and the max_hw_sectors_kb is very large.
> 
> I think this design logic is a bit strange.

Unless you can think of a different correct way to report block limits
for scsi-generic devices, then I think we're stuck with the sub-optimal
conservative value.

By the way, scsi-disk.c's scsi-block and scsi-hd devices are less
restrictive because the host is able to split requests. Splitting is not
possible for SCSI passthrough requests since they could be
vendor-specific requests and the host does not have enough information
to split them.

Can you use -device scsi-block instead of -device scsi-generic? That
would solve this problem.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
       [not found]       ` <32c2072d6fc017786f4d6ef0dd681ae7@suse.de>
@ 2025-04-24 14:51         ` Stefan Hajnoczi
  2025-04-25  3:21           ` lma
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2025-04-24 14:51 UTC (permalink / raw)
  To: lma; +Cc: qemu-devel, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 8670 bytes --]

On Wed, Apr 23, 2025 at 10:07:48PM +0800, lma wrote:
> 在 2025-04-23 21:24,Stefan Hajnoczi 写道:
> > On Wed, Apr 23, 2025 at 05:47:44PM +0800, lma wrote:
> > > 在 2025-04-18 23:34,Stefan Hajnoczi 写道:
> > > > On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
> > > > > Hi all,
> > > > >
> > > > > In case of SCSI passthrough, If the Block Limits VPD device response
> > > > > is
> > > > > absent from hardware, QEMU handles it.
> > > > >
> > > > > There are several variables involved in this process as follows:
> > > > > * The bl.max_transfer
> > > > > * The bl.max_iov that is associated with IOV_MAX.
> > > > > * The bl.max_hw_iov that is associated with the max_segments sysfs
> > > > > setting
> > > > > for the relevant block device on the host.
> > > > > * The bl.max_hw_transfer that is associated with the BLKSECTGET
> > > > > ioctl, in
> > > > > other words related to the current max_sectors_kb sysfs setting of the
> > > > > relevant block device on the host.
> > > > >
> > > > > Then take the smallest value and return it as the result of "Maximum
> > > > > transfer length" after relevant calculation, See:
> > > > > static uint64_t calculate_max_transfer(SCSIDevice *s)
> > > > > {
> > > > >     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
> > > > >     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
> > > > >
> > > > >     assert(max_transfer);
> > > > >     max_transfer = MIN_NON_ZERO(max_transfer,
> > > > >                                 max_iov * qemu_real_host_page_size());
> > > > >
> > > > >     return max_transfer / s->blocksize;
> > > > > }
> > > > >
> > > > >
> > > > > However, due to the limitation of IOV_MAX, no matter how powerful
> > > > > the host
> > > > > scsi hardware is, the "Maximum transfer length" that qemu emulates
> > > > > in bl vpd
> > > > > page is capped at 8192 sectors in case of 4kb page size and 512 bytes
> > > > > logical block size.
> > > > > For example:
> > > > > host:~ # sg_vpd -p bl /dev/sda
> > > > > Block limits VPD page (SBC)
> > > > >   ......
> > > > >   Maximum transfer length: 0 blocks [not reported]
> > > > >   ......
> > > > >
> > > > >
> > > > > host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > > > 16384
> > > > >
> > > > > host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > > > 32767
> > > > >
> > > > > host:~ # cat /sys/class/block/sda/queue/max_segments
> > > > > 4096
> > > > >
> > > > >
> > > > > Expected:
> > > > > guest:~ # sg_vpd -p bl /dev/sda
> > > > > Block limits VPD page (SBC)
> > > > >   ......
> > > > >   Maximum transfer length: 0x8000
> > > > >   ......
> > > > >
> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > > > 16384
> > > > >
> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > > > 32767
> > > > >
> > > > >
> > > > > Actual:
> > > > > guest:~ # sg_vpd -p bl /dev/sda
> > > > > Block limits VPD page (SBC)
> > > > >   ......
> > > > >   Maximum transfer length: 0x2000
> > > > >   ......
> > > > >
> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
> > > > > 4096
> > > > >
> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
> > > > > 32767
> > > > >
> > > > >
> > > > > It seems the current design logic is not able to fully utilize the
> > > > > performance of the scsi hardware. I have two questions:
> > > > > 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
> > > > > limitation, directly use the return value of BLKSECTGET as the maximum
> > > > > transfer length when QEMU emulates the block limit page of scsi vpd?
> > > > >    If we doing so, we will have maximum transfer length in the guest
> > > > > that is
> > > > > consistent with the capabilities of the host hardware。
> > > > >
> > > > > 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb
> > > > > in guest
> > > > > which doesn't exceed the capabilities of the host hardware(eg: 16384
> > > > > in kb)
> > > > > but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
> > > > >    Any risks in readv()/writev() of raw-posix?
> > > >
> > > > Not a definitive answer, but just something to encourage discussion:
> > > >
> > > > In theory IOV_MAX should not be factored into the Block Limits VPD page
> > > > Maximum Transfer Length field because there is already a HBA limit on
> > > > the maximum number of segments. For example, virtio-scsi has a seg_max
> > > > Configuration Space field that guest drivers honor independently of
> > > > Maximum Transfer Length.
> > > >
> > > > However, I can imagine why MAX_IOV needs to be factored in:
> > > >
> > > > 1. The maximum number of segments might be hardcoded in guest drivers
> > > >    for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
> > > >    guest in that case.
> > > >
> > > > 2. Guest physical RAM addresses translate to host virtual memory. That
> > > >    means 1 segment as seen by the guest might actually require multiple
> > > >    physical DMA segments on the host. A conservative calculation that
> > > >    assumes the worst-case 1 iovec per 4 KB memory page prevents the
> > > >    host maximum segments limit (note this is not the Maximum Transfer
> > > >    Length limit!) from being exceeded.
> > > >
> > > > So there seem to be at least two problems here. If you relax the
> > > > calculation there will be corner cases that break because the guest can
> > > > send too many segments.
> > > >
> > > > Stefan
> > > 
> > > The maximum allowed value for
> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb in guest os depends
> > > on the smaller of below two items in guest os:
> > > the "maximum transfer length of block limits VPD page"
> > > and
> > > the "/sys/class/block/<GUEST_DEV>/queue/max_hw_sectors_kb".
> > > 
> > > 
> > > The "seg_max Configuration Space field" in hw/scsi/virtio-scsi.c:
> > > static const Property virtio_scsi_properties[] = {
> > >     ...
> > >     DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI,
> > > parent_obj.conf.max_sectors,
> > >                                                   0xFFFF),
> > >     ...
> > > };
> > > 
> > > This field determines the value of max_hw_sectors_kb in sysfs in guest
> > > os, Eg: In case of Logical block size 512 bytes, 0xFFFF sectors means:
> > > max_hw_sectors_kb = 0xFFFF/2 = 32767, I believe many users will keep
> > > this default value when using virtio-scsi, rather than customizing it.
> > > 
> > > But by the current design and affected by IOV_MAX, the upper limit of
> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb is 4096 for SCSI
> > > passthrough scenario in case of 4kb page size and 512 bytes logical
> > > block size. Therefore, the gap between the upper limit of
> > > max_sectors_kb
> > > and the max_hw_sectors_kb is very large.
> > > 
> > > I think this design logic is a bit strange.
> > 
> > Unless you can think of a different correct way to report block limits
> > for scsi-generic devices, then I think we're stuck with the sub-optimal
> > conservative value.
> > 
> > By the way, scsi-disk.c's scsi-block and scsi-hd devices are less
> > restrictive because the host is able to split requests. Splitting is not
> > possible for SCSI passthrough requests since they could be
> > vendor-specific requests and the host does not have enough information
> > to split them.
> > 
> > Can you use -device scsi-block instead of -device scsi-generic? That
> > would solve this problem.
> 
> Well, unfortunately, that's exactly where I ran into the problem with
> the restriction‌ on maximum transfer length with the scsi-block, I've
> never used the scsi-generic.
> Eg:
> ......
> -device
> '{"driver":"virtio-scsi-pci","id":"scsi0","bus":"pci.7","addr":"0x0"}' \
> -blockdev '{"driver":"host_device","filename":"/dev/sda","node-name":\
> "libvirt-2-storage","read-only":false}' \
> -device
> '{"driver":"scsi-block","bus":"scsi0.0","channel":0,"scsi-id":0,"lun":0,\
> "drive":"libvirt-2-storage","id":"scsi0-0-0-0"}' \
> ......

Ah, scsi-blk uses scsi_generic_req_ops for INQUIRY commands.

It comes down to whether scsi-block handles all commands that transfer
logical blocks (READ/WRITE/etc) without issuing the SG_IO ioctl, then
it's safe to increase the Optimal and Maximum Transfer Length fields to
the same value as scsi-disk.

It's possible that a vendor-specific command transfers logical blocks
and honors Maximum Transfer Length, so then it would not be safe to make
this change. But I'm not sure...

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
  2025-04-24 14:51         ` Stefan Hajnoczi
@ 2025-04-25  3:21           ` lma
  0 siblings, 0 replies; 6+ messages in thread
From: lma @ 2025-04-25  3:21 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, Paolo Bonzini

在 2025-04-24 22:51,Stefan Hajnoczi 写道:
> On Wed, Apr 23, 2025 at 10:07:48PM +0800, lma wrote:
>> 在 2025-04-23 21:24,Stefan Hajnoczi 写道:
>> > On Wed, Apr 23, 2025 at 05:47:44PM +0800, lma wrote:
>> > > 在 2025-04-18 23:34,Stefan Hajnoczi 写道:
>> > > > On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > In case of SCSI passthrough, If the Block Limits VPD device response
>> > > > > is
>> > > > > absent from hardware, QEMU handles it.
>> > > > >
>> > > > > There are several variables involved in this process as follows:
>> > > > > * The bl.max_transfer
>> > > > > * The bl.max_iov that is associated with IOV_MAX.
>> > > > > * The bl.max_hw_iov that is associated with the max_segments sysfs
>> > > > > setting
>> > > > > for the relevant block device on the host.
>> > > > > * The bl.max_hw_transfer that is associated with the BLKSECTGET
>> > > > > ioctl, in
>> > > > > other words related to the current max_sectors_kb sysfs setting of the
>> > > > > relevant block device on the host.
>> > > > >
>> > > > > Then take the smallest value and return it as the result of "Maximum
>> > > > > transfer length" after relevant calculation, See:
>> > > > > static uint64_t calculate_max_transfer(SCSIDevice *s)
>> > > > > {
>> > > > >     uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
>> > > > >     uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
>> > > > >
>> > > > >     assert(max_transfer);
>> > > > >     max_transfer = MIN_NON_ZERO(max_transfer,
>> > > > >                                 max_iov * qemu_real_host_page_size());
>> > > > >
>> > > > >     return max_transfer / s->blocksize;
>> > > > > }
>> > > > >
>> > > > >
>> > > > > However, due to the limitation of IOV_MAX, no matter how powerful
>> > > > > the host
>> > > > > scsi hardware is, the "Maximum transfer length" that qemu emulates
>> > > > > in bl vpd
>> > > > > page is capped at 8192 sectors in case of 4kb page size and 512 bytes
>> > > > > logical block size.
>> > > > > For example:
>> > > > > host:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > >   ......
>> > > > >   Maximum transfer length: 0 blocks [not reported]
>> > > > >   ......
>> > > > >
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 16384
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_segments
>> > > > > 4096
>> > > > >
>> > > > >
>> > > > > Expected:
>> > > > > guest:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > >   ......
>> > > > >   Maximum transfer length: 0x8000
>> > > > >   ......
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 16384
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > >
>> > > > > Actual:
>> > > > > guest:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > >   ......
>> > > > >   Maximum transfer length: 0x2000
>> > > > >   ......
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 4096
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > >
>> > > > > It seems the current design logic is not able to fully utilize the
>> > > > > performance of the scsi hardware. I have two questions:
>> > > > > 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
>> > > > > limitation, directly use the return value of BLKSECTGET as the maximum
>> > > > > transfer length when QEMU emulates the block limit page of scsi vpd?
>> > > > >    If we doing so, we will have maximum transfer length in the guest
>> > > > > that is
>> > > > > consistent with the capabilities of the host hardware。
>> > > > >
>> > > > > 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb
>> > > > > in guest
>> > > > > which doesn't exceed the capabilities of the host hardware(eg: 16384
>> > > > > in kb)
>> > > > > but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
>> > > > >    Any risks in readv()/writev() of raw-posix?
>> > > >
>> > > > Not a definitive answer, but just something to encourage discussion:
>> > > >
>> > > > In theory IOV_MAX should not be factored into the Block Limits VPD page
>> > > > Maximum Transfer Length field because there is already a HBA limit on
>> > > > the maximum number of segments. For example, virtio-scsi has a seg_max
>> > > > Configuration Space field that guest drivers honor independently of
>> > > > Maximum Transfer Length.
>> > > >
>> > > > However, I can imagine why MAX_IOV needs to be factored in:
>> > > >
>> > > > 1. The maximum number of segments might be hardcoded in guest drivers
>> > > >    for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
>> > > >    guest in that case.
>> > > >
>> > > > 2. Guest physical RAM addresses translate to host virtual memory. That
>> > > >    means 1 segment as seen by the guest might actually require multiple
>> > > >    physical DMA segments on the host. A conservative calculation that
>> > > >    assumes the worst-case 1 iovec per 4 KB memory page prevents the
>> > > >    host maximum segments limit (note this is not the Maximum Transfer
>> > > >    Length limit!) from being exceeded.
>> > > >
>> > > > So there seem to be at least two problems here. If you relax the
>> > > > calculation there will be corner cases that break because the guest can
>> > > > send too many segments.
>> > > >
>> > > > Stefan
>> > >
>> > > The maximum allowed value for
>> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb in guest os depends
>> > > on the smaller of below two items in guest os:
>> > > the "maximum transfer length of block limits VPD page"
>> > > and
>> > > the "/sys/class/block/<GUEST_DEV>/queue/max_hw_sectors_kb".
>> > >
>> > >
>> > > The "seg_max Configuration Space field" in hw/scsi/virtio-scsi.c:
>> > > static const Property virtio_scsi_properties[] = {
>> > >     ...
>> > >     DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI,
>> > > parent_obj.conf.max_sectors,
>> > >                                                   0xFFFF),
>> > >     ...
>> > > };
>> > >
>> > > This field determines the value of max_hw_sectors_kb in sysfs in guest
>> > > os, Eg: In case of Logical block size 512 bytes, 0xFFFF sectors means:
>> > > max_hw_sectors_kb = 0xFFFF/2 = 32767, I believe many users will keep
>> > > this default value when using virtio-scsi, rather than customizing it.
>> > >
>> > > But by the current design and affected by IOV_MAX, the upper limit of
>> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb is 4096 for SCSI
>> > > passthrough scenario in case of 4kb page size and 512 bytes logical
>> > > block size. Therefore, the gap between the upper limit of
>> > > max_sectors_kb
>> > > and the max_hw_sectors_kb is very large.
>> > >
>> > > I think this design logic is a bit strange.
>> >
>> > Unless you can think of a different correct way to report block limits
>> > for scsi-generic devices, then I think we're stuck with the sub-optimal
>> > conservative value.
>> >
>> > By the way, scsi-disk.c's scsi-block and scsi-hd devices are less
>> > restrictive because the host is able to split requests. Splitting is not
>> > possible for SCSI passthrough requests since they could be
>> > vendor-specific requests and the host does not have enough information
>> > to split them.
>> >
>> > Can you use -device scsi-block instead of -device scsi-generic? That
>> > would solve this problem.
>> 
>> Well, unfortunately, that's exactly where I ran into the problem with
>> the restriction‌ on maximum transfer length with the scsi-block, I've
>> never used the scsi-generic.
>> Eg:
>> ......
>> -device
>> '{"driver":"virtio-scsi-pci","id":"scsi0","bus":"pci.7","addr":"0x0"}' 
>> \
>> -blockdev '{"driver":"host_device","filename":"/dev/sda","node-name":\
>> "libvirt-2-storage","read-only":false}' \
>> -device
>> '{"driver":"scsi-block","bus":"scsi0.0","channel":0,"scsi-id":0,"lun":0,\
>> "drive":"libvirt-2-storage","id":"scsi0-0-0-0"}' \
>> ......
> 
> Ah, scsi-blk uses scsi_generic_req_ops for INQUIRY commands.
> 
> It comes down to whether scsi-block handles all commands that transfer
> logical blocks (READ/WRITE/etc) without issuing the SG_IO ioctl, then
> it's safe to increase the Optimal and Maximum Transfer Length fields to
> the same value as scsi-disk.
> 
> It's possible that a vendor-specific command transfers logical blocks
> and honors Maximum Transfer Length, so then it would not be safe to 
> make
> this change. But I'm not sure...

Okay, Let's see if there's more discussion or comments involved.

Thanks for your input and time!
Lin


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-04-25  3:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-17 11:27 A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware lma
2025-04-18 15:34 ` Stefan Hajnoczi
2025-04-23  9:47   ` lma
2025-04-23 13:24     ` Stefan Hajnoczi
     [not found]       ` <32c2072d6fc017786f4d6ef0dd681ae7@suse.de>
2025-04-24 14:51         ` Stefan Hajnoczi
2025-04-25  3:21           ` lma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).