From: lma <lma@suse.de>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware
Date: Fri, 25 Apr 2025 11:21:44 +0800 [thread overview]
Message-ID: <a5a0c5a701e40adbd763578af95e3ac5@suse.de> (raw)
In-Reply-To: <20250424145151.GA399725@fedora>
在 2025-04-24 22:51,Stefan Hajnoczi 写道:
> On Wed, Apr 23, 2025 at 10:07:48PM +0800, lma wrote:
>> 在 2025-04-23 21:24,Stefan Hajnoczi 写道:
>> > On Wed, Apr 23, 2025 at 05:47:44PM +0800, lma wrote:
>> > > 在 2025-04-18 23:34,Stefan Hajnoczi 写道:
>> > > > On Thu, Apr 17, 2025 at 07:27:26PM +0800, lma wrote:
>> > > > > Hi all,
>> > > > >
>> > > > > In case of SCSI passthrough, If the Block Limits VPD device response
>> > > > > is
>> > > > > absent from hardware, QEMU handles it.
>> > > > >
>> > > > > There are several variables involved in this process as follows:
>> > > > > * The bl.max_transfer
>> > > > > * The bl.max_iov that is associated with IOV_MAX.
>> > > > > * The bl.max_hw_iov that is associated with the max_segments sysfs
>> > > > > setting
>> > > > > for the relevant block device on the host.
>> > > > > * The bl.max_hw_transfer that is associated with the BLKSECTGET
>> > > > > ioctl, in
>> > > > > other words related to the current max_sectors_kb sysfs setting of the
>> > > > > relevant block device on the host.
>> > > > >
>> > > > > Then take the smallest value and return it as the result of "Maximum
>> > > > > transfer length" after relevant calculation, See:
>> > > > > static uint64_t calculate_max_transfer(SCSIDevice *s)
>> > > > > {
>> > > > > uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
>> > > > > uint32_t max_iov = blk_get_max_hw_iov(s->conf.blk);
>> > > > >
>> > > > > assert(max_transfer);
>> > > > > max_transfer = MIN_NON_ZERO(max_transfer,
>> > > > > max_iov * qemu_real_host_page_size());
>> > > > >
>> > > > > return max_transfer / s->blocksize;
>> > > > > }
>> > > > >
>> > > > >
>> > > > > However, due to the limitation of IOV_MAX, no matter how powerful
>> > > > > the host
>> > > > > scsi hardware is, the "Maximum transfer length" that qemu emulates
>> > > > > in bl vpd
>> > > > > page is capped at 8192 sectors in case of 4kb page size and 512 bytes
>> > > > > logical block size.
>> > > > > For example:
>> > > > > host:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > > ......
>> > > > > Maximum transfer length: 0 blocks [not reported]
>> > > > > ......
>> > > > >
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 16384
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > > host:~ # cat /sys/class/block/sda/queue/max_segments
>> > > > > 4096
>> > > > >
>> > > > >
>> > > > > Expected:
>> > > > > guest:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > > ......
>> > > > > Maximum transfer length: 0x8000
>> > > > > ......
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 16384
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > >
>> > > > > Actual:
>> > > > > guest:~ # sg_vpd -p bl /dev/sda
>> > > > > Block limits VPD page (SBC)
>> > > > > ......
>> > > > > Maximum transfer length: 0x2000
>> > > > > ......
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_sectors_kb
>> > > > > 4096
>> > > > >
>> > > > > guest:~ # cat /sys/class/block/sda/queue/max_hw_sectors_kb
>> > > > > 32767
>> > > > >
>> > > > >
>> > > > > It seems the current design logic is not able to fully utilize the
>> > > > > performance of the scsi hardware. I have two questions:
>> > > > > 1. I'm curious that is it reasonable to drop the logic about IOV_MAX
>> > > > > limitation, directly use the return value of BLKSECTGET as the maximum
>> > > > > transfer length when QEMU emulates the block limit page of scsi vpd?
>> > > > > If we doing so, we will have maximum transfer length in the guest
>> > > > > that is
>> > > > > consistent with the capabilities of the host hardware。
>> > > > >
>> > > > > 2. Besides, Assume I set a value(eg: 8192 in kb) to max_sectors_kb
>> > > > > in guest
>> > > > > which doesn't exceed the capabilities of the host hardware(eg: 16384
>> > > > > in kb)
>> > > > > but exceeds the limit(eg: 4096 in kb) caused by IOV_MAX,
>> > > > > Any risks in readv()/writev() of raw-posix?
>> > > >
>> > > > Not a definitive answer, but just something to encourage discussion:
>> > > >
>> > > > In theory IOV_MAX should not be factored into the Block Limits VPD page
>> > > > Maximum Transfer Length field because there is already a HBA limit on
>> > > > the maximum number of segments. For example, virtio-scsi has a seg_max
>> > > > Configuration Space field that guest drivers honor independently of
>> > > > Maximum Transfer Length.
>> > > >
>> > > > However, I can imagine why MAX_IOV needs to be factored in:
>> > > >
>> > > > 1. The maximum number of segments might be hardcoded in guest drivers
>> > > > for some SCSI HBAs and QEMU has no way of exposing MAX_IOV to the
>> > > > guest in that case.
>> > > >
>> > > > 2. Guest physical RAM addresses translate to host virtual memory. That
>> > > > means 1 segment as seen by the guest might actually require multiple
>> > > > physical DMA segments on the host. A conservative calculation that
>> > > > assumes the worst-case 1 iovec per 4 KB memory page prevents the
>> > > > host maximum segments limit (note this is not the Maximum Transfer
>> > > > Length limit!) from being exceeded.
>> > > >
>> > > > So there seem to be at least two problems here. If you relax the
>> > > > calculation there will be corner cases that break because the guest can
>> > > > send too many segments.
>> > > >
>> > > > Stefan
>> > >
>> > > The maximum allowed value for
>> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb in guest os depends
>> > > on the smaller of below two items in guest os:
>> > > the "maximum transfer length of block limits VPD page"
>> > > and
>> > > the "/sys/class/block/<GUEST_DEV>/queue/max_hw_sectors_kb".
>> > >
>> > >
>> > > The "seg_max Configuration Space field" in hw/scsi/virtio-scsi.c:
>> > > static const Property virtio_scsi_properties[] = {
>> > > ...
>> > > DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI,
>> > > parent_obj.conf.max_sectors,
>> > > 0xFFFF),
>> > > ...
>> > > };
>> > >
>> > > This field determines the value of max_hw_sectors_kb in sysfs in guest
>> > > os, Eg: In case of Logical block size 512 bytes, 0xFFFF sectors means:
>> > > max_hw_sectors_kb = 0xFFFF/2 = 32767, I believe many users will keep
>> > > this default value when using virtio-scsi, rather than customizing it.
>> > >
>> > > But by the current design and affected by IOV_MAX, the upper limit of
>> > > /sys/class/block/<GUEST_DEV>/queue/max_sectors_kb is 4096 for SCSI
>> > > passthrough scenario in case of 4kb page size and 512 bytes logical
>> > > block size. Therefore, the gap between the upper limit of
>> > > max_sectors_kb
>> > > and the max_hw_sectors_kb is very large.
>> > >
>> > > I think this design logic is a bit strange.
>> >
>> > Unless you can think of a different correct way to report block limits
>> > for scsi-generic devices, then I think we're stuck with the sub-optimal
>> > conservative value.
>> >
>> > By the way, scsi-disk.c's scsi-block and scsi-hd devices are less
>> > restrictive because the host is able to split requests. Splitting is not
>> > possible for SCSI passthrough requests since they could be
>> > vendor-specific requests and the host does not have enough information
>> > to split them.
>> >
>> > Can you use -device scsi-block instead of -device scsi-generic? That
>> > would solve this problem.
>>
>> Well, unfortunately, that's exactly where I ran into the problem with
>> the restriction on maximum transfer length with the scsi-block, I've
>> never used the scsi-generic.
>> Eg:
>> ......
>> -device
>> '{"driver":"virtio-scsi-pci","id":"scsi0","bus":"pci.7","addr":"0x0"}'
>> \
>> -blockdev '{"driver":"host_device","filename":"/dev/sda","node-name":\
>> "libvirt-2-storage","read-only":false}' \
>> -device
>> '{"driver":"scsi-block","bus":"scsi0.0","channel":0,"scsi-id":0,"lun":0,\
>> "drive":"libvirt-2-storage","id":"scsi0-0-0-0"}' \
>> ......
>
> Ah, scsi-blk uses scsi_generic_req_ops for INQUIRY commands.
>
> It comes down to whether scsi-block handles all commands that transfer
> logical blocks (READ/WRITE/etc) without issuing the SG_IO ioctl, then
> it's safe to increase the Optimal and Maximum Transfer Length fields to
> the same value as scsi-disk.
>
> It's possible that a vendor-specific command transfers logical blocks
> and honors Maximum Transfer Length, so then it would not be safe to
> make
> this change. But I'm not sure...
Okay, Let's see if there's more discussion or comments involved.
Thanks for your input and time!
Lin
prev parent reply other threads:[~2025-04-25 3:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-17 11:27 A question about how to calculate the "Maximum transfer length" in case of its absence in the Block Limits VPD device response from the hardware lma
2025-04-18 15:34 ` Stefan Hajnoczi
2025-04-23 9:47 ` lma
2025-04-23 13:24 ` Stefan Hajnoczi
[not found] ` <32c2072d6fc017786f4d6ef0dd681ae7@suse.de>
2025-04-24 14:51 ` Stefan Hajnoczi
2025-04-25 3:21 ` lma [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a5a0c5a701e40adbd763578af95e3ac5@suse.de \
--to=lma@suse.de \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).