* [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough
2019-06-30 15:08 [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices Maxim Levitsky
@ 2019-06-30 15:08 ` Maxim Levitsky
2019-07-03 14:50 ` Eric Blake
2019-07-02 16:11 ` [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices Maxim Levitsky
2019-07-03 9:52 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2 siblings, 1 reply; 7+ messages in thread
From: Maxim Levitsky @ 2019-06-30 15:08 UTC (permalink / raw)
To: qemu-devel
Cc: Fam Zheng, Kevin Wolf, qemu-block, Maxim Levitsky, John Ferlan,
Max Reitz
Regular block devices (/dev/sda*, /dev/nvme*, etc) interface is not limited
by the underlying storage limits, but rather the kernel block layer
takes care to split the requests that are too large/fragmented.
Doing so allows us to have less overhead in qemu.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
block/file-posix.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/block/file-posix.c b/block/file-posix.c
index ab05b51a66..66dad34f8a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
s->reopen_state = NULL;
}
-static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
+static int sg_get_max_transfer_length(BlockDriverState *bs, int fd)
{
#ifdef BLKSECTGET
int max_bytes = 0;
- short max_sectors = 0;
- if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
+
+ if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
return max_bytes;
- } else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
- return max_sectors << BDRV_SECTOR_BITS;
} else {
return -errno;
}
@@ -1055,7 +1053,7 @@ static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
#endif
}
-static int hdev_get_max_segments(const struct stat *st)
+static int sg_get_max_segments(const struct stat *st)
{
#ifdef CONFIG_LINUX
char buf[32];
@@ -1106,12 +1104,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
struct stat st;
if (!fstat(s->fd, &st)) {
- if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
- int ret = hdev_get_max_transfer_length(bs, s->fd);
+ if (bs->sg) {
+ int ret = sg_get_max_transfer_length(bs, s->fd);
if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
bs->bl.max_transfer = pow2floor(ret);
}
- ret = hdev_get_max_segments(&st);
+ ret = sg_get_max_segments(&st);
if (ret > 0) {
bs->bl.max_transfer = MIN(bs->bl.max_transfer,
ret * getpagesize());
--
2.17.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough
2019-06-30 15:08 ` [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough Maxim Levitsky
@ 2019-07-03 14:50 ` Eric Blake
2019-07-03 15:28 ` Maxim Levitsky
0 siblings, 1 reply; 7+ messages in thread
From: Eric Blake @ 2019-07-03 14:50 UTC (permalink / raw)
To: Maxim Levitsky, qemu-devel
Cc: Fam Zheng, Kevin Wolf, Max Reitz, qemu-block, John Ferlan
[-- Attachment #1.1: Type: text/plain, Size: 2796 bytes --]
On 6/30/19 10:08 AM, Maxim Levitsky wrote:
> Regular block devices (/dev/sda*, /dev/nvme*, etc) interface is not limited
The regular block device interface is
or
Regular block devices interfaces are
> by the underlying storage limits, but rather the kernel block layer
> takes care to split the requests that are too large/fragmented.
>
> Doing so allows us to have less overhead in qemu.
>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
> block/file-posix.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index ab05b51a66..66dad34f8a 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
> s->reopen_state = NULL;
> }
>
> -static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> +static int sg_get_max_transfer_length(BlockDriverState *bs, int fd)
> {
> #ifdef BLKSECTGET
> int max_bytes = 0;
> - short max_sectors = 0;
> - if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> +
> + if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> return max_bytes;
> - } else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
> - return max_sectors << BDRV_SECTOR_BITS;
> } else {
> return -errno;
> }
> @@ -1055,7 +1053,7 @@ static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> #endif
> }
>
> -static int hdev_get_max_segments(const struct stat *st)
> +static int sg_get_max_segments(const struct stat *st)
> {
> #ifdef CONFIG_LINUX
> char buf[32];
> @@ -1106,12 +1104,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> struct stat st;
>
> if (!fstat(s->fd, &st)) {
> - if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
> - int ret = hdev_get_max_transfer_length(bs, s->fd);
Is it worth delaying the fstat()...
> + if (bs->sg) {
> + int ret = sg_get_max_transfer_length(bs, s->fd);
> if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> bs->bl.max_transfer = pow2floor(ret);
> }
> - ret = hdev_get_max_segments(&st);
> + ret = sg_get_max_segments(&st);
...until inside the if (bs->sg) condition, to avoid wasted work for
other scenarios?
> if (ret > 0) {
> bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> ret * getpagesize());
>
Reviewed-by: Eric Blake <eblake@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough
2019-07-03 14:50 ` Eric Blake
@ 2019-07-03 15:28 ` Maxim Levitsky
0 siblings, 0 replies; 7+ messages in thread
From: Maxim Levitsky @ 2019-07-03 15:28 UTC (permalink / raw)
To: Eric Blake, qemu-devel
Cc: Fam Zheng, Kevin Wolf, Max Reitz, qemu-block, John Ferlan
On Wed, 2019-07-03 at 09:50 -0500, Eric Blake wrote:
> On 6/30/19 10:08 AM, Maxim Levitsky wrote:
> > Regular block devices (/dev/sda*, /dev/nvme*, etc) interface is not limited
>
> The regular block device interface is
>
> or
>
> Regular block devices interfaces are
>
> > by the underlying storage limits, but rather the kernel block layer
> > takes care to split the requests that are too large/fragmented.
> >
> > Doing so allows us to have less overhead in qemu.
> >
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> > block/file-posix.c | 16 +++++++---------
> > 1 file changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index ab05b51a66..66dad34f8a 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1038,15 +1038,13 @@ static void raw_reopen_abort(BDRVReopenState *state)
> > s->reopen_state = NULL;
> > }
> >
> > -static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> > +static int sg_get_max_transfer_length(BlockDriverState *bs, int fd)
> > {
> > #ifdef BLKSECTGET
> > int max_bytes = 0;
> > - short max_sectors = 0;
> > - if (bs->sg && ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> > +
> > + if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> > return max_bytes;
> > - } else if (!bs->sg && ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
> > - return max_sectors << BDRV_SECTOR_BITS;
> > } else {
> > return -errno;
> > }
> > @@ -1055,7 +1053,7 @@ static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
> > #endif
> > }
> >
> > -static int hdev_get_max_segments(const struct stat *st)
> > +static int sg_get_max_segments(const struct stat *st)
> > {
> > #ifdef CONFIG_LINUX
> > char buf[32];
> > @@ -1106,12 +1104,12 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
> > struct stat st;
> >
> > if (!fstat(s->fd, &st)) {
> > - if (S_ISBLK(st.st_mode) || S_ISCHR(st.st_mode)) {
> > - int ret = hdev_get_max_transfer_length(bs, s->fd);
>
> Is it worth delaying the fstat()...
>
> > + if (bs->sg) {
> > + int ret = sg_get_max_transfer_length(bs, s->fd);
> > if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> > bs->bl.max_transfer = pow2floor(ret);
> > }
> > - ret = hdev_get_max_segments(&st);
> > + ret = sg_get_max_segments(&st);
>
> ...until inside the if (bs->sg) condition, to avoid wasted work for
> other scenarios?
>
> > if (ret > 0) {
> > bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> > ret * getpagesize());
> >
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
>
Thank you very much for the review. I'll send a V2 soon.
Best regards,
Maxim Levitsky
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices
2019-06-30 15:08 [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices Maxim Levitsky
2019-06-30 15:08 ` [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough Maxim Levitsky
@ 2019-07-02 16:11 ` Maxim Levitsky
2019-07-03 9:52 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2 siblings, 0 replies; 7+ messages in thread
From: Maxim Levitsky @ 2019-07-02 16:11 UTC (permalink / raw)
To: qemu-devel; +Cc: Fam Zheng, Kevin Wolf, Max Reitz, qemu-block, John Ferlan
On Sun, 2019-06-30 at 18:08 +0300, Maxim Levitsky wrote:
> It looks like Linux block devices, even in O_DIRECT mode don't have any user visible
> limit on transfer size / number of segments, which underlying block device can have.
> The block layer takes care of enforcing these limits by splitting the bios.
>
> By limiting the transfer sizes, we force qemu to do the splitting itself which
> introduces various overheads.
> It is especially visible in nbd server, where the low max transfer size of the
> underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of
> image conversion which benefits from large blocks.
>
> More information can be found here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1647104
>
> Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit.
> (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size)
>
> The benchmark:
>
> Images were created using:
>
> Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
> Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G
>
> The test was:
>
> echo "convert native:"
> rm -rf /dev/shm/disk.img
> time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero
>
> echo "convert via nbd:"
> qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork
> rm -rf /dev/shm/disk.img
> time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero
>
> The results:
>
> =========================================
> 1G sparse image:
> native:
> before: 0.027s
> after: 0.027s
> nbd:
> before: 0.287s
> after: 0.035s
>
> =========================================
> 100G sparse image:
> native:
> before: 0.028s
> after: 0.028s
> nbd:
> before: 23.796s
> after: 0.109s
>
> =========================================
> 1G preallocated image:
> native:
> before: 0.454s
> after: 0.427s
> nbd:
> before: 0.649s
> after: 0.546s
>
> The block limits of max transfer size/max segment size are retained
> for the SCSI passthrough because in this case the kernel passes the userspace request
> directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split
> such requests.
>
> What do you think?
>
> Fam, since you was the original author of the code that added
> these limits, could you share your opinion on that?
> What was the reason besides SCSI passthrough?
>
> Best regards,
> Maxim Levitsky
>
> Maxim Levitsky (1):
> raw-posix.c - use max transfer length / max segemnt count only for
> SCSI passthrough
>
> block/file-posix.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
Ping
Best regards,
Maxim Levitsky
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices
2019-06-30 15:08 [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices Maxim Levitsky
2019-06-30 15:08 ` [Qemu-devel] [PATCH 1/1] raw-posix.c - use max transfer length / max segemnt count only for SCSI passthrough Maxim Levitsky
2019-07-02 16:11 ` [Qemu-devel] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices Maxim Levitsky
@ 2019-07-03 9:52 ` Stefan Hajnoczi
2019-07-03 14:46 ` Eric Blake
2 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2019-07-03 9:52 UTC (permalink / raw)
To: Maxim Levitsky
Cc: Fam Zheng, Kevin Wolf, qemu-block, qemu-devel, Max Reitz,
John Ferlan
[-- Attachment #1: Type: text/plain, Size: 3261 bytes --]
On Sun, Jun 30, 2019 at 06:08:54PM +0300, Maxim Levitsky wrote:
> It looks like Linux block devices, even in O_DIRECT mode don't have any user visible
> limit on transfer size / number of segments, which underlying block device can have.
> The block layer takes care of enforcing these limits by splitting the bios.
>
> By limiting the transfer sizes, we force qemu to do the splitting itself which
> introduces various overheads.
> It is especially visible in nbd server, where the low max transfer size of the
> underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of
> image conversion which benefits from large blocks.
>
> More information can be found here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1647104
>
> Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit.
> (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size)
>
> The benchmark:
>
> Images were created using:
>
> Sparse image: qemu-img create -f qcow2 /dev/nvme0n1p3 1G / 10G / 100G
> Allocated image: qemu-img create -f qcow2 /dev/nvme0n1p3 -o preallocation=metadata 1G / 10G / 100G
>
> The test was:
>
> echo "convert native:"
> rm -rf /dev/shm/disk.img
> time qemu-img convert -p -f qcow2 -O raw -T none $FILE /dev/shm/disk.img > /dev/zero
>
> echo "convert via nbd:"
> qemu-nbd -k /tmp/nbd.sock -v -f qcow2 $FILE -x export --cache=none --aio=native --fork
> rm -rf /dev/shm/disk.img
> time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img > /dev/zero
>
> The results:
>
> =========================================
> 1G sparse image:
> native:
> before: 0.027s
> after: 0.027s
> nbd:
> before: 0.287s
> after: 0.035s
>
> =========================================
> 100G sparse image:
> native:
> before: 0.028s
> after: 0.028s
> nbd:
> before: 23.796s
> after: 0.109s
>
> =========================================
> 1G preallocated image:
> native:
> before: 0.454s
> after: 0.427s
> nbd:
> before: 0.649s
> after: 0.546s
>
> The block limits of max transfer size/max segment size are retained
> for the SCSI passthrough because in this case the kernel passes the userspace request
> directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split
> such requests.
>
> What do you think?
>
> Fam, since you was the original author of the code that added
> these limits, could you share your opinion on that?
> What was the reason besides SCSI passthrough?
>
> Best regards,
> Maxim Levitsky
>
> Maxim Levitsky (1):
> raw-posix.c - use max transfer length / max segemnt count only for
> SCSI passthrough
>
> block/file-posix.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
Adding Eric Blake, who implemented the generic request splitting in the
block layer and may know if there were any other reasons aside from SCSI
passthrough why file-posix.c enforces the host block device's maximum
transfer size.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH 0/1] RFC: don't obey the block device max transfer len / max segments for block devices
2019-07-03 9:52 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2019-07-03 14:46 ` Eric Blake
0 siblings, 0 replies; 7+ messages in thread
From: Eric Blake @ 2019-07-03 14:46 UTC (permalink / raw)
To: Stefan Hajnoczi, Maxim Levitsky
Cc: Fam Zheng, Kevin Wolf, qemu-block, qemu-devel, Max Reitz,
John Ferlan
[-- Attachment #1.1: Type: text/plain, Size: 2762 bytes --]
On 7/3/19 4:52 AM, Stefan Hajnoczi wrote:
> On Sun, Jun 30, 2019 at 06:08:54PM +0300, Maxim Levitsky wrote:
>> It looks like Linux block devices, even in O_DIRECT mode don't have any user visible
>> limit on transfer size / number of segments, which underlying block device can have.
>> The block layer takes care of enforcing these limits by splitting the bios.
s/The block layer/The kernel block layer/
>>
>> By limiting the transfer sizes, we force qemu to do the splitting itself which
double space
>> introduces various overheads.
>> It is especially visible in nbd server, where the low max transfer size of the
>> underlying device forces us to advertise this over NBD, thus increasing the traffic overhead in case of
Long line for a commit message.
>> image conversion which benefits from large blocks.
>>
>> More information can be found here:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1647104
>>
>> Tested this with qemu-img convert over nbd and natively and to my surprise, even native IO performance improved a bit.
>> (The device on which it was tested is Intel Optane DC P4800X, which has 128k max transfer size)
>>
>> The benchmark:
>>
I'm sorry I didn't see this before softfreeze, but as a performance
improvement, I think it still classes as a bug fix and is safe for
inclusion in rc0.
>> The block limits of max transfer size/max segment size are retained
>> for the SCSI passthrough because in this case the kernel passes the userspace request
>> directly to the kernel scsi driver, bypassing the block layer, and thus there is no code to split
>> such requests.
>>
>> What do you think?
Seems like a reasonable explanation.
>>
>> Fam, since you was the original author of the code that added
>> these limits, could you share your opinion on that?
>> What was the reason besides SCSI passthrough?
>>
>> Best regards,
>> Maxim Levitsky
>>
>> Maxim Levitsky (1):
>> raw-posix.c - use max transfer length / max segemnt count only for
>> SCSI passthrough
>>
>> block/file-posix.c | 16 +++++++---------
>> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> Adding Eric Blake, who implemented the generic request splitting in the
> block layer and may know if there were any other reasons aside from SCSI
> passthrough why file-posix.c enforces the host block device's maximum
> transfer size.
No, I don't have any strong reasons for why file I/O must be capped to a
specific limit other than size_t (since the kernel does just fine at
splitting things up).
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread