[Qemu-devel] [PATCH for-2.9 v3] file-posix: Consider max_segments for BlockLimits.max

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH for-2.9 v3] file-posix: Consider max_segments for BlockLimits.max_transfer
@ 2017-03-08 12:08 Fam Zheng
  2017-03-08 12:34 ` Kevin Wolf
  0 siblings, 1 reply; 2+ messages in thread
From: Fam Zheng @ 2017-03-08 12:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, Kevin Wolf, Max Reitz, pbonzini, eblake

BlockLimits.max_transfer can be too high without this fix, guest will
encounter I/O error or even get paused with werror=stop or rerror=stop. The
cause is explained below.

Linux has a separate limit, /sys/block/.../queue/max_segments, which in
the worst case can be more restrictive than the BLKSECTGET which we
already consider (note that they are two different things). So, the
failure scenario before this patch is:

1) host device has max_sectors_kb = 4096 and max_segments = 64;
2) guest learns max_sectors_kb limit from QEMU, but doesn't know
   max_segments;
3) guest issues e.g. a 512KB request thinking it's okay, but actually
   it's not, because it will be passed through to host device as an
   SG_IO req that has niov > 64;
4) host kernel doesn't like the segmenting of the request, and returns
   -EINVAL;

This patch checks the max_segments sysfs entry for the host device and
calculates a "conservative" bytes limit using the page size, which is
then merged into the existing max_transfer limit. Guest will discover
this from the usual virtual block device interfaces. (In the case of
scsi-generic, it will be done in the INQUIRY reply interception in
device model.)

The other possibility is to actually propagate it as a separate limit,
but it's not better. On the one hand, there is a big complication: the
limit is per-LUN in QEMU PoV (because we can attach LUNs from different
host HBAs to the same virtio-scsi bus), but the channel to communicate
it in a per-LUN manner is missing down the stack; on the other hand,
two limits versus one doesn't change much about the valid size of I/O
(because guest has no control over host segmenting).

Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
was explored. Unfortunately there is no neat way to ensure the bounce
buffer is less segmented (in terms of DMA addr) than the guest buffer.

Practically, this bug is not very common. It is only reported on a
Emulex (lpfc), so it's okay to get it fixed in the easier way.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>

---

v3: Clearer commit message. [Kevin]
v2: Use /sys/dev/block/MAJOR:MINOR/queue/max_segments. [Paolo]
---
 block/file-posix.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 4de1abd..c4c0663 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -668,6 +668,48 @@ static int hdev_get_max_transfer_length(BlockDriverState *bs, int fd)
 #endif
 }

+static int hdev_get_max_segments(const struct stat *st)
+{
+#ifdef CONFIG_LINUX
+    char buf[32];
+    const char *end;
+    char *sysfspath;
+    int ret;
+    int fd = -1;
+    long max_segments;
+
+    sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
+                                major(st->st_rdev), minor(st->st_rdev));
+    fd = open(sysfspath, O_RDONLY);
+    if (fd == -1) {
+        ret = -errno;
+        goto out;
+    }
+    do {
+        ret = read(fd, buf, sizeof(buf));
+    } while (ret == -1 && errno == EINTR);
+    if (ret < 0) {
+        ret = -errno;
+        goto out;
+    } else if (ret == 0) {
+        ret = -EIO;
+        goto out;
+    }
+    buf[ret] = 0;
+    /* The file is ended with '\n', pass 'end' to accept that. */
+    ret = qemu_strtol(buf, &end, 10, &max_segments);
+    if (ret == 0 && end && *end == '\n') {
+        ret = max_segments;
+    }
+
+out:
+    g_free(sysfspath);
+    return ret;
+#else
+    return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVRawState *s = bs->opaque;
@@ -679,6 +721,11 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
             if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
                 bs->bl.max_transfer = pow2floor(ret);
             }
+            ret = hdev_get_max_segments(&st);
+            if (ret > 0) {
+                bs->bl.max_transfer = MIN(bs->bl.max_transfer,
+                                          ret * getpagesize());
+            }
         }
     }

-- 
2.9.3

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9 v3] file-posix: Consider max_segments for BlockLimits.max_transfer
  2017-03-08 12:08 [Qemu-devel] [PATCH for-2.9 v3] file-posix: Consider max_segments for BlockLimits.max_transfer Fam Zheng
@ 2017-03-08 12:34 ` Kevin Wolf
  0 siblings, 0 replies; 2+ messages in thread
From: Kevin Wolf @ 2017-03-08 12:34 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, qemu-block, Max Reitz, pbonzini, eblake

Am 08.03.2017 um 13:08 hat Fam Zheng geschrieben:
> BlockLimits.max_transfer can be too high without this fix, guest will
> encounter I/O error or even get paused with werror=stop or rerror=stop. The
> cause is explained below.
> 
> Linux has a separate limit, /sys/block/.../queue/max_segments, which in
> the worst case can be more restrictive than the BLKSECTGET which we
> already consider (note that they are two different things). So, the
> failure scenario before this patch is:
> 
> 1) host device has max_sectors_kb = 4096 and max_segments = 64;
> 2) guest learns max_sectors_kb limit from QEMU, but doesn't know
>    max_segments;
> 3) guest issues e.g. a 512KB request thinking it's okay, but actually
>    it's not, because it will be passed through to host device as an
>    SG_IO req that has niov > 64;
> 4) host kernel doesn't like the segmenting of the request, and returns
>    -EINVAL;
> 
> This patch checks the max_segments sysfs entry for the host device and
> calculates a "conservative" bytes limit using the page size, which is
> then merged into the existing max_transfer limit. Guest will discover
> this from the usual virtual block device interfaces. (In the case of
> scsi-generic, it will be done in the INQUIRY reply interception in
> device model.)
> 
> The other possibility is to actually propagate it as a separate limit,
> but it's not better. On the one hand, there is a big complication: the
> limit is per-LUN in QEMU PoV (because we can attach LUNs from different
> host HBAs to the same virtio-scsi bus), but the channel to communicate
> it in a per-LUN manner is missing down the stack; on the other hand,
> two limits versus one doesn't change much about the valid size of I/O
> (because guest has no control over host segmenting).
> 
> Also, the idea to fall back to bounce buffering in QEMU, upon -EINVAL,
> was explored. Unfortunately there is no neat way to ensure the bounce
> buffer is less segmented (in terms of DMA addr) than the guest buffer.
> 
> Practically, this bug is not very common. It is only reported on a
> Emulex (lpfc), so it's okay to get it fixed in the easier way.
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>

Thanks, applied to the block branch.

Kevin

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-03-08 12:34 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-08 12:08 [Qemu-devel] [PATCH for-2.9 v3] file-posix: Consider max_segments for BlockLimits.max_transfer Fam Zheng
2017-03-08 12:34 ` Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).