* [Qemu-devel] Guest latency issues due to bdrv_check_byte_request
@ 2010-04-15 18:15 Jan Kiszka
2010-04-17 19:05 ` Stefan Hajnoczi
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2010-04-15 18:15 UTC (permalink / raw)
To: qemu-devel
Hi,
we are facing sporadic latency issue in our guests due to the
synchronous nature of bdrv_check_byte_request on raw images:
#0 0x00007ff8962e2070 in lseek64 () from /lib64/libpthread.so.0
#1 0x00000000004d7f97 in raw_getlength (bs=0xcab010) at block/raw-posix.c:704
#2 0x00000000004b9d22 in bdrv_getlength (bs=0xcab010) at block.c:816
#3 0x00000000004b95f0 in bdrv_check_byte_request (bs=0xcab010, offset=233472, size=4096) at block.c:618
#4 0x00000000004b9664 in bdrv_check_request (bs=0xcab010, sector_num=456, nb_sectors=8) at block.c:632
#5 0x00000000004bb56a in bdrv_aio_readv (bs=0xcab010, sector_num=456, qiov=0xcf5d48, nb_sectors=8, cb=0x4eaddc <scsi_read_complete>, opaque=0xcf5cc0) at block.c:1554
The host tends to hang in lseek on some kernel mutex. Instead of
addressing this with the big hammer (more preemptible kernel), I
wondered why we need that many raw_getlength requests with all their
file open/ioctl/close/whatever syscalls - and that in the asynchronous
I/O path.
Looking at bdrv_check_byte_request, I find
if (bs->growable)
return 0;
i.e. out-of-bound requests on growable devices are always handled by the
respective IO handler. Just fixed-size devices (like the classic raw
disk image file...) go through
len = bdrv_getlength(bs);
if (offset < 0)
return -EIO;
if ((offset > len) || (len - offset < size))
return -EIO;
for _each_ request!? Why? Looks like there is room for improvement, not
only /wrt latency, isn't it?
Jan
PS: This should also explain at least some of the latencies I once measured
with prio-boosted VCPU threads on PREEMPT-RT kernels.
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request
2010-04-15 18:15 [Qemu-devel] Guest latency issues due to bdrv_check_byte_request Jan Kiszka
@ 2010-04-17 19:05 ` Stefan Hajnoczi
2010-04-17 19:40 ` Christoph Hellwig
0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2010-04-17 19:05 UTC (permalink / raw)
To: Jan Kiszka; +Cc: qemu-devel
$ strace -cf x86_64-softmmu/qemu-system-x86_64 test.raw
Uncached getlength:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.40 1.944174 13136 148 4 futex
1.65 0.033259 1 56418 2507 select
0.39 0.007817 0 81118 5561 read
0.33 0.006556 0 78787 timer_gettime
0.31 0.006223 0 56412 timer_settime
0.26 0.005191 0 47723 lseek
0.24 0.004924 0 51896 write
0.24 0.004800 0 51844 2917 rt_sigreturn
0.17 0.003333 833 4 shmdt
0.01 0.000175 0 790 poll
Cached getlength:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
97.25 2.266124 14715 154 4 futex
1.03 0.023984 0 57749 3200 select
0.37 0.008644 0 79926 timer_gettime
0.29 0.006761 0 82390 6601 read
0.27 0.006398 0 57900 timer_settime
0.26 0.006038 0 52503 write
0.26 0.005985 0 52450 3671 rt_sigreturn
0.15 0.003418 1139 3 shmdt
0.10 0.002398 0 23846 lseek
0.01 0.000216 3 81 4 open
I think there are still a lot of lseeks left because
raw-posix.c:raw_pread_aligned() is implemented using lseek+read
instead of pread. Does anyone know the reasoning there or could
pread() be used?
Here is the cached getlength hack (I'm not confident that this patch
is correct in all cases, just a quick experiment):
diff --git a/block.c b/block.c
index 0f6be17..447327f 100644
--- a/block.c
+++ b/block.c
@@ -957,13 +957,25 @@ int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
int bdrv_truncate(BlockDriverState *bs, int64_t offset)
{
BlockDriver *drv = bs->drv;
+ int ret;
if (!drv)
return -ENOMEDIUM;
if (!drv->bdrv_truncate)
return -ENOTSUP;
if (bs->read_only)
return -EACCES;
- return drv->bdrv_truncate(bs, offset);
+ ret = drv->bdrv_truncate(bs, offset);
+ if (ret < 0) {
+ return ret;
+ }
+
+ /* refresh total sectors */
+ if (drv->bdrv_getlength) {
+ bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
+ } else {
+ bs->total_sectors = offset >> BDRV_SECTOR_BITS;
+ }
+ return ret;
}
/**
@@ -974,8 +986,12 @@ int64_t bdrv_getlength(BlockDriverState *bs)
BlockDriver *drv = bs->drv;
if (!drv)
return -ENOMEDIUM;
- if (!drv->bdrv_getlength) {
- /* legacy mode */
+
+ /* Fixed size devices use the total_sectors value for speed instead of
+ issuing a length query (like lseek) on each call. Also, legacy block
+ drivers don't provide a bdrv_getlength function and must use
+ total_sectors. */
+ if ((bs->total_sectors && !bs->growable) || !drv->bdrv_getlength) {
return bs->total_sectors * BDRV_SECTOR_SIZE;
}
return drv->bdrv_getlength(bs);
Stefan
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request
2010-04-17 19:05 ` Stefan Hajnoczi
@ 2010-04-17 19:40 ` Christoph Hellwig
2010-04-17 21:32 ` Stefan Hajnoczi
0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2010-04-17 19:40 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Jan Kiszka, qemu-devel
On Sat, Apr 17, 2010 at 08:05:45PM +0100, Stefan Hajnoczi wrote:
> I think there are still a lot of lseeks left because
> raw-posix.c:raw_pread_aligned() is implemented using lseek+read
> instead of pread. Does anyone know the reasoning there or could
> pread() be used?
There's no good reason for it except maybe compatiblity to really
olh hosts that do not have pread/pwrite. But given that AIO support
is now mandator and the AIO code uses pread/pwritev exclusively we
would have noticed if that's the case by now.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request
2010-04-17 19:40 ` Christoph Hellwig
@ 2010-04-17 21:32 ` Stefan Hajnoczi
2010-04-18 17:37 ` Christoph Hellwig
2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka
0 siblings, 2 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2010-04-17 21:32 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jan Kiszka, qemu-devel
Thanks Christoph.
Cached getlength with pread/pwrite:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.97 1.760111 11893 148 4 futex
1.61 0.029209 1 46891 2217 select
0.28 0.005047 0 64609 timer_gettime
0.22 0.004059 0 42745 2578 rt_sigreturn
0.22 0.003911 0 46261 timer_settime
0.18 0.003280 1093 3 shmdt
0.17 0.003095 0 23859 pread <---
0.17 0.003061 0 42800 write
0.16 0.002916 0 47759 5151 read
0.02 0.000285 0 645 writev
[...]
0.00 0.000000 0 13 lseek
Note that this is a Tiny Core Linux boot from disk and shutdown; not
very I/O intensive since it only loads a kernel and ~10 MB initramfs
without touching the disk much after kernel load.
diff --git a/block.c b/block.c
index 0f6be17..5c1652c 100644
--- a/block.c
+++ b/block.c
@@ -363,6 +363,7 @@ static int bdrv_open_common(BlockDriverState *bs,
const char *filename,
assert(drv != NULL);
bs->file = NULL;
+ bs->total_sectors = 0;
bs->is_temporary = 0;
bs->encrypted = 0;
bs->valid_key = 0;
@@ -416,9 +417,7 @@ static int bdrv_open_common(BlockDriverState *bs,
const char *filename,
}
bs->keep_read_only = bs->read_only = !(open_flags & BDRV_O_RDWR);
- if (drv->bdrv_getlength) {
- bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
- }
+ bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
#ifndef _WIN32
if (bs->is_temporary) {
unlink(filename);
@@ -957,13 +956,26 @@ int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
int bdrv_truncate(BlockDriverState *bs, int64_t offset)
{
BlockDriver *drv = bs->drv;
+ int ret;
if (!drv)
return -ENOMEDIUM;
if (!drv->bdrv_truncate)
return -ENOTSUP;
if (bs->read_only)
return -EACCES;
- return drv->bdrv_truncate(bs, offset);
+ ret = drv->bdrv_truncate(bs, offset);
+ if (ret < 0) {
+ return ret;
+ }
+
+ /* refresh total sectors */
+ if (drv->bdrv_getlength) {
+ bs->total_sectors = 0; /* discard cached value */
+ bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
+ } else {
+ bs->total_sectors = offset >> BDRV_SECTOR_BITS;
+ }
+ return ret;
}
/**
@@ -974,8 +986,12 @@ int64_t bdrv_getlength(BlockDriverState *bs)
BlockDriver *drv = bs->drv;
if (!drv)
return -ENOMEDIUM;
- if (!drv->bdrv_getlength) {
- /* legacy mode */
+
+ /* Fixed size devices use the total_sectors value for speed instead of
+ issuing a length query (like lseek) on each call. Also, legacy block
+ drivers don't provide a bdrv_getlength function and must use
+ total_sectors. */
+ if ((bs->total_sectors && !bs->growable) || !drv->bdrv_getlength) {
return bs->total_sectors * BDRV_SECTOR_SIZE;
}
return drv->bdrv_getlength(bs);
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 598ea19..7541ed2 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -105,7 +105,6 @@
typedef struct BDRVRawState {
int fd;
int type;
- unsigned int lseek_err_cnt;
int open_flags;
#if defined(__linux__)
/* linux floppy specific */
@@ -134,8 +133,6 @@ static int raw_open_common(BlockDriverState *bs,
const char *filename,
BDRVRawState *s = bs->opaque;
int fd, ret;
- s->lseek_err_cnt = 0;
-
s->open_flags = open_flags | O_BINARY;
s->open_flags &= ~O_ACCMODE;
if (bdrv_flags & BDRV_O_RDWR) {
@@ -243,19 +240,7 @@ static int raw_pread_aligned(BlockDriverState
*bs, int64_t offset,
if (ret < 0)
return ret;
- if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) {
- ++(s->lseek_err_cnt);
- if(s->lseek_err_cnt <= 10) {
- DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
- "] lseek failed : %d = %s\n",
- s->fd, bs->filename, offset, buf, count,
- bs->total_sectors, errno, strerror(errno));
- }
- return -1;
- }
- s->lseek_err_cnt=0;
-
- ret = read(s->fd, buf, count);
+ ret = pread(s->fd, buf, count, offset);
if (ret == count)
goto label__raw_read__success;
@@ -276,12 +261,10 @@ static int raw_pread_aligned(BlockDriverState
*bs, int64_t offset,
/* Try harder for CDrom. */
if (bs->type == BDRV_TYPE_CDROM) {
- lseek(s->fd, offset, SEEK_SET);
- ret = read(s->fd, buf, count);
+ ret = pread(s->fd, buf, count, offset);
if (ret == count)
goto label__raw_read__success;
- lseek(s->fd, offset, SEEK_SET);
- ret = read(s->fd, buf, count);
+ ret = pread(s->fd, buf, count, offset);
if (ret == count)
goto label__raw_read__success;
@@ -313,19 +296,7 @@ static int raw_pwrite_aligned(BlockDriverState
*bs, int64_t offset,
if (ret < 0)
return -errno;
- if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) {
- ++(s->lseek_err_cnt);
- if(s->lseek_err_cnt) {
- DEBUG_BLOCK_PRINT("raw_pwrite(%d:%s, %" PRId64 ", %p, %d) [%"
- PRId64 "] lseek failed : %d = %s\n",
- s->fd, bs->filename, offset, buf, count,
- bs->total_sectors, errno, strerror(errno));
- }
- return -EIO;
- }
- s->lseek_err_cnt = 0;
-
- ret = write(s->fd, buf, count);
+ ret = pwrite(s->fd, buf, count, offset);
if (ret == count)
goto label__raw_write__success;
Stefan
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request
2010-04-17 21:32 ` Stefan Hajnoczi
@ 2010-04-18 17:37 ` Christoph Hellwig
2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka
1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2010-04-18 17:37 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Jan Kiszka, Christoph Hellwig, qemu-devel
You should split this up into two patches - one for the the compat AIO
implementation and one for the getlength caching.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Qemu-devel] Re: Guest latency issues due to bdrv_check_byte_request
2010-04-17 21:32 ` Stefan Hajnoczi
2010-04-18 17:37 ` Christoph Hellwig
@ 2010-04-18 18:05 ` Jan Kiszka
1 sibling, 0 replies; 6+ messages in thread
From: Jan Kiszka @ 2010-04-18 18:05 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: Christoph Hellwig, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]
Stefan Hajnoczi wrote:
> Thanks Christoph.
>
> Cached getlength with pread/pwrite:
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 96.97 1.760111 11893 148 4 futex
> 1.61 0.029209 1 46891 2217 select
> 0.28 0.005047 0 64609 timer_gettime
> 0.22 0.004059 0 42745 2578 rt_sigreturn
> 0.22 0.003911 0 46261 timer_settime
> 0.18 0.003280 1093 3 shmdt
> 0.17 0.003095 0 23859 pread <---
> 0.17 0.003061 0 42800 write
> 0.16 0.002916 0 47759 5151 read
> 0.02 0.000285 0 645 writev
> [...]
> 0.00 0.000000 0 13 lseek
>
> Note that this is a Tiny Core Linux boot from disk and shutdown; not
> very I/O intensive since it only loads a kernel and ~10 MB initramfs
> without touching the disk much after kernel load.
Nice. Will give this a try tomorrow with "a bit" more load.
We already played with a hack to completely remove the checks from AIO
requests, thus avoiding lseek this way - effect as desired, but fragile
of course.
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-04-18 18:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-15 18:15 [Qemu-devel] Guest latency issues due to bdrv_check_byte_request Jan Kiszka
2010-04-17 19:05 ` Stefan Hajnoczi
2010-04-17 19:40 ` Christoph Hellwig
2010-04-17 21:32 ` Stefan Hajnoczi
2010-04-18 17:37 ` Christoph Hellwig
2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).