* [Qemu-devel] Guest latency issues due to bdrv_check_byte_request @ 2010-04-15 18:15 Jan Kiszka 2010-04-17 19:05 ` Stefan Hajnoczi 0 siblings, 1 reply; 6+ messages in thread From: Jan Kiszka @ 2010-04-15 18:15 UTC (permalink / raw) To: qemu-devel Hi, we are facing sporadic latency issue in our guests due to the synchronous nature of bdrv_check_byte_request on raw images: #0 0x00007ff8962e2070 in lseek64 () from /lib64/libpthread.so.0 #1 0x00000000004d7f97 in raw_getlength (bs=0xcab010) at block/raw-posix.c:704 #2 0x00000000004b9d22 in bdrv_getlength (bs=0xcab010) at block.c:816 #3 0x00000000004b95f0 in bdrv_check_byte_request (bs=0xcab010, offset=233472, size=4096) at block.c:618 #4 0x00000000004b9664 in bdrv_check_request (bs=0xcab010, sector_num=456, nb_sectors=8) at block.c:632 #5 0x00000000004bb56a in bdrv_aio_readv (bs=0xcab010, sector_num=456, qiov=0xcf5d48, nb_sectors=8, cb=0x4eaddc <scsi_read_complete>, opaque=0xcf5cc0) at block.c:1554 The host tends to hang in lseek on some kernel mutex. Instead of addressing this with the big hammer (more preemptible kernel), I wondered why we need that many raw_getlength requests with all their file open/ioctl/close/whatever syscalls - and that in the asynchronous I/O path. Looking at bdrv_check_byte_request, I find if (bs->growable) return 0; i.e. out-of-bound requests on growable devices are always handled by the respective IO handler. Just fixed-size devices (like the classic raw disk image file...) go through len = bdrv_getlength(bs); if (offset < 0) return -EIO; if ((offset > len) || (len - offset < size)) return -EIO; for _each_ request!? Why? Looks like there is room for improvement, not only /wrt latency, isn't it? Jan PS: This should also explain at least some of the latencies I once measured with prio-boosted VCPU threads on PREEMPT-RT kernels. -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request 2010-04-15 18:15 [Qemu-devel] Guest latency issues due to bdrv_check_byte_request Jan Kiszka @ 2010-04-17 19:05 ` Stefan Hajnoczi 2010-04-17 19:40 ` Christoph Hellwig 0 siblings, 1 reply; 6+ messages in thread From: Stefan Hajnoczi @ 2010-04-17 19:05 UTC (permalink / raw) To: Jan Kiszka; +Cc: qemu-devel $ strace -cf x86_64-softmmu/qemu-system-x86_64 test.raw Uncached getlength: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.40 1.944174 13136 148 4 futex 1.65 0.033259 1 56418 2507 select 0.39 0.007817 0 81118 5561 read 0.33 0.006556 0 78787 timer_gettime 0.31 0.006223 0 56412 timer_settime 0.26 0.005191 0 47723 lseek 0.24 0.004924 0 51896 write 0.24 0.004800 0 51844 2917 rt_sigreturn 0.17 0.003333 833 4 shmdt 0.01 0.000175 0 790 poll Cached getlength: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.25 2.266124 14715 154 4 futex 1.03 0.023984 0 57749 3200 select 0.37 0.008644 0 79926 timer_gettime 0.29 0.006761 0 82390 6601 read 0.27 0.006398 0 57900 timer_settime 0.26 0.006038 0 52503 write 0.26 0.005985 0 52450 3671 rt_sigreturn 0.15 0.003418 1139 3 shmdt 0.10 0.002398 0 23846 lseek 0.01 0.000216 3 81 4 open I think there are still a lot of lseeks left because raw-posix.c:raw_pread_aligned() is implemented using lseek+read instead of pread. Does anyone know the reasoning there or could pread() be used? Here is the cached getlength hack (I'm not confident that this patch is correct in all cases, just a quick experiment): diff --git a/block.c b/block.c index 0f6be17..447327f 100644 --- a/block.c +++ b/block.c @@ -957,13 +957,25 @@ int bdrv_pwrite(BlockDriverState *bs, int64_t offset, int bdrv_truncate(BlockDriverState *bs, int64_t offset) { BlockDriver *drv = bs->drv; + int ret; if (!drv) return -ENOMEDIUM; if (!drv->bdrv_truncate) return -ENOTSUP; if (bs->read_only) return -EACCES; - return drv->bdrv_truncate(bs, offset); + ret = drv->bdrv_truncate(bs, offset); + if (ret < 0) { + return ret; + } + + /* refresh total sectors */ + if (drv->bdrv_getlength) { + bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS; + } else { + bs->total_sectors = offset >> BDRV_SECTOR_BITS; + } + return ret; } /** @@ -974,8 +986,12 @@ int64_t bdrv_getlength(BlockDriverState *bs) BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; - if (!drv->bdrv_getlength) { - /* legacy mode */ + + /* Fixed size devices use the total_sectors value for speed instead of + issuing a length query (like lseek) on each call. Also, legacy block + drivers don't provide a bdrv_getlength function and must use + total_sectors. */ + if ((bs->total_sectors && !bs->growable) || !drv->bdrv_getlength) { return bs->total_sectors * BDRV_SECTOR_SIZE; } return drv->bdrv_getlength(bs); Stefan ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request 2010-04-17 19:05 ` Stefan Hajnoczi @ 2010-04-17 19:40 ` Christoph Hellwig 2010-04-17 21:32 ` Stefan Hajnoczi 0 siblings, 1 reply; 6+ messages in thread From: Christoph Hellwig @ 2010-04-17 19:40 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Jan Kiszka, qemu-devel On Sat, Apr 17, 2010 at 08:05:45PM +0100, Stefan Hajnoczi wrote: > I think there are still a lot of lseeks left because > raw-posix.c:raw_pread_aligned() is implemented using lseek+read > instead of pread. Does anyone know the reasoning there or could > pread() be used? There's no good reason for it except maybe compatiblity to really olh hosts that do not have pread/pwrite. But given that AIO support is now mandator and the AIO code uses pread/pwritev exclusively we would have noticed if that's the case by now. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request 2010-04-17 19:40 ` Christoph Hellwig @ 2010-04-17 21:32 ` Stefan Hajnoczi 2010-04-18 17:37 ` Christoph Hellwig 2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka 0 siblings, 2 replies; 6+ messages in thread From: Stefan Hajnoczi @ 2010-04-17 21:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jan Kiszka, qemu-devel Thanks Christoph. Cached getlength with pread/pwrite: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.97 1.760111 11893 148 4 futex 1.61 0.029209 1 46891 2217 select 0.28 0.005047 0 64609 timer_gettime 0.22 0.004059 0 42745 2578 rt_sigreturn 0.22 0.003911 0 46261 timer_settime 0.18 0.003280 1093 3 shmdt 0.17 0.003095 0 23859 pread <--- 0.17 0.003061 0 42800 write 0.16 0.002916 0 47759 5151 read 0.02 0.000285 0 645 writev [...] 0.00 0.000000 0 13 lseek Note that this is a Tiny Core Linux boot from disk and shutdown; not very I/O intensive since it only loads a kernel and ~10 MB initramfs without touching the disk much after kernel load. diff --git a/block.c b/block.c index 0f6be17..5c1652c 100644 --- a/block.c +++ b/block.c @@ -363,6 +363,7 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename, assert(drv != NULL); bs->file = NULL; + bs->total_sectors = 0; bs->is_temporary = 0; bs->encrypted = 0; bs->valid_key = 0; @@ -416,9 +417,7 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename, } bs->keep_read_only = bs->read_only = !(open_flags & BDRV_O_RDWR); - if (drv->bdrv_getlength) { - bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS; - } + bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS; #ifndef _WIN32 if (bs->is_temporary) { unlink(filename); @@ -957,13 +956,26 @@ int bdrv_pwrite(BlockDriverState *bs, int64_t offset, int bdrv_truncate(BlockDriverState *bs, int64_t offset) { BlockDriver *drv = bs->drv; + int ret; if (!drv) return -ENOMEDIUM; if (!drv->bdrv_truncate) return -ENOTSUP; if (bs->read_only) return -EACCES; - return drv->bdrv_truncate(bs, offset); + ret = drv->bdrv_truncate(bs, offset); + if (ret < 0) { + return ret; + } + + /* refresh total sectors */ + if (drv->bdrv_getlength) { + bs->total_sectors = 0; /* discard cached value */ + bs->total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS; + } else { + bs->total_sectors = offset >> BDRV_SECTOR_BITS; + } + return ret; } /** @@ -974,8 +986,12 @@ int64_t bdrv_getlength(BlockDriverState *bs) BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; - if (!drv->bdrv_getlength) { - /* legacy mode */ + + /* Fixed size devices use the total_sectors value for speed instead of + issuing a length query (like lseek) on each call. Also, legacy block + drivers don't provide a bdrv_getlength function and must use + total_sectors. */ + if ((bs->total_sectors && !bs->growable) || !drv->bdrv_getlength) { return bs->total_sectors * BDRV_SECTOR_SIZE; } return drv->bdrv_getlength(bs); diff --git a/block/raw-posix.c b/block/raw-posix.c index 598ea19..7541ed2 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -105,7 +105,6 @@ typedef struct BDRVRawState { int fd; int type; - unsigned int lseek_err_cnt; int open_flags; #if defined(__linux__) /* linux floppy specific */ @@ -134,8 +133,6 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, BDRVRawState *s = bs->opaque; int fd, ret; - s->lseek_err_cnt = 0; - s->open_flags = open_flags | O_BINARY; s->open_flags &= ~O_ACCMODE; if (bdrv_flags & BDRV_O_RDWR) { @@ -243,19 +240,7 @@ static int raw_pread_aligned(BlockDriverState *bs, int64_t offset, if (ret < 0) return ret; - if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) { - ++(s->lseek_err_cnt); - if(s->lseek_err_cnt <= 10) { - DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64 - "] lseek failed : %d = %s\n", - s->fd, bs->filename, offset, buf, count, - bs->total_sectors, errno, strerror(errno)); - } - return -1; - } - s->lseek_err_cnt=0; - - ret = read(s->fd, buf, count); + ret = pread(s->fd, buf, count, offset); if (ret == count) goto label__raw_read__success; @@ -276,12 +261,10 @@ static int raw_pread_aligned(BlockDriverState *bs, int64_t offset, /* Try harder for CDrom. */ if (bs->type == BDRV_TYPE_CDROM) { - lseek(s->fd, offset, SEEK_SET); - ret = read(s->fd, buf, count); + ret = pread(s->fd, buf, count, offset); if (ret == count) goto label__raw_read__success; - lseek(s->fd, offset, SEEK_SET); - ret = read(s->fd, buf, count); + ret = pread(s->fd, buf, count, offset); if (ret == count) goto label__raw_read__success; @@ -313,19 +296,7 @@ static int raw_pwrite_aligned(BlockDriverState *bs, int64_t offset, if (ret < 0) return -errno; - if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) { - ++(s->lseek_err_cnt); - if(s->lseek_err_cnt) { - DEBUG_BLOCK_PRINT("raw_pwrite(%d:%s, %" PRId64 ", %p, %d) [%" - PRId64 "] lseek failed : %d = %s\n", - s->fd, bs->filename, offset, buf, count, - bs->total_sectors, errno, strerror(errno)); - } - return -EIO; - } - s->lseek_err_cnt = 0; - - ret = write(s->fd, buf, count); + ret = pwrite(s->fd, buf, count, offset); if (ret == count) goto label__raw_write__success; Stefan ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Guest latency issues due to bdrv_check_byte_request 2010-04-17 21:32 ` Stefan Hajnoczi @ 2010-04-18 17:37 ` Christoph Hellwig 2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka 1 sibling, 0 replies; 6+ messages in thread From: Christoph Hellwig @ 2010-04-18 17:37 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Jan Kiszka, Christoph Hellwig, qemu-devel You should split this up into two patches - one for the the compat AIO implementation and one for the getlength caching. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Qemu-devel] Re: Guest latency issues due to bdrv_check_byte_request 2010-04-17 21:32 ` Stefan Hajnoczi 2010-04-18 17:37 ` Christoph Hellwig @ 2010-04-18 18:05 ` Jan Kiszka 1 sibling, 0 replies; 6+ messages in thread From: Jan Kiszka @ 2010-04-18 18:05 UTC (permalink / raw) To: Stefan Hajnoczi; +Cc: Christoph Hellwig, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1352 bytes --] Stefan Hajnoczi wrote: > Thanks Christoph. > > Cached getlength with pread/pwrite: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 96.97 1.760111 11893 148 4 futex > 1.61 0.029209 1 46891 2217 select > 0.28 0.005047 0 64609 timer_gettime > 0.22 0.004059 0 42745 2578 rt_sigreturn > 0.22 0.003911 0 46261 timer_settime > 0.18 0.003280 1093 3 shmdt > 0.17 0.003095 0 23859 pread <--- > 0.17 0.003061 0 42800 write > 0.16 0.002916 0 47759 5151 read > 0.02 0.000285 0 645 writev > [...] > 0.00 0.000000 0 13 lseek > > Note that this is a Tiny Core Linux boot from disk and shutdown; not > very I/O intensive since it only loads a kernel and ~10 MB initramfs > without touching the disk much after kernel load. Nice. Will give this a try tomorrow with "a bit" more load. We already played with a hack to completely remove the checks from AIO requests, thus avoiding lseek this way - effect as desired, but fragile of course. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-04-18 18:05 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-15 18:15 [Qemu-devel] Guest latency issues due to bdrv_check_byte_request Jan Kiszka 2010-04-17 19:05 ` Stefan Hajnoczi 2010-04-17 19:40 ` Christoph Hellwig 2010-04-17 21:32 ` Stefan Hajnoczi 2010-04-18 17:37 ` Christoph Hellwig 2010-04-18 18:05 ` [Qemu-devel] " Jan Kiszka
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).