From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34052) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X5XhA-0000tI-9i for qemu-devel@nongnu.org; Fri, 11 Jul 2014 06:04:57 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1X5Xh0-0001Be-LS for qemu-devel@nongnu.org; Fri, 11 Jul 2014 06:04:48 -0400 Received: from e06smtp13.uk.ibm.com ([195.75.94.109]:44574) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X5Xh0-0001BC-7W for qemu-devel@nongnu.org; Fri, 11 Jul 2014 06:04:38 -0400 Received: from /spool/local by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 11 Jul 2014 11:04:36 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (d06relay11.portsmouth.uk.ibm.com [9.149.109.196]) by d06dlp01.portsmouth.uk.ibm.com (Postfix) with ESMTP id 129BD17D805A for ; Fri, 11 Jul 2014 11:06:08 +0100 (BST) Received: from d06av05.portsmouth.uk.ibm.com (d06av05.portsmouth.uk.ibm.com [9.149.37.229]) by b06cxnps4074.portsmouth.uk.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s6BA4X6s25100300 for ; Fri, 11 Jul 2014 10:04:33 GMT Received: from d06av05.portsmouth.uk.ibm.com (localhost [127.0.0.1]) by d06av05.portsmouth.uk.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s6BA4XCR005657 for ; Fri, 11 Jul 2014 04:04:33 -0600 Message-ID: <53BFB6B0.8030603@de.ibm.com> Date: Fri, 11 Jul 2014 12:04:32 +0200 From: Christian Borntraeger MIME-Version: 1.0 References: <53BFA6AC.1090204@de.ibm.com> <20140711092334.GA25216@stefanha-thinkpad.redhat.com> In-Reply-To: <20140711092334.GA25216@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] block device fd consuption List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Cornelia Huck , "qemu-devel@nongnu.org" , Dominik Dingel On 11/07/14 11:23, Stefan Hajnoczi wrote: > On Fri, Jul 11, 2014 at 10:56:12AM +0200, Christian Borntraeger wrote: >> Stefan, >> >> I traced the creation of eventfds with gdb in the case of virtio-blk. > > Great, thanks for posting this! > > Most of these eventfds are "justified". They are actively used and are > not leaked. Avoiding them might be possible with some work but is > likely to make the code messier or notification more expensive (e.g. we > have to scan more request structs to check for completion). > > But see the thread pool case below where I think we can eliminate the > eventfd. > >> With the following setup >> qemu-system-s390x -enable-kvm -m 1000 -nographic -kernel /boot/vmlinux-3.15.0+ -initrd ramdisk -smp 2 -append "root=/dev/ram0" -M s390-ccw -drive file=/dev/sdc,if=none,id=d0,format=raw,serial=d0,cache=none,aio=native -device virtio-blk-ccw,drive=d0,x-data-plane=on,config-wce=off,scsi=off >> >> In addition to the file descriptor for the device itself I have the following eventfd: >> >> >> Breakpoint 1, event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #0 event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #1 0x000000008019e766 in aio_context_new () at /home/cborntra/REPOS/qemu/async.c:274 >> #2 0x00000000801ae628 in qemu_init_main_loop () at /home/cborntra/REPOS/qemu/main-loop.c:142 >> #3 0x000000008001598c in main (argc=, argv=0x3fffffff2c8, envp=) at /home/cborntra/REPOS/qemu/vl.c:3972 >> --> main loop: this is ok and not related to virtio-blk. > > Yes. > >> Breakpoint 1, event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #0 event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #1 0x00000000801ebb58 in laio_init () at /home/cborntra/REPOS/qemu/block/linux-aio.c:289 >> #2 0x00000000801ea17a in raw_set_aio (aio_ctx=0x807fa0a8, use_aio=0x807fa0a0, bdrv_flags=) at /home/cborntra/REPOS/qemu/block/raw-posix.c:351 >> #3 0x00000000801ea2a2 in raw_open_common (bs=bs@entry=0x807fd1a0, options=options@entry=0x807fdce0, bdrv_flags=bdrv_flags@entry=24802, open_flags=open_flags@entry=0, errp=errp@entry=0x3ffffffd698) >> at /home/cborntra/REPOS/qemu/block/raw-posix.c:433 >> #4 0x00000000801ea6b4 in hdev_open (bs=0x807fd1a0, options=0x807fdce0, flags=, errp=0x3ffffffe830) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1760 >> #5 0x00000000801aba9e in bdrv_open_common (errp=0x3ffffffe818, drv=0x80316088 , flags=57570, options=0x807fdce0, file=0x0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:967 >> #6 bdrv_open (pbs=pbs@entry=0x3ffffffe9e0, filename=, filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' , "2580", reference=reference@entry=0x0, >> options=0x807fdce0, flags=57570, drv=0x80316088 , errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1472 >> #7 0x00000000801ac460 in bdrv_open_image (pbs=pbs@entry=0x3ffffffe9e0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' , "2580", >> options=options@entry=0x807fb160, bdref_key=bdref_key@entry=0x8027a526 "file", flags=flags@entry=57570, allow_none=true, errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1274 >> #8 0x00000000801ab74a in bdrv_open (pbs=pbs@entry=0x807fa5b0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' , "2580", reference=reference@entry= >> 0x0, options=0x807fb160, options@entry=0x807f8ce0, flags=8418, flags@entry=226, drv=0x80312908 , errp=0x3ffffffead8) at /home/cborntra/REPOS/qemu/block.c:1451 >> #9 0x00000000800ba11e in blockdev_init (file=file@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' , "2580", bs_opts=bs_opts@entry=0x807f8ce0, errp=errp@entry= >> 0x3ffffffec58) at /home/cborntra/REPOS/qemu/blockdev.c:523 >> #10 0x00000000800bb530 in drive_new (all_opts=0x807e7cf0, block_default_type=) at /home/cborntra/REPOS/qemu/blockdev.c:930 >> #11 0x00000000800d11d4 in drive_init_func (opts=, opaque=) at /home/cborntra/REPOS/qemu/vl.c:1144 >> #12 0x00000000802110b0 in qemu_opts_foreach (list=, func=func@entry=0x800d11a8 , opaque=opaque@entry=0x807de6c0, abort_on_failure=abort_on_failure@entry=1) >> at /home/cborntra/REPOS/qemu/util/qemu-option.c:1072 >> #13 0x0000000080016438 in main (argc=, argv=, envp=) at /home/cborntra/REPOS/qemu/vl.c:4352 >> --> No idea > > Ah, I forgot about this one. This is the Linux AIO completion eventfd. > > It gets signalled when a Linux AIO request completes and we need to call > io_getevents(2). > > You can avoid it by using aio=threads instead of aio=native. But then > you cannot use Linux AIO. I am not aware of a good way around using > this fd. > >> Breakpoint 1, event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #0 event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29 >> #1 0x000000008019f04c in thread_pool_init_one (ctx=0x807e8e00, pool=0x807f4eb0) at /home/cborntra/REPOS/qemu/thread-pool.c:296 >> #2 thread_pool_new (ctx=) at /home/cborntra/REPOS/qemu/thread-pool.c:314 >> #3 0x000000008019e590 in aio_get_thread_pool (ctx=0x807e8e00) at /home/cborntra/REPOS/qemu/async.c:245 >> #4 0x00000000801e8f6c in paio_submit (bs=bs@entry=0x807fd1a0, fd=, sector_num=sector_num@entry=0, qiov=qiov@entry=0x3ffffffda98, nb_sectors=nb_sectors@entry=1, cb= >> 0x801a1a4c , opaque=0x3ffba3fe8d0, type=4097) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1027 >> #5 0x00000000801e9b5c in raw_aio_submit (bs=0x807fd1a0, sector_num=0, qiov=0x3ffffffda98, nb_sectors=, cb=cb@entry=0x801a1a4c , opaque=0x3ffba3fe8d0, type=4097) >> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1056 >> #6 0x00000000801e9c84 in raw_aio_readv (bs=, sector_num=, qiov=, nb_sectors=, cb=0x801a1a4c , opaque=0x3ffba3fe8d0) >> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1094 >> #7 0x00000000801a2594 in bdrv_co_io_em (is_write=false, iov=0x3ffffffda98, nb_sectors=, sector_num=0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:4835 >> #8 bdrv_co_readv_em (bs=bs@entry=0x807fd1a0, sector_num=sector_num@entry=0, nb_sectors=, iov=iov@entry=0x3ffffffda98) at /home/cborntra/REPOS/qemu/block.c:4852 >> #9 0x00000000801a7d7c in bdrv_aligned_preadv (bs=bs@entry=0x807fd1a0, req=req@entry=0x3ffba3feac8, offset=, bytes=bytes@entry=512, align=, qiov=0x3ffffffda98, flags=0) >> at /home/cborntra/REPOS/qemu/block.c:3057 >> #10 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fd1a0, offset=, bytes=512, qiov=0x3ffffffda98, flags=, flags@entry=(unknown: 0)) >> at /home/cborntra/REPOS/qemu/block.c:3136 >> #11 0x00000000801a83e4 in bdrv_co_do_readv (flags=(unknown: 0), qiov=, nb_sectors=, sector_num=, bs=) >> at /home/cborntra/REPOS/qemu/block.c:3158 >> #12 bdrv_co_readv (bs=, sector_num=, nb_sectors=, qiov=) at /home/cborntra/REPOS/qemu/block.c:3167 >> #13 0x00000000801a7ce2 in bdrv_aligned_preadv (bs=bs@entry=0x807fa620, req=req@entry=0x3ffba3fedb0, offset=, bytes=bytes@entry=512, align=512, qiov=0x3ffffffda98, flags=0) >> at /home/cborntra/REPOS/qemu/block.c:3042 >> #14 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fa620, offset=, bytes=512, qiov=0x3ffffffda98, flags=) at /home/cborntra/REPOS/qemu/block.c:3136 >> #15 0x00000000801a94d8 in bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2693 >> #16 bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2688 >> #17 0x00000000801ba140 in coroutine_trampoline (i0=, i1=) at /home/cborntra/REPOS/qemu/coroutine-ucontext.c:118 >> #18 0x000003fffc935892 in __makecontext_ret () from /lib64/libc.so.6 >> --> No idea > > Similar deal to the Linux AIO event notifier. It's the fd used to > signal thread pool work item completion. The threadpool is > per-AioContext so the fd overhead is per-iothread. > > However, we can use a BH instead since the API has now been made > thread-safe. Previously we used EventNotifier because > qemu_bh_schedule() was not thread-safe. > > I will send a patch but I'm not sure it's critical enough for QEMU 2.1. > Do you have a bug report or justification for pushing this into QEMU > 2.1? Well, 2.1-rc + all patches floating around (most of them in Kevin's tree) is now pretty stable regarding dataplane. So maybe its better to defer this to 2.2 - dont know. The iothread thing will be interesting to test when there is libvirt support. On s390 I expect people to hit the 1024 fd limit, but this is a one-line change in libvirts qemu.conf