From: Christian Borntraeger <borntraeger@de.ibm.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Dominik Dingel <dingel@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] block device fd consuption
Date: Fri, 11 Jul 2014 12:04:32 +0200 [thread overview]
Message-ID: <53BFB6B0.8030603@de.ibm.com> (raw)
In-Reply-To: <20140711092334.GA25216@stefanha-thinkpad.redhat.com>
On 11/07/14 11:23, Stefan Hajnoczi wrote:
> On Fri, Jul 11, 2014 at 10:56:12AM +0200, Christian Borntraeger wrote:
>> Stefan,
>>
>> I traced the creation of eventfds with gdb in the case of virtio-blk.
>
> Great, thanks for posting this!
>
> Most of these eventfds are "justified". They are actively used and are
> not leaked. Avoiding them might be possible with some work but is
> likely to make the code messier or notification more expensive (e.g. we
> have to scan more request structs to check for completion).
>
> But see the thread pool case below where I think we can eliminate the
> eventfd.
>
>> With the following setup
>> qemu-system-s390x -enable-kvm -m 1000 -nographic -kernel /boot/vmlinux-3.15.0+ -initrd ramdisk -smp 2 -append "root=/dev/ram0" -M s390-ccw -drive file=/dev/sdc,if=none,id=d0,format=raw,serial=d0,cache=none,aio=native -device virtio-blk-ccw,drive=d0,x-data-plane=on,config-wce=off,scsi=off
>>
>> In addition to the file descriptor for the device itself I have the following eventfd:
>>
>>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x000000008019e766 in aio_context_new () at /home/cborntra/REPOS/qemu/async.c:274
>> #2 0x00000000801ae628 in qemu_init_main_loop () at /home/cborntra/REPOS/qemu/main-loop.c:142
>> #3 0x000000008001598c in main (argc=<optimized out>, argv=0x3fffffff2c8, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:3972
>> --> main loop: this is ok and not related to virtio-blk.
>
> Yes.
>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x00000000801ebb58 in laio_init () at /home/cborntra/REPOS/qemu/block/linux-aio.c:289
>> #2 0x00000000801ea17a in raw_set_aio (aio_ctx=0x807fa0a8, use_aio=0x807fa0a0, bdrv_flags=<optimized out>) at /home/cborntra/REPOS/qemu/block/raw-posix.c:351
>> #3 0x00000000801ea2a2 in raw_open_common (bs=bs@entry=0x807fd1a0, options=options@entry=0x807fdce0, bdrv_flags=bdrv_flags@entry=24802, open_flags=open_flags@entry=0, errp=errp@entry=0x3ffffffd698)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:433
>> #4 0x00000000801ea6b4 in hdev_open (bs=0x807fd1a0, options=0x807fdce0, flags=<optimized out>, errp=0x3ffffffe830) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1760
>> #5 0x00000000801aba9e in bdrv_open_common (errp=0x3ffffffe818, drv=0x80316088 <bdrv_host_device>, flags=57570, options=0x807fdce0, file=0x0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:967
>> #6 bdrv_open (pbs=pbs@entry=0x3ffffffe9e0, filename=<optimized out>, filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", reference=reference@entry=0x0,
>> options=0x807fdce0, flags=57570, drv=0x80316088 <bdrv_host_device>, errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1472
>> #7 0x00000000801ac460 in bdrv_open_image (pbs=pbs@entry=0x3ffffffe9e0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580",
>> options=options@entry=0x807fb160, bdref_key=bdref_key@entry=0x8027a526 "file", flags=flags@entry=57570, allow_none=true, errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1274
>> #8 0x00000000801ab74a in bdrv_open (pbs=pbs@entry=0x807fa5b0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", reference=reference@entry=
>> 0x0, options=0x807fb160, options@entry=0x807f8ce0, flags=8418, flags@entry=226, drv=0x80312908 <bdrv_raw>, errp=0x3ffffffead8) at /home/cborntra/REPOS/qemu/block.c:1451
>> #9 0x00000000800ba11e in blockdev_init (file=file@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", bs_opts=bs_opts@entry=0x807f8ce0, errp=errp@entry=
>> 0x3ffffffec58) at /home/cborntra/REPOS/qemu/blockdev.c:523
>> #10 0x00000000800bb530 in drive_new (all_opts=0x807e7cf0, block_default_type=<optimized out>) at /home/cborntra/REPOS/qemu/blockdev.c:930
>> #11 0x00000000800d11d4 in drive_init_func (opts=<optimized out>, opaque=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:1144
>> #12 0x00000000802110b0 in qemu_opts_foreach (list=<optimized out>, func=func@entry=0x800d11a8 <drive_init_func>, opaque=opaque@entry=0x807de6c0, abort_on_failure=abort_on_failure@entry=1)
>> at /home/cborntra/REPOS/qemu/util/qemu-option.c:1072
>> #13 0x0000000080016438 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4352
>> --> No idea
>
> Ah, I forgot about this one. This is the Linux AIO completion eventfd.
>
> It gets signalled when a Linux AIO request completes and we need to call
> io_getevents(2).
>
> You can avoid it by using aio=threads instead of aio=native. But then
> you cannot use Linux AIO. I am not aware of a good way around using
> this fd.
>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x000000008019f04c in thread_pool_init_one (ctx=0x807e8e00, pool=0x807f4eb0) at /home/cborntra/REPOS/qemu/thread-pool.c:296
>> #2 thread_pool_new (ctx=<optimized out>) at /home/cborntra/REPOS/qemu/thread-pool.c:314
>> #3 0x000000008019e590 in aio_get_thread_pool (ctx=0x807e8e00) at /home/cborntra/REPOS/qemu/async.c:245
>> #4 0x00000000801e8f6c in paio_submit (bs=bs@entry=0x807fd1a0, fd=<optimized out>, sector_num=sector_num@entry=0, qiov=qiov@entry=0x3ffffffda98, nb_sectors=nb_sectors@entry=1, cb=
>> 0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0, type=4097) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1027
>> #5 0x00000000801e9b5c in raw_aio_submit (bs=0x807fd1a0, sector_num=0, qiov=0x3ffffffda98, nb_sectors=<optimized out>, cb=cb@entry=0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0, type=4097)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1056
>> #6 0x00000000801e9c84 in raw_aio_readv (bs=<optimized out>, sector_num=<optimized out>, qiov=<optimized out>, nb_sectors=<optimized out>, cb=0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1094
>> #7 0x00000000801a2594 in bdrv_co_io_em (is_write=false, iov=0x3ffffffda98, nb_sectors=<optimized out>, sector_num=0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:4835
>> #8 bdrv_co_readv_em (bs=bs@entry=0x807fd1a0, sector_num=sector_num@entry=0, nb_sectors=<optimized out>, iov=iov@entry=0x3ffffffda98) at /home/cborntra/REPOS/qemu/block.c:4852
>> #9 0x00000000801a7d7c in bdrv_aligned_preadv (bs=bs@entry=0x807fd1a0, req=req@entry=0x3ffba3feac8, offset=<optimized out>, bytes=bytes@entry=512, align=<optimized out>, qiov=0x3ffffffda98, flags=0)
>> at /home/cborntra/REPOS/qemu/block.c:3057
>> #10 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fd1a0, offset=<optimized out>, bytes=512, qiov=0x3ffffffda98, flags=<optimized out>, flags@entry=(unknown: 0))
>> at /home/cborntra/REPOS/qemu/block.c:3136
>> #11 0x00000000801a83e4 in bdrv_co_do_readv (flags=(unknown: 0), qiov=<optimized out>, nb_sectors=<optimized out>, sector_num=<optimized out>, bs=<optimized out>)
>> at /home/cborntra/REPOS/qemu/block.c:3158
>> #12 bdrv_co_readv (bs=<optimized out>, sector_num=<optimized out>, nb_sectors=<optimized out>, qiov=<optimized out>) at /home/cborntra/REPOS/qemu/block.c:3167
>> #13 0x00000000801a7ce2 in bdrv_aligned_preadv (bs=bs@entry=0x807fa620, req=req@entry=0x3ffba3fedb0, offset=<optimized out>, bytes=bytes@entry=512, align=512, qiov=0x3ffffffda98, flags=0)
>> at /home/cborntra/REPOS/qemu/block.c:3042
>> #14 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fa620, offset=<optimized out>, bytes=512, qiov=0x3ffffffda98, flags=<optimized out>) at /home/cborntra/REPOS/qemu/block.c:3136
>> #15 0x00000000801a94d8 in bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2693
>> #16 bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2688
>> #17 0x00000000801ba140 in coroutine_trampoline (i0=<optimized out>, i1=<error reading variable: value has been optimized out>) at /home/cborntra/REPOS/qemu/coroutine-ucontext.c:118
>> #18 0x000003fffc935892 in __makecontext_ret () from /lib64/libc.so.6
>> --> No idea
>
> Similar deal to the Linux AIO event notifier. It's the fd used to
> signal thread pool work item completion. The threadpool is
> per-AioContext so the fd overhead is per-iothread.
>
> However, we can use a BH instead since the API has now been made
> thread-safe. Previously we used EventNotifier because
> qemu_bh_schedule() was not thread-safe.
>
> I will send a patch but I'm not sure it's critical enough for QEMU 2.1.
> Do you have a bug report or justification for pushing this into QEMU
> 2.1?
Well, 2.1-rc + all patches floating around (most of them in Kevin's tree) is now pretty stable regarding dataplane. So maybe its better to defer this to 2.2 - dont know. The iothread thing will be interesting to test when there is libvirt support.
On s390 I expect people to hit the 1024 fd limit, but this is a one-line change in libvirts qemu.conf
prev parent reply other threads:[~2014-07-11 10:04 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-11 8:56 [Qemu-devel] block device fd consuption Christian Borntraeger
2014-07-11 9:23 ` Stefan Hajnoczi
2014-07-11 10:04 ` Christian Borntraeger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53BFB6B0.8030603@de.ibm.com \
--to=borntraeger@de.ibm.com \
--cc=cornelia.huck@de.ibm.com \
--cc=dingel@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).