From: Christian Borntraeger <borntraeger@de.ibm.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Dominik Dingel <dingel@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] block device fd consuption
Date: Fri, 11 Jul 2014 12:04:32 +0200 [thread overview]
Message-ID: <53BFB6B0.8030603@de.ibm.com> (raw)
In-Reply-To: <20140711092334.GA25216@stefanha-thinkpad.redhat.com>
On 11/07/14 11:23, Stefan Hajnoczi wrote:
> On Fri, Jul 11, 2014 at 10:56:12AM +0200, Christian Borntraeger wrote:
>> Stefan,
>>
>> I traced the creation of eventfds with gdb in the case of virtio-blk.
>
> Great, thanks for posting this!
>
> Most of these eventfds are "justified". They are actively used and are
> not leaked. Avoiding them might be possible with some work but is
> likely to make the code messier or notification more expensive (e.g. we
> have to scan more request structs to check for completion).
>
> But see the thread pool case below where I think we can eliminate the
> eventfd.
>
>> With the following setup
>> qemu-system-s390x -enable-kvm -m 1000 -nographic -kernel /boot/vmlinux-3.15.0+ -initrd ramdisk -smp 2 -append "root=/dev/ram0" -M s390-ccw -drive file=/dev/sdc,if=none,id=d0,format=raw,serial=d0,cache=none,aio=native -device virtio-blk-ccw,drive=d0,x-data-plane=on,config-wce=off,scsi=off
>>
>> In addition to the file descriptor for the device itself I have the following eventfd:
>>
>>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807e8f24, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x000000008019e766 in aio_context_new () at /home/cborntra/REPOS/qemu/async.c:274
>> #2 0x00000000801ae628 in qemu_init_main_loop () at /home/cborntra/REPOS/qemu/main-loop.c:142
>> #3 0x000000008001598c in main (argc=<optimized out>, argv=0x3fffffff2c8, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:3972
>> --> main loop: this is ok and not related to virtio-blk.
>
> Yes.
>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807fed48, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x00000000801ebb58 in laio_init () at /home/cborntra/REPOS/qemu/block/linux-aio.c:289
>> #2 0x00000000801ea17a in raw_set_aio (aio_ctx=0x807fa0a8, use_aio=0x807fa0a0, bdrv_flags=<optimized out>) at /home/cborntra/REPOS/qemu/block/raw-posix.c:351
>> #3 0x00000000801ea2a2 in raw_open_common (bs=bs@entry=0x807fd1a0, options=options@entry=0x807fdce0, bdrv_flags=bdrv_flags@entry=24802, open_flags=open_flags@entry=0, errp=errp@entry=0x3ffffffd698)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:433
>> #4 0x00000000801ea6b4 in hdev_open (bs=0x807fd1a0, options=0x807fdce0, flags=<optimized out>, errp=0x3ffffffe830) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1760
>> #5 0x00000000801aba9e in bdrv_open_common (errp=0x3ffffffe818, drv=0x80316088 <bdrv_host_device>, flags=57570, options=0x807fdce0, file=0x0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:967
>> #6 bdrv_open (pbs=pbs@entry=0x3ffffffe9e0, filename=<optimized out>, filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", reference=reference@entry=0x0,
>> options=0x807fdce0, flags=57570, drv=0x80316088 <bdrv_host_device>, errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1472
>> #7 0x00000000801ac460 in bdrv_open_image (pbs=pbs@entry=0x3ffffffe9e0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580",
>> options=options@entry=0x807fb160, bdref_key=bdref_key@entry=0x8027a526 "file", flags=flags@entry=57570, allow_none=true, errp=0x3ffffffe9e8) at /home/cborntra/REPOS/qemu/block.c:1274
>> #8 0x00000000801ab74a in bdrv_open (pbs=pbs@entry=0x807fa5b0, filename=filename@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", reference=reference@entry=
>> 0x0, options=0x807fb160, options@entry=0x807f8ce0, flags=8418, flags@entry=226, drv=0x80312908 <bdrv_raw>, errp=0x3ffffffead8) at /home/cborntra/REPOS/qemu/block.c:1451
>> #9 0x00000000800ba11e in blockdev_init (file=file@entry=0x807fa300 "/dev/disk/by-id/scsi-36005076305ffc1ae", '0' <repeats 12 times>, "2580", bs_opts=bs_opts@entry=0x807f8ce0, errp=errp@entry=
>> 0x3ffffffec58) at /home/cborntra/REPOS/qemu/blockdev.c:523
>> #10 0x00000000800bb530 in drive_new (all_opts=0x807e7cf0, block_default_type=<optimized out>) at /home/cborntra/REPOS/qemu/blockdev.c:930
>> #11 0x00000000800d11d4 in drive_init_func (opts=<optimized out>, opaque=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:1144
>> #12 0x00000000802110b0 in qemu_opts_foreach (list=<optimized out>, func=func@entry=0x800d11a8 <drive_init_func>, opaque=opaque@entry=0x807de6c0, abort_on_failure=abort_on_failure@entry=1)
>> at /home/cborntra/REPOS/qemu/util/qemu-option.c:1072
>> #13 0x0000000080016438 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4352
>> --> No idea
>
> Ah, I forgot about this one. This is the Linux AIO completion eventfd.
>
> It gets signalled when a Linux AIO request completes and we need to call
> io_getevents(2).
>
> You can avoid it by using aio=threads instead of aio=native. But then
> you cannot use Linux AIO. I am not aware of a good way around using
> this fd.
>
>> Breakpoint 1, event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #0 event_notifier_init (e=e@entry=0x807f4eb0, active=active@entry=0) at /home/cborntra/REPOS/qemu/util/event_notifier-posix.c:29
>> #1 0x000000008019f04c in thread_pool_init_one (ctx=0x807e8e00, pool=0x807f4eb0) at /home/cborntra/REPOS/qemu/thread-pool.c:296
>> #2 thread_pool_new (ctx=<optimized out>) at /home/cborntra/REPOS/qemu/thread-pool.c:314
>> #3 0x000000008019e590 in aio_get_thread_pool (ctx=0x807e8e00) at /home/cborntra/REPOS/qemu/async.c:245
>> #4 0x00000000801e8f6c in paio_submit (bs=bs@entry=0x807fd1a0, fd=<optimized out>, sector_num=sector_num@entry=0, qiov=qiov@entry=0x3ffffffda98, nb_sectors=nb_sectors@entry=1, cb=
>> 0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0, type=4097) at /home/cborntra/REPOS/qemu/block/raw-posix.c:1027
>> #5 0x00000000801e9b5c in raw_aio_submit (bs=0x807fd1a0, sector_num=0, qiov=0x3ffffffda98, nb_sectors=<optimized out>, cb=cb@entry=0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0, type=4097)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1056
>> #6 0x00000000801e9c84 in raw_aio_readv (bs=<optimized out>, sector_num=<optimized out>, qiov=<optimized out>, nb_sectors=<optimized out>, cb=0x801a1a4c <bdrv_co_io_em_complete>, opaque=0x3ffba3fe8d0)
>> at /home/cborntra/REPOS/qemu/block/raw-posix.c:1094
>> #7 0x00000000801a2594 in bdrv_co_io_em (is_write=false, iov=0x3ffffffda98, nb_sectors=<optimized out>, sector_num=0, bs=0x807fd1a0) at /home/cborntra/REPOS/qemu/block.c:4835
>> #8 bdrv_co_readv_em (bs=bs@entry=0x807fd1a0, sector_num=sector_num@entry=0, nb_sectors=<optimized out>, iov=iov@entry=0x3ffffffda98) at /home/cborntra/REPOS/qemu/block.c:4852
>> #9 0x00000000801a7d7c in bdrv_aligned_preadv (bs=bs@entry=0x807fd1a0, req=req@entry=0x3ffba3feac8, offset=<optimized out>, bytes=bytes@entry=512, align=<optimized out>, qiov=0x3ffffffda98, flags=0)
>> at /home/cborntra/REPOS/qemu/block.c:3057
>> #10 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fd1a0, offset=<optimized out>, bytes=512, qiov=0x3ffffffda98, flags=<optimized out>, flags@entry=(unknown: 0))
>> at /home/cborntra/REPOS/qemu/block.c:3136
>> #11 0x00000000801a83e4 in bdrv_co_do_readv (flags=(unknown: 0), qiov=<optimized out>, nb_sectors=<optimized out>, sector_num=<optimized out>, bs=<optimized out>)
>> at /home/cborntra/REPOS/qemu/block.c:3158
>> #12 bdrv_co_readv (bs=<optimized out>, sector_num=<optimized out>, nb_sectors=<optimized out>, qiov=<optimized out>) at /home/cborntra/REPOS/qemu/block.c:3167
>> #13 0x00000000801a7ce2 in bdrv_aligned_preadv (bs=bs@entry=0x807fa620, req=req@entry=0x3ffba3fedb0, offset=<optimized out>, bytes=bytes@entry=512, align=512, qiov=0x3ffffffda98, flags=0)
>> at /home/cborntra/REPOS/qemu/block.c:3042
>> #14 0x00000000801a82da in bdrv_co_do_preadv (bs=0x807fa620, offset=<optimized out>, bytes=512, qiov=0x3ffffffda98, flags=<optimized out>) at /home/cborntra/REPOS/qemu/block.c:3136
>> #15 0x00000000801a94d8 in bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2693
>> #16 bdrv_rw_co_entry (opaque=0x3ffffffd9b8) at /home/cborntra/REPOS/qemu/block.c:2688
>> #17 0x00000000801ba140 in coroutine_trampoline (i0=<optimized out>, i1=<error reading variable: value has been optimized out>) at /home/cborntra/REPOS/qemu/coroutine-ucontext.c:118
>> #18 0x000003fffc935892 in __makecontext_ret () from /lib64/libc.so.6
>> --> No idea
>
> Similar deal to the Linux AIO event notifier. It's the fd used to
> signal thread pool work item completion. The threadpool is
> per-AioContext so the fd overhead is per-iothread.
>
> However, we can use a BH instead since the API has now been made
> thread-safe. Previously we used EventNotifier because
> qemu_bh_schedule() was not thread-safe.
>
> I will send a patch but I'm not sure it's critical enough for QEMU 2.1.
> Do you have a bug report or justification for pushing this into QEMU
> 2.1?
Well, 2.1-rc + all patches floating around (most of them in Kevin's tree) is now pretty stable regarding dataplane. So maybe its better to defer this to 2.2 - dont know. The iothread thing will be interesting to test when there is libvirt support.
On s390 I expect people to hit the 1024 fd limit, but this is a one-line change in libvirts qemu.conf
prev parent reply other threads:[~2014-07-11 10:04 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-11 8:56 [Qemu-devel] block device fd consuption Christian Borntraeger
2014-07-11 9:23 ` Stefan Hajnoczi
2014-07-11 10:04 ` Christian Borntraeger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53BFB6B0.8030603@de.ibm.com \
--to=borntraeger@de.ibm.com \
--cc=cornelia.huck@de.ibm.com \
--cc=dingel@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.