Re: [Qemu-devel] Linux kernel polling for QEMU

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: qemu-devel@nongnu.org, Jens Axboe <axboe@fb.com>,
	Christoph Hellwig <hch@lst.de>,
	Eliezer Tamir <eliezer.tamir@linux.intel.com>,
	Davide Libenzi <davidel@xmailserver.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <famz@redhat.com>
Subject: Re: [Qemu-devel] Linux kernel polling for QEMU
Date: Tue, 29 Nov 2016 11:00:11 +0000	[thread overview]
Message-ID: <20161129110011.GB1300@stefanha-x1.localdomain> (raw)
In-Reply-To: <d31deaa3-0416-b9e0-19cf-65d35fd58537@de.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 6887 bytes --]

On Tue, Nov 29, 2016 at 09:19:22AM +0100, Christian Borntraeger wrote:
> On 11/24/2016 04:12 PM, Stefan Hajnoczi wrote:
> > I looked through the socket SO_BUSY_POLL and blk_mq poll support in
> > recent Linux kernels with an eye towards integrating the ongoing QEMU
> > polling work.  The main missing feature is eventfd polling support which
> > I describe below.
> > 
> > Background
> > ----------
> > We're experimenting with polling in QEMU so I wondered if there are
> > advantages to having the kernel do polling instead of userspace.
> > 
> > One such advantage has been pointed out by Christian Borntraeger and
> > Paolo Bonzini: a userspace thread spins blindly without knowing when it
> > is hogging a CPU that other tasks need.  The kernel knows when other
> > tasks need to run and can skip polling in that case.
> > 
> > Power management might also benefit if the kernel was aware of polling
> > activity on the system.  That way polling can be controlled by the
> > system administrator in a single place.  Perhaps smarter power saving
> > choices can also be made by the kernel.
> > 
> > Another advantage is that the kernel can poll hardware rings (e.g. NIC
> > rx rings) whereas QEMU can only poll its own virtual memory (including
> > guest RAM).  That means the kernel can bypass interrupts for devices
> > that are using kernel drivers.
> > 
> > State of polling in Linux
> > -------------------------
> > SO_BUSY_POLL causes recvmsg(2), select(2), and poll(2) family system
> > calls to spin awaiting new receive packets.  From what I can tell epoll
> > is not supported so that system call will sleep without polling.
> > 
> > blk_mq poll is mainly supported by NVMe.  It is only available with
> > synchronous direct I/O.  select(2), poll(2), epoll, and Linux AIO are
> > therefore not integrated.  It would be nice to extend the code so a
> > process waiting on Linux AIO using io_getevents(2), select(2), poll(2),
> > or epoll will poll.
> > 
> > QEMU and KVM-specific polling
> > -----------------------------
> > There are a few QEMU/KVM-specific items that require polling support:
> > 
> > QEMU's event loop aio_notify() mechanism wakes up the event loop from a
> > blocking poll(2) or epoll call.  It is used when another thread adds or
> > changes an event loop resource (such as scheduling a BH).  There is a
> > userspace memory location (ctx->notified) that is written by
> > aio_notify() as well as an eventfd that can be signalled.
> > 
> > kvm.ko's ioeventfd is signalled upon guest MMIO/PIO accesses.  Virtio
> > devices use ioeventfd as a doorbell after new requests have been placed
> > in a virtqueue, which is a descriptor ring in userspace memory.
> > 
> > Eventfd polling support could look like this:
> > 
> >   struct eventfd_poll_info poll_info = {
> >       .addr = ...memory location...,
> >       .size = sizeof(uint32_t),
> >       .op   = EVENTFD_POLL_OP_NOT_EQUAL, /* check *addr != val */
> >       .val  = ...last value...,
> >   };
> >   ioctl(eventfd, EVENTFD_SET_POLL, &poll_info);
> > 
> > In the kernel, eventfd stashes this information and eventfd_poll()
> > evaluates the operation (e.g. not equal, bitwise and, etc) to detect
> > progress.
> > 
> > Note that this eventfd polling mechanism doesn't actually poll the
> > eventfd counter value.  It's useful for situations where the eventfd is
> > a doorbell/notification that some object in userspace memory has been
> > updated.  So it polls that userspace memory location directly.
> > 
> > This new eventfd feature also provides a poor man's Linux AIO polling
> > support: set the Linux AIO shared ring index as the eventfd polling
> > memory location.  This is not as good as true Linux AIO polling support
> > where the kernel polls the NVMe, virtio_blk, etc ring since we'd still
> > rely on an interrupt to complete I/O requests.
> > 
> > Thoughts?
> 
> Would be an interesting excercise, but we should really try to avoid making
> the iothreads more costly. When I look at some of our measurements, I/O-wise
> we are  slightly behind z/VM, which can be tuned to be in a similar area but
> we use more host CPUs on s390 for the same throughput.
> 
> So I have two concerns and both a related to overhead.
> a: I am able to get a higher bandwidth and lower host cpu utilization
> when running fio for multiple disks when I pin the iothreads to a subset of
> the host CPUs (there is a sweet spot). Is the polling maybe just influencing
> the scheduler to do the same by making the iothread not doing sleep/wakeup
> all the time?

Interesting theory, look at sched_switch tracing data to find out
whether that is true.  Do you get any benefit from combining the sweet
spot pinning with polling?

> b: what about contention with other guests on the host? What
> worries me a bit, is the fact that most performance measurements and
> tunings are done for workloads without that. We (including myself) do our
> microbenchmarks (or fio runs) with just one guest and are happy if we see
> an improvement. But does that reflect real usage? For example have you ever
> measured the aio polling with 10 guests or so?
> My gut feeling (and obviously I have not done proper measurements myself) is
> that we want to stop polling as soon as there is contention.
> 
> As you outlined, we already have something in place in the kernel to stop
> polling
> 
> Interestingly enough, for SO_BUSY_POLL the network code seems to consider
>     !need_resched() && !signal_pending(current)
> for stopping the poll, which allows to consume your time slice. KVM instead
> uses single_task_running() for the halt_poll_thing. This means that KVM 
> yields much more aggressively, which is probably the right thing to do for
> opportunistic spinning.

Another thing I noticed about the busy_poll implementation is that it
will spin if *any* file descriptor supports polling.

In QEMU we decided to implement the opposite: spin only if *all* event
sources support polling.  The reason is that we don't want polling to
introduce any extra latency on the event sources that do not support
polling.

> Another thing to consider: In the kernel we have already other opportunistic
> spinners and we are in the process of making things less aggressive because
> it caused real issues. For example search for the  vcpu_is_preempted patch set.
> Which by the way shown another issue, running nested you do not only want to
> consider your own load, but also the load of the hypervisor.

These are good points and it's why I think polling in the kernel can
make smarter decisions than in polling userspace.  There are multiple
components in the system that can do polling, it would be best to have a
single place so that the polling activity doesn't interfere.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

next prev parent reply	other threads:[~2016-11-29 11:00 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-24 15:12 [Qemu-devel] Linux kernel polling for QEMU Stefan Hajnoczi
2016-11-28  9:31 ` Eliezer Tamir
2016-11-28 15:29   ` Stefan Hajnoczi
2016-11-28 15:41     ` Paolo Bonzini
2016-11-29 10:45       ` Stefan Hajnoczi
2016-11-30 17:41         ` Avi Kivity
2016-12-01 11:45           ` Stefan Hajnoczi
2016-12-01 11:59             ` Avi Kivity
2016-12-01 14:35               ` Paolo Bonzini
2016-12-02 10:12                 ` Stefan Hajnoczi
2016-12-07 10:38                   ` Avi Kivity
2016-12-07 10:32                 ` Avi Kivity
2016-11-28 20:41   ` Willem de Bruijn
2016-11-29  8:19 ` Christian Borntraeger
2016-11-29 11:00   ` Stefan Hajnoczi [this message]
2016-11-29 11:58     ` Christian Borntraeger
2016-11-29 10:32 ` Fam Zheng
2016-11-29 11:17   ` Paolo Bonzini
2016-11-29 13:24     ` Fam Zheng
2016-11-29 13:27       ` Paolo Bonzini
2016-11-29 14:17         ` Fam Zheng
2016-11-29 15:24           ` Andrew Jones
2016-11-29 15:39             ` Fam Zheng
2016-11-29 16:01               ` Andrew Jones
2016-11-29 16:13                 ` Paolo Bonzini
2016-11-29 19:38                   ` Andrew Jones
2016-11-30  7:19                     ` Peter Maydell
2016-11-30  9:05                       ` Andrew Jones
2016-11-30  9:46                         ` Peter Maydell
2016-11-30 14:18                           ` Paolo Bonzini
2016-12-05 11:20                             ` Alex Bennée
2016-11-29 15:45             ` Paolo Bonzini
2016-11-29 20:43       ` Stefan Hajnoczi
2016-11-30  5:42         ` Fam Zheng
2016-11-30  9:38           ` Stefan Hajnoczi
2016-11-30 10:50             ` Fam Zheng
2016-11-30 15:10               ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161129110011.GB1300@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=axboe@fb.com \
    --cc=borntraeger@de.ibm.com \
    --cc=davidel@xmailserver.org \
    --cc=eliezer.tamir@linux.intel.com \
    --cc=famz@redhat.com \
    --cc=hch@lst.de \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).