Re: [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_get interface

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gregory Haskins <gregory.haskins@gmail.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Gregory Haskins <ghaskins@novell.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	avi@redhat.com, paulmck@linux.vnet.ibm.com,
	davidel@xmailserver.org, mingo@elte.hu, rusty@rustcorp.com.au
Subject: Re: [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_get interface
Date: Mon, 22 Jun 2009 13:31:29 -0400	[thread overview]
Message-ID: <4A3FBFF1.8050504@gmail.com> (raw)
In-Reply-To: <20090622165708.GB15228@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 12351 bytes --]

Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 12:05:57PM -0400, Gregory Haskins wrote:
>   
>> This patch fixes all known races in irqfd, and paves the way to restore
>> DEASSIGN support.  For details of the eventfd races, please see the patch
>> presumably commited immediately prior to this one.
>>
>> In a nutshell, we use eventfd_kref_get/put() to properly manage the
>> lifetime of the underlying eventfd.  We also use careful coordination
>> with our workqueue to ensure that all irqfd objects have terminated
>> before we allow kvm to shutdown.  The logic used for shutdown walks
>> all open irqfds and releases them.  This logic can be generalized in
>> the future to allow a subset of irqfds to be released, thus allowing
>> DEASSIGN support.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>     
>
> I think this patch is a shade too tricky. Some explanation why below.
>
> But I think irqfd_pop is a good idea.
>   

Yeah, next we can add something like "irqfd_remove(gsi)" in a similar
way to do DEASSIGN.

> Here's an alternative design sketch: add a list of irqfds to be shutdown
> in kvm, and create a single-threaded workqueue. To kill an irqfd, move
> it from list of live irqfds to list of dead irqfds, then schedule work
> on a workqueue that walks this list and kills irqfds.
>   

Yeah, I actually thought of that too, and I think that will work.  But
then I realized flush_schedule_work does the same thing and its much
less code.  Perhaps it is also much less clear, too ;)  At the very
least, you have made me realize I need to comment better.
>   
>> ---
>>
>>  virt/kvm/eventfd.c |  144 ++++++++++++++++++++++++++++++++++++++++------------
>>  1 files changed, 110 insertions(+), 34 deletions(-)
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 9656027..67985cd 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/file.h>
>>  #include <linux/list.h>
>>  #include <linux/eventfd.h>
>> +#include <linux/kref.h>
>>  
>>  /*
>>   * --------------------------------------------------------------------
>> @@ -36,26 +37,68 @@
>>   * Credit goes to Avi Kivity for the original idea.
>>   * --------------------------------------------------------------------
>>   */
>> +
>> +enum {
>> +	irqfd_flags_shutdown,
>> +};
>> +
>>  struct _irqfd {
>>  	struct kvm               *kvm;
>> +	struct kref              *eventfd;
>>     
>
>
> Yay, kref.
>
>   
>>  	int                       gsi;
>>  	struct list_head          list;
>>  	poll_table                pt;
>>  	wait_queue_head_t        *wqh;
>>  	wait_queue_t              wait;
>> -	struct work_struct        inject;
>> +	struct work_struct        work;
>> +	unsigned long             flags;
>>     
>
> Just make it "int shutdown"?
>   

Yep, that is probably fine but we will have to use an explicit wmb in
lieu of a set_bit operation.  NBD.

>   
>>  };
>>  
>>  static void
>> -irqfd_inject(struct work_struct *work)
>> +irqfd_release(struct _irqfd *irqfd)
>> +{
>> +	eventfd_kref_put(irqfd->eventfd);
>> +	kfree(irqfd);
>> +}
>> +
>> +static void
>> +irqfd_work(struct work_struct *work)
>>  {
>> -	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
>> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
>>  	struct kvm *kvm = irqfd->kvm;
>>  
>> -	mutex_lock(&kvm->irq_lock);
>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> -	mutex_unlock(&kvm->irq_lock);
>> +	if (!test_bit(irqfd_flags_shutdown, &irqfd->flags)) {
>>     
>
> Why is it safe to test this bit outside of any lock?
>   
Because the ordering is guaranteed to set_bit(), schedule_work().  All
we need to do is make sure that the work-queue runs at least one more
time after the flag has been set.  (Of course, I could have screwed up
too, but that was my rationale).

>   
>> +		/* Inject an interrupt */
>> +		mutex_lock(&kvm->irq_lock);
>> +		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> +		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> +		mutex_unlock(&kvm->irq_lock);
>> +	} else {
>>     
>
>
> Not much shared code here - create a separate showdown work struct?
> They are cheap ...
>   

We can't because we need to ensure that all inject-jobs complete before
release-jobs.  Reading the work-queue code, it would be a deadlock for
the release-job to do a flush_work(inject-job).  Therefore, both
workloads are encapsulated into a single job, and we ensure that the job
is launched at least one more time after the flag has been set.

Of course, now that I wrote that,  I realize it was clear-as-mud in the
code and needs some commenting ;)

>   
>> +		/* shutdown the irqfd */
>> +		struct _irqfd *_irqfd = NULL;
>> +
>> +		mutex_lock(&kvm->lock);
>> +
>> +		if (!list_empty(&irqfd->list))
>> +			_irqfd = irqfd;
>> +
>> +		if (_irqfd)
>> +			list_del(&_irqfd->list);
>> +
>> +		mutex_unlock(&kvm->lock);
>> +
>> +		/*
>> +		 * If the item is not currently on the irqfds list, we know
>> +		 * we are running concurrently with the KVM side trying to
>> +		 * remove this item as well.
>>     
>
> We do? How? As far as I can see list is only empty after it has been
> created.  Generally, it would be better to either use a flag or use
> list_empty as an indication of going down, but not both.
>   

I think you are mis-reading that.  list_empty(&irqfd->list) is the
individual irqfd list-item, not the kvm->irqfds list itself.  This
conditional is telling us whether the irqfd in question is on or off the
list (its effectively an irqfd-specific flag), not whether the global
list is empty.  Again, poor commenting on my part.

>   
>>  Since the KVM side should be
>> +		 * holding the reference now, and will block behind a
>> +		 * flush_work(), lets just let them do the release() for us
>> +		 */
>> +		if (!_irqfd)
>> +			return;
>> +
>> +		irqfd_release(_irqfd);
>> +	}
>>  }
>>  
>>  static int
>> @@ -65,25 +108,20 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  	unsigned long flags = (unsigned long)key;
>>  
>>  	/*
>> -	 * Assume we will be called with interrupts disabled
>> +	 * called with interrupts disabled
>>  	 */
>> -	if (flags & POLLIN)
>> -		/*
>> -		 * Defer the IRQ injection until later since we need to
>> -		 * acquire the kvm->lock to do so.
>> -		 */
>> -		schedule_work(&irqfd->inject);
>> -
>>  	if (flags & POLLHUP) {
>>  		/*
>> -		 * for now, just remove ourselves from the list and let
>> -		 * the rest dangle.  We will fix this up later once
>> -		 * the races in eventfd are fixed
>> +		 * ordering is important: shutdown flag must be visible
>> +		 * before we schedule
>>  		 */
>>  		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> -		irqfd->wqh = NULL;
>> +		set_bit(irqfd_flags_shutdown, &irqfd->flags);
>>     
>
> So what happens if a previously scheduled work runs on irqfd
> and sees this flag?
My original thought was "thats ok", but now that you mention it I am not
so sure.  Ill give it some more thought because maybe you are on to
something.

>  And note that multiple works can run on irqfd
> in parallel.
>   

They can?  I thought work-queue items were guaranteed to only schedule
once?  If what you say is true, its broken, I agree, and Ill need to
revisit.  Let me get back to you.
>   
>>  	}
>>  
>> +	if (flags & (POLLHUP | POLLIN))
>> +		schedule_work(&irqfd->work);
>> +
>>  	return 0;
>>  }
>>  
>> @@ -102,6 +140,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  {
>>  	struct _irqfd *irqfd;
>>  	struct file *file = NULL;
>> +	struct kref *kref = NULL;
>>  	int ret;
>>  	unsigned int events;
>>  
>> @@ -112,7 +151,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  	irqfd->kvm = kvm;
>>  	irqfd->gsi = gsi;
>>  	INIT_LIST_HEAD(&irqfd->list);
>> -	INIT_WORK(&irqfd->inject, irqfd_inject);
>> +	INIT_WORK(&irqfd->work, irqfd_work);
>>  
>>  	file = eventfd_fget(fd);
>>  	if (IS_ERR(file)) {
>> @@ -133,11 +172,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  	list_add_tail(&irqfd->list, &kvm->irqfds);
>>  	mutex_unlock(&kvm->lock);
>>  
>> -	/*
>> -	 * Check if there was an event already queued
>> -	 */
>> -	if (events & POLLIN)
>> -		schedule_work(&irqfd->inject);
>> +	kref = eventfd_kref_get(file);
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto fail;
>> +	}
>> +
>> +	irqfd->eventfd = kref;
>>  
>>  	/*
>>  	 * do not drop the file until the irqfd is fully initialized, otherwise
>> @@ -145,9 +186,18 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  	 */
>>  	fput(file);
>>  
>> +	/*
>> +	 * Check if there was an event already queued
>> +	 */
>>     
>
> This comment seems to confuse more that it clarifies:
> queued where? eventfd only counts... Just kill the comment?
>
>   
non-zero values in eventfd are "queued" as a signal.  This test just
checks if an interrupt was already injected before we registered.

>> +	if (events & POLLIN)
>> +		schedule_work(&irqfd->work);
>> +
>>  	return 0;
>>  
>>  fail:
>> +	if (kref && !IS_ERR(kref))
>> +		eventfd_kref_put(kref);
>> +
>>  	if (file && !IS_ERR(file))
>>  		fput(file);
>>     
>
> let's add a couple more labels and avoid the kref/file check
> and the initialization above?
>   

I think that just makes it more confusing, personally.  But I will give
it some thought.

>   
>>  
>> @@ -161,21 +211,47 @@ kvm_irqfd_init(struct kvm *kvm)
>>  	INIT_LIST_HEAD(&kvm->irqfds);
>>  }
>>  
>> +static struct _irqfd *
>> +irqfd_pop(struct kvm *kvm)
>> +{
>> +	struct _irqfd *irqfd = NULL;
>> +
>> +	mutex_lock(&kvm->lock);
>> +
>> +	if (!list_empty(&kvm->irqfds)) {
>> +		irqfd = list_first_entry(&kvm->irqfds, struct _irqfd, list);
>> +		list_del(&irqfd->list);
>> +	}
>> +
>> +	mutex_unlock(&kvm->lock);
>> +
>> +	return irqfd;
>> +}
>> +
>>  void
>>  kvm_irqfd_release(struct kvm *kvm)
>>  {
>> -	struct _irqfd *irqfd, *tmp;
>> +	struct _irqfd *irqfd;
>>  
>> -	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
>> -		if (irqfd->wqh)
>> -			remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> +	while ((irqfd = irqfd_pop(kvm))) {
>>  
>> -		flush_work(&irqfd->inject);
>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>  
>> -		mutex_lock(&kvm->lock);
>> -		list_del(&irqfd->list);
>> -		mutex_unlock(&kvm->lock);
>> +		/*
>> +		 * We guarantee there will be no more notifications after
>> +		 * the remove_wait_queue returns.  Now lets make sure we
>> +		 * synchronize behind any outstanding work items before
>> +		 * releasing the resources
>> +		 */
>> +		flush_work(&irqfd->work);
>>  
>> -		kfree(irqfd);
>> +		irqfd_release(irqfd);
>>  	}
>> +
>> +	/*
>> +	 * We need to wait in case there are any outstanding work-items
>> +	 * in flight that had already removed themselves from the list
>> +	 * prior to entry to this function
>> +	 */
>>     
>
> Looks scary. Why doesn't the flush above cover all cases?
>   

The path inside the while() is for when KVM wins the race and finds the
item in the list.  It atomically removes it, and is responsible for
freeing it in a coordinated way.  In this case, we must block with the
flush_work() before we can irqfd_release() so that we do not yank the
memory out from under a running work-item.

The flush_scheduled_work() is for when eventfd wins the race and has
already removed itself from the list in the "shutdown" path in the
work-item.  We want to make sure that kvm_irqfd_release() cannot return
until all work-items have exited to prevent something like the kvm.ko
module unloading while the work-item is still in flight.

Thanks Michael,
-Greg
>   
>> +	flush_scheduled_work();
>>  }
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

next prev parent reply	other threads:[~2009-06-22 17:31 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-22 16:05 [KVM PATCH v3 0/3] irqfd/eventfd fixes Gregory Haskins
2009-06-22 16:05 ` [KVM PATCH v3 1/3] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
2009-06-22 16:05 ` [KVM PATCH v3 2/3] eventfd: add internal reference counting to fix notifier race conditions Gregory Haskins
2009-06-22 16:05 ` [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
2009-06-22 16:57   ` Michael S. Tsirkin
2009-06-22 17:31     ` Gregory Haskins [this message]
2009-06-22 17:45       ` Michael S. Tsirkin
2009-06-22 18:03         ` Gregory Haskins
2009-06-22 18:26           ` Michael S. Tsirkin
2009-06-22 18:11         ` Davide Libenzi
2009-06-22 18:32           ` Michael S. Tsirkin
2009-06-22 18:41             ` Davide Libenzi
2009-06-22 18:52               ` Michael S. Tsirkin
2009-06-23 14:55       ` Gregory Haskins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A3FBFF1.8050504@gmail.com \
    --to=gregory.haskins@gmail.com \
    --cc=avi@redhat.com \
    --cc=davidel@xmailserver.org \
    --cc=ghaskins@novell.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.