[PATCH 0/2] eventfd: new EFD

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] eventfd: new EFD_STATE flag
@ 2009-08-20 15:56 Michael S. Tsirkin
  2009-08-20 16:20 ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-20 15:56 UTC (permalink / raw)
  To: davidel, avi, gleb, kvm, linux-kernel

Davide,
Looks like I got an ack from Avi and no comments from others,
so the following patchset is identical to the last RFC.
Can it be merged for 2.6.32?
Thanks!

This series implements a new EFD_STATE flag for eventfd.
When set, this changes eventfd behaviour in the following way:
- write simply stores the value written, and is always non-blocking
- read unblocks when the value written changes, and
  returns the value written

Motivation: we'd like to use eventfd in qemu to pass interrupts from
(emulated or assigned) devices to guest. For level interrupts, the
counter supported currently by eventfd is not a good match: we really
need to set interrupt to a level, typically 0 or 1, wake the guest if
there was a change and give the guest ability to see the last value
written. The level can be set either from kernel (e.g. with assigned
devices) or from userspace.

Michael S. Tsirkin (2):
  eventfd: reorganize the code to simplify new flags
  eventfd: EFD_STATE flag

 fs/eventfd.c            |   83 +++++++++++++++++++++++++++++++++++++++--------
 include/linux/eventfd.h |    3 +-
 2 files changed, 71 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 15:56 [PATCH 0/2] eventfd: new EFD_STATE flag Michael S. Tsirkin
@ 2009-08-20 16:20 ` Davide Libenzi
  2009-08-20 17:38   ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-20 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, Linux Kernel Mailing List

On Thu, 20 Aug 2009, Michael S. Tsirkin wrote:

> Davide,
> Looks like I got an ack from Avi and no comments from others,
> so the following patchset is identical to the last RFC.
> Can it be merged for 2.6.32?
> Thanks!

I briefly looked at this while in vacation, although I did not reply 
hoping the horrible feeling about this code would go away.
It didn't.
I find this to be an ugly and ad-hoc multiplexing of eventfd with added 
functionalities of questionable general use.
I'm pretty sure you can do better on KVM side, to solve the problem w/out 
littering eventfd.

- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 16:20 ` Davide Libenzi
@ 2009-08-20 17:38   ` Avi Kivity
  2009-08-20 17:44     ` Davide Libenzi
  2009-08-20 17:55     ` Michael S. Tsirkin
  0 siblings, 2 replies; 79+ messages in thread
From: Avi Kivity @ 2009-08-20 17:38 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/20/2009 07:20 PM, Davide Libenzi wrote:
>
> I briefly looked at this while in vacation, although I did not reply
> hoping the horrible feeling about this code would go away.
> It didn't.
> I find this to be an ugly and ad-hoc multiplexing of eventfd with added
> functionalities of questionable general use.
> I'm pretty sure you can do better on KVM side, to solve the problem w/out
> littering eventfd.
>
>    

While we could argue about this my feeling is that we should drop this, 
at least until we can quantify what benefit it has and whether there are 
any Davide-acceptable alternatives.

In the meanwhile, we can let vhost-net support edge-triggered interrupts 
only, let qemu terminate those eventfds and convert then to 
level-triggered interrupts (which it can then inject using the existing 
ioctl).  It will keep vhost-net and kvm simpler at the cost of some 
performance penalty to guests using level interrupts.  These suck anyway 
so we'll point users at msi.

-- 
I have a truly marvellous patch that fixes the bug which thisb
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 17:38   ` Avi Kivity
@ 2009-08-20 17:44     ` Davide Libenzi
  2009-08-20 17:56       ` Paolo Bonzini
  2009-08-20 17:55     ` Michael S. Tsirkin
  1 sibling, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-20 17:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On Thu, 20 Aug 2009, Avi Kivity wrote:

> On 08/20/2009 07:20 PM, Davide Libenzi wrote:
> > 
> > I briefly looked at this while in vacation, although I did not reply
> > hoping the horrible feeling about this code would go away.
> > It didn't.
> > I find this to be an ugly and ad-hoc multiplexing of eventfd with added
> > functionalities of questionable general use.
> > I'm pretty sure you can do better on KVM side, to solve the problem w/out
> > littering eventfd.
> > 
> >    
> 
> While we could argue about this my feeling is that we should drop this, at
> least until we can quantify what benefit it has and whether there are any
> Davide-acceptable alternatives.

I really didn't mean to be harsh, but seriously, we cannot just have a 
Multiplexing Feast over eventfd, with one-time users.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 17:38   ` Avi Kivity
  2009-08-20 17:44     ` Davide Libenzi
@ 2009-08-20 17:55     ` Michael S. Tsirkin
  2009-08-20 18:06       ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-20 17:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On Thu, Aug 20, 2009 at 08:38:48PM +0300, Avi Kivity wrote:
> On 08/20/2009 07:20 PM, Davide Libenzi wrote:
>>
>> I briefly looked at this while in vacation, although I did not reply
>> hoping the horrible feeling about this code would go away.
>> It didn't.
>> I find this to be an ugly and ad-hoc multiplexing of eventfd with added
>> functionalities of questionable general use.
>> I'm pretty sure you can do better on KVM side, to solve the problem w/out
>> littering eventfd.
>>
>>    
>
> While we could argue about this my feeling is that we should drop this,  
> at least until we can quantify what benefit it has and whether there are  
> any Davide-acceptable alternatives.
>
> In the meanwhile, we can let vhost-net support edge-triggered interrupts  
> only, let qemu terminate those eventfds and convert then to  
> level-triggered interrupts (which it can then inject using the existing  
> ioctl).  It will keep vhost-net and kvm simpler at the cost of some  
> performance penalty to guests using level interrupts.  These suck anyway  
> so we'll point users at msi.

I thought the point was to move assigned devices out of KVM?

> -- 
> I have a truly marvellous patch that fixes the bug which thisb
> signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 17:44     ` Davide Libenzi
@ 2009-08-20 17:56       ` Paolo Bonzini
  2009-08-21 17:21         ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2009-08-20 17:56 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List

On 08/20/2009 07:44 PM, Davide Libenzi wrote:
> On Thu, 20 Aug 2009, Avi Kivity wrote:
>
>> On 08/20/2009 07:20 PM, Davide Libenzi wrote:
>>>
>>> I briefly looked at this while in vacation, although I did not reply
>>> hoping the horrible feeling about this code would go away.
>>> It didn't.
>>> I find this to be an ugly and ad-hoc multiplexing of eventfd with added
>>> functionalities of questionable general use.
>>> I'm pretty sure you can do better on KVM side, to solve the problem w/out
>>> littering eventfd.
>>
>> While we could argue about this my feeling is that we should drop this, at
>> least until we can quantify what benefit it has and whether there are any
>> Davide-acceptable alternatives.
>
> I really didn't mean to be harsh, but seriously, we cannot just have a
> Multiplexing Feast over eventfd, with one-time users.

EFD_STATE actually does two changes: a) makes read block until the value 
changes; b) makes each write override the previous one.  How would you 
feel if the two changes were separated?  I can see each of them has use 
cases

For example, (a) could be implemented by using select's xfds (POLLPRI) 
to poll for value changes (rfds would still poll for non-zeroness). 
Then Michael does not need even to read the eventfd; instead he'd check 
POLLIN with a zero timeout.

(b) could be implemented with a flag like Michael did.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 17:55     ` Michael S. Tsirkin
@ 2009-08-20 18:06       ` Avi Kivity
  2009-08-20 18:28         ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-20 18:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote:
>
>> While we could argue about this my feeling is that we should drop this,
>> at least until we can quantify what benefit it has and whether there are
>> any Davide-acceptable alternatives.
>>
>> In the meanwhile, we can let vhost-net support edge-triggered interrupts
>> only, let qemu terminate those eventfds and convert then to
>> level-triggered interrupts (which it can then inject using the existing
>> ioctl).  It will keep vhost-net and kvm simpler at the cost of some
>> performance penalty to guests using level interrupts.  These suck anyway
>> so we'll point users at msi.
>>      
> I thought the point was to move assigned devices out of KVM?
>    

Grr.  Forgot about that.

That's much more important.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 18:06       ` Avi Kivity
@ 2009-08-20 18:28         ` Michael S. Tsirkin
  2009-08-23 13:01           ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-20 18:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On Thu, Aug 20, 2009 at 09:06:51PM +0300, Avi Kivity wrote:
> On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote:
>>
>>> While we could argue about this my feeling is that we should drop this,
>>> at least until we can quantify what benefit it has and whether there are
>>> any Davide-acceptable alternatives.
>>>
>>> In the meanwhile, we can let vhost-net support edge-triggered interrupts
>>> only, let qemu terminate those eventfds and convert then to
>>> level-triggered interrupts (which it can then inject using the existing
>>> ioctl).  It will keep vhost-net and kvm simpler at the cost of some
>>> performance penalty to guests using level interrupts.  These suck anyway
>>> so we'll point users at msi.
>>>      
>> I thought the point was to move assigned devices out of KVM?
>>    
>
> Grr.  Forgot about that.
>
> That's much more important.

Looks like we'll have do with a separate char device or
something?

> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 17:56       ` Paolo Bonzini
@ 2009-08-21 17:21         ` Davide Libenzi
  0 siblings, 0 replies; 79+ messages in thread
From: Davide Libenzi @ 2009-08-21 17:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List

On Thu, 20 Aug 2009, Paolo Bonzini wrote:

> On 08/20/2009 07:44 PM, Davide Libenzi wrote:
> > On Thu, 20 Aug 2009, Avi Kivity wrote:
> > 
> > > On 08/20/2009 07:20 PM, Davide Libenzi wrote:
> > > > 
> > > > I briefly looked at this while in vacation, although I did not reply
> > > > hoping the horrible feeling about this code would go away.
> > > > It didn't.
> > > > I find this to be an ugly and ad-hoc multiplexing of eventfd with added
> > > > functionalities of questionable general use.
> > > > I'm pretty sure you can do better on KVM side, to solve the problem
> > > > w/out
> > > > littering eventfd.
> > > 
> > > While we could argue about this my feeling is that we should drop this, at
> > > least until we can quantify what benefit it has and whether there are any
> > > Davide-acceptable alternatives.
> > 
> > I really didn't mean to be harsh, but seriously, we cannot just have a
> > Multiplexing Feast over eventfd, with one-time users.
> 
> EFD_STATE actually does two changes: a) makes read block until the value
> changes; b) makes each write override the previous one.  How would you feel if
> the two changes were separated?

I'm afraid that splitting the change won't make less inappropriate for 
eventfd.
I've no doubt KVM needs something like this, but this just don't belong to 
eventfd, which act as either a mutex or a semaphore.
This EFD_STATE does not belong to a generic interface, proof of which is 
the lack of such abstraction in any generic library I'm aware of.
This aside from any consideration about how the code will look with that 
multiplexing in place.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-20 18:28         ` Michael S. Tsirkin
@ 2009-08-23 13:01           ` Avi Kivity
  2009-08-23 13:36             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-23 13:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote:
>>> I thought the point was to move assigned devices out of KVM?
>>>
>>>        
>> Grr.  Forgot about that.
>>
>> That's much more important.
>>      
> Looks like we'll have do with a separate char device or
> something?
>    

We can still tunnel it through userspace.  Most assigned devices will 
have MSI.  uio can signal the eventfd when the level changes, qemu can 
read the level and write it to kvm.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-23 13:01           ` Avi Kivity
@ 2009-08-23 13:36             ` Michael S. Tsirkin
  2009-08-23 13:40               ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-23 13:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On Sun, Aug 23, 2009 at 04:01:29PM +0300, Avi Kivity wrote:
> On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote:
>>>> I thought the point was to move assigned devices out of KVM?
>>>>
>>>>        
>>> Grr.  Forgot about that.
>>>
>>> That's much more important.
>>>      
>> Looks like we'll have do with a separate char device or
>> something?
>>    
>
> We can still tunnel it through userspace.

More important here is realization that eventfd is a mutex/semaphore
implementation, not a generic event reporting interface as we are trying
to use it.

> Most assigned devices will  
> have MSI.  uio can signal the eventfd when the level changes, qemu can  
> read the level and write it to kvm.
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-23 13:36             ` Michael S. Tsirkin
@ 2009-08-23 13:40               ` Avi Kivity
  2009-08-23 14:30                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-23 13:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
> More important here is realization that eventfd is a mutex/semaphore
> implementation, not a generic event reporting interface as we are trying
> to use it.
>    

Well it is a generic event reporting interface (for example, aio uses it).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-23 13:40               ` Avi Kivity
@ 2009-08-23 14:30                 ` Michael S. Tsirkin
  2009-08-23 16:51                   ` Paolo Bonzini
  2009-08-24 18:25                   ` Davide Libenzi
  0 siblings, 2 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-23 14:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
> On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
>> More important here is realization that eventfd is a mutex/semaphore
>> implementation, not a generic event reporting interface as we are trying
>> to use it.
>>    
>
> Well it is a generic event reporting interface (for example, aio uses it).

Davide, I think it's a valid point.  For example, what read on eventfd
does (zero a counter and return) is not like any semaphore I saw.

Look at eventfd as pipe replacement: for wake-up between processes, it
works well. But what if we want to pass around some kind of key as well?
pipe lets you pass single-byte values atomically. This flag allows us to
use 8 byte values as keys.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-23 14:30                 ` Michael S. Tsirkin
@ 2009-08-23 16:51                   ` Paolo Bonzini
  2009-08-24 18:25                   ` Davide Libenzi
  1 sibling, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2009-08-23 16:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On 08/23/2009 04:30 PM, Michael S. Tsirkin wrote:
> On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
>> On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
>>> More important here is realization that eventfd is a mutex/semaphore
>>> implementation, not a generic event reporting interface as we are trying
>>> to use it.
>>>
>>
>> Well it is a generic event reporting interface (for example, aio uses it).
>
> Davide, I think it's a valid point.  For example, what read on eventfd
> does (zero a counter and return) is not like any semaphore I saw.

It is similar to Win32 events (which can be used like contorted mutexes 
to sync two processes).

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-23 14:30                 ` Michael S. Tsirkin
  2009-08-23 16:51                   ` Paolo Bonzini
@ 2009-08-24 18:25                   ` Davide Libenzi
  2009-08-24 18:31                     ` Avi Kivity
  2009-08-24 21:49                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 79+ messages in thread
From: Davide Libenzi @ 2009-08-24 18:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:

> On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
> > On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
> >> More important here is realization that eventfd is a mutex/semaphore
> >> implementation, not a generic event reporting interface as we are trying
> >> to use it.
> >>    
> >
> > Well it is a generic event reporting interface (for example, aio uses it).
> 
> Davide, I think it's a valid point.  For example, what read on eventfd
> does (zero a counter and return) is not like any semaphore I saw.


Indeed, the default eventfd behaviour is like, well, an event. Signaling 
(kernel side) or writing (userspace side), signals the event.
Waiting (reading) it, will reset the event.
If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
Events and sempahores are two widely known and used abstractions.
The EFD_STATE proposed one, well, no. Not at all.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 18:25                   ` Davide Libenzi
@ 2009-08-24 18:31                     ` Avi Kivity
  2009-08-24 22:08                       ` Davide Libenzi
  2009-08-24 21:49                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-24 18:31 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/24/2009 09:25 PM, Davide Libenzi wrote:
> Indeed, the default eventfd behaviour is like, well, an event. Signaling
> (kernel side) or writing (userspace side), signals the event.
> Waiting (reading) it, will reset the event.
> If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
> Events and sempahores are two widely known and used abstractions.
> The EFD_STATE proposed one, well, no. Not at all.
>    

There are libraries that provide notifications (or fire watches) when 
some value changes.  They're much less frequently used than events or 
semaphores, though.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 18:25                   ` Davide Libenzi
  2009-08-24 18:31                     ` Avi Kivity
@ 2009-08-24 21:49                     ` Michael S. Tsirkin
  2009-08-24 22:15                       ` Davide Libenzi
  1 sibling, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-24 21:49 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
> On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
> 
> > On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
> > > On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
> > >> More important here is realization that eventfd is a mutex/semaphore
> > >> implementation, not a generic event reporting interface as we are trying
> > >> to use it.
> > >>    
> > >
> > > Well it is a generic event reporting interface (for example, aio uses it).
> > 
> > Davide, I think it's a valid point.  For example, what read on eventfd
> > does (zero a counter and return) is not like any semaphore I saw.
> 
> 
> Indeed, the default eventfd behaviour is like, well, an event. Signaling 
> (kernel side) or writing (userspace side), signals the event.
> Waiting (reading) it, will reset the event.
> If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
> Events and sempahores are two widely known and used abstractions.
> The EFD_STATE proposed one, well, no. Not at all.

Hmm. All we try to do is, associate a small key with the event
that we signal. Is it really that uncommon/KVM specific?

> 
> 
> - Davide
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 18:31                     ` Avi Kivity
@ 2009-08-24 22:08                       ` Davide Libenzi
  2009-08-24 22:10                         ` Paolo Bonzini
  2009-08-25  4:26                         ` Avi Kivity
  0 siblings, 2 replies; 79+ messages in thread
From: Davide Libenzi @ 2009-08-24 22:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On Mon, 24 Aug 2009, Avi Kivity wrote:

> On 08/24/2009 09:25 PM, Davide Libenzi wrote:
> > Indeed, the default eventfd behaviour is like, well, an event. Signaling
> > (kernel side) or writing (userspace side), signals the event.
> > Waiting (reading) it, will reset the event.
> > If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
> > Events and sempahores are two widely known and used abstractions.
> > The EFD_STATE proposed one, well, no. Not at all.
> >    
> 
> There are libraries that provide notifications (or fire watches) when some
> value changes.  They're much less frequently used than events or semaphores,
> though.

There are userspace libraries that do almost everything, but you hardly 
see things like pthread_(EFD_STATE-like)_create() or similar system 
interfaces based on such abstraction.
Is that really difficult to understand where I'm standing, leaving the KVM 
hat off for a moment?



- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 22:08                       ` Davide Libenzi
@ 2009-08-24 22:10                         ` Paolo Bonzini
  2009-08-24 22:32                           ` Davide Libenzi
  2009-08-25  4:26                         ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2009-08-24 22:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List


> There are userspace libraries that do almost everything, but you hardly
> see things like pthread_(EFD_STATE-like)_create() or similar system
> interfaces based on such abstraction.

It actually seems as close to a condition variable as an eventfd can be.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 21:49                     ` Michael S. Tsirkin
@ 2009-08-24 22:15                       ` Davide Libenzi
  2009-08-25  7:22                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-24 22:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:

> On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
> > On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
> > 
> > > On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
> > > > On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
> > > >> More important here is realization that eventfd is a mutex/semaphore
> > > >> implementation, not a generic event reporting interface as we are trying
> > > >> to use it.
> > > >>    
> > > >
> > > > Well it is a generic event reporting interface (for example, aio uses it).
> > > 
> > > Davide, I think it's a valid point.  For example, what read on eventfd
> > > does (zero a counter and return) is not like any semaphore I saw.
> > 
> > 
> > Indeed, the default eventfd behaviour is like, well, an event. Signaling 
> > (kernel side) or writing (userspace side), signals the event.
> > Waiting (reading) it, will reset the event.
> > If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
> > Events and sempahores are two widely known and used abstractions.
> > The EFD_STATE proposed one, well, no. Not at all.
> 
> Hmm. All we try to do is, associate a small key with the event
> that we signal. Is it really that uncommon/KVM specific?

All I'm trying to do, is to avoid that eventfd will become an horrible 
multiplexor for every freaky one-time-use behaviors arising inside kernel 
modules.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 22:10                         ` Paolo Bonzini
@ 2009-08-24 22:32                           ` Davide Libenzi
  2009-08-25  6:59                             ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-24 22:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List

On Tue, 25 Aug 2009, Paolo Bonzini wrote:

> 
> > There are userspace libraries that do almost everything, but you hardly
> > see things like pthread_(EFD_STATE-like)_create() or similar system
> > interfaces based on such abstraction.
> 
> It actually seems as close to a condition variable as an eventfd can be.

A pthread condition typical code usage maps to eventfd like:

	while (read(efd, ...) > 0)
		if (CONDITION)
			break;

So a pthread condition is really a wakeup gate like eventfd is.
EFD_STATE has nothing to do with a pthread condition.



- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 22:08                       ` Davide Libenzi
  2009-08-24 22:10                         ` Paolo Bonzini
@ 2009-08-25  4:26                         ` Avi Kivity
  1 sibling, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2009-08-25  4:26 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/25/2009 01:08 AM, Davide Libenzi wrote:
> Is that really difficult to understand where I'm standing, leaving the KVM
> hat off for a moment?
>    

I understand it perfectly.  I take the same position with kvm.  I'm 
providing more data in the hope that you'll change you mind, not trying 
to flood you with email so you'll give up.

We can always create our eventfd-lookalike for kvm, but I'd rather not 
do that (other options include a userspace proxy through existing 
interfaces, it might even be better than changing eventfd if we decide 
performance for level-triggered interrupts is not critical).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 22:32                           ` Davide Libenzi
@ 2009-08-25  6:59                             ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2009-08-25  6:59 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List


>>> There are userspace libraries that do almost everything, but you hardly
>>> see things like pthread_(EFD_STATE-like)_create() or similar system
>>> interfaces based on such abstraction.
>>
>> It actually seems as close to a condition variable as an eventfd can be.
>
> A pthread condition typical code usage maps to eventfd like:
>
> 	while (read(efd, ...)>  0)
> 		if (CONDITION)
> 			break;
>
> So a pthread condition is really a wakeup gate like eventfd is.
> EFD_STATE has nothing to do with a pthread condition.

No, your code does not work for pthread_cond_broadcast (which arguably 
is the common case) because all the eventfd readers after the first 
would block.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-24 22:15                       ` Davide Libenzi
@ 2009-08-25  7:22                         ` Michael S. Tsirkin
  2009-08-25 21:57                           ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-25  7:22 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Mon, Aug 24, 2009 at 03:15:05PM -0700, Davide Libenzi wrote:
> On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
> 
> > On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
> > > On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
> > > 
> > > > On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
> > > > > On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
> > > > >> More important here is realization that eventfd is a mutex/semaphore
> > > > >> implementation, not a generic event reporting interface as we are trying
> > > > >> to use it.
> > > > >>    
> > > > >
> > > > > Well it is a generic event reporting interface (for example, aio uses it).
> > > > 
> > > > Davide, I think it's a valid point.  For example, what read on eventfd
> > > > does (zero a counter and return) is not like any semaphore I saw.
> > > 
> > > 
> > > Indeed, the default eventfd behaviour is like, well, an event. Signaling 
> > > (kernel side) or writing (userspace side), signals the event.
> > > Waiting (reading) it, will reset the event.
> > > If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
> > > Events and sempahores are two widely known and used abstractions.
> > > The EFD_STATE proposed one, well, no. Not at all.
> > 
> > Hmm. All we try to do is, associate a small key with the event
> > that we signal. Is it really that uncommon/KVM specific?
> 
> All I'm trying to do, is to avoid that eventfd will become an horrible 
> multiplexor for every freaky one-time-use behaviors arising inside kernel 
> modules.

Yes, we don't want that. The best thing is to try to restate the problem
in a way that is generic, and then either solve or best use existing
solution. Right?

I thought I had that, but apparently not.  The reason I'm Cc-ing you is
not to try and spam you until you give up and accept the patch, it's
hoping that you see the pattern behind our usage, and help generalize
it.

If I understand it correctly, you believe this is not possible and so
any solution will have to be in KVM? Or maybe I didn't state the problem
clearly enough and should restate it?


> 
> 
> - Davide

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-25  7:22                         ` Michael S. Tsirkin
@ 2009-08-25 21:57                           ` Davide Libenzi
  2009-08-26 10:29                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-25 21:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:

> Yes, we don't want that. The best thing is to try to restate the problem
> in a way that is generic, and then either solve or best use existing
> solution. Right?
> 
> I thought I had that, but apparently not.  The reason I'm Cc-ing you is
> not to try and spam you until you give up and accept the patch, it's
> hoping that you see the pattern behind our usage, and help generalize
> it.
> 
> If I understand it correctly, you believe this is not possible and so
> any solution will have to be in KVM? Or maybe I didn't state the problem
> clearly enough and should restate it?

Please do.



- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-25 21:57                           ` Davide Libenzi
@ 2009-08-26 10:29                             ` Michael S. Tsirkin
  2009-08-26 10:41                               ` Avi Kivity
  2009-08-26 17:45                               ` Davide Libenzi
  0 siblings, 2 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-26 10:29 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote:
> On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
> 
> > Yes, we don't want that. The best thing is to try to restate the problem
> > in a way that is generic, and then either solve or best use existing
> > solution. Right?
> > 
> > I thought I had that, but apparently not.  The reason I'm Cc-ing you is
> > not to try and spam you until you give up and accept the patch, it's
> > hoping that you see the pattern behind our usage, and help generalize
> > it.
> > 
> > If I understand it correctly, you believe this is not possible and so
> > any solution will have to be in KVM? Or maybe I didn't state the problem
> > clearly enough and should restate it?
> 
> Please do.
> 
> 
> 
> - Davide

Problem looks like this:

There are multiple processes (devices) where each has a condition
(interrupt line) which it has logic to determine is either true or
false.

A single other process (hypervisor) is interested in a condition
(interrupt level) which is a logical OR of all interrupt lines.
On changes, an interrupt level value needs to be read and copied to
guest virtual cpu.

We also want ability to replace some or all processes above by a kernel
components, with condition changes done potentially from hardware
interrupt context.

How we wanted to solve it with EFD_STATE: Share a separate eventfd
between each device and the hypervisor.  device sets state to either 0
or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
calculates the interrupt level and updates guest.

Alternative solution: shared memory where each device writes interrupt
line value. This makes setup more complex (need to share around much more
than just an fd), and makes access from interrupt impossible unless we
lock the memory (and locking userspace memory introduces yet another set
of issues).

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 10:29                             ` Michael S. Tsirkin
@ 2009-08-26 10:41                               ` Avi Kivity
  2009-08-26 17:45                               ` Davide Libenzi
  1 sibling, 0 replies; 79+ messages in thread
From: Avi Kivity @ 2009-08-26 10:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On 08/26/2009 01:29 PM, Michael S. Tsirkin wrote:
> How we wanted to solve it with EFD_STATE: Share a separate eventfd
> between each device and the hypervisor.  device sets state to either 0
> or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
> calculates the interrupt level and updates guest.
>
> Alternative solution: shared memory where each device writes interrupt
> line value. This makes setup more complex (need to share around much more
> than just an fd), and makes access from interrupt impossible unless we
> lock the memory (and locking userspace memory introduces yet another set
> of issues).
>    

For completeness:

If the device is implemented in the same process as the hypervisor, an 
eventfd isn't really needed, as there is an ioctl which performs the 
same operation.

An important class of device implementations is real devices that are 
assigned to a guest.  We would like to forward the interrupt directly 
from the host interrupt handler to qemu.  Currently, we have a 
kvm-specific interrupt handler that forwards the interrupt using 
kvm-specific interfaces.  We would like to use a generic interrupt 
handler implemented by uio, so we want a generic interrupt transfer 
mechanism.

uio already supports edge-triggered interrupts using an eventfd-like 
mechanism.  So it makes sense to extend uio to support real eventfds, 
and to make it also support level-triggered interrupts.

We can work around the lack of state eventfd by having userspace wait on 
whatever mechanism uio uses to make the interrupt state visible, and 
then use the ioctl mentioned above to inform the hypervisor of this 
state.  But it's faster and nicer to give both components an eventfd and 
let them communicate directly.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 10:29                             ` Michael S. Tsirkin
  2009-08-26 10:41                               ` Avi Kivity
@ 2009-08-26 17:45                               ` Davide Libenzi
  2009-08-26 18:58                                 ` Avi Kivity
  1 sibling, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-26 17:45 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Wed, 26 Aug 2009, Michael S. Tsirkin wrote:

> On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote:
> > On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
> > 
> > > Yes, we don't want that. The best thing is to try to restate the problem
> > > in a way that is generic, and then either solve or best use existing
> > > solution. Right?
> > > 
> > > I thought I had that, but apparently not.  The reason I'm Cc-ing you is
> > > not to try and spam you until you give up and accept the patch, it's
> > > hoping that you see the pattern behind our usage, and help generalize
> > > it.
> > > 
> > > If I understand it correctly, you believe this is not possible and so
> > > any solution will have to be in KVM? Or maybe I didn't state the problem
> > > clearly enough and should restate it?
> > 
> > Please do.
> > 
> > 
> > 
> > - Davide
> 
> 
> Problem looks like this:
> 
> There are multiple processes (devices) where each has a condition
> (interrupt line) which it has logic to determine is either true or
> false.
> 
> A single other process (hypervisor) is interested in a condition
> (interrupt level) which is a logical OR of all interrupt lines.
> On changes, an interrupt level value needs to be read and copied to
> guest virtual cpu.
> 
> We also want ability to replace some or all processes above by a kernel
> components, with condition changes done potentially from hardware
> interrupt context.
> 
> 
> How we wanted to solve it with EFD_STATE: Share a separate eventfd
> between each device and the hypervisor.  device sets state to either 0
> or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
> calculates the interrupt level and updates guest.
> 
> Alternative solution: shared memory where each device writes interrupt
> line value. This makes setup more complex (need to share around much more
> than just an fd), and makes access from interrupt impossible unless we
> lock the memory (and locking userspace memory introduces yet another set
> of issues).

OK, if I get it correctly, there is one eventfd signaler (the device), and 
one eventfd reader (the hypervisor), right?
Each hypervisor listens for multiple devices detecting state changes, and 
associating the eventfd "line" to the IRQ number by some configuration 
(ala PCI), right?



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 17:45                               ` Davide Libenzi
@ 2009-08-26 18:58                                 ` Avi Kivity
  2009-08-26 19:13                                   ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-26 18:58 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/26/2009 08:45 PM, Davide Libenzi wrote:
> OK, if I get it correctly, there is one eventfd signaler (the device), and
> one eventfd reader (the hypervisor), right?
> Each hypervisor listens for multiple devices detecting state changes, and
> associating the eventfd "line" to the IRQ number by some configuration
> (ala PCI), right?
>    

Yes.  The PCI stuff happens in userspace, all the hypervisor sees is 
"this eventfd is IRQ 10".  There may be multiple eventfds routed to one 
IRQ (corresponding to a shared IRQ line).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 18:58                                 ` Avi Kivity
@ 2009-08-26 19:13                                   ` Davide Libenzi
  2009-08-26 19:42                                     ` Avi Kivity
  2009-08-27  9:05                                     ` Paolo Bonzini
  0 siblings, 2 replies; 79+ messages in thread
From: Davide Libenzi @ 2009-08-26 19:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On Wed, 26 Aug 2009, Avi Kivity wrote:

> On 08/26/2009 08:45 PM, Davide Libenzi wrote:
> > OK, if I get it correctly, there is one eventfd signaler (the device), and
> > one eventfd reader (the hypervisor), right?
> > Each hypervisor listens for multiple devices detecting state changes, and
> > associating the eventfd "line" to the IRQ number by some configuration
> > (ala PCI), right?
> >    
> 
> Yes.  The PCI stuff happens in userspace, all the hypervisor sees is "this
> eventfd is IRQ 10".  There may be multiple eventfds routed to one IRQ
> (corresponding to a shared IRQ line).

Ok, so why not using the eventfd counter as state?
On the device side:

void write_state(int sfd, int state) {
	u64 cnt;

	/* Clear the current state, sfd is in non-blocking mode */
	read(sfd, &cnt, sizeof(cnt));
	/* Writes new state */
	cnt = 1 + !!state;
	write(sfd, &cnt, sizeof(cnt));
}


On the hypervisor side:

int read_state(int sfd) {
	u64 cnt;

	read(sfd, &cnt, sizeof(cnt));
	return state - 1;
}




- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:13                                   ` Davide Libenzi
@ 2009-08-26 19:42                                     ` Avi Kivity
  2009-08-26 19:44                                       ` Davide Libenzi
  2009-08-26 19:50                                       ` Gleb Natapov
  2009-08-27  9:05                                     ` Paolo Bonzini
  1 sibling, 2 replies; 79+ messages in thread
From: Avi Kivity @ 2009-08-26 19:42 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/26/2009 10:13 PM, Davide Libenzi wrote:
> Ok, so why not using the eventfd counter as state?
> On the device side:
>
> void write_state(int sfd, int state) {
> 	u64 cnt;
>
> 	/* Clear the current state, sfd is in non-blocking mode */
> 	read(sfd,&cnt, sizeof(cnt));
> 	/* Writes new state */
> 	cnt = 1 + !!state;
> 	write(sfd,&cnt, sizeof(cnt));
> }
>
>
> On the hypervisor side:
>
> int read_state(int sfd) {
> 	u64 cnt;
>
> 	read(sfd,&cnt, sizeof(cnt));
> 	return state - 1;
> }
>
>    

Hadn't though of read+write as set.  While the 1+ is a little ugly, it's 
workable.

I see no kernel equivalent to read(), but that's easily done.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:42                                     ` Avi Kivity
@ 2009-08-26 19:44                                       ` Davide Libenzi
  2009-08-26 23:30                                         ` Davide Libenzi
  2009-08-26 19:50                                       ` Gleb Natapov
  1 sibling, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-26 19:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On Wed, 26 Aug 2009, Avi Kivity wrote:

> On 08/26/2009 10:13 PM, Davide Libenzi wrote:
> > Ok, so why not using the eventfd counter as state?
> > On the device side:
> > 
> > void write_state(int sfd, int state) {
> > 	u64 cnt;
> > 
> > 	/* Clear the current state, sfd is in non-blocking mode */
> > 	read(sfd,&cnt, sizeof(cnt));
> > 	/* Writes new state */
> > 	cnt = 1 + !!state;
> > 	write(sfd,&cnt, sizeof(cnt));
> > }
> > 
> > 
> > On the hypervisor side:
> > 
> > int read_state(int sfd) {
> > 	u64 cnt;
> > 
> > 	read(sfd,&cnt, sizeof(cnt));
> > 	return state - 1;
> > }
> > 
> >    
> 
> Hadn't though of read+write as set.  While the 1+ is a little ugly, it's
> workable.

Pick what you want, as long as it always writes something != 0 :)


> I see no kernel equivalent to read(), but that's easily done.

Adding an in-kernel read based on "ctx", that is no problem at all.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:42                                     ` Avi Kivity
  2009-08-26 19:44                                       ` Davide Libenzi
@ 2009-08-26 19:50                                       ` Gleb Natapov
  2009-08-26 20:04                                         ` Davide Libenzi
  1 sibling, 1 reply; 79+ messages in thread
From: Gleb Natapov @ 2009-08-26 19:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Davide Libenzi, Michael S. Tsirkin, kvm,
	Linux Kernel Mailing List

On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
> On 08/26/2009 10:13 PM, Davide Libenzi wrote:
> >Ok, so why not using the eventfd counter as state?
> >On the device side:
> >
> >void write_state(int sfd, int state) {
> >	u64 cnt;
> >
> >	/* Clear the current state, sfd is in non-blocking mode */
> >	read(sfd,&cnt, sizeof(cnt));
> >	/* Writes new state */
> >	cnt = 1 + !!state;
> >	write(sfd,&cnt, sizeof(cnt));
> >}
> >
> >
> >On the hypervisor side:
> >
> >int read_state(int sfd) {
> >	u64 cnt;
> >
> >	read(sfd,&cnt, sizeof(cnt));
> >	return state - 1;
> >}
> >
> 
> Hadn't though of read+write as set.  While the 1+ is a little ugly,
> it's workable.
> 
It's two system calls instead of one to inject interrupt.

> I see no kernel equivalent to read(), but that's easily done.
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:50                                       ` Gleb Natapov
@ 2009-08-26 20:04                                         ` Davide Libenzi
  2009-08-27  5:25                                           ` Gleb Natapov
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-26 20:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Avi Kivity, Michael S. Tsirkin, kvm, Linux Kernel Mailing List

On Wed, 26 Aug 2009, Gleb Natapov wrote:

> On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
> > On 08/26/2009 10:13 PM, Davide Libenzi wrote:
> > >Ok, so why not using the eventfd counter as state?
> > >On the device side:
> > >
> > >void write_state(int sfd, int state) {
> > >	u64 cnt;
> > >
> > >	/* Clear the current state, sfd is in non-blocking mode */
> > >	read(sfd,&cnt, sizeof(cnt));
> > >	/* Writes new state */
> > >	cnt = 1 + !!state;
> > >	write(sfd,&cnt, sizeof(cnt));
> > >}
> > >
> > >
> > >On the hypervisor side:
> > >
> > >int read_state(int sfd) {
> > >	u64 cnt;
> > >
> > >	read(sfd,&cnt, sizeof(cnt));
> > >	return state - 1;
> > >}
> > >
> > 
> > Hadn't though of read+write as set.  While the 1+ is a little ugly,
> > it's workable.
> > 
> It's two system calls instead of one to inject interrupt.

I guess that's going to completely throw off-chart your RT performance, 
doesn't it?


- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:44                                       ` Davide Libenzi
@ 2009-08-26 23:30                                         ` Davide Libenzi
  2009-08-27  4:13                                           ` Avi Kivity
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-26 23:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On Wed, 26 Aug 2009, Davide Libenzi wrote:

> > I see no kernel equivalent to read(), but that's easily done.
> 
> Adding an in-kernel read based on "ctx", that is no problem at all.

Something like the untested below.
I had thought you said the eventfd readers where in userspace, but I might 
have misunderstood you.



- Davide



---
 fs/eventfd.c            |   51 +++++++++++++++++++++++++++++++++---------------
 include/linux/eventfd.h |    7 ++++++
 2 files changed, 43 insertions(+), 15 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c	2009-08-26 15:58:03.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c	2009-08-26 16:20:03.000000000 -0700
@@ -130,26 +130,33 @@ static unsigned int eventfd_poll(struct 
 	return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-			    loff_t *ppos)
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	ssize_t res;
-	__u64 ucnt = 0;
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (count < sizeof(ucnt))
-		return -EINVAL;
 	spin_lock_irq(&ctx->wqh.lock);
+	*cnt = 0;
 	res = -EAGAIN;
 	if (ctx->count > 0)
-		res = sizeof(ucnt);
-	else if (!(file->f_flags & O_NONBLOCK)) {
+		res = 0;
+	else if (!no_wait) {
 		__add_wait_queue(&ctx->wqh, &wait);
-		for (res = 0;;) {
+		for (;;) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (ctx->count > 0) {
-				res = sizeof(ucnt);
+				res = 0;
 				break;
 			}
 			if (signal_pending(current)) {
@@ -163,18 +170,32 @@ static ssize_t eventfd_read(struct file 
 		__remove_wait_queue(&ctx->wqh, &wait);
 		__set_current_state(TASK_RUNNING);
 	}
-	if (likely(res > 0)) {
-		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
-		ctx->count -= ucnt;
+	if (likely(res == 0)) {
+		*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
+		ctx->count -= *cnt;
 		if (waitqueue_active(&ctx->wqh))
 			wake_up_locked_poll(&ctx->wqh, POLLOUT);
 	}
 	spin_unlock_irq(&ctx->wqh.lock);
-	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
-		return -EFAULT;
 
 	return res;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 cnt;
+
+	if (count < sizeof(cnt))
+		return -EINVAL;
+	if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
+		return res;
+
+	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
 			     loff_t *ppos)
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h	2009-08-26 15:58:03.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h	2009-08-26 16:17:01.000000000 -0700
@@ -33,6 +33,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -55,6 +56,12 @@ static inline void eventfd_ctx_put(struc
 
 }
 
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+				       __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 23:30                                         ` Davide Libenzi
@ 2009-08-27  4:13                                           ` Avi Kivity
  2009-08-27  8:06                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Avi Kivity @ 2009-08-27  4:13 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Michael S. Tsirkin, gleb, kvm, Linux Kernel Mailing List

On 08/27/2009 02:30 AM, Davide Libenzi wrote:
> On Wed, 26 Aug 2009, Davide Libenzi wrote:
>
>    
>>> I see no kernel equivalent to read(), but that's easily done.
>>>        
>> Adding an in-kernel read based on "ctx", that is no problem at all.
>>      
> Something like the untested below.
> I had thought you said the eventfd readers where in userspace, but I might
> have misunderstood you.
>    

No, they're all over the place.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 20:04                                         ` Davide Libenzi
@ 2009-08-27  5:25                                           ` Gleb Natapov
  0 siblings, 0 replies; 79+ messages in thread
From: Gleb Natapov @ 2009-08-27  5:25 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Avi Kivity, Michael S. Tsirkin, kvm, Linux Kernel Mailing List

On Wed, Aug 26, 2009 at 01:04:09PM -0700, Davide Libenzi wrote:
> On Wed, 26 Aug 2009, Gleb Natapov wrote:
> 
> > On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
> > > On 08/26/2009 10:13 PM, Davide Libenzi wrote:
> > > >Ok, so why not using the eventfd counter as state?
> > > >On the device side:
> > > >
> > > >void write_state(int sfd, int state) {
> > > >	u64 cnt;
> > > >
> > > >	/* Clear the current state, sfd is in non-blocking mode */
> > > >	read(sfd,&cnt, sizeof(cnt));
> > > >	/* Writes new state */
> > > >	cnt = 1 + !!state;
> > > >	write(sfd,&cnt, sizeof(cnt));
> > > >}
> > > >
> > > >
> > > >On the hypervisor side:
> > > >
> > > >int read_state(int sfd) {
> > > >	u64 cnt;
> > > >
> > > >	read(sfd,&cnt, sizeof(cnt));
> > > >	return state - 1;
> > > >}
> > > >
> > > 
> > > Hadn't though of read+write as set.  While the 1+ is a little ugly,
> > > it's workable.
> > > 
> > It's two system calls instead of one to inject interrupt.
> 
> I guess that's going to completely throw off-chart your RT performance, 
> doesn't it?
> 
Do you consider interrupt injection path not worth of optimizing?

--
			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27  4:13                                           ` Avi Kivity
@ 2009-08-27  8:06                                             ` Michael S. Tsirkin
  2009-08-27 14:20                                               ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-27  8:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Davide Libenzi, gleb, kvm, Linux Kernel Mailing List

On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote:
> On 08/27/2009 02:30 AM, Davide Libenzi wrote:
>> On Wed, 26 Aug 2009, Davide Libenzi wrote:
>>
>>    
>>>> I see no kernel equivalent to read(), but that's easily done.
>>>>        
>>> Adding an in-kernel read based on "ctx", that is no problem at all.
>>>      
>> Something like the untested below.
>> I had thought you said the eventfd readers where in userspace, but I might
>> have misunderstood you.
>>    
>
> No, they're all over the place.

Further, with Davide's proposal you must be a reader to signal events.

> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-26 19:13                                   ` Davide Libenzi
  2009-08-26 19:42                                     ` Avi Kivity
@ 2009-08-27  9:05                                     ` Paolo Bonzini
  2009-08-27  9:09                                       ` Michael S. Tsirkin
  2009-08-27 14:21                                       ` Davide Libenzi
  1 sibling, 2 replies; 79+ messages in thread
From: Paolo Bonzini @ 2009-08-27  9:05 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm, linux-kernel

> Ok, so why not using the eventfd counter as state?
> On the device side:
> 
> void write_state(int sfd, int state) {
> 	u64 cnt;
> 
> 	/* Clear the current state, sfd is in non-blocking mode */
> 	read(sfd,&cnt, sizeof(cnt));
> 	/* Writes new state */
> 	cnt = 1 + !!state;
> 	write(sfd,&cnt, sizeof(cnt));
> }

It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
added exactly to avoid a read+write combination for the case of
decrementing a value.  Here it's the same, just it's about the case of
writing a *given* value.  What about having EFD_STATE simply mean "do
not use a counter, just write the value" without affecting the way
read works, and use

	/* Writes new state */
	cnt = 1 + !!state;
	write(sfd,&cnt, sizeof(cnt));

See below?

Paolo

> On the hypervisor side:
> 
> int read_state(int sfd) {
> 	u64 cnt;
> 
> 	read(sfd,&cnt, sizeof(cnt));
> 	return state - 1;
> }


------------- 8<-- ---------------
Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag

This implements a new EFD_ABSOLUTE flag for eventfd.
This changes eventfd behaviour so that write simply
stores the value written, and is always non-blocking.

Untested (I just modified Michael's patch, and given
the simpler code I'm not sure it's now worthwhile
introducing the inline functions), but otherwise

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 fs/eventfd.c            |   13 ++++++++-----
 include/linux/eventfd.h |    4 ++--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 347a0e0..7b279e3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -31,8 +31,6 @@
 
 static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n)
 {
-	return ULLONG_MAX - n > ctx->count;
-	return (ctx->flags & EFD_ABSOLUTE) || (ULLONG_MAX - n > ctx->count);
 }
 
 static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt)
@@ -42,10 +40,14 @@
 
 static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt)
 {
-	if (eventfd_writeable(ctx, ucnt))
-		ucnt = ULLONG_MAX - ctx->count;
+	if (ctx->flags & EFD_ABSOLUTE)
+		ctx->count = ucnt;
+	else {
+		if (ULLONG_MAX - ctx->count < ucnt)
+			ucnt = ULLONG_MAX - ctx->count;
 
-	ctx->count += ucnt;
+		ctx->count += ucnt;
+	}
 }
 
 static inline u64 eventfd_doread(struct eventfd_ctx *ctx)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3b85ba6..78ff649 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -19,11 +19,12 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1 << 0)
+#define EFD_ABSOLUTE (1 << 1)
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE)
 
 #ifdef CONFIG_EVENTFD
--
1.6.2.5 

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27  9:05                                     ` Paolo Bonzini
@ 2009-08-27  9:09                                       ` Michael S. Tsirkin
  2009-08-27 14:21                                       ` Davide Libenzi
  1 sibling, 0 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-27  9:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Davide Libenzi, Avi Kivity, gleb, kvm, linux-kernel

On Thu, Aug 27, 2009 at 11:05:30AM +0200, Paolo Bonzini wrote:
> > Ok, so why not using the eventfd counter as state?
> > On the device side:
> > 
> > void write_state(int sfd, int state) {
> > 	u64 cnt;
> > 
> > 	/* Clear the current state, sfd is in non-blocking mode */
> > 	read(sfd,&cnt, sizeof(cnt));
> > 	/* Writes new state */
> > 	cnt = 1 + !!state;
> > 	write(sfd,&cnt, sizeof(cnt));
> > }
> 
> It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
> added exactly to avoid a read+write combination for the case of
> decrementing a value.  Here it's the same, just it's about the case of
> writing a *given* value.  What about having EFD_STATE simply mean "do
> not use a counter, just write the value" without affecting the way
> read works, and use
> 
> 	/* Writes new state */
> 	cnt = 1 + !!state;
> 	write(sfd,&cnt, sizeof(cnt));

That would work for kvm.

> See below?
> 
> Paolo
> 
> > On the hypervisor side:
> > 
> > int read_state(int sfd) {
> > 	u64 cnt;
> > 
> > 	read(sfd,&cnt, sizeof(cnt));
> > 	return state - 1;
> > }
> 
> 
> ------------- 8<-- ---------------
> Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag
> 
> This implements a new EFD_ABSOLUTE flag for eventfd.
> This changes eventfd behaviour so that write simply
> stores the value written, and is always non-blocking.
> 
> Untested (I just modified Michael's patch, and given
> the simpler code I'm not sure it's now worthwhile
> introducing the inline functions), but otherwise
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  fs/eventfd.c            |   13 ++++++++-----
>  include/linux/eventfd.h |    4 ++--
>  2 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index 347a0e0..7b279e3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -31,8 +31,6 @@
>  
>  static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n)
>  {
> -	return ULLONG_MAX - n > ctx->count;
> -	return (ctx->flags & EFD_ABSOLUTE) || (ULLONG_MAX - n > ctx->count);
>  }
>  
>  static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt)
> @@ -42,10 +40,14 @@
>  
>  static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt)
>  {
> -	if (eventfd_writeable(ctx, ucnt))
> -		ucnt = ULLONG_MAX - ctx->count;
> +	if (ctx->flags & EFD_ABSOLUTE)
> +		ctx->count = ucnt;
> +	else {
> +		if (ULLONG_MAX - ctx->count < ucnt)
> +			ucnt = ULLONG_MAX - ctx->count;
>  
> -	ctx->count += ucnt;
> +		ctx->count += ucnt;
> +	}
>  }
>  
>  static inline u64 eventfd_doread(struct eventfd_ctx *ctx)
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 3b85ba6..78ff649 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -19,11 +19,12 @@
>   * shared O_* flags.
>   */
>  #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_ABSOLUTE (1 << 1)
>  #define EFD_CLOEXEC O_CLOEXEC
>  #define EFD_NONBLOCK O_NONBLOCK
>  
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE)
>  
>  #ifdef CONFIG_EVENTFD
> --
> 1.6.2.5 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27  8:06                                             ` Michael S. Tsirkin
@ 2009-08-27 14:20                                               ` Davide Libenzi
  0 siblings, 0 replies; 79+ messages in thread
From: Davide Libenzi @ 2009-08-27 14:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

> On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote:
> > On 08/27/2009 02:30 AM, Davide Libenzi wrote:
> >> On Wed, 26 Aug 2009, Davide Libenzi wrote:
> >>
> >>    
> >>>> I see no kernel equivalent to read(), but that's easily done.
> >>>>        
> >>> Adding an in-kernel read based on "ctx", that is no problem at all.
> >>>      
> >> Something like the untested below.
> >> I had thought you said the eventfd readers where in userspace, but I might
> >> have misunderstood you.
> >>    
> >
> > No, they're all over the place.
> 
> Further, with Davide's proposal you must be a reader to signal events.

To signal events, you must have an instance of "ctx" (with the new 
exposed eventfd_ctx_read). How would you do otherwise?


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27  9:05                                     ` Paolo Bonzini
  2009-08-27  9:09                                       ` Michael S. Tsirkin
@ 2009-08-27 14:21                                       ` Davide Libenzi
  2009-08-27 14:30                                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-27 14:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Avi Kivity, Michael S. Tsirkin, gleb, kvm,
	Linux Kernel Mailing List

On Thu, 27 Aug 2009, Paolo Bonzini wrote:

> > Ok, so why not using the eventfd counter as state?
> > On the device side:
> > 
> > void write_state(int sfd, int state) {
> > 	u64 cnt;
> > 
> > 	/* Clear the current state, sfd is in non-blocking mode */
> > 	read(sfd,&cnt, sizeof(cnt));
> > 	/* Writes new state */
> > 	cnt = 1 + !!state;
> > 	write(sfd,&cnt, sizeof(cnt));
> > }
> 
> It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
> added exactly to avoid a read+write combination for the case of
> decrementing a value.

Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
*semaphore* is a pretty widely known and used abstraction.


- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27 14:21                                       ` Davide Libenzi
@ 2009-08-27 14:30                                         ` Michael S. Tsirkin
  2009-08-27 14:38                                           ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-27 14:30 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Paolo Bonzini, Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
> On Thu, 27 Aug 2009, Paolo Bonzini wrote:
> 
> > > Ok, so why not using the eventfd counter as state?
> > > On the device side:
> > > 
> > > void write_state(int sfd, int state) {
> > > 	u64 cnt;
> > > 
> > > 	/* Clear the current state, sfd is in non-blocking mode */
> > > 	read(sfd,&cnt, sizeof(cnt));
> > > 	/* Writes new state */
> > > 	cnt = 1 + !!state;
> > > 	write(sfd,&cnt, sizeof(cnt));
> > > }
> > 
> > It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
> > added exactly to avoid a read+write combination for the case of
> > decrementing a value.
> 
> Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
> *semaphore* is a pretty widely known and used abstraction.

what about an atomic variable, btw?  does it make sense to implement
write that does compare and exchange?

> 
> - Davide
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27 14:30                                         ` Michael S. Tsirkin
@ 2009-08-27 14:38                                           ` Davide Libenzi
  2009-08-27 14:49                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-27 14:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

> On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
> > On Thu, 27 Aug 2009, Paolo Bonzini wrote:
> > 
> > > > Ok, so why not using the eventfd counter as state?
> > > > On the device side:
> > > > 
> > > > void write_state(int sfd, int state) {
> > > > 	u64 cnt;
> > > > 
> > > > 	/* Clear the current state, sfd is in non-blocking mode */
> > > > 	read(sfd,&cnt, sizeof(cnt));
> > > > 	/* Writes new state */
> > > > 	cnt = 1 + !!state;
> > > > 	write(sfd,&cnt, sizeof(cnt));
> > > > }
> > > 
> > > It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
> > > added exactly to avoid a read+write combination for the case of
> > > decrementing a value.
> > 
> > Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
> > *semaphore* is a pretty widely known and used abstraction.
> 
> what about an atomic variable, btw?  does it make sense to implement
> write that does compare and exchange?

It is surprising to me, that is front of a workable solution w/out any 
use-once additions, yet you want to try to add optimizations and new 
ad-hoc abstractions to user visible interfaces.
Now, you tell me what an atomic variable has to do with an eventfd.


- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27 14:38                                           ` Davide Libenzi
@ 2009-08-27 14:49                                             ` Michael S. Tsirkin
  2009-08-27 15:29                                               ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2009-08-27 14:49 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Paolo Bonzini, Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, Aug 27, 2009 at 07:38:46AM -0700, Davide Libenzi wrote:
> On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
> 
> > On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
> > > On Thu, 27 Aug 2009, Paolo Bonzini wrote:
> > > 
> > > > > Ok, so why not using the eventfd counter as state?
> > > > > On the device side:
> > > > > 
> > > > > void write_state(int sfd, int state) {
> > > > > 	u64 cnt;
> > > > > 
> > > > > 	/* Clear the current state, sfd is in non-blocking mode */
> > > > > 	read(sfd,&cnt, sizeof(cnt));
> > > > > 	/* Writes new state */
> > > > > 	cnt = 1 + !!state;
> > > > > 	write(sfd,&cnt, sizeof(cnt));
> > > > > }
> > > > 
> > > > It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
> > > > added exactly to avoid a read+write combination for the case of
> > > > decrementing a value.
> > > 
> > > Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
> > > *semaphore* is a pretty widely known and used abstraction.
> > 
> > what about an atomic variable, btw?  does it make sense to implement
> > write that does compare and exchange?
> 
> It is surprising to me, that is front of a workable solution w/out any 
> use-once additions, yet you want to try to add optimizations and new 
> ad-hoc abstractions to user visible interfaces.
> Now, you tell me what an atomic variable has to do with an eventfd.
> 
> 
> - Davide
> 

Oh, I stopped pushing EFD_STATE since we have a solution.
I am just trying to grok what does and what does not consititute a
use-once addition, in your mind, and what does and what does not
belong in eventfd. The reason atomic does not belong there and
semaphore does is because one waits on semaphore but not
on atomic? Is that it?

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27 14:49                                             ` Michael S. Tsirkin
@ 2009-08-27 15:29                                               ` Davide Libenzi
  2009-08-27 17:09                                                 ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-27 15:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

> Oh, I stopped pushing EFD_STATE since we have a solution.
> I am just trying to grok what does and what does not consititute a
> use-once addition, in your mind, and what does and what does not
> belong in eventfd. The reason atomic does not belong there and
> semaphore does is because one waits on semaphore but not
> on atomic? Is that it?

An atomic variable is not a synchronization interface.
Even if it'd come up that we really need an atomic variable abstraction 
via an fd, this should not go via eventfd.
Yeah, maybe the name syncfd would have been better, but the very reason 
why eventfd born was to allow KAIO to signal events to poll/select/epoll.
That first implementation was also usable as a mutex.
Then it was propsed to make eventfd to act as a semaphore too. Code was 
trivial, a semaphore fitted the interface, and the abstraction was a 
pretty damn known too. So it went in.
Yes, you could have implemented a semaphore with the existing eventfd, by 
reading the counter and writing counter-1. But again, a semaphore was 
something widely known and generic enough, and was fitting the bill.

- Davide

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2009-08-27 15:29                                               ` Davide Libenzi
@ 2009-08-27 17:09                                                 ` Davide Libenzi
       [not found]                                                   ` <alpine.DEB.2.00.0908311644410.17349@makko.or.mcafeemobile.com>
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2009-08-27 17:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Avi Kivity, gleb, kvm, Linux Kernel Mailing List

On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

> Oh, I stopped pushing EFD_STATE since we have a solution.

Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
Because if nobody uses it, I'm not going to push it.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
       [not found]                                                       ` <alpine.DEB.2.00.0909010723380.28172@makko.or.mcafeemobile.com>
@ 2010-01-06 19:33                                                         ` Michael S. Tsirkin
  2010-01-06 20:43                                                           ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-06 19:33 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
> On Tue, 1 Sep 2009, Avi Kivity wrote:
> 
> > On 09/01/2009 02:45 AM, Davide Libenzi wrote:
> > > On Thu, 27 Aug 2009, Davide Libenzi wrote:
> > > 
> > >    
> > > > On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
> > > > 
> > > >      
> > > > > Oh, I stopped pushing EFD_STATE since we have a solution.
> > > > >        
> > > > Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
> > > > Because if nobody uses it, I'm not going to push it.
> > > >      
> > > Guys, I did not get a reply on this. Do you need me to push it, or you're
> > > not going to use it at the end?
> > >    
> > 
> > We'll use it eventually, but we're still some ways from it.
> 
> OK, then bug me when you're going to need it. I won't push it before that.
> 
> 
> - Davide

So, it turns out that we need this: be thought we don't because
currently kvm does not zero eventfd counter when it polls eventfd.  But
this causes spurious interrupts when we disconnect irqfd from kvm and
re-connect it back.

However, since kvm does its own thing with the wait queue, and might
read the counter from wait queue callback (which might be from
interrupt context), a simpler, lower-level interface would be better for
us.  Does the following (build tested only) look palatable?

Thanks!


diff --git a/fs/eventfd.c b/fs/eventfd.c
index d26402f..e350ffd 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -135,6 +135,17 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
 	return events;
 }
 
+/* Caller must have wait queue head lock. */
+ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt)
+{
+	if (!ctx->count)
+		return -EAGAIN;
+	*ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
+	ctx->count -= *ucnt;
+	return sizeof *ucnt;
+}
+EXPORT_SYMBOL_GPL(_eventfd_read_ctx);
+
 static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 			    loff_t *ppos)
 {
@@ -146,17 +157,14 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 	if (count < sizeof(ucnt))
 		return -EINVAL;
 	spin_lock_irq(&ctx->wqh.lock);
-	res = -EAGAIN;
-	if (ctx->count > 0)
-		res = sizeof(ucnt);
-	else if (!(file->f_flags & O_NONBLOCK)) {
+	res = _eventfd_read_ctx(ctx, &ucnt);
+	if (res < 0 && !(file->f_flags & O_NONBLOCK)) {
 		__add_wait_queue(&ctx->wqh, &wait);
 		for (res = 0;;) {
 			set_current_state(TASK_INTERRUPTIBLE);
-			if (ctx->count > 0) {
-				res = sizeof(ucnt);
+			res = _eventfd_read_ctx(ctx, &ucnt);
+			if (res > 0)
 				break;
-			}
 			if (signal_pending(current)) {
 				res = -ERESTARTSYS;
 				break;
@@ -169,8 +177,6 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 		__set_current_state(TASK_RUNNING);
 	}
 	if (likely(res > 0)) {
-		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
-		ctx->count -= ucnt;
 		if (waitqueue_active(&ctx->wqh))
 			wake_up_locked_poll(&ctx->wqh, POLLOUT);
 	}
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 94dd103..a3d0ce9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -34,6 +34,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -61,6 +62,11 @@ static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
 
 }
 
+static inline ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt)
+{
+	return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */


-- 
MST

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 19:33                                                         ` Michael S. Tsirkin
@ 2010-01-06 20:43                                                           ` Davide Libenzi
  2010-01-06 20:55                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-06 20:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:

> On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
> > On Tue, 1 Sep 2009, Avi Kivity wrote:
> > 
> > > On 09/01/2009 02:45 AM, Davide Libenzi wrote:
> > > > On Thu, 27 Aug 2009, Davide Libenzi wrote:
> > > > 
> > > >    
> > > > > On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
> > > > > 
> > > > >      
> > > > > > Oh, I stopped pushing EFD_STATE since we have a solution.
> > > > > >        
> > > > > Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
> > > > > Because if nobody uses it, I'm not going to push it.
> > > > >      
> > > > Guys, I did not get a reply on this. Do you need me to push it, or you're
> > > > not going to use it at the end?
> > > >    
> > > 
> > > We'll use it eventually, but we're still some ways from it.
> > 
> > OK, then bug me when you're going to need it. I won't push it before that.
> > 
> > 
> > - Davide
> 
> So, it turns out that we need this: be thought we don't because
> currently kvm does not zero eventfd counter when it polls eventfd.  But
> this causes spurious interrupts when we disconnect irqfd from kvm and
> re-connect it back.
> 
> However, since kvm does its own thing with the wait queue, and might
> read the counter from wait queue callback (which might be from
> interrupt context), a simpler, lower-level interface would be better for
> us.  Does the following (build tested only) look palatable?

What is wrong with you? :/
There was a patch attached with this email, and yet you did your own, and 
yet again you managed to add an underscore at the beginning of the API 
name, in an API set where there are no leading underscores.
Even if KVM does that for its own reasons, you should be able to fit your 
naming "style" to the interface you're adding your droplets into.
I will post an eventfd_read_ctx() to Andrew ASAP.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 20:43                                                           ` Davide Libenzi
@ 2010-01-06 20:55                                                             ` Michael S. Tsirkin
  2010-01-06 21:17                                                               ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-06 20:55 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 12:43:07PM -0800, Davide Libenzi wrote:
> On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:
> 
> > On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
> > > On Tue, 1 Sep 2009, Avi Kivity wrote:
> > > 
> > > > On 09/01/2009 02:45 AM, Davide Libenzi wrote:
> > > > > On Thu, 27 Aug 2009, Davide Libenzi wrote:
> > > > > 
> > > > >    
> > > > > > On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
> > > > > > 
> > > > > >      
> > > > > > > Oh, I stopped pushing EFD_STATE since we have a solution.
> > > > > > >        
> > > > > > Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
> > > > > > Because if nobody uses it, I'm not going to push it.
> > > > > >      
> > > > > Guys, I did not get a reply on this. Do you need me to push it, or you're
> > > > > not going to use it at the end?
> > > > >    
> > > > 
> > > > We'll use it eventually, but we're still some ways from it.
> > > 
> > > OK, then bug me when you're going to need it. I won't push it before that.
> > > 
> > > 
> > > - Davide
> > 
> > So, it turns out that we need this: be thought we don't because
> > currently kvm does not zero eventfd counter when it polls eventfd.  But
> > this causes spurious interrupts when we disconnect irqfd from kvm and
> > re-connect it back.
> > 
> > However, since kvm does its own thing with the wait queue, and might
> > read the counter from wait queue callback (which might be from
> > interrupt context), a simpler, lower-level interface would be better for
> > us.  Does the following (build tested only) look palatable?
> 
> What is wrong with you? :/

I was trying to be helpful.

> There was a patch attached with this email, and yet you did your own,

I tried to explain, no?  That patch was taking wait queue spinlock and
was assuming that eventfd_read_ctx is called from a task that can block.
KVM attaches its own poller so this is not a good fit for us.  Hope this
clarifies.

> and yet again you managed to add an underscore at the beginning of the
> API name, in an API set where there are no leading underscores.  Even
> if KVM does that for its own reasons, you should be able to fit your
> naming "style" to the interface you're adding your droplets into.  I
> will post an eventfd_read_ctx() to Andrew ASAP.
> 
> 
> 
> - Davide

Sorry about the underscore - it's easy to fix but I won't bore you with
more patch revisions before I understand what makes you unhappy.

Before KVM starts creating workqueues and doing schedule_work calls just
to work around the API, can we discuss this please? I am not hung on my
patch, but could we have it work so that we can call eventfd_read_ctx
from wait queue callback directly?

Thanks!


-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 20:55                                                             ` Michael S. Tsirkin
@ 2010-01-06 21:17                                                               ` Davide Libenzi
  2010-01-06 22:29                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-06 21:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:

> I tried to explain, no?  That patch was taking wait queue spinlock and
> was assuming that eventfd_read_ctx is called from a task that can block.
> KVM attaches its own poller so this is not a good fit for us.  Hope this
> clarifies.
> 
> > and yet again you managed to add an underscore at the beginning of the
> > API name, in an API set where there are no leading underscores.  Even
> > if KVM does that for its own reasons, you should be able to fit your
> > naming "style" to the interface you're adding your droplets into.  I
> > will post an eventfd_read_ctx() to Andrew ASAP.
> > 
> > 
> > 
> > - Davide
> 
> Sorry about the underscore - it's easy to fix but I won't bore you with
> more patch revisions before I understand what makes you unhappy.
> 
> Before KVM starts creating workqueues and doing schedule_work calls just
> to work around the API, can we discuss this please? I am not hung on my
> patch, but could we have it work so that we can call eventfd_read_ctx
> from wait queue callback directly?

The "read" needs to wake possible OUT waiters, *cannot* be done your way.
On top of creating an interface which requires a lock, which noone can get 
from the interface itself, since it's not exposed.
I could split the two and have a locked one, and an unlocked one, but that 
looks shitty too (for the above reason).
I thought you *already* do your own stuff from a work queue, why would you 
need to read from IRQ context?



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 21:17                                                               ` Davide Libenzi
@ 2010-01-06 22:29                                                                 ` Michael S. Tsirkin
  2010-01-06 22:46                                                                   ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-06 22:29 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 01:17:02PM -0800, Davide Libenzi wrote:
> On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:
> 
> > I tried to explain, no?  That patch was taking wait queue spinlock and
> > was assuming that eventfd_read_ctx is called from a task that can block.
> > KVM attaches its own poller so this is not a good fit for us.  Hope this
> > clarifies.
> > 
> > > and yet again you managed to add an underscore at the beginning of the
> > > API name, in an API set where there are no leading underscores.  Even
> > > if KVM does that for its own reasons, you should be able to fit your
> > > naming "style" to the interface you're adding your droplets into.  I
> > > will post an eventfd_read_ctx() to Andrew ASAP.
> > > 
> > > 
> > > 
> > > - Davide
> > 
> > Sorry about the underscore - it's easy to fix but I won't bore you with
> > more patch revisions before I understand what makes you unhappy.
> > 
> > Before KVM starts creating workqueues and doing schedule_work calls just
> > to work around the API, can we discuss this please? I am not hung on my
> > patch, but could we have it work so that we can call eventfd_read_ctx
> > from wait queue callback directly?
> 
> The "read" needs to wake possible OUT waiters, *cannot* be done your way.

Right. Now I see. Thanks!

> On top of creating an interface which requires a lock, which noone can get 
> from the interface itself, since it's not exposed.

I think here's how KVM gets it: the way it does is by calling poll with
our own poll table, then in poll_queue_proc we get wait queue pointer,
and we use the wait queue. Lock is in there :)

> I could split the two and have a locked one, and an unlocked one, but that 
> looks shitty too (for the above reason).

Yes, this will work. Thanks!

> I thought you *already* do your own stuff from a work queue, why would you 
> need to read from IRQ context?
> 
> 
> 
> - Davide

This was just work around while interrupt delivery was using mutex
locks, so we had a workqueue to work around that limitation.  Now that
it has been all switched to RCU, we'll be getting getting rid of this.

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 22:29                                                                 ` Michael S. Tsirkin
@ 2010-01-06 22:46                                                                   ` Davide Libenzi
  2010-01-06 23:45                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-06 22:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

> > On top of creating an interface which requires a lock, which noone can get 
> > from the interface itself, since it's not exposed.
> 
> I think here's how KVM gets it: the way it does is by calling poll with
> our own poll table, then in poll_queue_proc we get wait queue pointer,
> and we use the wait queue. Lock is in there :)

Yes, I know you are called locked, but it does not lead to a clean 
interface.



> > I could split the two and have a locked one, and an unlocked one, but that 
> > looks shitty too (for the above reason).
> 
> Yes, this will work. Thanks!

This is a lot more complex than I thought. The wakeup code is already 
enumerating the list, and doing a wakeup might trigger a secondary 
enumeration/recursion.
Do you really need to "consume" the value from IRQ context, or you can 
simply "peek" the value, and flush it later?



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 22:46                                                                   ` Davide Libenzi
@ 2010-01-06 23:45                                                                     ` Michael S. Tsirkin
  2010-01-06 23:59                                                                       ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-06 23:45 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > > On top of creating an interface which requires a lock, which noone can get 
> > > from the interface itself, since it's not exposed.
> > 
> > I think here's how KVM gets it: the way it does is by calling poll with
> > our own poll table, then in poll_queue_proc we get wait queue pointer,
> > and we use the wait queue. Lock is in there :)
> 
> Yes, I know you are called locked, but it does not lead to a clean 
> interface.

True.

> > > I could split the two and have a locked one, and an unlocked one, but that 
> > > looks shitty too (for the above reason).
> > 
> > Yes, this will work. Thanks!
> 
> This is a lot more complex than I thought. The wakeup code is already 
> enumerating the list, and doing a wakeup might trigger a secondary 
> enumeration/recursion.

For KVM what you describe is I think is not a problem: we check wake type
and ignore POLLOUT events.

> Do you really need to "consume" the value from IRQ context, or you can 
> simply "peek" the value, and flush it later?
> 
> 
> 
> - Davide

Maybe yes. I'll think it over and get back to you. Thanks!

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 23:45                                                                     ` Michael S. Tsirkin
@ 2010-01-06 23:59                                                                       ` Davide Libenzi
  2010-01-07  0:02                                                                         ` Michael S. Tsirkin
  2010-01-07  6:45                                                                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 79+ messages in thread
From: Davide Libenzi @ 2010-01-06 23:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

> On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
> > On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> > 
> > > > On top of creating an interface which requires a lock, which noone can get 
> > > > from the interface itself, since it's not exposed.
> > > 
> > > I think here's how KVM gets it: the way it does is by calling poll with
> > > our own poll table, then in poll_queue_proc we get wait queue pointer,
> > > and we use the wait queue. Lock is in there :)
> > 
> > Yes, I know you are called locked, but it does not lead to a clean 
> > interface.
> 
> True.
> 
> > > > I could split the two and have a locked one, and an unlocked one, but that 
> > > > looks shitty too (for the above reason).
> > > 
> > > Yes, this will work. Thanks!
> > 
> > This is a lot more complex than I thought. The wakeup code is already 
> > enumerating the list, and doing a wakeup might trigger a secondary 
> > enumeration/recursion.
> 
> For KVM what you describe is I think is not a problem: we check wake type
> and ignore POLLOUT events.

You seem to think in one dimension only ;)
The interface needs to be stable for everyone.



> Maybe yes. I'll think it over and get back to you. Thanks!

Let me know.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 23:59                                                                       ` Davide Libenzi
@ 2010-01-07  0:02                                                                         ` Michael S. Tsirkin
  2010-01-07  6:45                                                                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-07  0:02 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
> > > On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> > > 
> > > > > On top of creating an interface which requires a lock, which noone can get 
> > > > > from the interface itself, since it's not exposed.
> > > > 
> > > > I think here's how KVM gets it: the way it does is by calling poll with
> > > > our own poll table, then in poll_queue_proc we get wait queue pointer,
> > > > and we use the wait queue. Lock is in there :)
> > > 
> > > Yes, I know you are called locked, but it does not lead to a clean 
> > > interface.
> > 
> > True.
> > 
> > > > > I could split the two and have a locked one, and an unlocked one, but that 
> > > > > looks shitty too (for the above reason).
> > > > 
> > > > Yes, this will work. Thanks!
> > > 
> > > This is a lot more complex than I thought. The wakeup code is already 
> > > enumerating the list, and doing a wakeup might trigger a secondary 
> > > enumeration/recursion.
> > 
> > For KVM what you describe is I think is not a problem: we check wake type
> > and ignore POLLOUT events.
> 
> You seem to think in one dimension only ;)
> The interface needs to be stable for everyone.
> 

I agree it's better to have interface that can't be misused.  I was just
pointing out that users *can* break recursion by looking at event type.

> > Maybe yes. I'll think it over and get back to you. Thanks!
> 
> Let me know.
> 
> 
> 
> - Davide
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-06 23:59                                                                       ` Davide Libenzi
  2010-01-07  0:02                                                                         ` Michael S. Tsirkin
@ 2010-01-07  6:45                                                                         ` Michael S. Tsirkin
  2010-01-07  7:25                                                                           ` Davide Libenzi
  1 sibling, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-07  6:45 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
> > > On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> > > 
> > > > > On top of creating an interface which requires a lock, which noone can get 
> > > > > from the interface itself, since it's not exposed.
> > > > 
> > > > I think here's how KVM gets it: the way it does is by calling poll with
> > > > our own poll table, then in poll_queue_proc we get wait queue pointer,
> > > > and we use the wait queue. Lock is in there :)
> > > 
> > > Yes, I know you are called locked, but it does not lead to a clean 
> > > interface.
> > 
> > True.
> > 
> > > > > I could split the two and have a locked one, and an unlocked one, but that 
> > > > > looks shitty too (for the above reason).
> > > > 
> > > > Yes, this will work. Thanks!
> > > 
> > > This is a lot more complex than I thought. The wakeup code is already 
> > > enumerating the list, and doing a wakeup might trigger a secondary 
> > > enumeration/recursion.
> > 
> > For KVM what you describe is I think is not a problem: we check wake type
> > and ignore POLLOUT events.
> 
> You seem to think in one dimension only ;)
> The interface needs to be stable for everyone.
> 
> 
> 
> > Maybe yes. I'll think it over and get back to you. Thanks!
> 
> Let me know.
> 
> 
> 
> - Davide

OK. What I think we need is a way to remove ourselves from the eventfd
wait queue and clear the counter atomically.

We currently do
        remove_wait_queue(irqfd->wqh, &irqfd->wait);
where wqh saves the eventfd wait queue head.

If we do this before proposed eventfd_read_ctx, we can lose events.
If we do this after, we can get spurious events.

An unlocked read is one way to fix this.

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07  6:45                                                                         ` Michael S. Tsirkin
@ 2010-01-07  7:25                                                                           ` Davide Libenzi
  2010-01-07 10:36                                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-07  7:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

> OK. What I think we need is a way to remove ourselves from the eventfd
> wait queue and clear the counter atomically.
> 
> We currently do
>         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> where wqh saves the eventfd wait queue head.

You do a remove_wait_queue() from inside a callback wakeup on the same 
wait queue head?


> If we do this before proposed eventfd_read_ctx, we can lose events.
> If we do this after, we can get spurious events.
> 
> An unlocked read is one way to fix this.

You posted one line of code and a two lines analysis of the issue. Can you 
be a little bit more verbose and show me more code, so that I can actually 
see what is going on?


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07  7:25                                                                           ` Davide Libenzi
@ 2010-01-07 10:36                                                                             ` Michael S. Tsirkin
  2010-01-07 23:37                                                                               ` Davide Libenzi
                                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-07 10:36 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Wed, Jan 06, 2010 at 11:25:40PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > OK. What I think we need is a way to remove ourselves from the eventfd
> > wait queue and clear the counter atomically.
> > 
> > We currently do
> >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > where wqh saves the eventfd wait queue head.
> 
> You do a remove_wait_queue() from inside a callback wakeup on the same 
> wait queue head?
> 

No, not from callback, in ioctl context.

> > If we do this before proposed eventfd_read_ctx, we can lose events.
> > If we do this after, we can get spurious events.
> > 
> > An unlocked read is one way to fix this.
> 
> You posted one line of code and a two lines analysis of the issue. Can you 
> be a little bit more verbose and show me more code, so that I can actually 
> see what is going on?
> 
> 
> - Davide

Sure, I was trying to be as brief as possible, here's a detailed summary.

Description of the system (MSI emulation in KVM):

KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
in guest OS.  When this eventfd is signalled, interrupt message is sent.
This assignment is done from qemu system emulator.

eventfd is signalled from device emulation in another thread in
userspace or from kernel, which talks with guest OS through another
eventfd and shared memory (possibility of out of process was discussed
but never got implemented yet).

Note: it's okay to delay messages from correctness point of view, but
generally this is latency-sensitive path. If multiple identical messages
are requested, it's okay to send a single last message, but missing a
message altogether causes deadlocks.  Sending a message when none were
requested might in theory cause crashes, in practice doing this causes
performance degradation.

Another KVM feature is interrupt masking: guest OS requests that we
stop sending some interrupt message, possibly modified mapping
and re-enables this message. This needs to be done without
involving the device that might keep requesting events:
while masked, message is marked "pending", and guest might test
the pending status.

We can implement masking in system emulator in userspace, by using
assign/deassign ioctls: when message is masked, we simply deassign all
eventfd, and when it is unmasked, we assign them back.

Here's some code to illustrate how this all works: assign/deassign code
in kernel looks like the following:

this is called to unmask interrupt

static int
kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
	struct _irqfd *irqfd, *tmp;
	struct file *file = NULL;
	struct eventfd_ctx *eventfd = NULL;
	int ret;
	unsigned int events;

	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);

...

	file = eventfd_fget(fd);
	if (IS_ERR(file)) {
		ret = PTR_ERR(file);
		goto fail;
	}

	eventfd = eventfd_ctx_fileget(file);
	if (IS_ERR(eventfd)) {
		ret = PTR_ERR(eventfd);
		goto fail;
	}

	irqfd->eventfd = eventfd;

	/*
	 * Install our own custom wake-up handling so we are notified via
	 * a callback whenever someone signals the underlying eventfd
	 */
	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);

	spin_lock_irq(&kvm->irqfds.lock);

	events = file->f_op->poll(file, &irqfd->pt);

	list_add_tail(&irqfd->list, &kvm->irqfds.items);
	spin_unlock_irq(&kvm->irqfds.lock);

A.
	/*
	 * Check if there was an event already pending on the eventfd
	 * before we registered, and trigger it as if we didn't miss it.
	 */
	if (events & POLLIN)
		schedule_work(&irqfd->inject);

	/*
	 * do not drop the file until the irqfd is fully initialized, otherwise
	 * we might race against the POLLHUP
	 */
	fput(file);

	return 0;

fail:
	...
}

This is called to mask interrupt

/*
 * shutdown any irqfd's that match fd+gsi
 */
static int
kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
{
	struct _irqfd *irqfd, *tmp;
	struct eventfd_ctx *eventfd;

	eventfd = eventfd_ctx_fdget(fd);
	if (IS_ERR(eventfd))
		return PTR_ERR(eventfd);

	spin_lock_irq(&kvm->irqfds.lock);

	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
			irqfd_deactivate(irqfd);
	}

	spin_unlock_irq(&kvm->irqfds.lock);
	eventfd_ctx_put(eventfd);

	/*
	 * Block until we know all outstanding shutdown jobs have completed
	 * so that we guarantee there will not be any more interrupts on this
	 * gsi once this deassign function returns.
	 */
	flush_workqueue(irqfd_cleanup_wq);

	return 0;
}

And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
so this is not under the spinlock):

        /*
         * Synchronize with the wait-queue and unhook ourselves to
         * prevent
         * further events.
         */
B.
        remove_wait_queue(irqfd->wqh, &irqfd->wait);

	....

        /*
         * It is now safe to release the object's resources
         */
        eventfd_ctx_put(irqfd->eventfd);
        kfree(irqfd);

The problems (really the same bug) in KVM that I am trying to fix:
1. Because of A above, if event was requested while message was masked,
   we will not miss a message. However, because we never clear
   the counter, so we currently get a spurious message each time
   we unmask.

   We should clear the counter either each time we
   deliver message, or at least when message is masked.

2. We currently always return pending status 0 to guests,
   but for strict spec compliance we must do it correctly
   and report pending message if one was requested while
   it was masked.

   An obvious way to do this is to do nonblocking read on the eventfd.
   Since this clears the counter, we will have to write the value
   back. But again, this requires that all messages that were
   sent before mask are cleared from the counter, otherwise
   we will always report pending messages.

So, how should I clear the counter?

I started with the idea of clearing it in irqfd_wakeup.  However, as you
point out this will cause another list scan, so it's messy.  It
currently uses a workqueue to work around KVM locking issue, but the
issue is fixed so we will stop doing this soon, so can not rely on that.

Thus I came up with the idea to do this on unmask:
this is slow path and can always be done from workqueue or ioctl
context. So I would add eventfd_read_ctx in code at point B:

		eventfd_read_ctx(irqfd->eventfd, &ucnt);
	->
		remove_wait_queue(irqfd->wqh, &irqfd->wait);

now, if device signals eventfd at point marked by ->,
it will be sent but counter not cleared,
so we will get spurious message when we unmask.

Or if I do it the other way:
		remove_wait_queue(irqfd->wqh, &irqfd->wait);
	->
		eventfd_read_ctx(irqfd->eventfd, &ucnt);

now, if device signals eventfd at point marked by ->,
it will not be sent but counter will be cleared,
so we will loose a message.

There does not appear to be a way to avoid this race unless we both
remove from wait queue and clear counter under the wait queue lock, just
like eventfd_read currently does.

I hope this clarifies.

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07 10:36                                                                             ` Michael S. Tsirkin
@ 2010-01-07 23:37                                                                               ` Davide Libenzi
  2010-01-08  0:13                                                                                 ` Davide Libenzi
  2010-01-08  0:26                                                                               ` Davide Libenzi
  2010-01-11  9:01                                                                               ` Gleb Natapov
  2 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-07 23:37 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

> Sure, I was trying to be as brief as possible, here's a detailed summary.
> 
> Description of the system (MSI emulation in KVM):
> 
> KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
> in guest OS.  When this eventfd is signalled, interrupt message is sent.
> This assignment is done from qemu system emulator.
> 
> eventfd is signalled from device emulation in another thread in
> userspace or from kernel, which talks with guest OS through another
> eventfd and shared memory (possibility of out of process was discussed
> but never got implemented yet).
> 
> Note: it's okay to delay messages from correctness point of view, but
> generally this is latency-sensitive path. If multiple identical messages
> are requested, it's okay to send a single last message, but missing a
> message altogether causes deadlocks.  Sending a message when none were
> requested might in theory cause crashes, in practice doing this causes
> performance degradation.
> 
> Another KVM feature is interrupt masking: guest OS requests that we
> stop sending some interrupt message, possibly modified mapping
> and re-enables this message. This needs to be done without
> involving the device that might keep requesting events:
> while masked, message is marked "pending", and guest might test
> the pending status.
> 
> We can implement masking in system emulator in userspace, by using
> assign/deassign ioctls: when message is masked, we simply deassign all
> eventfd, and when it is unmasked, we assign them back.
> 
> Here's some code to illustrate how this all works: assign/deassign code
> in kernel looks like the following:
> 
> 
> this is called to unmask interrupt
> 
> static int
> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> {
> 	struct _irqfd *irqfd, *tmp;
> 	struct file *file = NULL;
> 	struct eventfd_ctx *eventfd = NULL;
> 	int ret;
> 	unsigned int events;
> 
> 	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> 
> ...
> 
> 	file = eventfd_fget(fd);
> 	if (IS_ERR(file)) {
> 		ret = PTR_ERR(file);
> 		goto fail;
> 	}
> 
> 	eventfd = eventfd_ctx_fileget(file);
> 	if (IS_ERR(eventfd)) {
> 		ret = PTR_ERR(eventfd);
> 		goto fail;
> 	}
> 
> 	irqfd->eventfd = eventfd;
> 
> 	/*
> 	 * Install our own custom wake-up handling so we are notified via
> 	 * a callback whenever someone signals the underlying eventfd
> 	 */
> 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> 	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> 
> 	spin_lock_irq(&kvm->irqfds.lock);
> 
> 	events = file->f_op->poll(file, &irqfd->pt);
> 
> 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> 	spin_unlock_irq(&kvm->irqfds.lock);
> 
> A.
> 	/*
> 	 * Check if there was an event already pending on the eventfd
> 	 * before we registered, and trigger it as if we didn't miss it.
> 	 */
> 	if (events & POLLIN)
> 		schedule_work(&irqfd->inject);
> 
> 	/*
> 	 * do not drop the file until the irqfd is fully initialized, otherwise
> 	 * we might race against the POLLHUP
> 	 */
> 	fput(file);
> 
> 	return 0;
> 
> fail:
> 	...
> }
> 
> This is called to mask interrupt
> 
> /*
>  * shutdown any irqfd's that match fd+gsi
>  */
> static int
> kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> {
> 	struct _irqfd *irqfd, *tmp;
> 	struct eventfd_ctx *eventfd;
> 
> 	eventfd = eventfd_ctx_fdget(fd);
> 	if (IS_ERR(eventfd))
> 		return PTR_ERR(eventfd);
> 
> 	spin_lock_irq(&kvm->irqfds.lock);
> 
> 	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> 		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
> 			irqfd_deactivate(irqfd);
> 	}
> 
> 	spin_unlock_irq(&kvm->irqfds.lock);
> 	eventfd_ctx_put(eventfd);
> 
> 	/*
> 	 * Block until we know all outstanding shutdown jobs have completed
> 	 * so that we guarantee there will not be any more interrupts on this
> 	 * gsi once this deassign function returns.
> 	 */
> 	flush_workqueue(irqfd_cleanup_wq);
> 
> 	return 0;
> }
> 
> 
> And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> so this is not under the spinlock):
> 
>         /*
>          * Synchronize with the wait-queue and unhook ourselves to
>          * prevent
>          * further events.
>          */
> B.
>         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> 	....
> 
>         /*
>          * It is now safe to release the object's resources
>          */
>         eventfd_ctx_put(irqfd->eventfd);
>         kfree(irqfd);
> 
> 
> The problems (really the same bug) in KVM that I am trying to fix:
> 1. Because of A above, if event was requested while message was masked,
>    we will not miss a message. However, because we never clear
>    the counter, so we currently get a spurious message each time
>    we unmask.
> 
>    We should clear the counter either each time we
>    deliver message, or at least when message is masked.
> 
> 2. We currently always return pending status 0 to guests,
>    but for strict spec compliance we must do it correctly
>    and report pending message if one was requested while
>    it was masked.
> 
>    An obvious way to do this is to do nonblocking read on the eventfd.
>    Since this clears the counter, we will have to write the value
>    back. But again, this requires that all messages that were
>    sent before mask are cleared from the counter, otherwise
>    we will always report pending messages.
> 
> 
> So, how should I clear the counter?
> 
> I started with the idea of clearing it in irqfd_wakeup.  However, as you
> point out this will cause another list scan, so it's messy.  It
> currently uses a workqueue to work around KVM locking issue, but the
> issue is fixed so we will stop doing this soon, so can not rely on that.
> 
> Thus I came up with the idea to do this on unmask:
> this is slow path and can always be done from workqueue or ioctl
> context. So I would add eventfd_read_ctx in code at point B:
> 
> 
> 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> 	->
> 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> now, if device signals eventfd at point marked by ->,
> it will be sent but counter not cleared,
> so we will get spurious message when we unmask.
> 
> Or if I do it the other way:
> 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 	->
> 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> 
> now, if device signals eventfd at point marked by ->,
> it will not be sent but counter will be cleared,
> so we will loose a message.
> 
> There does not appear to be a way to avoid this race unless we both
> remove from wait queue and clear counter under the wait queue lock, just
> like eventfd_read currently does.
> 
> I hope this clarifies.

Yes, it sure does, thanks.
When you unmask, shouldn't the fd be in a "clear" state anyway?
So, can't you just do  eventfd_ctx_read(ctx, 1, &ucnt)  on unmask, and 
ignore the value (and the POLLIN flags)?
Another solution, is to keep track of the counter (using 
eventfd_ctx_read() and eventfd_ctx_peek_value()) inside your code, and 
decide the status based on its changes.




- Davide


---
 fs/eventfd.c            |   74 ++++++++++++++++++++++++++++++++++++++----------
 include/linux/eventfd.h |   13 ++++++++
 2 files changed, 72 insertions(+), 15 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c	2010-01-06 12:37:28.000000000 -0800
+++ linux-2.6.mod/fs/eventfd.c	2010-01-06 17:43:09.000000000 -0800
@@ -135,26 +135,56 @@ static unsigned int eventfd_poll(struct 
 	return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-			    loff_t *ppos)
+/**
+ * eventfd_ctx_peek_value - Reads the eventfd counter.
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * NOTE: Must be called with the wait queue lock held.  Locking the
+ * eventfd_ctx interface is not actually exposed outside of the
+ * eventfd implementation, but when poll callback registers with the
+ * eventfd poll, such callbacks will be invoked with the wait queue
+ * lock held.  IOW this means that this function can only be called from
+ * inside a registered poll callback at the moment.
+ * The function returns the current counter value, without consuming it.
+ */
+__u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx)
+{
+	assert_spin_locked(&ctx->wqh.lock);
+
+	return ctx->count;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_peek_value);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	ssize_t res;
-	__u64 ucnt = 0;
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (count < sizeof(ucnt))
-		return -EINVAL;
 	spin_lock_irq(&ctx->wqh.lock);
+	*cnt = 0;
 	res = -EAGAIN;
 	if (ctx->count > 0)
-		res = sizeof(ucnt);
-	else if (!(file->f_flags & O_NONBLOCK)) {
+		res = 0;
+	else if (!no_wait) {
 		__add_wait_queue(&ctx->wqh, &wait);
-		for (res = 0;;) {
+		for (;;) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (ctx->count > 0) {
-				res = sizeof(ucnt);
+				res = 0;
 				break;
 			}
 			if (signal_pending(current)) {
@@ -168,18 +198,32 @@ static ssize_t eventfd_read(struct file 
 		__remove_wait_queue(&ctx->wqh, &wait);
 		__set_current_state(TASK_RUNNING);
 	}
-	if (likely(res > 0)) {
-		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
-		ctx->count -= ucnt;
+	if (likely(res == 0)) {
+		*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1: ctx->count;
+		ctx->count -= *cnt;
 		if (waitqueue_active(&ctx->wqh))
 			wake_up_locked_poll(&ctx->wqh, POLLOUT);
 	}
 	spin_unlock_irq(&ctx->wqh.lock);
-	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
-		return -EFAULT;
 
 	return res;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 cnt;
+
+	if (count < sizeof(cnt))
+		return -EINVAL;
+	if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
+		return res;
+
+	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
 			     loff_t *ppos)
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h	2010-01-06 12:37:28.000000000 -0800
+++ linux-2.6.mod/include/linux/eventfd.h	2010-01-06 14:54:36.000000000 -0800
@@ -34,6 +34,8 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
+__u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx);
 
 #else /* CONFIG_EVENTFD */
 
@@ -61,6 +63,17 @@ static inline void eventfd_ctx_put(struc
 
 }
 
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+				       __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
+static inline __u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx)
+{
+	return 0;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07 23:37                                                                               ` Davide Libenzi
@ 2010-01-08  0:13                                                                                 ` Davide Libenzi
  0 siblings, 0 replies; 79+ messages in thread
From: Davide Libenzi @ 2010-01-08  0:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Davide Libenzi wrote:

> When you unmask, shouldn't the fd be in a "clear" state anyway?

Forget that. If it's used for IRQs you likely need to have a current state 
when you unmask.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07 10:36                                                                             ` Michael S. Tsirkin
  2010-01-07 23:37                                                                               ` Davide Libenzi
@ 2010-01-08  0:26                                                                               ` Davide Libenzi
  2010-01-10 10:30                                                                                 ` Michael S. Tsirkin
  2010-01-11  9:01                                                                               ` Gleb Natapov
  2 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-08  0:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

> Sure, I was trying to be as brief as possible, here's a detailed summary.
> 
> Description of the system (MSI emulation in KVM):
> 
> KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
> in guest OS.  When this eventfd is signalled, interrupt message is sent.
> This assignment is done from qemu system emulator.
> 
> eventfd is signalled from device emulation in another thread in
> userspace or from kernel, which talks with guest OS through another
> eventfd and shared memory (possibility of out of process was discussed
> but never got implemented yet).
> 
> Note: it's okay to delay messages from correctness point of view, but
> generally this is latency-sensitive path. If multiple identical messages
> are requested, it's okay to send a single last message, but missing a
> message altogether causes deadlocks.  Sending a message when none were
> requested might in theory cause crashes, in practice doing this causes
> performance degradation.
> 
> Another KVM feature is interrupt masking: guest OS requests that we
> stop sending some interrupt message, possibly modified mapping
> and re-enables this message. This needs to be done without
> involving the device that might keep requesting events:
> while masked, message is marked "pending", and guest might test
> the pending status.
> 
> We can implement masking in system emulator in userspace, by using
> assign/deassign ioctls: when message is masked, we simply deassign all
> eventfd, and when it is unmasked, we assign them back.
> 
> Here's some code to illustrate how this all works: assign/deassign code
> in kernel looks like the following:
> 
> 
> this is called to unmask interrupt
> 
> static int
> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> {
> 	struct _irqfd *irqfd, *tmp;
> 	struct file *file = NULL;
> 	struct eventfd_ctx *eventfd = NULL;
> 	int ret;
> 	unsigned int events;
> 
> 	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> 
> ...
> 
> 	file = eventfd_fget(fd);
> 	if (IS_ERR(file)) {
> 		ret = PTR_ERR(file);
> 		goto fail;
> 	}
> 
> 	eventfd = eventfd_ctx_fileget(file);
> 	if (IS_ERR(eventfd)) {
> 		ret = PTR_ERR(eventfd);
> 		goto fail;
> 	}
> 
> 	irqfd->eventfd = eventfd;
> 
> 	/*
> 	 * Install our own custom wake-up handling so we are notified via
> 	 * a callback whenever someone signals the underlying eventfd
> 	 */
> 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> 	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> 
> 	spin_lock_irq(&kvm->irqfds.lock);
> 
> 	events = file->f_op->poll(file, &irqfd->pt);
> 
> 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> 	spin_unlock_irq(&kvm->irqfds.lock);
> 
> A.
> 	/*
> 	 * Check if there was an event already pending on the eventfd
> 	 * before we registered, and trigger it as if we didn't miss it.
> 	 */
> 	if (events & POLLIN)
> 		schedule_work(&irqfd->inject);
> 
> 	/*
> 	 * do not drop the file until the irqfd is fully initialized, otherwise
> 	 * we might race against the POLLHUP
> 	 */
> 	fput(file);
> 
> 	return 0;
> 
> fail:
> 	...
> }

What is you do (under proper irqfd locking) something like:

	eventfd_ctx_read(ctx, 1, &cnt);
	if (irqfd->cnt != cnt) {
		irqfd->cnt = cnt;
		schedule_work(&irqfd->inject);
	}




> And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> so this is not under the spinlock):
> 
>         /*
>          * Synchronize with the wait-queue and unhook ourselves to
>          * prevent
>          * further events.
>          */
> B.
>         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> 	....
> 
>         /*
>          * It is now safe to release the object's resources
>          */
>         eventfd_ctx_put(irqfd->eventfd);
>         kfree(irqfd);

And:

	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
	remove_wait_queue(irqfd->wqh, &irqfd->wait);




- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-08  0:26                                                                               ` Davide Libenzi
@ 2010-01-10 10:30                                                                                 ` Michael S. Tsirkin
  2010-01-10 15:26                                                                                   ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-10 10:30 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> 
> > Sure, I was trying to be as brief as possible, here's a detailed summary.
> > 
> > Description of the system (MSI emulation in KVM):
> > 
> > KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
> > in guest OS.  When this eventfd is signalled, interrupt message is sent.
> > This assignment is done from qemu system emulator.
> > 
> > eventfd is signalled from device emulation in another thread in
> > userspace or from kernel, which talks with guest OS through another
> > eventfd and shared memory (possibility of out of process was discussed
> > but never got implemented yet).
> > 
> > Note: it's okay to delay messages from correctness point of view, but
> > generally this is latency-sensitive path. If multiple identical messages
> > are requested, it's okay to send a single last message, but missing a
> > message altogether causes deadlocks.  Sending a message when none were
> > requested might in theory cause crashes, in practice doing this causes
> > performance degradation.
> > 
> > Another KVM feature is interrupt masking: guest OS requests that we
> > stop sending some interrupt message, possibly modified mapping
> > and re-enables this message. This needs to be done without
> > involving the device that might keep requesting events:
> > while masked, message is marked "pending", and guest might test
> > the pending status.
> > 
> > We can implement masking in system emulator in userspace, by using
> > assign/deassign ioctls: when message is masked, we simply deassign all
> > eventfd, and when it is unmasked, we assign them back.
> > 
> > Here's some code to illustrate how this all works: assign/deassign code
> > in kernel looks like the following:
> > 
> > 
> > this is called to unmask interrupt
> > 
> > static int
> > kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> > {
> > 	struct _irqfd *irqfd, *tmp;
> > 	struct file *file = NULL;
> > 	struct eventfd_ctx *eventfd = NULL;
> > 	int ret;
> > 	unsigned int events;
> > 
> > 	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> > 
> > ...
> > 
> > 	file = eventfd_fget(fd);
> > 	if (IS_ERR(file)) {
> > 		ret = PTR_ERR(file);
> > 		goto fail;
> > 	}
> > 
> > 	eventfd = eventfd_ctx_fileget(file);
> > 	if (IS_ERR(eventfd)) {
> > 		ret = PTR_ERR(eventfd);
> > 		goto fail;
> > 	}
> > 
> > 	irqfd->eventfd = eventfd;
> > 
> > 	/*
> > 	 * Install our own custom wake-up handling so we are notified via
> > 	 * a callback whenever someone signals the underlying eventfd
> > 	 */
> > 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> > 	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> > 
> > 	spin_lock_irq(&kvm->irqfds.lock);
> > 
> > 	events = file->f_op->poll(file, &irqfd->pt);
> > 
> > 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> > 	spin_unlock_irq(&kvm->irqfds.lock);
> > 
> > A.
> > 	/*
> > 	 * Check if there was an event already pending on the eventfd
> > 	 * before we registered, and trigger it as if we didn't miss it.
> > 	 */
> > 	if (events & POLLIN)
> > 		schedule_work(&irqfd->inject);
> > 
> > 	/*
> > 	 * do not drop the file until the irqfd is fully initialized, otherwise
> > 	 * we might race against the POLLHUP
> > 	 */
> > 	fput(file);
> > 
> > 	return 0;
> > 
> > fail:
> > 	...
> > }
> 
> What is you do (under proper irqfd locking) something like:
> 
> 	eventfd_ctx_read(ctx, 1, &cnt);
> 	if (irqfd->cnt != cnt) {
> 		irqfd->cnt = cnt;
> 		schedule_work(&irqfd->inject);
> 	}
> 
> 
> 
> 
> > And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> > so this is not under the spinlock):
> > 
> >         /*
> >          * Synchronize with the wait-queue and unhook ourselves to
> >          * prevent
> >          * further events.
> >          */
> > B.
> >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 
> > 	....
> > 
> >         /*
> >          * It is now safe to release the object's resources
> >          */
> >         eventfd_ctx_put(irqfd->eventfd);
> >         kfree(irqfd);
> 
> And:
> 
> 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);


->

> 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 
> 
> 
> 
> - Davide

Yes, this is exactly what I wanted to do.  So, here's the issue: if an
event is signalled at point ->: after eventfd_ctx_read but before
remove_wait_queue, then we inject interrupt but counter will be left
non-zero and then when we unmask, we inject antoher, spurious interrupt.

This is why I wanted to have eventfd_ctx_read not take wait queue head
lock: then I could do:

	spin_lock_irqsave(&ctx->wqh.lock, flags);
 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
 	__remove_wait_queue(irqfd->wqh, &irqfd->wait);
	spin_lock_irqrestore(&ctx->wqh.lock, flags);


-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 10:30                                                                                 ` Michael S. Tsirkin
@ 2010-01-10 15:26                                                                                   ` Davide Libenzi
  2010-01-10 16:22                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-10 15:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

> > What is you do (under proper irqfd locking) something like:
> > 
> > 	eventfd_ctx_read(ctx, 1, &cnt);
> > 	if (irqfd->cnt != cnt) {
> > 		irqfd->cnt = cnt;
> > 		schedule_work(&irqfd->inject);
> > 	}
> > 
> > 
> > 
> > 
> > > And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> > > so this is not under the spinlock):
> > > 
> > >         /*
> > >          * Synchronize with the wait-queue and unhook ourselves to
> > >          * prevent
> > >          * further events.
> > >          */
> > > B.
> > >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > 
> > > 	....
> > > 
> > >         /*
> > >          * It is now safe to release the object's resources
> > >          */
> > >         eventfd_ctx_put(irqfd->eventfd);
> > >         kfree(irqfd);
> > 
> > And:
> > 
> > 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
> 
> 
> ->
> 
> > 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 
> > 
> > 
> > 
> > - Davide
> 
> Yes, this is exactly what I wanted to do.  So, here's the issue: if an
> event is signalled at point ->: after eventfd_ctx_read but before
> remove_wait_queue, then we inject interrupt but counter will be left
> non-zero and then when we unmask, we inject antoher, spurious interrupt.
> 
> This is why I wanted to have eventfd_ctx_read not take wait queue head
> lock: then I could do:
> 
> 	spin_lock_irqsave(&ctx->wqh.lock, flags);
>  	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
>  	__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 	spin_lock_irqrestore(&ctx->wqh.lock, flags);

This is why I said "under proper irqfd locking".


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 15:26                                                                                   ` Davide Libenzi
@ 2010-01-10 16:22                                                                                     ` Michael S. Tsirkin
  2010-01-10 17:27                                                                                       ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-10 16:22 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Sun, Jan 10, 2010 at 07:26:12AM -0800, Davide Libenzi wrote:
> On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
> 
> > > What is you do (under proper irqfd locking) something like:
> > > 
> > > 	eventfd_ctx_read(ctx, 1, &cnt);
> > > 	if (irqfd->cnt != cnt) {
> > > 		irqfd->cnt = cnt;
> > > 		schedule_work(&irqfd->inject);
> > > 	}
> > > 
> > > 
> > > 
> > > 
> > > > And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> > > > so this is not under the spinlock):
> > > > 
> > > >         /*
> > > >          * Synchronize with the wait-queue and unhook ourselves to
> > > >          * prevent
> > > >          * further events.
> > > >          */
> > > > B.
> > > >         remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > > 
> > > > 	....
> > > > 
> > > >         /*
> > > >          * It is now safe to release the object's resources
> > > >          */
> > > >         eventfd_ctx_put(irqfd->eventfd);
> > > >         kfree(irqfd);
> > > 
> > > And:
> > > 
> > > 	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
> > 
> > 
> > ->
> > 
> > > 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > 
> > > 
> > > 
> > > 
> > > - Davide
> > 
> > Yes, this is exactly what I wanted to do.  So, here's the issue: if an
> > event is signalled at point ->: after eventfd_ctx_read but before
> > remove_wait_queue, then we inject interrupt but counter will be left
> > non-zero and then when we unmask, we inject antoher, spurious interrupt.
> > 
> > This is why I wanted to have eventfd_ctx_read not take wait queue head
> > lock: then I could do:
> > 
> > 	spin_lock_irqsave(&ctx->wqh.lock, flags);
> >  	eventfd_ctx_read(ctx, 1, &irqfd->cnt);
> >  	__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 	spin_lock_irqrestore(&ctx->wqh.lock, flags);
> 
> This is why I said "under proper irqfd locking".
> 
> 
> - Davide

Not sure what you mean by irqfd locking.  IMO we must take waitqueue
lock, otherwise we can not prevent wait queue callback from being
invoked, right? Note that if we take our own lock under wait queue
callback, we won't be able to use this lock around remove_wait_queue.
Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
queue lock - we'd like to do the same with our wait queue.

Could you clarify pls?

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 16:22                                                                                     ` Michael S. Tsirkin
@ 2010-01-10 17:27                                                                                       ` Davide Libenzi
  2010-01-10 17:35                                                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-10 17:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

> Not sure what you mean by irqfd locking.  IMO we must take waitqueue
> lock, otherwise we can not prevent wait queue callback from being
> invoked, right? Note that if we take our own lock under wait queue
> callback, we won't be able to use this lock around remove_wait_queue.
> Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
> queue lock - we'd like to do the same with our wait queue.
> 
> Could you clarify pls?

Wait a second. Looking at the current code in Linus tree, when you 
deassign an irqfd, you are actually destroying it, so the idea of saving 
the counter on deassign is not going to work.



- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 17:27                                                                                       ` Davide Libenzi
@ 2010-01-10 17:35                                                                                         ` Michael S. Tsirkin
  2010-01-10 19:04                                                                                           ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-10 17:35 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
> On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
> 
> > Not sure what you mean by irqfd locking.  IMO we must take waitqueue
> > lock, otherwise we can not prevent wait queue callback from being
> > invoked, right? Note that if we take our own lock under wait queue
> > callback, we won't be able to use this lock around remove_wait_queue.
> > Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
> > queue lock - we'd like to do the same with our wait queue.
> > 
> > Could you clarify pls?
> 
> Wait a second. Looking at the current code in Linus tree, when you 
> deassign an irqfd, you are actually destroying it, so the idea of saving 
> the counter on deassign is not going to work.
> 
> 
> 
> - Davide

Yes, the only context passed between deassign and assign is the eventfd
itself. So I think the following code for deassign would work provided
eventfd_ctx_read_locked works with wait queue lock taken:

        spin_lock_irqsave(&irqfd->wqh.lock, flags);
        eventfd_ctx_read_locked(ctx->eventfd, 1, &ucnt);
        __remove_wait_queue(irqfd->wqh, &irqfd->wait);
        spin_lock_irqrestore(&irqfd->wqh.lock, flags);

And we discard ucnt here. This works because on on assign we do:

	events = file->f_op->poll(file, &irqfd->pt);
	if (events & POLLIN)
		schedule_work(&irqfd->inject);

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 17:35                                                                                         ` Michael S. Tsirkin
@ 2010-01-10 19:04                                                                                           ` Davide Libenzi
  2010-01-11  7:34                                                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-10 19:04 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

> On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
> > On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
> > 
> > > Not sure what you mean by irqfd locking.  IMO we must take waitqueue
> > > lock, otherwise we can not prevent wait queue callback from being
> > > invoked, right? Note that if we take our own lock under wait queue
> > > callback, we won't be able to use this lock around remove_wait_queue.
> > > Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
> > > queue lock - we'd like to do the same with our wait queue.
> > > 
> > > Could you clarify pls?
> > 
> > Wait a second. Looking at the current code in Linus tree, when you 
> > deassign an irqfd, you are actually destroying it, so the idea of saving 
> > the counter on deassign is not going to work.
> > 
> > 
> > 
> > - Davide
> 
> Yes, the only context passed between deassign and assign is the eventfd
> itself. So I think the following code for deassign would work provided
> eventfd_ctx_read_locked works with wait queue lock taken:
> 
>         spin_lock_irqsave(&irqfd->wqh.lock, flags);
>         eventfd_ctx_read_locked(ctx->eventfd, 1, &ucnt);
>         __remove_wait_queue(irqfd->wqh, &irqfd->wait);
>         spin_lock_irqrestore(&irqfd->wqh.lock, flags);

As I said, you cannot do that from KVM, since write-witers will nedd to be 
woke up.
Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested).



- Davide


---
 fs/eventfd.c            |   88 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/eventfd.h |   15 ++++++++
 2 files changed, 88 insertions(+), 15 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c	2010-01-06 18:33:27.000000000 -0800
+++ linux-2.6.mod/fs/eventfd.c	2010-01-10 11:02:00.000000000 -0800
@@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct 
 	return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-			    loff_t *ppos)
+static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1: ctx->count;
+	ctx->count -= *cnt;
+}
+
+/**
+ * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue.
+ * @ctx: [in] Pointer to eventfd context.
+ * @wait: [in] Wait queue to be removed.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked.
+ *
+ * This is used to atomically remove a wait queue entry from the eventfd wait
+ * queue head, and read/reset the counter value.
+ */
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+				  __u64 *cnt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->wqh.lock, flags);
+	eventfd_ctx_do_read(ctx, cnt);
+	__remove_wait_queue(&ctx->wqh, wait);
+	if (*cnt != 0 && waitqueue_active(&ctx->wqh))
+		wake_up_locked_poll(&ctx->wqh, POLLOUT);
+	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+
+	return *cnt != 0 ? 0: -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	ssize_t res;
-	__u64 ucnt = 0;
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (count < sizeof(ucnt))
-		return -EINVAL;
 	spin_lock_irq(&ctx->wqh.lock);
+	*cnt = 0;
 	res = -EAGAIN;
 	if (ctx->count > 0)
-		res = sizeof(ucnt);
-	else if (!(file->f_flags & O_NONBLOCK)) {
+		res = 0;
+	else if (!no_wait) {
 		__add_wait_queue(&ctx->wqh, &wait);
-		for (res = 0;;) {
+		for (;;) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (ctx->count > 0) {
-				res = sizeof(ucnt);
+				res = 0;
 				break;
 			}
 			if (signal_pending(current)) {
@@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file 
 		__remove_wait_queue(&ctx->wqh, &wait);
 		__set_current_state(TASK_RUNNING);
 	}
-	if (likely(res > 0)) {
-		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
-		ctx->count -= ucnt;
+	if (likely(res == 0)) {
+		eventfd_ctx_do_read(ctx, cnt);
 		if (waitqueue_active(&ctx->wqh))
 			wake_up_locked_poll(&ctx->wqh, POLLOUT);
 	}
 	spin_unlock_irq(&ctx->wqh.lock);
-	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
-		return -EFAULT;
 
 	return res;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 cnt;
+
+	if (count < sizeof(cnt))
+		return -EINVAL;
+	if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
+		return res;
+
+	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
 			     loff_t *ppos)
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h	2010-01-06 18:33:27.000000000 -0800
+++ linux-2.6.mod/include/linux/eventfd.h	2010-01-10 10:57:06.000000000 -0800
@@ -34,6 +34,9 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+				  __u64 *cnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -61,6 +64,18 @@ static inline void eventfd_ctx_put(struc
 
 }
 
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+				       __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
+static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
+						wait_queue_t *wait, __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-10 19:04                                                                                           ` Davide Libenzi
@ 2010-01-11  7:34                                                                                             ` Michael S. Tsirkin
  2010-01-11 19:14                                                                                               ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11  7:34 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Sun, Jan 10, 2010 at 11:04:23AM -0800, Davide Libenzi wrote:
> On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
> 
> > On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
> > > On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
> > > 
> > > > Not sure what you mean by irqfd locking.  IMO we must take waitqueue
> > > > lock, otherwise we can not prevent wait queue callback from being
> > > > invoked, right? Note that if we take our own lock under wait queue
> > > > callback, we won't be able to use this lock around remove_wait_queue.
> > > > Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
> > > > queue lock - we'd like to do the same with our wait queue.
> > > > 
> > > > Could you clarify pls?
> > > 
> > > Wait a second. Looking at the current code in Linus tree, when you 
> > > deassign an irqfd, you are actually destroying it, so the idea of saving 
> > > the counter on deassign is not going to work.
> > > 
> > > 
> > > 
> > > - Davide
> > 
> > Yes, the only context passed between deassign and assign is the eventfd
> > itself. So I think the following code for deassign would work provided
> > eventfd_ctx_read_locked works with wait queue lock taken:
> > 
> >         spin_lock_irqsave(&irqfd->wqh.lock, flags);
> >         eventfd_ctx_read_locked(ctx->eventfd, 1, &ucnt);
> >         __remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >         spin_lock_irqrestore(&irqfd->wqh.lock, flags);
> 
> As I said, you cannot do that from KVM, since write-witers will nedd to be 
> woke up.
> Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested).
> 
> 
> 
> - Davide
> 

Yes, I think this will work. Will test and report. Thanks!

> ---
>  fs/eventfd.c            |   88 +++++++++++++++++++++++++++++++++++++++---------
>  include/linux/eventfd.h |   15 ++++++++
>  2 files changed, 88 insertions(+), 15 deletions(-)
> 
> Index: linux-2.6.mod/fs/eventfd.c
> ===================================================================
> --- linux-2.6.mod.orig/fs/eventfd.c	2010-01-06 18:33:27.000000000 -0800
> +++ linux-2.6.mod/fs/eventfd.c	2010-01-10 11:02:00.000000000 -0800
> @@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct 
>  	return events;
>  }
>  
> -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> -			    loff_t *ppos)
> +static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
> +{
> +	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1: ctx->count;
> +	ctx->count -= *cnt;
> +}
> +
> +/**
> + * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue.
> + * @ctx: [in] Pointer to eventfd context.
> + * @wait: [in] Wait queue to be removed.
> + * @cnt: [out] Pointer to the 64bit conter value.
> + *
> + * Returns zero if successful, or the following error codes:
> + *
> + * -EAGAIN      : The operation would have blocked.
> + *
> + * This is used to atomically remove a wait queue entry from the eventfd wait
> + * queue head, and read/reset the counter value.
> + */
> +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
> +				  __u64 *cnt)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ctx->wqh.lock, flags);
> +	eventfd_ctx_do_read(ctx, cnt);
> +	__remove_wait_queue(&ctx->wqh, wait);
> +	if (*cnt != 0 && waitqueue_active(&ctx->wqh))
> +		wake_up_locked_poll(&ctx->wqh, POLLOUT);
> +	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
> +
> +	return *cnt != 0 ? 0: -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
> +
> +/**
> + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
> + * @ctx: [in] Pointer to eventfd context.
> + * @no_wait: [in] Different from zero if the operation should not block.
> + * @cnt: [out] Pointer to the 64bit conter value.
> + *
> + * Returns zero if successful, or the following error codes:
> + *
> + * -EAGAIN      : The operation would have blocked but @no_wait was nonzero.
> + * -ERESTARTSYS : A signal interrupted the wait operation.
> + *
> + * If @no_wait is zero, the function might sleep until the eventfd internal
> + * counter becomes greater than zero.
> + */
> +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
>  {
> -	struct eventfd_ctx *ctx = file->private_data;
>  	ssize_t res;
> -	__u64 ucnt = 0;
>  	DECLARE_WAITQUEUE(wait, current);
>  
> -	if (count < sizeof(ucnt))
> -		return -EINVAL;
>  	spin_lock_irq(&ctx->wqh.lock);
> +	*cnt = 0;
>  	res = -EAGAIN;
>  	if (ctx->count > 0)
> -		res = sizeof(ucnt);
> -	else if (!(file->f_flags & O_NONBLOCK)) {
> +		res = 0;
> +	else if (!no_wait) {
>  		__add_wait_queue(&ctx->wqh, &wait);
> -		for (res = 0;;) {
> +		for (;;) {
>  			set_current_state(TASK_INTERRUPTIBLE);
>  			if (ctx->count > 0) {
> -				res = sizeof(ucnt);
> +				res = 0;
>  				break;
>  			}
>  			if (signal_pending(current)) {
> @@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file 
>  		__remove_wait_queue(&ctx->wqh, &wait);
>  		__set_current_state(TASK_RUNNING);
>  	}
> -	if (likely(res > 0)) {
> -		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
> -		ctx->count -= ucnt;
> +	if (likely(res == 0)) {
> +		eventfd_ctx_do_read(ctx, cnt);
>  		if (waitqueue_active(&ctx->wqh))
>  			wake_up_locked_poll(&ctx->wqh, POLLOUT);
>  	}
>  	spin_unlock_irq(&ctx->wqh.lock);
> -	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> -		return -EFAULT;
>  
>  	return res;
>  }
> +EXPORT_SYMBOL_GPL(eventfd_ctx_read);
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 cnt;
> +
> +	if (count < sizeof(cnt))
> +		return -EINVAL;
> +	if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
> +		return res;
> +
> +	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
> +}
>  
>  static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
>  			     loff_t *ppos)
> Index: linux-2.6.mod/include/linux/eventfd.h
> ===================================================================
> --- linux-2.6.mod.orig/include/linux/eventfd.h	2010-01-06 18:33:27.000000000 -0800
> +++ linux-2.6.mod/include/linux/eventfd.h	2010-01-10 10:57:06.000000000 -0800
> @@ -34,6 +34,9 @@ struct file *eventfd_fget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fdget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
>  int eventfd_signal(struct eventfd_ctx *ctx, int n);
> +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
> +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
> +				  __u64 *cnt);
>  
>  #else /* CONFIG_EVENTFD */
>  
> @@ -61,6 +64,18 @@ static inline void eventfd_ctx_put(struc
>  
>  }
>  
> +static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
> +				       __u64 *cnt)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
> +						wait_queue_t *wait, __u64 *cnt)
> +{
> +	return -ENOSYS;
> +}
> +
>  #endif
>  
>  #endif /* _LINUX_EVENTFD_H */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-07 10:36                                                                             ` Michael S. Tsirkin
  2010-01-07 23:37                                                                               ` Davide Libenzi
  2010-01-08  0:26                                                                               ` Davide Libenzi
@ 2010-01-11  9:01                                                                               ` Gleb Natapov
  2010-01-11  9:02                                                                                 ` Michael S. Tsirkin
  2 siblings, 1 reply; 79+ messages in thread
From: Gleb Natapov @ 2010-01-11  9:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, Avi Kivity, kvm

On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> Or if I do it the other way:
> 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 	->
> 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> 
> now, if device signals eventfd at point marked by ->,
> it will not be sent but counter will be cleared,
> so we will loose a message.
> 
May be I am missing something here, but why doing it like that is a
problem? Event delivery races with interrupt masking so making masking
succeed before event delivery is OK. Event generation is asynchronous
anyway and could have happened jiffy latter anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  9:01                                                                               ` Gleb Natapov
@ 2010-01-11  9:02                                                                                 ` Michael S. Tsirkin
  2010-01-11  9:08                                                                                   ` Gleb Natapov
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11  9:02 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Davide Libenzi, Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
> On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> > Or if I do it the other way:
> > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 	->
> > 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > 
> > now, if device signals eventfd at point marked by ->,
> > it will not be sent but counter will be cleared,
> > so we will loose a message.
> > 
> May be I am missing something here, but why doing it like that is a
> problem? Event delivery races with interrupt masking so making masking
> succeed before event delivery is OK. Event generation is asynchronous
> anyway and could have happened jiffy latter anyway.
> 
> --
> 			Gleb.

No, event generation would only trigger a single interrupt.  This race
generates two interrupts for a single event.  This can never happen with
real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  9:02                                                                                 ` Michael S. Tsirkin
@ 2010-01-11  9:08                                                                                   ` Gleb Natapov
  2010-01-11  9:19                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Gleb Natapov @ 2010-01-11  9:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
> > On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> > > Or if I do it the other way:
> > > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > 	->
> > > 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > > 
> > > now, if device signals eventfd at point marked by ->,
> > > it will not be sent but counter will be cleared,
> > > so we will loose a message.
> > > 
> > May be I am missing something here, but why doing it like that is a
> > problem? Event delivery races with interrupt masking so making masking
> > succeed before event delivery is OK. Event generation is asynchronous
> > anyway and could have happened jiffy latter anyway.
> > 
> > --
> > 			Gleb.
> 
> No, event generation would only trigger a single interrupt.  This race
> generates two interrupts for a single event.  This can never happen with
> real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
> 
In quoted test above you are saying that "we will loose a message". So
how does it generates two interrupts when we loose a message?
  
--
			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  9:08                                                                                   ` Gleb Natapov
@ 2010-01-11  9:19                                                                                     ` Michael S. Tsirkin
  2010-01-11  9:36                                                                                       ` Gleb Natapov
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11  9:19 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Davide Libenzi, Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
> On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
> > > On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> > > > Or if I do it the other way:
> > > > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > > 	->
> > > > 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > > > 
> > > > now, if device signals eventfd at point marked by ->,
> > > > it will not be sent but counter will be cleared,
> > > > so we will loose a message.
> > > > 
> > > May be I am missing something here, but why doing it like that is a
> > > problem? Event delivery races with interrupt masking so making masking
> > > succeed before event delivery is OK. Event generation is asynchronous
> > > anyway and could have happened jiffy latter anyway.
> > > 
> > > --
> > > 			Gleb.
> > 
> > No, event generation would only trigger a single interrupt.  This race
> > generates two interrupts for a single event.  This can never happen with
> > real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
> > 
> In quoted test above you are saying that "we will loose a message". So
> how does it generates two interrupts when we loose a message?
>   
> --
> 			Gleb.


Right, sorry. I think what you miss is the fact that this is done during
interrupt vector masking/unmasking, so events signalled while eventfd is not
assigned to interrupt must not be lost, they should be pending and
delivered later when interrupt vector is unmasked, that is when
eventfd is reassigned to an interrupt.

So this implementation really loses an interrupt:

	remove_wait_queue(irqfd->wqh, &irqfd->wait);
		-> at this point vector is masked: we are not polling
		   eventfd anymore, event triggered at this point should cause interrupt
		   after vector is unmasked, but the only thing is causes
		   is counter increment in eventfd.

	eventfd_read_ctx(irqfd->eventfd, &ucnt);
		-> the above call would clear the counter, so
		   we won't get an interrupt when vector is later
		   unmasked.

-- 
MST

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  9:19                                                                                     ` Michael S. Tsirkin
@ 2010-01-11  9:36                                                                                       ` Gleb Natapov
  2010-01-11  9:41                                                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Gleb Natapov @ 2010-01-11  9:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Davide Libenzi, Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
> > On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
> > > > On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> > > > > Or if I do it the other way:
> > > > > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > > > 	->
> > > > > 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > > > > 
> > > > > now, if device signals eventfd at point marked by ->,
> > > > > it will not be sent but counter will be cleared,
> > > > > so we will loose a message.
> > > > > 
> > > > May be I am missing something here, but why doing it like that is a
> > > > problem? Event delivery races with interrupt masking so making masking
> > > > succeed before event delivery is OK. Event generation is asynchronous
> > > > anyway and could have happened jiffy latter anyway.
> > > > 
> > > > --
> > > > 			Gleb.
> > > 
> > > No, event generation would only trigger a single interrupt.  This race
> > > generates two interrupts for a single event.  This can never happen with
> > > real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
> > > 
> > In quoted test above you are saying that "we will loose a message". So
> > how does it generates two interrupts when we loose a message?
> >   
> > --
> > 			Gleb.
> 
> 
> Right, sorry. I think what you miss is the fact that this is done during
> interrupt vector masking/unmasking, so events signalled while eventfd is not
> assigned to interrupt must not be lost, they should be pending and
> delivered later when interrupt vector is unmasked, that is when
> eventfd is reassigned to an interrupt.
> 
Is this how MSI works? If interrupt is triggered while it is masked it
is reasserted after unmasking? This is certainly no true for interrupt
masking on irq chip level.

> So this implementation really loses an interrupt:
> 
> 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> 		-> at this point vector is masked: we are not polling
> 		   eventfd anymore, event triggered at this point should cause interrupt
> 		   after vector is unmasked, but the only thing is causes
> 		   is counter increment in eventfd.
> 
> 	eventfd_read_ctx(irqfd->eventfd, &ucnt);
> 		-> the above call would clear the counter, so
> 		   we won't get an interrupt when vector is later
> 		   unmasked.
> 
Don't you going to use ucnt to set interrupt status bit? Can't you
re-trigger the interrupt after unmasking if status bit is set?

--
			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  9:36                                                                                       ` Gleb Natapov
@ 2010-01-11  9:41                                                                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11  9:41 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Davide Libenzi, Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:36:16AM +0200, Gleb Natapov wrote:
> On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
> > > On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
> > > > > On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
> > > > > > Or if I do it the other way:
> > > > > > 		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > > > > > 	->
> > > > > > 		eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > > > > > 
> > > > > > now, if device signals eventfd at point marked by ->,
> > > > > > it will not be sent but counter will be cleared,
> > > > > > so we will loose a message.
> > > > > > 
> > > > > May be I am missing something here, but why doing it like that is a
> > > > > problem? Event delivery races with interrupt masking so making masking
> > > > > succeed before event delivery is OK. Event generation is asynchronous
> > > > > anyway and could have happened jiffy latter anyway.
> > > > > 
> > > > > --
> > > > > 			Gleb.
> > > > 
> > > > No, event generation would only trigger a single interrupt.  This race
> > > > generates two interrupts for a single event.  This can never happen with
> > > > real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
> > > > 
> > > In quoted test above you are saying that "we will loose a message". So
> > > how does it generates two interrupts when we loose a message?
> > >   
> > > --
> > > 			Gleb.
> > 
> > 
> > Right, sorry. I think what you miss is the fact that this is done during
> > interrupt vector masking/unmasking, so events signalled while eventfd is not
> > assigned to interrupt must not be lost, they should be pending and
> > delivered later when interrupt vector is unmasked, that is when
> > eventfd is reassigned to an interrupt.
> > 
> Is this how MSI works? If interrupt is triggered while it is masked it
> is reasserted after unmasking? This is certainly no true for interrupt
> masking on irq chip level.

Yes.

> > So this implementation really loses an interrupt:
> > 
> > 	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> > 		-> at this point vector is masked: we are not polling
> > 		   eventfd anymore, event triggered at this point should cause interrupt
> > 		   after vector is unmasked, but the only thing is causes
> > 		   is counter increment in eventfd.
> > 
> > 	eventfd_read_ctx(irqfd->eventfd, &ucnt);
> > 		-> the above call would clear the counter, so
> > 		   we won't get an interrupt when vector is later
> > 		   unmasked.
> > 
> Don't you going to use ucnt to set interrupt status bit? Can't you
> re-trigger the interrupt after unmasking if status bit is set?

Yes, this is what Davidel suggested originally.  There's nowhere to save
the ucnt though because event is being deasserted.

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11  7:34                                                                                             ` Michael S. Tsirkin
@ 2010-01-11 19:14                                                                                               ` Davide Libenzi
  2010-01-11 19:19                                                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-11 19:14 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:

> Yes, I think this will work. Will test and report. Thanks!

If you ever happen to find a solution in KVM that does not require eventfd 
changes, that'd be even better :)
Otherwise ping me when you tested, that I'll puch this to Andrew.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11 19:14                                                                                               ` Davide Libenzi
@ 2010-01-11 19:19                                                                                                 ` Michael S. Tsirkin
  2010-01-11 22:53                                                                                                   ` Davide Libenzi
  0 siblings, 1 reply; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11 19:19 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
> On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
> 
> > Yes, I think this will work. Will test and report. Thanks!
> 
> If you ever happen to find a solution in KVM that does not require eventfd 
> changes, that'd be even better :)
> Otherwise ping me when you tested, that I'll puch this to Andrew.
> 
> 
> - Davide

Hmm, the fix also needs changes in kvm driver to call the new API. To
simplify dependencies, can we merge this patch through Avi's kvm tree as
well?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11 19:19                                                                                                 ` Michael S. Tsirkin
@ 2010-01-11 22:53                                                                                                   ` Davide Libenzi
  2010-01-13 17:07                                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 79+ messages in thread
From: Davide Libenzi @ 2010-01-11 22:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, kvm

On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:

> On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
> > On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
> > 
> > > Yes, I think this will work. Will test and report. Thanks!
> > 
> > If you ever happen to find a solution in KVM that does not require eventfd 
> > changes, that'd be even better :)
> > Otherwise ping me when you tested, that I'll puch this to Andrew.
> > 
> > 
> > - Davide
> 
> Hmm, the fix also needs changes in kvm driver to call the new API. To
> simplify dependencies, can we merge this patch through Avi's kvm tree as
> well?

It is fine for me. I can officially post it to Avi, CCing kvm and lkml.


- Davide



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 0/2] eventfd: new EFD_STATE flag
  2010-01-11 22:53                                                                                                   ` Davide Libenzi
@ 2010-01-13 17:07                                                                                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 79+ messages in thread
From: Michael S. Tsirkin @ 2010-01-13 17:07 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Avi Kivity, kvm

On Mon, Jan 11, 2010 at 02:53:53PM -0800, Davide Libenzi wrote:
> On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
> 
> > On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
> > > On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
> > > 
> > > > Yes, I think this will work. Will test and report. Thanks!
> > > 
> > > If you ever happen to find a solution in KVM that does not require eventfd 
> > > changes, that'd be even better :)
> > > Otherwise ping me when you tested, that I'll puch this to Andrew.
> > > 
> > > 
> > > - Davide
> > 
> > Hmm, the fix also needs changes in kvm driver to call the new API. To
> > simplify dependencies, can we merge this patch through Avi's kvm tree as
> > well?
> 
> It is fine for me. I can officially post it to Avi, CCing kvm and lkml.
> 
> 
> - Davide

Since it now uses wait_queue_t, I had to add include linux/wait.h to
make it self-contained.  The following works fine for me. Thanks!

--->

From: Davide Libenzi <davidel@xmailserver.org>
Subject: eventfd: support eventfd_ctx_remove_wait_queue

Add eventfd_ctx_remove_wait_queue API so that kernel users can clear
counter while getting out of poll queue, atomically.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Tested-by: Michael S. Tsirkin <mst@redhat.com>

--

diff --git a/fs/eventfd.c b/fs/eventfd.c
index d26402f..4cda278 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
 	return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-			    loff_t *ppos)
+static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1: ctx->count;
+	ctx->count -= *cnt;
+}
+
+/**
+ * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue.
+ * @ctx: [in] Pointer to eventfd context.
+ * @wait: [in] Wait queue to be removed.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked.
+ *
+ * This is used to atomically remove a wait queue entry from the eventfd wait
+ * queue head, and read/reset the counter value.
+ */
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+				  __u64 *cnt)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->wqh.lock, flags);
+	eventfd_ctx_do_read(ctx, cnt);
+	__remove_wait_queue(&ctx->wqh, wait);
+	if (*cnt != 0 && waitqueue_active(&ctx->wqh))
+		wake_up_locked_poll(&ctx->wqh, POLLOUT);
+	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
+
+	return *cnt != 0 ? 0: -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN      : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	ssize_t res;
-	__u64 ucnt = 0;
 	DECLARE_WAITQUEUE(wait, current);
 
-	if (count < sizeof(ucnt))
-		return -EINVAL;
 	spin_lock_irq(&ctx->wqh.lock);
+	*cnt = 0;
 	res = -EAGAIN;
 	if (ctx->count > 0)
-		res = sizeof(ucnt);
-	else if (!(file->f_flags & O_NONBLOCK)) {
+		res = 0;
+	else if (!no_wait) {
 		__add_wait_queue(&ctx->wqh, &wait);
-		for (res = 0;;) {
+		for (;;) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (ctx->count > 0) {
-				res = sizeof(ucnt);
+				res = 0;
 				break;
 			}
 			if (signal_pending(current)) {
@@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
 		__remove_wait_queue(&ctx->wqh, &wait);
 		__set_current_state(TASK_RUNNING);
 	}
-	if (likely(res > 0)) {
-		ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
-		ctx->count -= ucnt;
+	if (likely(res == 0)) {
+		eventfd_ctx_do_read(ctx, cnt);
 		if (waitqueue_active(&ctx->wqh))
 			wake_up_locked_poll(&ctx->wqh, POLLOUT);
 	}
 	spin_unlock_irq(&ctx->wqh.lock);
-	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
-		return -EFAULT;
 
 	return res;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+			    loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 cnt;
+
+	if (count < sizeof(cnt))
+		return -EINVAL;
+	if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
+		return res;
+
+	return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
 			     loff_t *ppos)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 94dd103..91bb4f2 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -10,6 +10,7 @@
 
 #include <linux/fcntl.h>
 #include <linux/file.h>
+#include <linux/wait.h>
 
 /*
  * CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -34,6 +35,9 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+				  __u64 *cnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -61,6 +65,18 @@ static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
 
 }
 
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+				       __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
+static inline int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx,
+						wait_queue_t *wait, __u64 *cnt)
+{
+	return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */

^ permalink raw reply related	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2010-01-13 17:10 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-20 15:56 [PATCH 0/2] eventfd: new EFD_STATE flag Michael S. Tsirkin
2009-08-20 16:20 ` Davide Libenzi
2009-08-20 17:38   ` Avi Kivity
2009-08-20 17:44     ` Davide Libenzi
2009-08-20 17:56       ` Paolo Bonzini
2009-08-21 17:21         ` Davide Libenzi
2009-08-20 17:55     ` Michael S. Tsirkin
2009-08-20 18:06       ` Avi Kivity
2009-08-20 18:28         ` Michael S. Tsirkin
2009-08-23 13:01           ` Avi Kivity
2009-08-23 13:36             ` Michael S. Tsirkin
2009-08-23 13:40               ` Avi Kivity
2009-08-23 14:30                 ` Michael S. Tsirkin
2009-08-23 16:51                   ` Paolo Bonzini
2009-08-24 18:25                   ` Davide Libenzi
2009-08-24 18:31                     ` Avi Kivity
2009-08-24 22:08                       ` Davide Libenzi
2009-08-24 22:10                         ` Paolo Bonzini
2009-08-24 22:32                           ` Davide Libenzi
2009-08-25  6:59                             ` Paolo Bonzini
2009-08-25  4:26                         ` Avi Kivity
2009-08-24 21:49                     ` Michael S. Tsirkin
2009-08-24 22:15                       ` Davide Libenzi
2009-08-25  7:22                         ` Michael S. Tsirkin
2009-08-25 21:57                           ` Davide Libenzi
2009-08-26 10:29                             ` Michael S. Tsirkin
2009-08-26 10:41                               ` Avi Kivity
2009-08-26 17:45                               ` Davide Libenzi
2009-08-26 18:58                                 ` Avi Kivity
2009-08-26 19:13                                   ` Davide Libenzi
2009-08-26 19:42                                     ` Avi Kivity
2009-08-26 19:44                                       ` Davide Libenzi
2009-08-26 23:30                                         ` Davide Libenzi
2009-08-27  4:13                                           ` Avi Kivity
2009-08-27  8:06                                             ` Michael S. Tsirkin
2009-08-27 14:20                                               ` Davide Libenzi
2009-08-26 19:50                                       ` Gleb Natapov
2009-08-26 20:04                                         ` Davide Libenzi
2009-08-27  5:25                                           ` Gleb Natapov
2009-08-27  9:05                                     ` Paolo Bonzini
2009-08-27  9:09                                       ` Michael S. Tsirkin
2009-08-27 14:21                                       ` Davide Libenzi
2009-08-27 14:30                                         ` Michael S. Tsirkin
2009-08-27 14:38                                           ` Davide Libenzi
2009-08-27 14:49                                             ` Michael S. Tsirkin
2009-08-27 15:29                                               ` Davide Libenzi
2009-08-27 17:09                                                 ` Davide Libenzi
     [not found]                                                   ` <alpine.DEB.2.00.0908311644410.17349@makko.or.mcafeemobile.com>
     [not found]                                                     ` <4A9CB318.7030401@redhat.com>
     [not found]                                                       ` <alpine.DEB.2.00.0909010723380.28172@makko.or.mcafeemobile.com>
2010-01-06 19:33                                                         ` Michael S. Tsirkin
2010-01-06 20:43                                                           ` Davide Libenzi
2010-01-06 20:55                                                             ` Michael S. Tsirkin
2010-01-06 21:17                                                               ` Davide Libenzi
2010-01-06 22:29                                                                 ` Michael S. Tsirkin
2010-01-06 22:46                                                                   ` Davide Libenzi
2010-01-06 23:45                                                                     ` Michael S. Tsirkin
2010-01-06 23:59                                                                       ` Davide Libenzi
2010-01-07  0:02                                                                         ` Michael S. Tsirkin
2010-01-07  6:45                                                                         ` Michael S. Tsirkin
2010-01-07  7:25                                                                           ` Davide Libenzi
2010-01-07 10:36                                                                             ` Michael S. Tsirkin
2010-01-07 23:37                                                                               ` Davide Libenzi
2010-01-08  0:13                                                                                 ` Davide Libenzi
2010-01-08  0:26                                                                               ` Davide Libenzi
2010-01-10 10:30                                                                                 ` Michael S. Tsirkin
2010-01-10 15:26                                                                                   ` Davide Libenzi
2010-01-10 16:22                                                                                     ` Michael S. Tsirkin
2010-01-10 17:27                                                                                       ` Davide Libenzi
2010-01-10 17:35                                                                                         ` Michael S. Tsirkin
2010-01-10 19:04                                                                                           ` Davide Libenzi
2010-01-11  7:34                                                                                             ` Michael S. Tsirkin
2010-01-11 19:14                                                                                               ` Davide Libenzi
2010-01-11 19:19                                                                                                 ` Michael S. Tsirkin
2010-01-11 22:53                                                                                                   ` Davide Libenzi
2010-01-13 17:07                                                                                                     ` Michael S. Tsirkin
2010-01-11  9:01                                                                               ` Gleb Natapov
2010-01-11  9:02                                                                                 ` Michael S. Tsirkin
2010-01-11  9:08                                                                                   ` Gleb Natapov
2010-01-11  9:19                                                                                     ` Michael S. Tsirkin
2010-01-11  9:36                                                                                       ` Gleb Natapov
2010-01-11  9:41                                                                                         ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).