[patch] sys_epoll ...

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] sys_epoll ...
@ 2002-10-21  0:16 Davide Libenzi
  2002-10-21  0:50 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21  0:16 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Hanna Linder

Per Linus request I implemented a syscall interface to the old /dev/epoll
to avoid the creation of magic inodes inside /dev. Since the new
implementation shares 95% of the code with the old one, the old interface
is maintained for compatibility with existing users. The new syscall
interface adds three system calls ( Linus doesn't like multiplexing ) :

#define EP_CTL_ADD 1
#define EP_CTL_DEL 2
#define EP_CTL_MOD 3

asmlinkage int sys_epoll_create(int maxfds);
asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);

The function sys_epoll_create() creates a sys_epoll "object" by allocation
space for "maxfds" descriptors. The sys_epoll "object" is basically a
file descriptor, and this enable the new interface to :

1) Mantain compatibility with the existing interface
2) Avoid the creation of a sys_epoll_close() syscall
3) Reuse 95% of the existing code
4) Inherit the file* automatic cleanup code w/out having to code a
	dedicated one

The function sys_epoll_ctl() is the controller interface and what it does
is pretty obvious. The "op" parameter is either EP_CTL_ADD, EP_CTL_DEL or
EP_CTL_MOD and the parameter "fd" is the target of the operation. The last
parameter "events" is used in both EP_CTL_ADD and EP_CTL_MOD and rapresent
the event interest mask. The function sys_epoll_wait() waits for events by
allowing a maximum timeout "timeout" in milliseconds and returns the
number of events ( struct pollfd ) that the caller will find available at
"*events". The port of the old /dev/epoll to 2.5.44 and the new sys_epoll
are available at :

http://www.xmailserver.org/linux-patches/nio-improve.html#patches

- Davide

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  0:16 [patch] sys_epoll Davide Libenzi
@ 2002-10-21  0:50 ` Andrew Morton
  2002-10-21  1:28   ` Davide Libenzi
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andrew Morton @ 2002-10-21  0:50 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Hanna Linder

Davide Libenzi wrote:
> 
> Per Linus request I implemented a syscall interface to the old /dev/epoll
> to avoid the creation of magic inodes inside /dev.

>From a (very) quick read:


+static struct vm_operations_struct eventpoll_mmap_ops = {
+       open: eventpoll_mm_open,
+       close: eventpoll_mm_close,
+};

Please use the C99 form.

+       ep = (struct eventpoll *) file->private_data;

The cast there just adds noise.  private_data is void *.  In
fact you lose a bit of compile-time checking by having the cast
there.


+       addr = do_mmap_pgoff(file, 0, EP_MAP_SIZE(maxfds + 1), PROT_READ | PROT_WRITE,
+                                                MAP_PRIVATE, 0);

You must hold current->mm->mmap-sem for writing before calling
do_mmap_pgoff().  I shall add the missing comment to do_mmap_pgoff.


+asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout)
+{
+       int error = -EBADF;
+       unsigned long eaddr;
+       struct file *file;
+       struct eventpoll *ep;
+       struct evpoll dvp;
+
+       DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d)\n",
+                                current, epfd, events, timeout));
+
+       file = fget(epfd);
+       if (!file)
+               goto fget_fail;
+       ep = (struct eventpoll *) file->private_data;
+
+       error = -EINVAL;
+       if (!atomic_read(&ep->mmapped))

Will oops if not passed one of your fd's?

(Ditto in sys_epoll_ctl)

+       clear_bit(PG_reserved, &virt_to_page(pages[ii])->flags);

Please use SetPageReserved()/ClearPageReserved().


+static void ep_free(struct eventpoll *ep)
+{
+       int ii;
+       struct list_head *lnk;
+
+       lock_kernel();
+       for (ii = 0; ii <= ep->hmask; ii++) {
+               while ((lnk = list_first(&ep->hash[ii]))) {
+                       struct epitem *dpi = list_entry(lnk, struct epitem, llink);
+

What is locking the nodes in this hashtable?  Here it seems to
be lock_kernel.  Elsewhere it seems to be write_lock_irqsave(&ep->lock);

+static inline struct epitem *ep_find_nl(struct eventpoll *ep, int fd)

Should be uninlined.

+               __copy_from_user(&pfd, buffer, sizeof(pfd));

Check for EFAULT.


+                       if (ep->eventcnt || !timeout)
+                               break;
+                       if (signal_pending(current)) {
+                               res = -EINTR;
+                               break;
+                       }
+
+                       set_current_state(TASK_INTERRUPTIBLE);
+
+                       write_unlock_irqrestore(&ep->lock, flags);
+                       timeout = schedule_timeout(timeout);

You should set current->state before performing the checks.

+               if (res > 0)
+                       copy_to_user((void *) arg, &dvp, sizeof(struct evpoll));

EFAULT?


+       write_lock_irqsave(&ep->lock, flags);
+
+       res = -EINVAL;
+       if (numpages != (2 * ep->numpages))
+               goto out;
+
+       start = vma->vm_start;
+       for (ii = 0; ii < ep->numpages; ii++) {
+               if (remap_page_range(vma, start, __pa(ep->pages0[ii]),
+                                                        PAGE_SIZE, vma->vm_page_prot))

remap_page_range() can sleep in pmd_alloc(), etc.  May not be called under
spinlock.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  0:50 ` Andrew Morton
@ 2002-10-21  1:28   ` Davide Libenzi
  2002-10-21  1:54   ` Davide Libenzi
  2002-10-21  3:34   ` Davide Libenzi
  2 siblings, 0 replies; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21  1:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Per Linus request I implemented a syscall interface to the old /dev/epoll
> > to avoid the creation of magic inodes inside /dev.
>
> >From a (very) quick read:
>
>
> +static struct vm_operations_struct eventpoll_mmap_ops = {
> +       open: eventpoll_mm_open,
> +       close: eventpoll_mm_close,
> +};
>
> Please use the C99 form.

Consider It Done.



> +       ep = (struct eventpoll *) file->private_data;
>
> The cast there just adds noise.  private_data is void *.  In
> fact you lose a bit of compile-time checking by having the cast
> there.

CID



> +       addr = do_mmap_pgoff(file, 0, EP_MAP_SIZE(maxfds + 1), PROT_READ | PROT_WRITE,
> +                                                MAP_PRIVATE, 0);
>
> You must hold current->mm->mmap-sem for writing before calling
> do_mmap_pgoff().  I shall add the missing comment to do_mmap_pgoff.

True, CID



> +asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout)
> +{
> +       int error = -EBADF;
> +       unsigned long eaddr;
> +       struct file *file;
> +       struct eventpoll *ep;
> +       struct evpoll dvp;
> +
> +       DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d)\n",
> +                                current, epfd, events, timeout));
> +
> +       file = fget(epfd);
> +       if (!file)
> +               goto fget_fail;
> +       ep = (struct eventpoll *) file->private_data;
> +
> +       error = -EINVAL;
> +       if (!atomic_read(&ep->mmapped))
>
> Will oops if not passed one of your fd's?

Yes it will. I will find a solution asap.



> +       clear_bit(PG_reserved, &virt_to_page(pages[ii])->flags);
>
> Please use SetPageReserved()/ClearPageReserved().

CID



> +static void ep_free(struct eventpoll *ep)
> +{
> +       int ii;
> +       struct list_head *lnk;
> +
> +       lock_kernel();
> +       for (ii = 0; ii <= ep->hmask; ii++) {
> +               while ((lnk = list_first(&ep->hash[ii]))) {
> +                       struct epitem *dpi = list_entry(lnk, struct epitem, llink);
> +
>
> What is locking the nodes in this hashtable?  Here it seems to
> be lock_kernel.  Elsewhere it seems to be write_lock_irqsave(&ep->lock);



> +static inline struct epitem *ep_find_nl(struct eventpoll *ep, int fd)
>
> Should be uninlined.

CID



>
> +               __copy_from_user(&pfd, buffer, sizeof(pfd));
>
> Check for EFAULT.

CID



> +                       if (ep->eventcnt || !timeout)
> +                               break;
> +                       if (signal_pending(current)) {
> +                               res = -EINTR;
> +                               break;
> +                       }
> +
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +
> +                       write_unlock_irqrestore(&ep->lock, flags);
> +                       timeout = schedule_timeout(timeout);
>
> You should set current->state before performing the checks.



> +               if (res > 0)
> +                       copy_to_user((void *) arg, &dvp, sizeof(struct evpoll));
>
> EFAULT?

CID



> +       write_lock_irqsave(&ep->lock, flags);
> +
> +       res = -EINVAL;
> +       if (numpages != (2 * ep->numpages))
> +               goto out;
> +
> +       start = vma->vm_start;
> +       for (ii = 0; ii < ep->numpages; ii++) {
> +               if (remap_page_range(vma, start, __pa(ep->pages0[ii]),
> +                                                        PAGE_SIZE, vma->vm_page_prot))
>
> remap_page_range() can sleep in pmd_alloc(), etc.  May not be called under
> spinlock.

Yep, true. CID




- Davide




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  0:50 ` Andrew Morton
  2002-10-21  1:28   ` Davide Libenzi
@ 2002-10-21  1:54   ` Davide Libenzi
  2002-10-21  2:01     ` Andrew Morton
  2002-10-21  3:34   ` Davide Libenzi
  2 siblings, 1 reply; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21  1:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List

On Sun, 20 Oct 2002, Andrew Morton wrote:

> +                       if (ep->eventcnt || !timeout)
> +                               break;
> +                       if (signal_pending(current)) {
> +                               res = -EINTR;
> +                               break;
> +                       }
> +
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +
> +                       write_unlock_irqrestore(&ep->lock, flags);
> +                       timeout = schedule_timeout(timeout);
>
> You should set current->state before performing the checks.

Why this Andrew ?



- Davide



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  1:54   ` Davide Libenzi
@ 2002-10-21  2:01     ` Andrew Morton
  2002-10-21  2:18       ` Davide Libenzi
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2002-10-21  2:01 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

Davide Libenzi wrote:
> 
> On Sun, 20 Oct 2002, Andrew Morton wrote:
> 
> > +                       if (ep->eventcnt || !timeout)
> > +                               break;
> > +                       if (signal_pending(current)) {
> > +                               res = -EINTR;
> > +                               break;
> > +                       }
> > +
> > +                       set_current_state(TASK_INTERRUPTIBLE);
> > +
> > +                       write_unlock_irqrestore(&ep->lock, flags);
> > +                       timeout = schedule_timeout(timeout);
> >
> > You should set current->state before performing the checks.
> 
> Why this Andrew ?
> 

Well I'm assuming that you don't want to sleep if, say,
ep->eventcnt is non-zero.  The code is currently (simplified):

	add_wait_queue(...);
	if (ep->eventcnt)
		break;
	/* window here */
	set_current_state(TASK_INTERRUPTIBLE);
	schedule();

If another CPU increments eventcnt and sends this task a wakeup in that
window, it is missed and we still sleep.  The conventional fix for that
is:

	add_wait_queue(...);
	set_current_state(TASK_INTERRUPTIBLE);
	if (ep->eventcnt)
		break;
	/* harmless window here */
	schedule();

So if someone delivers a wakeup in the "harmless window" then this task
still calls schedule(), but the wakeup has turned the state from
TASK_INTERRUPTIBLE into TASK_RUNNING, so the schedule() doesn't actually
take this task off the runqueue.  This task will zoom straight through the
schedule() and will then loop back and notice the incremented ep->eventcnt.

So it is important that the waker increment eventcnt _before_ delivering
the wake_up, too.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  2:01     ` Andrew Morton
@ 2002-10-21  2:18       ` Davide Libenzi
  0 siblings, 0 replies; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21  2:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Well I'm assuming that you don't want to sleep if, say,
> ep->eventcnt is non-zero.  The code is currently (simplified):
>
> 	add_wait_queue(...);
> 	if (ep->eventcnt)
> 		break;
> 	/* window here */
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	schedule();
>
> If another CPU increments eventcnt and sends this task a wakeup in that
> window, it is missed and we still sleep.  The conventional fix for that
> is:
>
> 	add_wait_queue(...);
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	if (ep->eventcnt)
> 		break;
> 	/* harmless window here */
> 	schedule();
>
> So if someone delivers a wakeup in the "harmless window" then this task
> still calls schedule(), but the wakeup has turned the state from
> TASK_INTERRUPTIBLE into TASK_RUNNING, so the schedule() doesn't actually
> take this task off the runqueue.  This task will zoom straight through the
> schedule() and will then loop back and notice the incremented ep->eventcnt.
>
> So it is important that the waker increment eventcnt _before_ delivering
> the wake_up, too.

It's true ... but the window is pretty small there :) Anyway it makes the
code more correct and I changed it. I have the new patch with your
suggestions ready and I will post as sonn as it'll pass a few tests.



- Davide



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch] sys_epoll ...
  2002-10-21  0:50 ` Andrew Morton
  2002-10-21  1:28   ` Davide Libenzi
  2002-10-21  1:54   ` Davide Libenzi
@ 2002-10-21  3:34   ` Davide Libenzi
  2 siblings, 0 replies; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21  3:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Kernel Mailing List

On Sun, 20 Oct 2002, Andrew Morton wrote:

> Davide Libenzi wrote:
> >
> > Per Linus request I implemented a syscall interface to the old /dev/epoll
> > to avoid the creation of magic inodes inside /dev.
>
> >From a (very) quick read:

[ lots of required fixes ]

The new patch ( 0.3 ) with the fixes Andrew suggested is available here :

http://www.xmailserver.org/linux-patches/nio-improve.html#patches




- Davide



^ permalink raw reply	[flat|nested] 9+ messages in thread

* re: [patch] sys_epoll ...
@ 2002-10-21  7:05 Dan Kegel
  2002-10-21 17:10 ` Davide Libenzi
  0 siblings, 1 reply; 9+ messages in thread
From: Dan Kegel @ 2002-10-21  7:05 UTC (permalink / raw)
  To: linux-kernel, Davide Libenzi

Davide wrote:
 >asmlinkage int sys_epoll_create(int maxfds);
 >asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
 >asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);

Hey Davide,
I've always been a bit bothered by the need to specify maxfds in
advance.  What's the preferred way to handle the situation
where you guess wrong on the value of maxfds?  Create a new
epoll and register all the old fds with it?  (Sounds like
a good job for a userspace wrapper library.)

Regardless, thanks for pushing /dev/epoll along towards inclusion
in 2.5.  I'm looking forward to seeing it it integrated.
Even if the interface doesn't please everyone, the performance
should...
- Dan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* re: [patch] sys_epoll ...
  2002-10-21  7:05 Dan Kegel
@ 2002-10-21 17:10 ` Davide Libenzi
  0 siblings, 0 replies; 9+ messages in thread
From: Davide Libenzi @ 2002-10-21 17:10 UTC (permalink / raw)
  To: Dan Kegel; +Cc: linux-kernel

On Mon, 21 Oct 2002, Dan Kegel wrote:

> Davide wrote:
>  >asmlinkage int sys_epoll_create(int maxfds);
>  >asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
>  >asmlinkage int sys_epoll_wait(int epfd, struct pollfd **events, int timeout);
>
> Hey Davide,
> I've always been a bit bothered by the need to specify maxfds in
> advance.  What's the preferred way to handle the situation
> where you guess wrong on the value of maxfds?  Create a new
> epoll and register all the old fds with it?  (Sounds like
> a good job for a userspace wrapper library.)

I'm currently looking at doing it automatically, w/out the need of
specifying it. I don't like it either ...


> Regardless, thanks for pushing /dev/epoll along towards inclusion
> in 2.5.  I'm looking forward to seeing it it integrated.
> Even if the interface doesn't please everyone, the performance
> should...

You should thank Hanna Linder that bugged Linus way more than I did :)



- Davide



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2002-10-21 16:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-21  0:16 [patch] sys_epoll Davide Libenzi
2002-10-21  0:50 ` Andrew Morton
2002-10-21  1:28   ` Davide Libenzi
2002-10-21  1:54   ` Davide Libenzi
2002-10-21  2:01     ` Andrew Morton
2002-10-21  2:18       ` Davide Libenzi
2002-10-21  3:34   ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2002-10-21  7:05 Dan Kegel
2002-10-21 17:10 ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox