public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [rfc] epoll interface change and glibc bits ...
@ 2002-11-18 16:05 Davide Libenzi
  2002-11-18 16:12 ` Jakub Jelinek
  2002-11-18 17:51 ` Mark Mielke
  0 siblings, 2 replies; 67+ messages in thread
From: Davide Libenzi @ 2002-11-18 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Ulrich Drepper


1) epoll's event structure extension

I received quite a few request to extend the event structure to have space
for an opaque user data object. The eventpoll event structure will turn to
be :

struct epollfd {
	int fd;
	unsigned short int events, revents;
	unsigned long obj;
};

and the epoll_ctl() function will turn to :

int epoll_ctl(int epfd, int op, struct epollfd *pfd);



2) epoll bits in glibc

I was talking to Ulrich Drepper about adding epoll bits inside glibc. His
first objection was to store epoll bits inside poll.h, that IMHO is wrong
because epoll semantics are completely different from poll(). My idea of
the <sys/epoll.h> include file would be this :

#ifndef _SYS_EPOLL_H
#define _SYS_EPOLL_H 1

#include <bits/poll.h>

/* Valid opcodes to issue to sys_epoll_ctl() */
#define EP_CTL_ADD 1
#define EP_CTL_DEL 2
#define EP_CTL_MOD 3

struct epollfd {
        int fd;
        unsigned short int events, revents;
        unsigned long obj;
};

__BEGIN_DECLS

extern int epoll_create(int size);
extern int epoll_ctl(int epfd, int op, struct epollfd *pfd) THROW;
extern int epoll_wait(int epfd, struct epollfd *events, int maxevents,
                      int timeout) THROW;

__END_DECLS

#endif  /* sys/epoll.h */


But he does not like epoll to include <bits/poll.h> and he  would like
epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT, ...
In my opinion it is right for epoll to include <bits/poll.h> because those
are bits that f_op->poll() returns, and renaming those bits inside another
include file will require more maintainance. If the kernel will be
extended to support more POLL* bits, they will have to go only inside
<bits/poll.h> w/out having another file to be updated IMHO.



I would like to receive feedback on those issues ...





- Davide




^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-18 18:40 Grant Taylor
  0 siblings, 0 replies; 67+ messages in thread
From: Grant Taylor @ 2002-11-18 18:40 UTC (permalink / raw)
  To: linux-kernel

[ This is sort of a rehash of a private email or two; Davide asked me
  to air my objections... ]

Davide Libenzi writes:

> I received quite a few request to extend the event structure to have
> space for an opaque user data object. The eventpoll event structure
> will turn to be:
> 
> struct epollfd {
>        int fd;
>        unsigned short int events, revents;
>        unsigned long obj;
> };

NOOOO!!!!!

As it stands, it's very easy to write an event dispatching layer that
runs over both epoll and poll.  You write a little function that,
somehow, gathers events from the kernel in the form of a "struct
pollfd" array, and then the same code can process these events for
both poll and epoll.  If epoll deviates from returning pollfd's, then
there needs to be mapping code or duplicate event dispatch logic for
two different yet substantively similar structures.

More importantly, the whole notion of userspace storing pointers (and
don't kid yourself, that's what people will use it for) in the kernel
is a little suspect.  At best you get some slightly awkward caveats.
For example, when userspace decides to free or move something, it has
to tell the kernel via a syscall AND manually scrub the epollfd's
present in the userspace-owned part of the event mapping.

I also don't think there is a performance argument for this feature.
The usual thing is an array lookup indexed on the fd, and that should
be entirely workable for at least as long as the kernel keeps fd's in
an array.

I'm open to epoll returning non-pollfd records if we find that there
is useful information that should be passed to userspace that way, but
I don't think this is it.

> I was talking to Ulrich Drepper about adding epoll bits inside
> glibc. His first objection was to store epoll bits inside poll.h,
> that IMHO is wrong because epoll semantics are completely different
> from poll(). 

> My idea of the <sys/epoll.h> include file would be this:

> #ifndef _SYS_EPOLL_H
> #define _SYS_EPOLL_H 1
> 
> #include <bits/poll.h>
> [...]

> But he does not like epoll to include <bits/poll.h> and he would
> like epoll to redefine POLLIN, POLLOUT, ... to EPOLLIN, EPOLLOUT,
> ...

> In my opinion it is right for epoll to include <bits/poll.h> because
> those are bits that f_op->poll() returns, and renaming those bits
> inside another include file will require more maintainance. If the
> kernel will be extended to support more POLL* bits, they will have
> to go only inside <bits/poll.h> w/out having another file to be
> updated IMHO.

I'm with you on this one, mostly for the same reasons I disagree with
the userdata thing.  It just adds work to have to do things
differently.

The various flavors of signals for event delivery all give you normal
poll bits and not oddball "SIGPOLLFOO" mechanism-specific bits.  If we
continue adding new mechanisms with new bits, we'll rapidly run out,
at least on 32 bit machines.

I appreciate the interface cleanliness direction the library people
are coming from, but I don't think it's useful here.  The meaning of
the event is a function of both the event (the struct pollfd) and the
event arrival (via epoll, poll, sigio, sig32+, etc).

-- 
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
   Linux Printing Website and HOWTO:  http://www.linuxprinting.org/

^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-18 22:04 Grant Taylor
  2002-11-18 22:32 ` Mark Mielke
  2002-11-18 23:07 ` Davide Libenzi
  0 siblings, 2 replies; 67+ messages in thread
From: Grant Taylor @ 2002-11-18 22:04 UTC (permalink / raw)
  To: linux-kernel

Ulrich Drepper writes:

>> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.

> It does today. We are talking about "you promise that this will be
> the case ever after or we'll cut your head off". 
> [...]
> it is not you who has to deal with the fallout of a change when it

Maybe Davide wouldn't, but *I* do; my project at work runs over epoll,
and interface changes would require rework by me.

Sensible interface changes in the future won't bother me.  I don't
expect anything in the future nearly as earth-shattering as this
current driver/ioctl->syscall transition.

> If epoll is so different from poll (and this is what I've been told
> frmo Davide) then there should be a clear separation of the
> interfaces and all those arguing to unify the data types and
> constants better should rethink there understanding.

The main call returns a subset of the information that poll returns.
What could be more natural than to name that subset the same thing?

Really, sys_epoll does two things:

 1 It sets up epoll itself.  

   This interface is entirely epoll-specific, and has all EP-specific
   constants etc, in <sys/epoll.h>, as well it should.


 2 It returns the set of poll bits that have changed since the last
   epoll.

   This interface is defined largely in terms of poll, and that's OK.
   How many changes do you expect in the poll interface?


In the end, I don't really feel all that strongly about this bits
issue (since truth be told the performance impact will be very small
at most) so it's up to you and Davide.


OTOH, I really hate the "user pointer in struct epollfd" thing...

-- 
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
   Linux Printing Website and HOWTO:  http://www.linuxprinting.org/

^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-18 22:21 Dan Kegel
  2002-11-18 23:09 ` Davide Libenzi
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Kegel @ 2002-11-18 22:21 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Ulrich wrote:
>> epoll does hook f_op->poll() and hence uses the asm/poll.h bits.
> 
> It does today.  We are talking about "you promise that this will be the
> case ever after or we'll cut your head off".  I have no idea why you're
> so reluctant since you don't have to maintain any of the user-level
> bits.  And it is not you who has to deal with the fallout of a change
> when it happens.
> 
> If epoll is so different from poll (and this is what I've been told frmo
> Davide) then there should be a clear separation of the interfaces and
> all those arguing to unify the data types and constants better should
> rethink there understanding.

epoll is not really that different from poll, is it?
It delivers edge-triggered versions of the same events poll uses.
Or is there something epoll does I'm not aware of?
- Dan


^ permalink raw reply	[flat|nested] 67+ messages in thread
[parent not found: <20021118223125.GB14649@mark.mielke.cc>]
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-19  5:49 Grant Taylor
  2002-11-19  6:22 ` Mark Mielke
  0 siblings, 1 reply; 67+ messages in thread
From: Grant Taylor @ 2002-11-19  5:49 UTC (permalink / raw)
  To: linux-kernel

>>>>> Mark Mielke <mark@mark.mielke.cc> writes:

> We're talking about one extra field to a data structure.

Hmm.  As it happens, it looks like Davide is going to err on the side
of the kitchen sink; that neatly makes us both more or less happy,
even if it's still ugly ;)


Meanwhile, in the more important caveat department (Dan, this will
appeal to you), I found a while back that signals cause pain with
epoll.

For example, sometimes TCP reads return EAGAIN when in fact they have
data.  This seems to stem from the case where the signal is found
before the first segment copy (from tcp.c circa 1425, there's even a
handy FIXME note there).  If you use epoll and get an EAGAIN, you have
no idea if it was a signal or a real empty socket unless you are also
very careful to notice when you got a signal during the read.

	do {
		struct sk_buff * skb;
		u32 offset;

		/* Are we at urgent data? Stop if we have read anything. */
		if (copied && tp->urg_data && tp->urg_seq == *seq)
			break;

		/* We need to check signals first, to get correct SIGURG
		 * handling. FIXME: Need to check this doesnt impact 1003.1g
		 * and move it down to the bottom of the loop
		 */
		if (signal_pending(current)) {
			if (copied)
				break;
			copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
			break;
		}

		/* Next get a buffer. */

		skb = skb_peek(&sk->receive_queue);
		do {
	


-- 
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
   Linux Printing Website and HOWTO:  http://www.linuxprinting.org/

^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-19 17:28 Dan Kegel
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Kegel @ 2002-11-19 17:28 UTC (permalink / raw)
  To: Grant Taylor, Linux Kernel Mailing List

Grant Taylor wrote:
> Meanwhile, in the more important caveat department (Dan, this will
> appeal to you), I found a while back that signals cause pain with
> epoll.
> 
> For example, sometimes TCP reads return EAGAIN when in fact they have
> data.  This seems to stem from the case where the signal is found
> before the first segment copy (from tcp.c circa 1425, there's even a
> handy FIXME note there).  If you use epoll and get an EAGAIN, you have
> no idea if it was a signal or a real empty socket unless you are also
> very careful to notice when you got a signal during the read.
 > ...
> 		/* We need to check signals first, to get correct SIGURG
> 		 * handling. FIXME: Need to check this doesnt impact 1003.1g
> 		 * and move it down to the bottom of the loop
> 		 */
> 		if (signal_pending(current)) {
> 			if (copied)
> 				break;
> 			copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
> 			break;
> 		}

eek!  Thanks for noticing that!

Jamie wrote:

> Mark's right, it should be EINTR.  EAGAIN shouldn't break any
> single-thread user state machines using poll/select, as a non-blocking
> read is always allowed to return EAGAIN for any transient reason.
> 
> I'm not sure if EAGAIN can cause a poll() wakeup event to be missed.
> If so, that would be a TCP bug that breaks epoll, and it would also
> break some user state machines using poll/select, when there are
> multiple processes waiting on a socket.

I guess we should scour the sources looking for ways read() and write()
can return EAGAIN, and make sure that there is no chance this causes
a hang in user state machines that rely on epoll.  (Sure would be nice
if the Stanford Checker was up to this kind of thing.)
- Dan




^ permalink raw reply	[flat|nested] 67+ messages in thread
* Re: [rfc] epoll interface change and glibc bits ...
@ 2002-11-19 19:33 Grant Taylor
  2002-11-19 19:51 ` Davide Libenzi
  0 siblings, 1 reply; 67+ messages in thread
From: Grant Taylor @ 2002-11-19 19:33 UTC (permalink / raw)
  To: linux-kernel

Dan Kegel writes:

>> For example, sometimes TCP reads return EAGAIN when in fact they
> eek!  Thanks for noticing that!

We aim to please.

In fact, it took me a long time to track this sucker down, so it's
good that I remembered before someone else got baffled.

This wouldn't be an issue if nonblocking reads or writes were
guaranteed to be uninterruptible even if on a "slow" device.  But
that's a deeper change than just a bit of caution wrt EAGAIN, and this
is an area to monkey around in only very carefully.

Unfortunately, I never looked elsewhere for this style of error; in
the old epoll you could only epoll TCP sockets (maybe pipes too?).  It
might even be present in TCP sendmsg; I don't recall...


Davide writes:

> I will not enable epoll fds to be stored inside another epoll
> fd. With limited numbers of fds poll scales good, and if you need to
> create a priority hierarchy, you can use a small poll set of epoll
> fds. This is a very corner case and I'm not willing to screw up the
> to to handle something that made 0.1% of users are going to do. It
> _can_ be done in other ways.

I'd vote for the "three levels, tops" approach, myself, but I suspect
that for now you are right and it will be simpler to just be
restrictive until we find a need and/or a good way to let epoll epoll
epoll.  So to speak ;)

If we're bound to poll small sets of epoll fds perhaps a bit of
improvement in poll would be worthwhile.  I should go look at my
profiles again; I've got a program which works this way because some
library code requires poll semantics and it's no good rewriting the
world...

-- 
Grant Taylor - gtaylor<at>picante.com - http://www.picante.com/~gtaylor/
  Linux Printing Website and HOWTO:  http://www.linuxprinting.org/

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2002-11-21 17:37 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-18 16:05 [rfc] epoll interface change and glibc bits Davide Libenzi
2002-11-18 16:12 ` Jakub Jelinek
2002-11-18 16:15   ` Davide Libenzi
2002-11-18 16:18     ` Jakub Jelinek
2002-11-18 16:32       ` Davide Libenzi
2002-11-18 22:22         ` Matthew D. Hall
2002-11-18 17:51 ` Mark Mielke
2002-11-18 18:37   ` Davide Libenzi
2002-11-18 19:59     ` Ulrich Drepper
2002-11-18 21:31       ` Davide Libenzi
2002-11-18 22:56         ` Jamie Lokier
2002-11-18 23:56           ` Davide Libenzi
2002-11-19  1:34             ` Jamie Lokier
2002-11-19  1:50               ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2002-11-18 18:40 Grant Taylor
2002-11-18 22:04 Grant Taylor
2002-11-18 22:32 ` Mark Mielke
2002-11-18 23:07 ` Davide Libenzi
2002-11-18 22:21 Dan Kegel
2002-11-18 23:09 ` Davide Libenzi
2002-11-18 23:39   ` Dan Kegel
2002-11-18 23:20     ` Davide Libenzi
2002-11-18 23:52       ` Dan Kegel
2002-11-18 23:35         ` Davide Libenzi
2002-11-19  0:53           ` Dan Kegel
2002-11-19  1:34             ` Davide Libenzi
2002-11-19  2:08               ` Dan Kegel
2002-11-19  2:04                 ` Davide Libenzi
2002-11-19  3:46           ` Edgar Toernig
2002-11-19  4:14             ` Davide Libenzi
2002-11-19  5:35               ` Edgar Toernig
2002-11-19  6:09                 ` Mark Mielke
2002-11-19 17:07                   ` Davide Libenzi
2002-11-20  1:59                 ` Davide Libenzi
2002-11-20  3:09                   ` Jamie Lokier
2002-11-20  3:46                     ` Davide Libenzi
2002-11-20  4:04                     ` Davide Libenzi
2002-11-20  8:01                       ` Mark Mielke
2002-11-20 23:19                         ` Davide Libenzi
2002-11-20 23:51                           ` Mark Mielke
2002-11-20 23:57                             ` Davide Libenzi
2002-11-21  0:28                               ` Jamie Lokier
2002-11-21  1:23                                 ` Mark Mielke
2002-11-21  1:20                                   ` Davide Libenzi
2002-11-21  0:33                               ` Mark Mielke
2002-11-21  0:55                                 ` Jamie Lokier
2002-11-21  1:04                                   ` Davide Libenzi
2002-11-21 20:08                                 ` Denis Vlasenko
2002-11-21 16:51                                   ` Mark Mielke
2002-11-21 17:45                                     ` Davide Libenzi
2002-11-20 22:04                       ` Jamie Lokier
2002-11-20 22:08                         ` Davide Libenzi
2002-11-20 23:28                           ` Jamie Lokier
2002-11-20 23:33                             ` Davide Libenzi
2002-11-20  7:47                     ` Mark Mielke
2002-11-19  3:53         ` Mark Mielke
     [not found] <20021118223125.GB14649@mark.mielke.cc>
2002-11-19  0:23 ` Grant Taylor
2002-11-19  3:58   ` Davide Libenzi
2002-11-19  4:04   ` Mark Mielke
2002-11-19  5:49 Grant Taylor
2002-11-19  6:22 ` Mark Mielke
2002-11-19 15:24   ` Jamie Lokier
2002-11-19 17:28 Dan Kegel
2002-11-19 19:33 Grant Taylor
2002-11-19 19:51 ` Davide Libenzi
2002-11-19 20:57   ` Mark Mielke
2002-11-19 20:54     ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox