* epoll patches queue status and glibc submission ...
@ 2002-11-27 20:09 Davide Libenzi
2002-11-27 21:25 ` Chris Friesen
0 siblings, 1 reply; 4+ messages in thread
From: Davide Libenzi @ 2002-11-27 20:09 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Linux Kernel Mailing List
[-- Attachment #1: Type: TEXT/PLAIN, Size: 452 bytes --]
This is the latest status of the epoll system include file and man pages
that I'll submit to Ulrich after having received comments from lkml. If
you have some feedback to report pls speak now or forever hold your piece :)
PS: Linus could you give me some feedback about the status of the epoll
patches I sent you for 2.5.49 ? The current kernel does not reflect this
interface and I won't post to Ulrich until all bits will be aligned.
- Davide
[-- Attachment #2: Type: TEXT/PLAIN, Size: 2753 bytes --]
#ifndef _SYS_EPOLL_H
#define _SYS_EPOLL_H 1
#include <sys/types.h>
enum EPOLL_EVENTS {
EPOLLIN = 0x001,
#define EPOLLIN EPOLLIN
EPOLLPRI = 0x002,
#define EPOLLPRI EPOLLPRI
EPOLLOUT = 0x004,
#define EPOLLOUT EPOLLOUT
#ifdef __USE_XOPEN
EPOLLRDNORM = 0x040,
#define EPOLLRDNORM EPOLLRDNORM
EPOLLRDBAND = 0x080,
#define EPOLLRDBAND EPOLLRDBAND
EPOLLWRNORM = 0x100,
#define EPOLLWRNORM EPOLLWRNORM
EPOLLWRBAND = 0x200,
#define EPOLLWRBAND EPOLLWRBAND
#endif /* #ifdef __USE_XOPEN */
#ifdef __USE_GNU
EPOLLMSG = 0x400,
#define EPOLLMSG EPOLLMSG
#endif /* #ifdef __USE_GNU */
EPOLLERR = 0x008,
#define EPOLLERR EPOLLERR
EPOLLHUP = 0x010
#define EPOLLHUP EPOLLHUP
};
/* Valid opcodes ( "op" parameter ) to issue to epoll_ctl() */
#define EPOLL_CTL_ADD 1 /* Add a file decriptor to the interface */
#define EPOLL_CTL_DEL 2 /* Remove a file decriptor from the interface */
#define EPOLL_CTL_MOD 3 /* Change file decriptor epoll_event structure */
typedef union epoll_data {
void *ptr;
int fd;
__uint32_t u32;
__uint64_t u64;
} epoll_data_t;
struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
__BEGIN_DECLS
/*
* Creates an epoll instance.
*
* Returns an fd for the new instance.
*
* The "size" parameter is a hint specifying the number of file
* descriptors to be associated with the new instance.
*
* The fd returned by epoll_create() should be closed with close().
*/
extern int epoll_create(int size);
/*
* Manipulate an epoll instance "epfd".
*
* Returns 0 in case of success, -1 in case of error ( the "errno" variable
* will contain the specific error code )
*
* The "op" parameter is one of the EPOLL_CTL_* constants defined above.
* The "fd" parameter is the target of the operation.
* The "event" parameter describes which events the caller is interested
* in and any associated user data.
*/
extern int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) __THROW;
/*
* Wait for events on an epoll instance "epfd".
*
* Returns the number of triggered events returned in "events" buffer.
* Or -1 in case of error with the "errno" variable set to the specific error code.
*
* The "events" parameter is a buffer that will contain triggered events.
* The "maxevents" is the maximum number of events to be returned
* ( usually size of "events" ).
* The "timeout" parameter specifies the maximum wait time in milliseconds
* ( -1 == infinite ).
*/
extern int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
int timeout) __THROW;
__END_DECLS
#endif /* #ifndef _SYS_EPOLL_H */
[-- Attachment #3: Type: TEXT/PLAIN, Size: 8540 bytes --]
.\"
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
.\" Copyright (C) 2002 Davide Libenzi
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
.\" the Free Software Foundation; either version 2 of the License, or
.\" (at your option) any later version.
.\"
.\" This program is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program; if not, write to the Free Software
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
.\"
.\" Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL 4 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll \- edge triggered I/O readiness notification facility
.SH SYNOPSIS
.B #include <sys/epoll.h>
.SH DESCRIPTION
epoll is the edge based variant of
.BR poll (2)
and scales well to large numbers of watched fds. Three system calls are
provided to set up and control an
epoll set:
.BR epoll_create (2) ,
.BR epoll_ctl (2) ,
.BR epoll_wait (2) .
An epoll set is connected to a file descriptor created by
.BR epoll_create (2) .
Interest for certain file descriptors is then registered via
.BR epoll_ctl(2) .
Finally, the actual wait is started by
.BR epoll_wait (2) .
.SH NOTES
epoll should not be regarded as a faster version of
.BR poll (2)
because its semantics are fundamentally different. Whereas
.BR poll (2)
and
.BR select (2)
return immediately if any of their fds are available for reading,
writing or have an exceptional condition, epoll only returns if a
condition changes (level triggered vs. edge triggered). However,
epoll does scale better with large file descriptor sets.
epoll should be used with non-blocking file descriptors to avoid
having a blocking read or write starve the task that is handling
multiple file decriptors. The suggested way to use epoll is below,
and possible pitfalls to avoid follow.
.SR
.TP
.B i
with non-blocking file descriptors
.TP
.B ii
by going to wait for an event only after
.BR read (2)
or
.BR write (2)
return EAGAIN
.SE
.SH EXAMPLE FOR SUGGESTED USAGE
In this example, listener is a non-blocking socket on which
.BR listen (2)
has been called. The function do_use_fd() uses the new ready
file descriptor until EAGAIN is returned by either
.BR read (2)
or
.BR write (2).
An event driven state machine application should, after having received
EAGAIN, record its current state so that at the next call to do_use_fd()
it will continue to
.BR read (2)
or
.BR write (2)
from where it stopped before.
.nf
for(;;) {
nfds = epoll_wait(kdpfd, &pfds, -1);
for(n = 0; n < nfds; ++n) {
if(pfds[n].fd == s) {
client = accept(listener, (struct sockaddr*)&local, &addrlen);
if(client < 0){
perror("accept");
continue;
}
setnonblocking(client);
if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, EPOLLIN) < 0) {
fprintf(stderr, "epoll set insertion error: fd=%d\n", client);
return -1;
}
}
else
do_use_fd(pfds[n].fd);
}
}
.fi
For performance reasons, it is possible to add the file descriptor inside the
epoll interface (
.B EPOLL_CTL_ADD
) once by specifying (EPOLLIN | EPOLLOUT). This allows you to avoid continuously switching
between EPOLLIN and EPOLLOUT calling
.BR epoll_ctl (2)
with
.BR EPOLL_CTL_MOD .
.SH QUESTIONS AND ANSWERS (from linux-kernel)
.SR
.TP
.B Q1
What happens if you add the same fd to an epoll_set twice?
.TP
.B A1
You will probably get EEXIST. However, it is possible that two
threads may add the same fd twice. This is a harmless condition.
.TP
.B Q2
Can two epoll sets wait for the same fd? If so, are events reported
to both epoll sets fds?
.TP
.B A2
Yes. However, it is not recommended. Yes it would be reported to both.
.TP
.B Q3
Is the epoll fd itself poll/epoll/selectable?
.TP
.B A3
Yes.
.TP
.B Q4
What happens if the epoll fd is put into its own fd set?
.TP
.B A4
It will fail. However, you can add an epoll fd inside
another epoll fd set.
.TP
.B Q5
Can I send the epoll fd over a unix-socket to another process?
.TP
.B A5
No.
.TP
.B Q6
Will the close of an fd cause it to be removed from all epoll sets automatically?
.TP
.B A6
Yes.
.TP
.B Q7
If more than one event comes in between epoll_wait calls, are they combined
or reported separately?
.TP
.B A7
They will be combined.
.TP
.B Q8
Does an operation on an fd affect the already collected but not yet reported
events?
.TP
.B A8
You can do two operations on an existing fd. Remove would be meaningless for
this case. Modify will re-read available I/O.
.TP
.B Q9
Do I need to continuosly read/write an fd until EAGAIN ?
.TP
.B A9
No you don't. Receiving an event from
.BR epoll_wait (2)
should suggest you that such file descriptor is ready for the requested I/O
operation. You have simply to consider it ready until you will receive the
next EAGAIN. When and how you will use such file descriptor is entirely up
to you.
.SE
.SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM
.SR
.TP
.B o False Positive
.pp
It is possible that while reading (assuming you are reading in a loop waiting for EAGAIN),
more I/O comes in for a second event. That I/O will be read immediately. However, the next
time you call epoll_wait on that fd, it will say there is an event ready even though the I/O
for it has already been consumed.
.pp
.nf
input0 -> triggers event0
epoll_wait () returns True
while (ret != EAGAIN){
ret = read()
input1 -> triggers event1 but read() consumes input1 immediately
}
epoll_wait () returns True but there is no input1 waiting.
read() returns EAGAIN immediately.
.fi
.pp
In the case of non-blocking file descriptors, this will result in the
next call to read immediately returning with EAGAIN. In the case of blocking
file descriptors, you will hang waiting to read I/O that is not there.
The author does not recommend using non-blocking sockets but will not stop you.
.pp
.pp
One way to handle this is to mark the file descriptor as ready in its associated data
structure after the first event is received, then ignore other events while it is in the
ready state. When you read until receiving EAGAIN, set the ready state bit off before
calling epoll_wait again on that fd.
.pp
.TP
.B o Starvation
.pp
If there is a large amount of I/O space, it is possible that by trying to drain
it the other files will not get processed causing starvation. This
is not specific to epoll.
.pp
.pp
The solution is to maintain a ready list and mark the file descriptor as ready
in its associated data structure, thereby allowing the application to
remember which files need to be processed but still round robin amongst
all the ready files. This also supports ignoring subsequent events you
receive for fd's that are already ready.
.pp
.TP
.B o If using an event cache...
.pp
If you use an event cache or store all the fd's returned from epoll_wait,
then make sure to provide a way to mark its closure dynamically (ie- caused by
a previous event's processing). Suppose you receive 100 events from
epoll_wait, and in eventi #47 a condition causes event #13 to be closed.
If you remove the structure and close() the fd for event #13, then your
event cache might still say there are events waiting for that fd causing
confusion.
.pp
.pp
One solution for this is to call, during the processing of event 47,
epoll_ctl(EPOLL_CTL_DEL) to delete fd 13 and close(), then mark its associated
data structure as removed and link it to a cleanup list. If you find another
event for fd 13 in your batch processing, you will discover the fd had been
previously removed and there will be no confusion.
.pp
.SE
.SH CONFORMING TO
.BR epoll (4)
is a new API introduced in Linux kernel
.BR 2.5.44 .
Its interface should be finalized in Linux kernel
.BR 2.5.50 .
.SH "SEE ALSO"
.BR epoll_create (2)
.BR epoll_ctl (2)
.BR epoll_wait (2)
[-- Attachment #4: Type: TEXT/PLAIN, Size: 2179 bytes --]
.\"
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
.\" Copyright (C) 2002 Davide Libenzi
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
.\" the Free Software Foundation; either version 2 of the License, or
.\" (at your option) any later version.
.\"
.\" This program is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program; if not, write to the Free Software
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
.\"
.\" Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_CREATE 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_create \- open an
.B epoll
file descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_create(int " size )
.SH DESCRIPTION
Open an
.B epoll
file descriptor by requesting the kernel allocate
an event backing store dimensioned for
.I size
descriptors. The
.I size
is not the maximum size of the backing store but
just a hint to the kernel about how to dimension internal structures.
The returned file descriptor will be used for all the subsequent calls to the
.B epoll
interface. The file descriptor returned by
.BR epoll_create (2)
must be closed by using
.BR close (2).
.SH "RETURN VALUE"
When successful,
.BR epoll_create (2)
returns a positive integer identifying the descriptor.
When an error occurs,
.BR epoll_create (2)
returns -1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B ENOMEM
There was insufficient memory to create the kernel object.
.SH CONFORMING TO
.BR epoll_create (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
The interface should be finalized by Linux kernel
.BR 2.5.50 .
.SH "SEE ALSO"
.BR epoll (4)
.BR epoll_ctl (2)
.BR epoll_wait (2)
.BR close (2)
[-- Attachment #5: Type: TEXT/PLAIN, Size: 3610 bytes --]
.\"
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
.\" Copyright (C) 2002 Davide Libenzi
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
.\" the Free Software Foundation; either version 2 of the License, or
.\" (at your option) any later version.
.\"
.\" This program is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program; if not, write to the Free Software
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
.\"
.\" Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_CTL 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_ctl \- controller interface for an
.B epoll
descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_ctl(int " epfd ", int " op ", int " fd ", struct epoll_event *" event )
.SH DESCRIPTION
Control an
.B epoll
descriptor,
.IR epfd ,
by requesting the operation
.IR op
be performed on the target file descriptor,
.IR fd .
The
.IR event
describe the object linked to the file descriptor
.IR fd .
The
.B struct epoll_event
is defined as :
.sp
.nf
typedef union epoll_data {
void *ptr;
int fd;
__uint32_t u32;
__uint64_t u64;
} epoll_data_t;
struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
.fi
The
.I events
member is a bit set composed using the following available event
types :
.TP
.B EPOLLIN
The associated file is available for
.BR read (2)
opeartions.
.TP
.B EPOLLOUT
The associated file is available for
.BR write (2)
opeartions.
.TP
.B EPOLLPRI
There is urgent data available for
.BR read (2)
opeartions.
.TP
.B EPOLLERR
Error condition happened on the associated file descriptor.
.TP
.B EPOLLHUP
Hang up happened on the associated file descriptor.
.PP
.sp
The
.B epoll
interface supports all file descriptors that support
.BR poll (2).
Valid values for the
.IR op
parameter are :
.SR
.TP
.B EPOLL_CTL_ADD
Add the target file descriptor
.I fd
to the
.B epoll
descriptor
.I epfd
and associate the event
.I event
with the internal file linked to
.IR fd .
.TP
.B EPOLL_CTL_MOD
Change the event
.I event
associated to the target file descriptor
.IR fd .
.TP
.B EPOLL_CTL_DEL
Remove the target file descriptor
.I fd
from the
.B epoll
file descriptor,
.IR epfd .
.SE
.SH "RETURN VALUE"
When successful,
.BR epoll_ctl (2)
returns zero. When an error occurs,
.BR epoll_ctl (2)
returns \-1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B EBADF
The
.I epfd
file descriptor is not a valid file descriptor.
.TP
.B EPERM
The target file
.I fd
is not supported by
.BR epoll .
.TP
.B EINVAL
The supplied file descriptor,
.IR epfd ,
is not an
.B epoll
file descriptor, or the requested operation
.I op
is not supported by this interface.
.TP
.B ENOMEM
There was insufficient memory to handle the requested
.I op
controller operation.
.SH CONFORMING TO
.BR epoll_ctl (2)
is a new API introduced in Linux kernel
.BR 2.5.44 .
The interface should be finalized by Linux kernel
.BR 2.5.50 .
.SH "SEE ALSO"
.BR epoll (4)
.BR epoll_create (2)
.BR epoll_wait (2)
[-- Attachment #6: Type: TEXT/PLAIN, Size: 3047 bytes --]
.\"
.\" epoll by Davide Libenzi ( efficient event notification retrieval )
.\" Copyright (C) 2002 Davide Libenzi
.\"
.\" This program is free software; you can redistribute it and/or modify
.\" it under the terms of the GNU General Public License as published by
.\" the Free Software Foundation; either version 2 of the License, or
.\" (at your option) any later version.
.\"
.\" This program is distributed in the hope that it will be useful,
.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
.\" GNU General Public License for more details.
.\"
.\" You should have received a copy of the GNU General Public License
.\" along with this program; if not, write to the Free Software
.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
.\"
.\" Davide Libenzi <davidel@xmailserver.org>
.\"
.\"
.TH EPOLL_WAIT 2 "23 October 2002" Linux "Linux Programmer's Manual"
.SH NAME
epoll_wait \- wait for an I/O event on an
.B epoll
file descriptor
.SH SYNOPSIS
.B #include <sys/epoll.h>
.sp
.BR "int epoll_wait(int " epfd ", struct epoll_event * " events ", int " maxevents ", int " timeout)
.SH DESCRIPTION
Wait for events on the
.B epoll
file descriptor
.I epfd
for a maximum time of
.I timeout
milliseconds. The memory area pointed by
.I events
will contain the events that will be available for the caller.
Up to
.I maxevents
are returned by
.BR epoll_wait (2).
The
.I maxevents
parameter must be greater than zero. Specifying a
.I timeout
of \-1 makes
.BR epoll_wait (2)
wait indefinitely. The
.B struct epoll_event
is defined as :
.sp
.nf
typedef union epoll_data {
void *ptr;
int fd;
__uint32_t u32;
__uint64_t u64;
} epoll_data_t;
struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
.fi
The
.I data
of each returned structure will contain the same data the user set with a
.BR epoll_ctl (2)
.IR ( EPOLL_CTL_ADD , EPOLL_CTL_MOD )
while the
.I events
member will contain the returned event bit field.
.SH "RETURN VALUE"
When successful,
.BR epoll_wait (2)
returns the number of file descriptors ready for the requested I/O, or zero
if no file descriptor became ready during the requested
.I timeout
milliseconds. When an error occurs,
.BR epoll_wait (2)
returns \-1 and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B EBADF
.I epfd
is not a valid file descriptor.
.TP
.B EINVAL
The supplied file descriptor,
.IR epfd ,
is not an
.B epoll
file descriptor, or the
.I maxevents
parameter is less than or equal to zero.
.TP
.B EFAULT
The memory area pointed to by
.I events
is not accesible with write permissions.
.SH CONFORMING TO
.BR epoll_wait (2)
is a new API introduced in Linux kernel
.BR 2.5.44 . The interface should be finalized by Linux kernel
.BR 2.5.50 .
.SH "SEE ALSO"
.BR epoll (4)
.BR epoll_create (2)
.BR epoll_ctl (2)
^ permalink raw reply [flat|nested] 4+ messages in thread
* re: epoll patches queue status and glibc submission ...
@ 2002-11-27 20:49 Dan Kegel
2002-11-27 20:56 ` Davide Libenzi
0 siblings, 1 reply; 4+ messages in thread
From: Dan Kegel @ 2002-11-27 20:49 UTC (permalink / raw)
To: Davide Libenzi, linux-kernel
Hi Davide,
I just realized that 'edge triggered I/O readiness notification facility'
is too jargony. A plain English wording of the man page title might be
epoll \- I/O readiness change notification facility
This is shorter and much clearer. Apologies for having suggested
the earlier jargony text.
- Dan
^ permalink raw reply [flat|nested] 4+ messages in thread
* re: epoll patches queue status and glibc submission ...
2002-11-27 20:49 epoll patches queue status and glibc submission Dan Kegel
@ 2002-11-27 20:56 ` Davide Libenzi
0 siblings, 0 replies; 4+ messages in thread
From: Davide Libenzi @ 2002-11-27 20:56 UTC (permalink / raw)
To: Dan Kegel; +Cc: Linux Kernel Mailing List
On Wed, 27 Nov 2002, Dan Kegel wrote:
> Hi Davide,
> I just realized that 'edge triggered I/O readiness notification facility'
> is too jargony. A plain English wording of the man page title might be
>
> epoll \- I/O readiness change notification facility
>
> This is shorter and much clearer. Apologies for having suggested
> the earlier jargony text.
Ok, done. Any other crappyness at your sight ?
- Davide
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: epoll patches queue status and glibc submission ...
2002-11-27 20:09 Davide Libenzi
@ 2002-11-27 21:25 ` Chris Friesen
0 siblings, 0 replies; 4+ messages in thread
From: Chris Friesen @ 2002-11-27 21:25 UTC (permalink / raw)
To: Linux Kernel Mailing List
Davide Libenzi wrote:
> This is the latest status of the epoll system include file and man pages
> that I'll submit to Ulrich after having received comments from lkml. If
> you have some feedback to report pls speak now or forever hold your piece :)
I'm canadian. I don't carry a piece....
Chris
--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: cfriesen@nortelnetworks.com
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2002-11-27 21:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-27 20:49 epoll patches queue status and glibc submission Dan Kegel
2002-11-27 20:56 ` Davide Libenzi
-- strict thread matches above, loose matches on Subject: below --
2002-11-27 20:09 Davide Libenzi
2002-11-27 21:25 ` Chris Friesen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox