* {sys_,/dev/}epoll waiting timeout
@ 2003-01-22 6:55 Lennert Buytenhek
2003-01-22 8:03 ` Jamie Lokier
0 siblings, 1 reply; 24+ messages in thread
From: Lennert Buytenhek @ 2003-01-22 6:55 UTC (permalink / raw)
To: Davide Libenzi; +Cc: linux-kernel
Hi!
Both /dev/epoll EP_POLL and sys_epoll_wait, when converting the passed
timeout value in msec to jiffies, round down instead of up. This
occasionally causes these functions to return without any active fd's
before the given timeout period has passed.
This can cause fun situations like these:
epoll_wait(epfd, events, maxevents, timeout_until_next_timer_expiry)
[ returns too early ]
gettimeofday(&now, NULL)
[ notice that the first timer has not yet expired, do nothing ]
epoll_wait(epfd, events, maxevents, random_small_value_less_than_jiffy)
[ returns immediately ]
gettimeofday(&now, NULL)
[ notice that first timer still didn't expire yet, do nothing ]
etc.
Effectively causing busy-wait loops of on average half a jiffy.
nanosleep(2) always rounds timeout values up (I think it is required to do
so by some specification which says that this call should sleep _at_least_
the given amount of time), and this approach to me makes sense for
{sys_,/dev/}epoll also. See <linux/time.h>:timespec_to_jiffies and
kernel/time.c:sys_nanosleep.
Will you accept a patch to do this?
cheers,
Lennert
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 6:55 {sys_,/dev/}epoll waiting timeout Lennert Buytenhek @ 2003-01-22 8:03 ` Jamie Lokier 2003-01-22 12:46 ` Ed Tomlinson ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Jamie Lokier @ 2003-01-22 8:03 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: Davide Libenzi, linux-kernel Lennert Buytenhek wrote: > Both /dev/epoll EP_POLL and sys_epoll_wait, when converting the passed > timeout value in msec to jiffies, round down instead of up. This > occasionally causes these functions to return without any active fd's > before the given timeout period has passed. So you would like it to round up the timeout like poll(), select(), and io_getevents() do? Fair enough, for the sake of consistency! sys_poll() says: if (timeout) { /* Careful about overflow in the intermediate values */ if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) timeout = (unsigned long)(timeout*HZ+999)/1000+1; else /* Negative or overflow */ timeout = MAX_SCHEDULE_TIMEOUT; } sys_io_getevents() does something more complicated in a function called set_timeout(), but it essentially comes to the same thing. It takes a value in nanseconds (which I prefer, btw, for future usefulness). There is also a nasty overflow in ep_poll(). If HZ == 1000 and you ask to wait for > approx. 2000 seconds (not unreasonable, say if epoll is running an ftp server and the client timeout is set to 1 hour), then ep_poll gets the calculation wrong. I suggest that this line in eventpoll.c: jtimeout = timeout == -1 ? MAX_SCHEDULE_TIMEOUT: (timeout * HZ) / 1000; be changed to this: jtimeout = 0; if (timeout) { /* Careful about overflow in the intermediate values */ if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) jtimeout = (unsigned long)(timeout*HZ+999)/1000+1; else /* Negative or overflow */ jtimeout = MAX_SCHEDULE_TIMEOUT; } And that the prototypes for ep_poll() and sys_epoll_wait() be changed to take a "long timeout" instead of an "int", just like sys_poll(). > Effectively causing busy-wait loops of on average half a jiffy. Sometimes busy waiting is the right thing to do. With HZ == 100, anything involving video animation needs to busy wait after select() or equivalent, otherwise there is too much display jitter. But all that sort of code needs to know how much select() et al. tend to overrun by anyway, so they can just do the same if using epoll. -- Jamie ps. sys_* system-call functions should never return "int". They should always return "long" or a pointer - even if the user-space equivalent returns "int". Take a look at sys_open() for an example. Technical requirement of the system call return path on 64-bit targets. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 8:03 ` Jamie Lokier @ 2003-01-22 12:46 ` Ed Tomlinson 2003-01-22 13:20 ` Jamie Lokier 2003-01-23 14:07 ` Davide Libenzi 2003-01-28 19:42 ` {sys_,/dev/}epoll waiting timeout Randy.Dunlap 2 siblings, 1 reply; 24+ messages in thread From: Ed Tomlinson @ 2003-01-22 12:46 UTC (permalink / raw) To: Jamie Lokier, linux-kernel Jamie Lokier wrote: > jtimeout = 0; > if (timeout) { > /* Careful about overflow in the intermediate values */ > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) > jtimeout = (unsigned long)(timeout*HZ+999)/1000+1; > else /* Negative or overflow */ > jtimeout = MAX_SCHEDULE_TIMEOUT; > } Why assume HZ=1000? Would not: timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1; make more sense? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 12:46 ` Ed Tomlinson @ 2003-01-22 13:20 ` Jamie Lokier 2003-01-22 19:14 ` Randy.Dunlap 0 siblings, 1 reply; 24+ messages in thread From: Jamie Lokier @ 2003-01-22 13:20 UTC (permalink / raw) To: Ed Tomlinson; +Cc: linux-kernel Ed Tomlinson wrote: > Jamie Lokier wrote: > > > jtimeout = 0; > > if (timeout) { > > /* Careful about overflow in the intermediate values */ > > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) > > jtimeout = (unsigned long)(timeout*HZ+999)/1000+1; > > else /* Negative or overflow */ > > jtimeout = MAX_SCHEDULE_TIMEOUT; > > } > > Why assume HZ=1000? Would not: > > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1; > > make more sense? No, that's silly. Why do you want to multiply by HZ and then divide by HZ? -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 13:20 ` Jamie Lokier @ 2003-01-22 19:14 ` Randy.Dunlap 2003-01-22 19:34 ` Jamie Lokier 0 siblings, 1 reply; 24+ messages in thread From: Randy.Dunlap @ 2003-01-22 19:14 UTC (permalink / raw) To: Jamie Lokier; +Cc: Ed Tomlinson, linux-kernel On Wed, 22 Jan 2003, Jamie Lokier wrote: | Ed Tomlinson wrote: | > Jamie Lokier wrote: | > | > > jtimeout = 0; | > > if (timeout) { | > > /* Careful about overflow in the intermediate values */ | > > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) | > > jtimeout = (unsigned long)(timeout*HZ+999)/1000+1; | > > else /* Negative or overflow */ | > > jtimeout = MAX_SCHEDULE_TIMEOUT; | > > } | > | > Why assume HZ=1000? Would not: | > | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1; | > | > make more sense? | | No, that's silly. Why do you want to multiply by HZ and then divide by HZ? OK, I don't get it. All Ed did was replace 1000 with HZ and 999 with (HZ-1). What's bad about that? Seems to me like the right thing to do. Much more portable. What if HZ changes? Who's going to audit the kernel for changes? -- ~Randy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 19:14 ` Randy.Dunlap @ 2003-01-22 19:34 ` Jamie Lokier 2003-01-22 19:32 ` Randy.Dunlap 0 siblings, 1 reply; 24+ messages in thread From: Jamie Lokier @ 2003-01-22 19:34 UTC (permalink / raw) To: Randy.Dunlap; +Cc: Ed Tomlinson, linux-kernel Randy.Dunlap wrote: > | > Why assume HZ=1000? Would not: > | > > | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1; > | > > | > make more sense? > | > | No, that's silly. Why do you want to multiply by HZ and then divide by HZ? > > OK, I don't get it. All Ed did was replace 1000 with HZ and > 999 with (HZ-1). What's bad about that? Seems to me like > the right thing to do. Much more portable. > > What if HZ changes? Who's going to audit the kernel for changes? You're being dense. The input timeout is measured in milliseconds; see poll(2). The calculated timeout is measured in jiffies. Hence multiply by jiffies and divide by milliseconds. -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 19:34 ` Jamie Lokier @ 2003-01-22 19:32 ` Randy.Dunlap 0 siblings, 0 replies; 24+ messages in thread From: Randy.Dunlap @ 2003-01-22 19:32 UTC (permalink / raw) To: Jamie Lokier; +Cc: Ed Tomlinson, linux-kernel On Wed, 22 Jan 2003, Jamie Lokier wrote: | Randy.Dunlap wrote: | > | > Why assume HZ=1000? Would not: | > | > | > | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1; | > | > | > | > make more sense? | > | | > | No, that's silly. Why do you want to multiply by HZ and then divide by HZ? | > | > OK, I don't get it. All Ed did was replace 1000 with HZ and | > 999 with (HZ-1). What's bad about that? Seems to me like | > the right thing to do. Much more portable. | > | > What if HZ changes? Who's going to audit the kernel for changes? | | You're being dense. The input timeout is measured in milliseconds; | see poll(2). The calculated timeout is measured in jiffies. Hence | multiply by jiffies and divide by milliseconds. Like I said, I didn't get it. Now I do. Thanks. -- ~Randy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 8:03 ` Jamie Lokier 2003-01-22 12:46 ` Ed Tomlinson @ 2003-01-23 14:07 ` Davide Libenzi 2003-01-23 15:43 ` Jamie Lokier 2003-01-28 19:42 ` {sys_,/dev/}epoll waiting timeout Randy.Dunlap 2 siblings, 1 reply; 24+ messages in thread From: Davide Libenzi @ 2003-01-23 14:07 UTC (permalink / raw) To: Jamie Lokier; +Cc: Lennert Buytenhek, Linux Kernel Mailing List On Wed, 22 Jan 2003, Jamie Lokier wrote: > So you would like it to round up the timeout like poll(), select(), > and io_getevents() do? Fair enough, for the sake of consistency! > > sys_poll() says: > > if (timeout) { > /* Careful about overflow in the intermediate values */ > if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ) > timeout = (unsigned long)(timeout*HZ+999)/1000+1; > else /* Negative or overflow */ > timeout = MAX_SCHEDULE_TIMEOUT; > } > > sys_io_getevents() does something more complicated in a function > called set_timeout(), but it essentially comes to the same thing. It > takes a value in nanseconds (which I prefer, btw, for future usefulness). >From a mathematical point of view this is a ceil(v)+1, so this is wrong. It should be : t = (t * HZ + 999) / 1000; The +999 already gives you the round up. Different is if we want to be sure to sleep at least that amount of jiffies ( the rounded up ), in that case since the timer tick might arrive immediately after we go to sleep by making us to lose immediately a jiffie, we need another +1. Anyway I'll do the round up. Same for the overflow check. > And that the prototypes for ep_poll() and sys_epoll_wait() be changed > to take a "long timeout" instead of an "int", just like sys_poll(). I don't see why. The poll(2) timeout is an int. > ps. sys_* system-call functions should never return "int". They > should always return "long" or a pointer - even if the user-space > equivalent returns "int". Take a look at sys_open() for an example. > Technical requirement of the system call return path on 64-bit targets. True, I'll change it. - Davide ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 14:07 ` Davide Libenzi @ 2003-01-23 15:43 ` Jamie Lokier 2003-01-23 17:27 ` Mark Mielke 0 siblings, 1 reply; 24+ messages in thread From: Jamie Lokier @ 2003-01-23 15:43 UTC (permalink / raw) To: Davide Libenzi; +Cc: Lennert Buytenhek, Linux Kernel Mailing List Davide Libenzi wrote: > >From a mathematical point of view this is a ceil(v)+1, so this is wrong. > It should be : > > t = (t * HZ + 999) / 1000; > > The +999 already gives you the round up. Different is if we want to be > sure to sleep at least that amount of jiffies ( the rounded up ), in that > case since the timer tick might arrive immediately after we go to sleep by > making us to lose immediately a jiffie, we need another +1. Anyway I'll do > the round up. Same for the overflow check. I wonder if it's appropriate to copy sys_poll(), which has the +1, or sys_select(), which doesn't! > > And that the prototypes for ep_poll() and sys_epoll_wait() be changed > > to take a "long timeout" instead of an "int", just like sys_poll(). > > I don't see why. The poll(2) timeout is an int. poll(2) takes an int, but sys_poll() takes a long. I think everyone is confused :) The reason I suggested "long timeout" for ep_poll is because the multiply in the expression: jtimeout = (unsigned long)(timeout*HZ+999)/1000; can overflow if you don't. If you stick with the int, you'll need to write: jtimeout = (((unsigned long)timeout)*HZ+999)/1000; -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 15:43 ` Jamie Lokier @ 2003-01-23 17:27 ` Mark Mielke 2003-01-23 18:28 ` Jamie Lokier 0 siblings, 1 reply; 24+ messages in thread From: Mark Mielke @ 2003-01-23 17:27 UTC (permalink / raw) To: Jamie Lokier; +Cc: Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List On Thu, Jan 23, 2003 at 03:43:04PM +0000, Jamie Lokier wrote: > Davide Libenzi wrote: > > >From a mathematical point of view this is a ceil(v)+1, so this is wrong. > > It should be : > > t = (t * HZ + 999) / 1000; > > The +999 already gives you the round up. Different is if we want to be > > sure to sleep at least that amount of jiffies ( the rounded up ), in that > > case since the timer tick might arrive immediately after we go to sleep by > > making us to lose immediately a jiffie, we need another +1. Anyway I'll do > > the round up. Same for the overflow check. > I wonder if it's appropriate to copy sys_poll(), which has the +1, or > sys_select(), which doesn't! Or, fix sys_poll(). With the +1, this means that sys_poll() would have a 1 in 1001 chance per second of returning one jiffie too early. > > > And that the prototypes for ep_poll() and sys_epoll_wait() be changed > > > to take a "long timeout" instead of an "int", just like sys_poll(). > > I don't see why. The poll(2) timeout is an int. > poll(2) takes an int, but sys_poll() takes a long. > I think everyone is confused :) > The reason I suggested "long timeout" for ep_poll is because the > multiply in the expression: > jtimeout = (unsigned long)(timeout*HZ+999)/1000; > can overflow if you don't. If you stick with the int, you'll need to > write: > jtimeout = (((unsigned long)timeout)*HZ+999)/1000; On a 16 bit platform, perhaps... :-) mark -- mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 17:27 ` Mark Mielke @ 2003-01-23 18:28 ` Jamie Lokier 2003-01-23 20:40 ` Mark Mielke 0 siblings, 1 reply; 24+ messages in thread From: Jamie Lokier @ 2003-01-23 18:28 UTC (permalink / raw) To: Mark Mielke; +Cc: Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List Mark Mielke wrote: > Or, fix sys_poll(). With the +1, this means that sys_poll() would have > a 1 in 1001 chance per second of returning one jiffie too early. Nope. Read the expression again. > > poll(2) takes an int, but sys_poll() takes a long. > > I think everyone is confused :) > > The reason I suggested "long timeout" for ep_poll is because the > > multiply in the expression: > > jtimeout = (unsigned long)(timeout*HZ+999)/1000; > > can overflow if you don't. If you stick with the int, you'll need to > > write: > > jtimeout = (((unsigned long)timeout)*HZ+999)/1000; > > On a 16 bit platform, perhaps... :-) Nope. It will overflow on a _64_ bit platform, if you give it a value of MAXINT for example. -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 18:28 ` Jamie Lokier @ 2003-01-23 20:40 ` Mark Mielke 2003-01-23 22:18 ` Jamie Lokier 0 siblings, 1 reply; 24+ messages in thread From: Mark Mielke @ 2003-01-23 20:40 UTC (permalink / raw) To: Jamie Lokier; +Cc: Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List On Thu, Jan 23, 2003 at 06:28:31PM +0000, Jamie Lokier wrote: > Mark Mielke wrote: > > Or, fix sys_poll(). With the +1, this means that sys_poll() would have > > a 1 in 1001 chance per second of returning one jiffie too early. > Nope. Read the expression again. Sorry... not 1 in 1001... almost 100% chance of returning of one jiffie too many. In practice, even on a relatively idle system, the process will not be able to wake up as frequently as it might be able to expect. mark -- mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 20:40 ` Mark Mielke @ 2003-01-23 22:18 ` Jamie Lokier 2003-01-24 14:41 ` Andreas Schwab ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Jamie Lokier @ 2003-01-23 22:18 UTC (permalink / raw) To: Mark Mielke; +Cc: Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List Mark Mielke wrote: > Sorry... not 1 in 1001... almost 100% chance of returning of one > jiffie too many. It's curious that select() is different from poll() - almost is if completely different people wrote the code :) We have to wonder whether it was a design decision. Perhaps the unix specifications require it (see below). > In practice, even on a relatively idle system, the process will not > be able to wake up as frequently as it might be able to expect. You're right - it's unfortunate that using poll() lets you sleep and wake up no faster than every _two_ ticks. That's actually caused by poll()'s treating zero differently though, not by +1 as such. There's a strange contradiction between rounding up the waiting time to the next number of jiffies, and then rounding it down (in a time-dependent way) by waiting until the next N'th timer interrupt. If, as someone said, the appropriate unix specification says that "wait for 10ms" means to wait for _at minimum_ 10ms, then you do need the +1. (Davide), IMHO epoll should decide whether it means "at minimum" (in which case the +1 is a requirement), or it means "at maximum" (in which case rounding up is wrong). The current method of rounding up and then effectively down means that you get an unpredictable mixture of both. -- Jamie ps. I would always prefer an absolute wakeup time anyway - it avoids a race condition too. What a shame none of the system calls work that way. pps. To summarise, all the time APIs are a complete mess in unix, and there's nothing you can do in user space to make up for the b0rken system call interface. Except not duplicate past errors in new interfaces :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 22:18 ` Jamie Lokier @ 2003-01-24 14:41 ` Andreas Schwab 2003-01-25 1:08 ` Davide Libenzi 2003-01-27 21:27 ` bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) Bill Rugolsky Jr. 2 siblings, 0 replies; 24+ messages in thread From: Andreas Schwab @ 2003-01-24 14:41 UTC (permalink / raw) To: Jamie Lokier Cc: Mark Mielke, Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List Jamie Lokier <jamie@shareable.org> writes: |> (Davide), IMHO epoll should decide whether it means "at minimum" (in |> which case the +1 is a requirement), or it means "at maximum" (in |> which case rounding up is wrong). Since Linux isn't a RTOS you can't meet "at maximum". POSIX says "poll () shall wait at least timeout milliseconds" Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nürnberg Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-23 22:18 ` Jamie Lokier 2003-01-24 14:41 ` Andreas Schwab @ 2003-01-25 1:08 ` Davide Libenzi 2003-01-27 21:27 ` bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) Bill Rugolsky Jr. 2 siblings, 0 replies; 24+ messages in thread From: Davide Libenzi @ 2003-01-25 1:08 UTC (permalink / raw) To: Jamie Lokier; +Cc: Mark Mielke, Lennert Buytenhek, Linux Kernel Mailing List On Thu, 23 Jan 2003, Jamie Lokier wrote: > (Davide), IMHO epoll should decide whether it means "at minimum" (in > which case the +1 is a requirement), or it means "at maximum" (in > which case rounding up is wrong). > > The current method of rounding up and then effectively down means that > you get an unpredictable mixture of both. I think I'll go with : Tj = (Tms * HZ + 999) / 1000 Somehow I feel it more correct :) - Davide ^ permalink raw reply [flat|nested] 24+ messages in thread
* bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-23 22:18 ` Jamie Lokier 2003-01-24 14:41 ` Andreas Schwab 2003-01-25 1:08 ` Davide Libenzi @ 2003-01-27 21:27 ` Bill Rugolsky Jr. 2003-01-27 22:52 ` Davide Libenzi 2 siblings, 1 reply; 24+ messages in thread From: Bill Rugolsky Jr. @ 2003-01-27 21:27 UTC (permalink / raw) To: Jamie Lokier Cc: Mark Mielke, Davide Libenzi, Lennert Buytenhek, Linux Kernel Mailing List On Thu, Jan 23, 2003 at 10:18:58PM +0000, Jamie Lokier wrote: > If, as someone said, the appropriate unix specification says that > "wait for 10ms" means to wait for _at minimum_ 10ms, then you do need > the +1. > > (Davide), IMHO epoll should decide whether it means "at minimum" (in > which case the +1 is a requirement), or it means "at maximum" (in > which case rounding up is wrong). > > The current method of rounding up and then effectively down means that > you get an unpredictable mixture of both. Quite independent of this discussion, my boss came across this today while looking at some strace output: gettimeofday({1043689947, 402580}, NULL) = 0 select(4, [0], [], [], {1, 999658}) = 0 (Timeout) gettimeofday({1043689949, 401857}, NULL) = 0 gettimeofday({1043689949, 401939}, NULL) = 0 select(4, [0], [], [], {0, 299}) = 0 (Timeout) gettimeofday({1043689949, 403577}, NULL) = 0 Note that 1043689949.401857 - 1043689947.402580 = 1.999277. The Single Unix Specification (v2 and v3), says of select(): Implementations may also place limitations on the granularity of timeout intervals. If the requested timeout interval requires a finer granularity than the implementation supports, the actual timeout interval shall be rounded up to the next supported value. That seems to indicate that a fix is required. Regards, Bill Rugolsky ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-27 21:27 ` bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) Bill Rugolsky Jr. @ 2003-01-27 22:52 ` Davide Libenzi 2003-01-28 9:45 ` Jamie Lokier 0 siblings, 1 reply; 24+ messages in thread From: Davide Libenzi @ 2003-01-27 22:52 UTC (permalink / raw) To: Bill Rugolsky Jr. Cc: Jamie Lokier, Mark Mielke, Lennert Buytenhek, Linux Kernel Mailing List On Mon, 27 Jan 2003, Bill Rugolsky Jr. wrote: > Quite independent of this discussion, my boss came across this today > while looking at some strace output: > > gettimeofday({1043689947, 402580}, NULL) = 0 > select(4, [0], [], [], {1, 999658}) = 0 (Timeout) > gettimeofday({1043689949, 401857}, NULL) = 0 > gettimeofday({1043689949, 401939}, NULL) = 0 > select(4, [0], [], [], {0, 299}) = 0 (Timeout) > gettimeofday({1043689949, 403577}, NULL) = 0 > > Note that 1043689949.401857 - 1043689947.402580 = 1.999277. > > The Single Unix Specification (v2 and v3), says of select(): > > Implementations may also place limitations on the granularity of > timeout intervals. If the requested timeout interval requires a finer > granularity than the implementation supports, the actual timeout > interval shall be rounded up to the next supported value. > > That seems to indicate that a fix is required. The problem is that the schedule_timeout() is not precise. So if you pass N to such function, your sleep interval can be ]N-1, N+1[ This w/out considering other latencies but looking only at how timers are updated ( you can call schedule_timeout() immediately before a timer tick or immediately after ). So if we want to be sure to sleep at least the rounded up number of jiffies, we might end up sleeping one/two jiffies more ( and this w/out accounting other latencies ). And this will lead to the formula ( used by poll() ) : Tj = (Tms * HZ + 999) / 1000 + 1 ( if Tms > 0 ) - Davide ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-27 22:52 ` Davide Libenzi @ 2003-01-28 9:45 ` Jamie Lokier 2003-01-28 10:52 ` Mark Mielke 2003-01-28 22:15 ` Davide Libenzi 0 siblings, 2 replies; 24+ messages in thread From: Jamie Lokier @ 2003-01-28 9:45 UTC (permalink / raw) To: Davide Libenzi Cc: Bill Rugolsky Jr., Mark Mielke, Lennert Buytenhek, Linux Kernel Mailing List Davide Libenzi wrote: > ( if Tms > 0 ) Which is unfortunate, because that doesn't allow for a value of Tms == 0 which is needed when you want to sleep and wake up on every jiffie on systems where HZ >= 1000. Tms == 0 is taken already, to mean do not wait at all. -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-28 9:45 ` Jamie Lokier @ 2003-01-28 10:52 ` Mark Mielke 2003-01-28 21:39 ` Jamie Lokier 2003-01-28 22:15 ` Davide Libenzi 1 sibling, 1 reply; 24+ messages in thread From: Mark Mielke @ 2003-01-28 10:52 UTC (permalink / raw) To: Jamie Lokier Cc: Davide Libenzi, Bill Rugolsky Jr., Lennert Buytenhek, Linux Kernel Mailing List On Tue, Jan 28, 2003 at 09:45:00AM +0000, Jamie Lokier wrote: > Davide Libenzi wrote: > > ( if Tms > 0 ) > Which is unfortunate, because that doesn't allow for a value of Tms == > 0 which is needed when you want to sleep and wake up on every jiffie > on systems where HZ >= 1000. Tms == 0 is taken already, to mean do > not wait at all. To some degree, isn't this the equivalent of yield()? mark -- mark@mielke.cc/markm@ncf.ca/markm@nortelnetworks.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-28 10:52 ` Mark Mielke @ 2003-01-28 21:39 ` Jamie Lokier 0 siblings, 0 replies; 24+ messages in thread From: Jamie Lokier @ 2003-01-28 21:39 UTC (permalink / raw) To: Mark Mielke Cc: Davide Libenzi, Bill Rugolsky Jr., Lennert Buytenhek, Linux Kernel Mailing List Mark Mielke wrote: > On Tue, Jan 28, 2003 at 09:45:00AM +0000, Jamie Lokier wrote: > > Davide Libenzi wrote: > > > ( if Tms > 0 ) > > Which is unfortunate, because that doesn't allow for a value of Tms == > > 0 which is needed when you want to sleep and wake up on every jiffie > > on systems where HZ >= 1000. Tms == 0 is taken already, to mean do > > not wait at all. > > To some degree, isn't this the equivalent of yield()? No. If you want a process to wake every HZ tick, do a little work and then sleep again, yield() won't do that. If HZ >= 1000, you simply can't use Linux poll() to do that; you have to use select(). (Or epoll_wait()). Even if select() is changed to do double-rounding-up like poll(), it will still do this because select() times have microsecond granularity. -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) 2003-01-28 9:45 ` Jamie Lokier 2003-01-28 10:52 ` Mark Mielke @ 2003-01-28 22:15 ` Davide Libenzi 1 sibling, 0 replies; 24+ messages in thread From: Davide Libenzi @ 2003-01-28 22:15 UTC (permalink / raw) To: Jamie Lokier Cc: Bill Rugolsky Jr., Mark Mielke, Lennert Buytenhek, Linux Kernel Mailing List On Tue, 28 Jan 2003, Jamie Lokier wrote: > Davide Libenzi wrote: > > ( if Tms > 0 ) > > Which is unfortunate, because that doesn't allow for a value of Tms == > 0 which is needed when you want to sleep and wake up on every jiffie > on systems where HZ >= 1000. Tms == 0 is taken already, to mean do > not wait at all. Waking up every jiffie does not make a lot of sense in most applications since they probably prefer to deal with seconds and its derivates, to have a predictable behavior on different systems. Functions like poll/select/epoll are simply not the right solution if you want to cut the microsecond on sleep times. - Davide ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-22 8:03 ` Jamie Lokier 2003-01-22 12:46 ` Ed Tomlinson 2003-01-23 14:07 ` Davide Libenzi @ 2003-01-28 19:42 ` Randy.Dunlap 2003-01-28 21:36 ` Jamie Lokier 2 siblings, 1 reply; 24+ messages in thread From: Randy.Dunlap @ 2003-01-28 19:42 UTC (permalink / raw) To: Jamie Lokier; +Cc: Lennert Buytenhek, Davide Libenzi, linux-kernel On Wed, 22 Jan 2003, Jamie Lokier wrote: | ps. sys_* system-call functions should never return "int". They | should always return "long" or a pointer - even if the user-space | equivalent returns "int". Take a look at sys_open() for an example. | Technical requirement of the system call return path on 64-bit targets. Is this a blanket truism? For all architectures? Should current (older/all) syscalls be modified, or should only new ones (like epoll) be corrected? Thanks, -- ~Randy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-28 19:42 ` {sys_,/dev/}epoll waiting timeout Randy.Dunlap @ 2003-01-28 21:36 ` Jamie Lokier 2003-01-28 21:44 ` David Mosberger 0 siblings, 1 reply; 24+ messages in thread From: Jamie Lokier @ 2003-01-28 21:36 UTC (permalink / raw) To: Randy.Dunlap; +Cc: Lennert Buytenhek, Davide Libenzi, linux-kernel Randy.Dunlap wrote: > On Wed, 22 Jan 2003, Jamie Lokier wrote: > > | ps. sys_* system-call functions should never return "int". They > | should always return "long" or a pointer - even if the user-space > | equivalent returns "int". Take a look at sys_open() for an example. > | Technical requirement of the system call return path on 64-bit targets. > > Is this a blanket truism? For all architectures? I believe so, for all architecture-independent syscall functions. (And architecture-dependent on those that need it -- see end of this message). Linus mentioned it in passing in the recent x86 vsyscall threads. I have never seen it mentioned before. I always assumed that returned longs were due to sloppy programming :) If you look at the syscall return paths on the 64-bit architectures, some of them always check the 64-bit return value register to see if it is negative. If so, they set an error flag, which is what userspace uses to decide whether to return -1, rather than checking if the return value is >= -4096 (or >= -125, architecture implementations vary). This convoluted ABI allows those architectures to return full 64-bit values as legitimate values, which is needed for... ptrace(), only on those architecture of course. The question is therefore are "int" values returned from C functions sign-extended to 64 bits? I don't know, maybe it varies between architectures -- for all I know it happens to work on all the supported 64-bit architectures -- but I believe this is the reason why "long" (or equivalent, e.g. "ssize_t") and pointer types are _supposed_ to be the only permitted return types from syscall functions. > Should current (older/all) syscalls be modified, or should only new ones > (like epoll) be corrected? All of them. Here's a partial list of functions which return int from 2.5.49, based on parsing Alpha, ARM, CRIS, IA64 and x86_64 trees: sys_set_tid_address sys_futex sys_sched_setaffinity sys_sched_getaffinity sys_remap_file_pages sys_epoll_create sys_epoll_ctl sys_epoll_wait sys_lookup_dcookie sys_pause As far as I can guess from reading the GCC sources, 32-bit return values aren't sign extended on x86_64 and IA64, but they are on all the other Linux-supported 64-bit architectures (I could be mistaken about this). Of x86_64 and IA64, only the IA64 syscall return path tests if the return value register is negative. Which suggests that all the architectures are fine with all these "int" returns, except IA64. Curiously, IA64's own sys_perfmonctl() returns int. -- Jamie ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: {sys_,/dev/}epoll waiting timeout 2003-01-28 21:36 ` Jamie Lokier @ 2003-01-28 21:44 ` David Mosberger 0 siblings, 0 replies; 24+ messages in thread From: David Mosberger @ 2003-01-28 21:44 UTC (permalink / raw) To: Jamie Lokier Cc: Randy.Dunlap, Lennert Buytenhek, Davide Libenzi, linux-kernel >>>>> On Tue, 28 Jan 2003 21:36:21 +0000, Jamie Lokier <jamie@shareable.org> said: Jamie> Curiously, IA64's own sys_perfmonctl() returns int. Definitely a bug. Thanks for catching this. I'll let the maintainer know. --david ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2003-01-28 22:00 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-22 6:55 {sys_,/dev/}epoll waiting timeout Lennert Buytenhek
2003-01-22 8:03 ` Jamie Lokier
2003-01-22 12:46 ` Ed Tomlinson
2003-01-22 13:20 ` Jamie Lokier
2003-01-22 19:14 ` Randy.Dunlap
2003-01-22 19:34 ` Jamie Lokier
2003-01-22 19:32 ` Randy.Dunlap
2003-01-23 14:07 ` Davide Libenzi
2003-01-23 15:43 ` Jamie Lokier
2003-01-23 17:27 ` Mark Mielke
2003-01-23 18:28 ` Jamie Lokier
2003-01-23 20:40 ` Mark Mielke
2003-01-23 22:18 ` Jamie Lokier
2003-01-24 14:41 ` Andreas Schwab
2003-01-25 1:08 ` Davide Libenzi
2003-01-27 21:27 ` bug in select() (was Re: {sys_,/dev/}epoll waiting timeout) Bill Rugolsky Jr.
2003-01-27 22:52 ` Davide Libenzi
2003-01-28 9:45 ` Jamie Lokier
2003-01-28 10:52 ` Mark Mielke
2003-01-28 21:39 ` Jamie Lokier
2003-01-28 22:15 ` Davide Libenzi
2003-01-28 19:42 ` {sys_,/dev/}epoll waiting timeout Randy.Dunlap
2003-01-28 21:36 ` Jamie Lokier
2003-01-28 21:44 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox