[patch] sys_epoll_wait() timeout saga ...

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] sys_epoll_wait() timeout saga ...
@ 2005-09-23 18:13 Davide Libenzi
  2005-09-23 18:24 ` Nish Aravamudan
  2005-09-24  4:05 ` Willy Tarreau
  0 siblings, 2 replies; 33+ messages in thread
From: Davide Libenzi @ 2005-09-23 18:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Andrew Morton

[-- Attachment #1: Type: TEXT/PLAIN, Size: 249 bytes --]


The sys_epoll_wait() function was not handling correctly negative timeouts 
(besides -1), and like sys_poll(), was comparing millisec to secs in 
testing the upper timeout limit.


Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide

[-- Attachment #2: Type: TEXT/PLAIN, Size: 475 bytes --]

--- a/fs/eventpoll.c	2005-09-23 10:56:57.000000000 -0700
+++ b/fs/eventpoll.c	2005-09-23 11:00:06.000000000 -0700
@@ -1507,7 +1507,7 @@
 	 * and the overflow condition. The passed timeout is in milliseconds,
 	 * that why (t * HZ) / 1000.
 	 */
-	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
+	jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
 		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
 
 retry:

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-23 18:13 [patch] sys_epoll_wait() timeout saga Davide Libenzi
@ 2005-09-23 18:24 ` Nish Aravamudan
  2005-09-24  4:05 ` Willy Tarreau
  1 sibling, 0 replies; 33+ messages in thread
From: Nish Aravamudan @ 2005-09-23 18:24 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Andrew Morton

On 9/23/05, Davide Libenzi <davidel@xmailserver.org> wrote:
>
> The sys_epoll_wait() function was not handling correctly negative timeouts
> (besides -1), and like sys_poll(), was comparing millisec to secs in
> testing the upper timeout limit.
>
>
> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>

Looks a lot more correct :)

Probably want to eventually convert the code path to be similar to
sys_poll(), though? Maybe with a helper to do the converting? I think
the epoll code can probable use msecs_to_jiffies() + 1 as well, no? I
will wait for your patch to go in and send a patch for these ideas
later.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-23 18:13 [patch] sys_epoll_wait() timeout saga Davide Libenzi
  2005-09-23 18:24 ` Nish Aravamudan
@ 2005-09-24  4:05 ` Willy Tarreau
  2005-09-24  4:44   ` Nish Aravamudan
  1 sibling, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24  4:05 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List, Andrew Morton

Hi Davide,

On Fri, Sep 23, 2005 at 11:13:30AM -0700, Davide Libenzi wrote:
> 
> The sys_epoll_wait() function was not handling correctly negative 
> timeouts (besides -1), and like sys_poll(), was comparing millisec to 
> secs in testing the upper timeout limit.
> 
> 
> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
> 
> 
> - Davide

> --- a/fs/eventpoll.c	2005-09-23 10:56:57.000000000 -0700
> +++ b/fs/eventpoll.c	2005-09-23 11:00:06.000000000 -0700
> @@ -1507,7 +1507,7 @@
>  	 * and the overflow condition. The passed timeout is in milliseconds,
>  	 * that why (t * HZ) / 1000.
>  	 */
> -	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> +	jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
>  		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

Here, I'm not certain that gcc will optimize the divide. It would be better
anyway to write this which is equivalent, and a pure integer comparison :

+	jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
>  		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

gcc will also spit a warning if the constant is too big for an int,
depending on MAX_SCHEDULE_TIMEOUT and HZ, while in the previous case,
it would remain silent, and possibly, timeout/1000 would never reach
the limit.

Regards,
Willy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  4:05 ` Willy Tarreau
@ 2005-09-24  4:44   ` Nish Aravamudan
  2005-09-24  6:15     ` Willy Tarreau
  0 siblings, 1 reply; 33+ messages in thread
From: Nish Aravamudan @ 2005-09-24  4:44 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Davide Libenzi, Linux Kernel Mailing List, Andrew Morton

On 9/23/05, Willy Tarreau <willy@w.ods.org> wrote:
> Hi Davide,
>
> On Fri, Sep 23, 2005 at 11:13:30AM -0700, Davide Libenzi wrote:
> >
> > The sys_epoll_wait() function was not handling correctly negative
> > timeouts (besides -1), and like sys_poll(), was comparing millisec to
> > secs in testing the upper timeout limit.
> >
> >
> > Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
> >
> >
> > - Davide
>
> > --- a/fs/eventpoll.c  2005-09-23 10:56:57.000000000 -0700
> > +++ b/fs/eventpoll.c  2005-09-23 11:00:06.000000000 -0700
> > @@ -1507,7 +1507,7 @@
> >        * and the overflow condition. The passed timeout is in milliseconds,
> >        * that why (t * HZ) / 1000.
> >        */
> > -     jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> > +     jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
> >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
>
> Here, I'm not certain that gcc will optimize the divide. It would be better
> anyway to write this which is equivalent, and a pure integer comparison :
>
> +       jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
> >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

Just a question here, maybe it's dumb.

* and / have the same priority in the order of operations, yes? If so,
won't the the 1000 * MAX_SCHEDULE_TIMEOUT overflow
(MAX_SCHEDULE_TIMEOUT is LONG_MAX)? I really think this code just move
to the same thing that sys_poll() does to avoid overlflow (I fixed the
bug Alexey was experiencing, so I think the changes are safe now). In
any case, this code is approaching unreadable with lots of jiffies
<--> human-time units manipulations done in non-standard ways, which
the updated sys_poll() also tries to avoid.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  4:44   ` Nish Aravamudan
@ 2005-09-24  6:15     ` Willy Tarreau
  2005-09-24  7:33       ` Vadim Lobanov
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24  6:15 UTC (permalink / raw)
  To: Nish Aravamudan
  Cc: Willy Tarreau, Davide Libenzi, Linux Kernel Mailing List,
	Andrew Morton

On Fri, Sep 23, 2005 at 09:44:10PM -0700, Nish Aravamudan wrote:
> > >        * that why (t * HZ) / 1000.
> > >        */
> > > -     jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> > > +     jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
> > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> >
> > Here, I'm not certain that gcc will optimize the divide. It would be better
> > anyway to write this which is equivalent, and a pure integer comparison :
> >
> > +       jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
> > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> 
> Just a question here, maybe it's dumb.

Your question is not dumb, this code is not trivial at all !

> * and / have the same priority in the order of operations, yes? If so,
> won't the the 1000 * MAX_SCHEDULE_TIMEOUT overflow
> (MAX_SCHEDULE_TIMEOUT is LONG_MAX)?

Yes it can, and that's why I said that gcc should send a warning when
comparing an int with something too large for an int. But I should have
forced the constant to be evaluated as long long. At the moment, the
constant cannot overflow, but it can reach a value so high that
timeout/1000 will never reach it. Example :
  MAX_SCHEDULE_TIMEOUT=LONG_MAX
  HZ=250
  timeout=LONG_MAX-1
  => timeout/1000 < MAX_SCHEDULE_TIMEOUT/HZ
  but (timeout * HZ + 999) / 1000 will still overflow !

So I finally think that the safest test would be to avoid the timeout
range which can overflow in the computation, using something like this
(but which will limit the timeout to 49 days on HZ=1000 machines) :

+       jtimeout = timeout < 0 || \
+                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) || \
+                    timeout >= (LONG_MAX / HZ - 1000) ?
                   MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

as both are constants, they can be optimized. Otherwise, we can resort to
using a MAX() macro to reduce this to only one test which will catch all
corner cases.

> I really think this code just move
> to the same thing that sys_poll() does to avoid overlflow (I fixed the
> bug Alexey was experiencing, so I think the changes are safe now).

I'm not totally certain that all overflows are avoided, see above. If
you play with timeout values close to LONG_MAX / HZ, you're still not
caught by the test and can overflow in the multiply.

> In any case, this code is approaching unreadable with lots of jiffies
> <--> human-time units manipulations done in non-standard ways, which
> the updated sys_poll() also tries to avoid.

I've not checked sys_poll(), but I agree with you that it's rather
difficult to imagine all corner cases this way.

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  6:15     ` Willy Tarreau
@ 2005-09-24  7:33       ` Vadim Lobanov
  2005-09-24  7:51         ` Willy Tarreau
  2005-09-24 15:10       ` Davide Libenzi
  2005-09-24 17:19       ` Nishanth Aravamudan
  2 siblings, 1 reply; 33+ messages in thread
From: Vadim Lobanov @ 2005-09-24  7:33 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nish Aravamudan, Davide Libenzi, Linux Kernel Mailing List,
	Andrew Morton

On Sat, 24 Sep 2005, Willy Tarreau wrote:

> On Fri, Sep 23, 2005 at 09:44:10PM -0700, Nish Aravamudan wrote:
> > > >        * that why (t * HZ) / 1000.
> > > >        */
> > > > -     jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> > > > +     jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
> > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> > >
> > > Here, I'm not certain that gcc will optimize the divide. It would be better
> > > anyway to write this which is equivalent, and a pure integer comparison :
> > >
> > > +       jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
> > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> >
> > Just a question here, maybe it's dumb.
>
> Your question is not dumb, this code is not trivial at all !
>
> > * and / have the same priority in the order of operations, yes? If so,
> > won't the the 1000 * MAX_SCHEDULE_TIMEOUT overflow
> > (MAX_SCHEDULE_TIMEOUT is LONG_MAX)?
>
> Yes it can, and that's why I said that gcc should send a warning when
> comparing an int with something too large for an int. But I should have
> forced the constant to be evaluated as long long. At the moment, the
> constant cannot overflow, but it can reach a value so high that
> timeout/1000 will never reach it. Example :
>   MAX_SCHEDULE_TIMEOUT=LONG_MAX
>   HZ=250
>   timeout=LONG_MAX-1
>   => timeout/1000 < MAX_SCHEDULE_TIMEOUT/HZ
>   but (timeout * HZ + 999) / 1000 will still overflow !
>
> So I finally think that the safest test would be to avoid the timeout
> range which can overflow in the computation, using something like this
> (but which will limit the timeout to 49 days on HZ=1000 machines) :
>
> +       jtimeout = timeout < 0 || \
> +                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) || \
> +                    timeout >= (LONG_MAX / HZ - 1000) ?
>                    MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

It seems that we can make the second overflow test be less strict by
doing the following instead:
    timeout >= (LONG_MAX - 1000) / HZ
Unless I'm confused. :-)

> as both are constants, they can be optimized. Otherwise, we can resort to
> using a MAX() macro to reduce this to only one test which will catch all
> corner cases.
>
> > I really think this code just move
> > to the same thing that sys_poll() does to avoid overlflow (I fixed the
> > bug Alexey was experiencing, so I think the changes are safe now).
>
> I'm not totally certain that all overflows are avoided, see above. If
> you play with timeout values close to LONG_MAX / HZ, you're still not
> caught by the test and can overflow in the multiply.
>
> > In any case, this code is approaching unreadable with lots of jiffies
> > <--> human-time units manipulations done in non-standard ways, which
> > the updated sys_poll() also tries to avoid.
>
> I've not checked sys_poll(), but I agree with you that it's rather
> difficult to imagine all corner cases this way.
>
> Regards,
> Willy
>

-Vadim Lobanov

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  7:33       ` Vadim Lobanov
@ 2005-09-24  7:51         ` Willy Tarreau
  0 siblings, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24  7:51 UTC (permalink / raw)
  To: Vadim Lobanov
  Cc: Nish Aravamudan, Davide Libenzi, Linux Kernel Mailing List,
	Andrew Morton

On Sat, Sep 24, 2005 at 12:33:05AM -0700, Vadim Lobanov wrote:
> On Sat, 24 Sep 2005, Willy Tarreau wrote:
> 
> > On Fri, Sep 23, 2005 at 09:44:10PM -0700, Nish Aravamudan wrote:
> > > > >        * that why (t * HZ) / 1000.
> > > > >        */
> > > > > -     jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> > > > > +     jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
> > > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> > > >
> > > > Here, I'm not certain that gcc will optimize the divide. It would be better
> > > > anyway to write this which is equivalent, and a pure integer comparison :
> > > >
> > > > +       jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
> > > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> > >
> > > Just a question here, maybe it's dumb.
> >
> > Your question is not dumb, this code is not trivial at all !
> >
> > > * and / have the same priority in the order of operations, yes? If so,
> > > won't the the 1000 * MAX_SCHEDULE_TIMEOUT overflow
> > > (MAX_SCHEDULE_TIMEOUT is LONG_MAX)?
> >
> > Yes it can, and that's why I said that gcc should send a warning when
> > comparing an int with something too large for an int. But I should have
> > forced the constant to be evaluated as long long. At the moment, the
> > constant cannot overflow, but it can reach a value so high that
> > timeout/1000 will never reach it. Example :
> >   MAX_SCHEDULE_TIMEOUT=LONG_MAX
> >   HZ=250
> >   timeout=LONG_MAX-1
> >   => timeout/1000 < MAX_SCHEDULE_TIMEOUT/HZ
> >   but (timeout * HZ + 999) / 1000 will still overflow !
> >
> > So I finally think that the safest test would be to avoid the timeout
> > range which can overflow in the computation, using something like this
> > (but which will limit the timeout to 49 days on HZ=1000 machines) :
> >
> > +       jtimeout = timeout < 0 || \
> > +                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) || \
> > +                    timeout >= (LONG_MAX / HZ - 1000) ?
> >                    MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> 
> It seems that we can make the second overflow test be less strict by
> doing the following instead:
>     timeout >= (LONG_MAX - 1000) / HZ
> Unless I'm confused. :-)

oops, you're right. Then it produces the following patch :


diff -purN linux-2.6.13/fs/eventpoll.c linux-2.6.13-epoll/fs/eventpoll.c
--- linux-2.6.13/fs/eventpoll.c	Sun Sep 11 08:25:26 2005
+++ linux-2.6.13-epoll/fs/eventpoll.c	Sat Sep 24 09:49:43 2005
@@ -1504,9 +1504,12 @@ static int ep_poll(struct eventpoll *ep,
 	/*
 	 * Calculate the timeout by checking for the "infinite" value ( -1 )
 	 * and the overflow condition. The passed timeout is in milliseconds,
-	 * that why (t * HZ) / 1000.
+	 * that why (t * HZ) / 1000. Note that we also want to avoid an
+	 * overflow in the multiply.
 	 */
-	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
+	jtimeout = timeout < 0 ||
+		timeout > (MAX_SCHEDULE_TIMEOUT * 1000ULL / HZ) ||
+		timeout > (LONG_MAX - 1000) / HZ ?
 		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
 
 retry:


Interestingly, as long as MAC_SCHEDULE_TIMEOUT == LONG_MAX, the check is
identical to the initial one (and does not add any divide) !

Regards,
Willy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  6:15     ` Willy Tarreau
  2005-09-24  7:33       ` Vadim Lobanov
@ 2005-09-24 15:10       ` Davide Libenzi
  2005-09-24 17:20         ` Willy Tarreau
  2005-09-24 17:19       ` Nishanth Aravamudan
  2 siblings, 1 reply; 33+ messages in thread
From: Davide Libenzi @ 2005-09-24 15:10 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Nish Aravamudan, Linux Kernel Mailing List, Andrew Morton

On Sat, 24 Sep 2005, Willy Tarreau wrote:

> Yes it can, and that's why I said that gcc should send a warning when
> comparing an int with something too large for an int. But I should have
> forced the constant to be evaluated as long long. At the moment, the
> constant cannot overflow, but it can reach a value so high that
> timeout/1000 will never reach it. Example :
>  MAX_SCHEDULE_TIMEOUT=LONG_MAX
>  HZ=250
>  timeout=LONG_MAX-1
>  => timeout/1000 < MAX_SCHEDULE_TIMEOUT/HZ
>  but (timeout * HZ + 999) / 1000 will still overflow !
>
> So I finally think that the safest test would be to avoid the timeout
> range which can overflow in the computation, using something like this
> (but which will limit the timeout to 49 days on HZ=1000 machines) :
>
> +       jtimeout = timeout < 0 || \
> +                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) || \
> +                    timeout >= (LONG_MAX / HZ - 1000) ?
>                   MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
>
> as both are constants, they can be optimized. Otherwise, we can resort to
> using a MAX() macro to reduce this to only one test which will catch all
> corner cases.

Using the MIN() macro would be better so we have a single check, and the 
compiler optimize that automatically. Or we can force 'timeout * HZ' to 
use ULL math. I don't think it makes a lot of difference for something 
that is in a likely sleep path ;)


- Davide



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 15:10       ` Davide Libenzi
@ 2005-09-24 17:20         ` Willy Tarreau
  2005-09-24 18:19           ` Davide Libenzi
  0 siblings, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 17:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Willy Tarreau, Nish Aravamudan, Linux Kernel Mailing List,
	Andrew Morton

On Sat, Sep 24, 2005 at 08:10:32AM -0700, Davide Libenzi wrote:
> >+       jtimeout = timeout < 0 || \
> >+                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) || 
> >\
> >+                    timeout >= (LONG_MAX / HZ - 1000) ?
> >                  MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> >
> >as both are constants, they can be optimized. Otherwise, we can resort to
> >using a MAX() macro to reduce this to only one test which will catch all
> >corner cases.
> 
> Using the MIN() macro would be better so we have a single check, and the 
> compiler optimize that automatically.

you're right, it's MIN() not MAX() ;-)
Anyway, I've checked the code and the compiler does a single test with -O2.

> Or we can force 'timeout * HZ' to use ULL math. I don't think it makes a lot of difference for something that is in a likely sleep path ;)

"likely", yes, but not necessarily. Under a high load, you can have enough
events queued so that epoll() will not wait at all. I've already encountered
such cases during benchmarks, and I noticed that epoll() took more time than
select() for small numbers of FDs (something like 20% below 100 FDs), but of
course, it is considerably faster above. So turning the multiply to an ULL
may increase this overhead on some architectures, while the double check
will leave the code identical.

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 17:20         ` Willy Tarreau
@ 2005-09-24 18:19           ` Davide Libenzi
  2005-09-25  6:05             ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Davide Libenzi @ 2005-09-24 18:19 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Nish Aravamudan, Linux Kernel Mailing List, Andrew Morton

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1851 bytes --]

On Sat, 24 Sep 2005, Willy Tarreau wrote:

> On Sat, Sep 24, 2005 at 08:10:32AM -0700, Davide Libenzi wrote:
>>> +       jtimeout = timeout < 0 || \
>>> +                    timeout >= (1000ULL * MAX_SCHEDULE_TIMEOUT / HZ) ||
>>> \
>>> +                    timeout >= (LONG_MAX / HZ - 1000) ?
>>>                  MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
>>>
>>> as both are constants, they can be optimized. Otherwise, we can resort to
>>> using a MAX() macro to reduce this to only one test which will catch all
>>> corner cases.
>>
>> Using the MIN() macro would be better so we have a single check, and the
>> compiler optimize that automatically.
>
> you're right, it's MIN() not MAX() ;-)
> Anyway, I've checked the code and the compiler does a single test with -O2.
>
>> Or we can force 'timeout * HZ' to use ULL math. I don't think it makes a lot of difference for something that is in a likely sleep path ;)
>
> "likely", yes, but not necessarily. Under a high load, you can have enough
> events queued so that epoll() will not wait at all. I've already encountered
> such cases during benchmarks, and I noticed that epoll() took more time than
> select() for small numbers of FDs (something like 20% below 100 FDs), but of
> course, it is considerably faster above. So turning the multiply to an ULL
> may increase this overhead on some architectures, while the double check
> will leave the code identical.

The attached patch uses the kernel min() macro, that is optimized has 
single compare by gcc-O2. Andrew, this goes over (hopefully ;) the bits 
you already have in -mm.

PS: It might be possible to move the currently epoll-local EP_MAX_MSTIMEO 
macro, to a MAX_LONG_MSTIMEO (or whatever name you like) somewhere in the 
tree, to be used where needed.


Signed-off-by: Davide Libenzi <davidel@xmailserver.org>


- Davide



[-- Attachment #2: Type: TEXT/plain, Size: 799 bytes --]

--- a/fs/eventpoll.c	2005-09-24 11:07:04.000000000 -0700
+++ b/fs/eventpoll.c	2005-09-24 11:11:06.000000000 -0700
@@ -101,6 +101,10 @@
 /* Maximum number of poll wake up nests we are allowing */
 #define EP_MAX_POLLWAKE_NESTS 4
 
+/* Maximum msec timeout value storeable in a long int */
+#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)
+
+
 struct epoll_filefd {
 	struct file *file;
 	int fd;
@@ -1507,8 +1511,7 @@
 	 * and the overflow condition. The passed timeout is in milliseconds,
 	 * that why (t * HZ) / 1000.
 	 */
-	jtimeout = (timeout < 0 ||
-		    (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ)) ?
+	jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
 		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
 
 retry:

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 18:19           ` Davide Libenzi
@ 2005-09-25  6:05             ` Andrew Morton
  2005-09-25  6:20               ` Willy Tarreau
  2005-09-25  7:08               ` Vadim Lobanov
  0 siblings, 2 replies; 33+ messages in thread
From: Andrew Morton @ 2005-09-25  6:05 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: willy, nish.aravamudan, linux-kernel

Davide Libenzi <davidel@xmailserver.org> wrote:
>
> The attached patch uses the kernel min() macro, that is optimized has 
>  single compare by gcc-O2. Andrew, this goes over (hopefully ;) the bits 
>  you already have in -mm.

OK, well I've rather lost the plot with all the patches flying around.

I now have one single patch against epoll.c, below.  Please confirm that
this is what was intended.   If not, I'll drop it and let's start again.





From: Davide Libenzi <davidel@xmailserver.org>

The sys_epoll_wait() function was not handling correctly negative timeouts
(besides -1), and like sys_poll(), was comparing millisec to secs in
testing the upper timeout limit.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 fs/eventpoll.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff -puN fs/eventpoll.c~sys_epoll_wait-fix-handling-of-negative-timeouts fs/eventpoll.c
--- devel/fs/eventpoll.c~sys_epoll_wait-fix-handling-of-negative-timeouts	2005-09-24 23:01:00.000000000 -0700
+++ devel-akpm/fs/eventpoll.c	2005-09-24 23:02:50.000000000 -0700
@@ -101,6 +101,10 @@
 /* Maximum number of poll wake up nests we are allowing */
 #define EP_MAX_POLLWAKE_NESTS 4
 
+/* Maximum msec timeout value storeable in a long int */
+#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)
+
+
 struct epoll_filefd {
 	struct file *file;
 	int fd;
@@ -1506,8 +1510,8 @@ static int ep_poll(struct eventpoll *ep,
 	 * and the overflow condition. The passed timeout is in milliseconds,
 	 * that why (t * HZ) / 1000.
 	 */
-	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
-		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
+	jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
+		MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
 
 retry:
 	write_lock_irqsave(&ep->lock, flags);
_


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-25  6:05             ` Andrew Morton
@ 2005-09-25  6:20               ` Willy Tarreau
  2005-09-25  6:32                 ` Andrew Morton
  2005-09-25  7:08               ` Vadim Lobanov
  1 sibling, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-25  6:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Davide Libenzi, willy, nish.aravamudan, linux-kernel

Hi Andrew,

On Sat, Sep 24, 2005 at 11:05:45PM -0700, Andrew Morton wrote:
> Davide Libenzi <davidel@xmailserver.org> wrote:
> >
> > The attached patch uses the kernel min() macro, that is optimized has 
> >  single compare by gcc-O2. Andrew, this goes over (hopefully ;) the bits 
> >  you already have in -mm.
> 
> OK, well I've rather lost the plot with all the patches flying around.
> 
> I now have one single patch against epoll.c, below.  Please confirm that
> this is what was intended.   If not, I'll drop it and let's start again.

Well, this one is fine. However, since there are possible overflows in
msec_to_jiffies() that I fixed in another patch, it would be cleaner to
only rely on this function's ability to detect overflows. Moreover, using
this function removes the divide for most common values of HZ, which is
interesting on some archs.

As Davide mentionned it, the EP_MAX_MSTIMEO constant would be useful for
other parts of the kernel (starting with sys_poll()). That's also what my
fix to jiffies.h does.

If you don't feel comfortable with so much changes at once, it's better
to first apply Davide's fix, then later the jiffies one, and then after,
I could resend the ep_poll() and sys_poll() patches to use the functions
in jiffies.h instead of doing their own magics.

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-25  6:20               ` Willy Tarreau
@ 2005-09-25  6:32                 ` Andrew Morton
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2005-09-25  6:32 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: davidel, willy, nish.aravamudan, linux-kernel

Willy Tarreau <willy@w.ods.org> wrote:
>
> If you don't feel comfortable with so much changes at once, it's better
>  to first apply Davide's fix, then later the jiffies one, and then after,
>  I could resend the ep_poll() and sys_poll() patches to use the functions
>  in jiffies.h instead of doing their own magics.

That works for me.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-25  6:05             ` Andrew Morton
  2005-09-25  6:20               ` Willy Tarreau
@ 2005-09-25  7:08               ` Vadim Lobanov
  2005-09-25  8:03                 ` Willy Tarreau
  1 sibling, 1 reply; 33+ messages in thread
From: Vadim Lobanov @ 2005-09-25  7:08 UTC (permalink / raw)
  To: Andrew Morton, Davide Libenzi; +Cc: willy, nish.aravamudan, linux-kernel

On Sat, 24 Sep 2005, Andrew Morton wrote:

> Davide Libenzi <davidel@xmailserver.org> wrote:
> >
> > The attached patch uses the kernel min() macro, that is optimized has
> >  single compare by gcc-O2. Andrew, this goes over (hopefully ;) the bits
> >  you already have in -mm.
>
> OK, well I've rather lost the plot with all the patches flying around.
>
> I now have one single patch against epoll.c, below.  Please confirm that
> this is what was intended.   If not, I'll drop it and let's start again.
>

I hate to be the squeaky wheel here, but the attached patch is not 100%
right -

>
> From: Davide Libenzi <davidel@xmailserver.org>
>
> The sys_epoll_wait() function was not handling correctly negative timeouts
> (besides -1), and like sys_poll(), was comparing millisec to secs in
> testing the upper timeout limit.
>
> Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
> Signed-off-by: Andrew Morton <akpm@osdl.org>
> ---
>
>  fs/eventpoll.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff -puN fs/eventpoll.c~sys_epoll_wait-fix-handling-of-negative-timeouts fs/eventpoll.c
> --- devel/fs/eventpoll.c~sys_epoll_wait-fix-handling-of-negative-timeouts	2005-09-24 23:01:00.000000000 -0700
> +++ devel-akpm/fs/eventpoll.c	2005-09-24 23:02:50.000000000 -0700
> @@ -101,6 +101,10 @@
>  /* Maximum number of poll wake up nests we are allowing */
>  #define EP_MAX_POLLWAKE_NESTS 4
>
> +/* Maximum msec timeout value storeable in a long int */
> +#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)

This should instead be:
#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ)
Here's why:
We want to avoid overflow of (timeout * HZ + 999), or, in other words,
the case where (timeout * HZ + 999) >= LONG_MAX
Unwrapping the equation, we get timeout >= (LONG_MAX - 999) / HZ

The original code isn't _wrong_, but more restrictive than it should be.
In any case, better to fix up the base patch now, before all the other
patches go in. I could do this, or Davide can... it's all good. :-)

> +
> +
>  struct epoll_filefd {
>  	struct file *file;
>  	int fd;
> @@ -1506,8 +1510,8 @@ static int ep_poll(struct eventpoll *ep,
>  	 * and the overflow condition. The passed timeout is in milliseconds,
>  	 * that why (t * HZ) / 1000.
>  	 */
> -	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> -		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> +	jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
> +		MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
>
>  retry:
>  	write_lock_irqsave(&ep->lock, flags);
> _
>
> -

-Vadim Lobanov

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-25  7:08               ` Vadim Lobanov
@ 2005-09-25  8:03                 ` Willy Tarreau
  0 siblings, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-25  8:03 UTC (permalink / raw)
  To: Vadim Lobanov
  Cc: Andrew Morton, Davide Libenzi, nish.aravamudan, linux-kernel

Hi,

On Sun, Sep 25, 2005 at 12:08:03AM -0700, Vadim Lobanov wrote:
(...)
> > +/* Maximum msec timeout value storeable in a long int */
> > +#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)
> 
> This should instead be:
> #define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ)
> Here's why:
> We want to avoid overflow of (timeout * HZ + 999), or, in other words,
> the case where (timeout * HZ + 999) >= LONG_MAX
> Unwrapping the equation, we get timeout >= (LONG_MAX - 999) / HZ
> 
> The original code isn't _wrong_, but more restrictive than it should be.
> In any case, better to fix up the base patch now, before all the other
> patches go in. I could do this, or Davide can... it's all good. :-)

I think it's because with the numerous changes we brought, the '>' test
became '>=' but the old timeout was still used with '>'. With '>=', I
agree with you that it must be -999.

Andrew, Vadim is right. Anyway, this proves why we must really move all
those complicated tests to jiffies.h ASAP !

BTW, Andrew, could you merge the jiffies fix before -mm3, so that we can
remove those annoying tests quickly ?

Thanks in advance,
Willy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24  6:15     ` Willy Tarreau
  2005-09-24  7:33       ` Vadim Lobanov
  2005-09-24 15:10       ` Davide Libenzi
@ 2005-09-24 17:19       ` Nishanth Aravamudan
  2005-09-24 18:25         ` Davide Libenzi
  2005-09-24 18:30         ` Willy Tarreau
  2 siblings, 2 replies; 33+ messages in thread
From: Nishanth Aravamudan @ 2005-09-24 17:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nish Aravamudan, Davide Libenzi, Linux Kernel Mailing List,
	Andrew Morton

On 24.09.2005 [08:15:00 +0200], Willy Tarreau wrote:
> On Fri, Sep 23, 2005 at 09:44:10PM -0700, Nish Aravamudan wrote:
> > > >        * that why (t * HZ) / 1000.
> > > >        */
> > > > -     jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> > > > +     jtimeout = timeout < 0 || (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ) ?
> > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> > >
> > > Here, I'm not certain that gcc will optimize the divide. It would be better
> > > anyway to write this which is equivalent, and a pure integer comparison :
> > >
> > > +       jtimeout = timeout < 0 || timeout >= 1000 * MAX_SCHEDULE_TIMEOUT / HZ ?
> > > >               MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> > 
> > Just a question here, maybe it's dumb.
> 
> Your question is not dumb, this code is not trivial at all !
> 
> > * and / have the same priority in the order of operations, yes? If so,
> > won't the the 1000 * MAX_SCHEDULE_TIMEOUT overflow
> > (MAX_SCHEDULE_TIMEOUT is LONG_MAX)?
> 
> Yes it can, and that's why I said that gcc should send a warning when
> comparing an int with something too large for an int. But I should have
> forced the constant to be evaluated as long long. At the moment, the
> constant cannot overflow, but it can reach a value so high that
> timeout/1000 will never reach it. Example :

Ok, makes sense!

<snip>

> > In any case, this code is approaching unreadable with lots of jiffies
> > <--> human-time units manipulations done in non-standard ways, which
> > the updated sys_poll() also tries to avoid.
> 
> I've not checked sys_poll(), but I agree with you that it's rather
> difficult to imagine all corner cases this way.

Here is the patch I was referring to, adapted to the epoll use of
schedule_timeout(). A similar fix is in -mm for almost identical code in
sys_poll().

Description: The @timeout parameter to ep_poll() is in milliseconds but
we compare it to (MAX_SCHEDULE_TIMEOUT - 1000 / HZ), which is
(jiffies/jiffies-per-sec) or seconds. That seems blatantly broken.  This
led to improper overflow checking for @timeout. As Andrew Morton pointed
out in a similar fix for sys_poll(), the best solution is to to check
for potential overflow first, then either select an indefinite value or
convert @timeout.

To achieve this and clean-up the code, change the prototype of the
ep_poll() to make it clear that the parameter is in milliseconds and
rename jtimeout to timeout_jiffies to hold the corresonding jiffies
value.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---

 fs/eventpoll.c |   46 ++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 38 insertions(+), 8 deletions(-)

diff -urpN 2.6.14-rc2-mm1/fs/eventpoll.c 2.6.14-rc2-mm1-dev/fs/eventpoll.c
--- 2.6.14-rc2-mm1/fs/eventpoll.c	2005-09-21 23:50:28.000000000 -0700
+++ 2.6.14-rc2-mm1-dev/fs/eventpoll.c	2005-09-24 10:07:51.000000000 -0700
@@ -260,7 +260,7 @@ static int ep_events_transfer(struct eve
 			      struct epoll_event __user *events,
 			      int maxevents);
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout);
+		   int maxevents, long timeout_msecs);
 static int eventpollfs_delete_dentry(struct dentry *dentry);
 static struct inode *ep_eventpoll_inode(void);
 static struct super_block *eventpollfs_get_sb(struct file_system_type *fs_type,
@@ -1494,11 +1494,12 @@ static int ep_events_transfer(struct eve
 
 
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-		   int maxevents, long timeout)
+		   int maxevents, long timeout_msecs)
 {
 	int res, eavail;
+	int overflow;
 	unsigned long flags;
-	long jtimeout;
+	long timeout_jiffies;
 	wait_queue_t wait;
 
 	/*
@@ -1506,8 +1507,36 @@ static int ep_poll(struct eventpoll *ep,
 	 * and the overflow condition. The passed timeout is in milliseconds,
 	 * that why (t * HZ) / 1000.
 	 */
-	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
-		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
+
+	if (timeout_msecs) {
+		/*
+		 * We compare HZ with 1000 to work out which side of the
+		 * expression needs conversion.  Because we want to
+		 * avoid converting any value to a numerically higher
+		 * value, which could overflow.
+		 */
+#if HZ > 1000
+		overflow = timeout_msecs >=
+				jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT);
+#else
+		overflow = msecs_to_jiffies(timeout_msecs) >=
+				MAX_SCHEDULE_TIMEOUT;
+#endif
+
+		/*
+		 * If we would overflow in the conversion or a negative
+		 * timeout is requested, sleep indefinitely.
+		 */
+		if (overflow || timeout_msecs < 0)
+			timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
+		else
+			timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1;
+	} else {
+		/*
+		 * 0 millisecond requests become 0 jiffy requests
+		 */
+		timeout_jiffies = 0;
+	}
 
 retry:
 	write_lock_irqsave(&ep->lock, flags);
@@ -1529,7 +1558,7 @@ retry:
 			 * to TASK_INTERRUPTIBLE before doing the checks.
 			 */
 			set_current_state(TASK_INTERRUPTIBLE);
-			if (!list_empty(&ep->rdllist) || !jtimeout)
+			if (!list_empty(&ep->rdllist) || !timeout_jiffies)
 				break;
 			if (signal_pending(current)) {
 				res = -EINTR;
@@ -1537,7 +1566,7 @@ retry:
 			}
 
 			write_unlock_irqrestore(&ep->lock, flags);
-			jtimeout = schedule_timeout(jtimeout);
+			timeout_jiffies = schedule_timeout(timeout_jiffies);
 			write_lock_irqsave(&ep->lock, flags);
 		}
 		remove_wait_queue(&ep->wq, &wait);
@@ -1556,7 +1585,8 @@ retry:
 	 * more luck.
 	 */
 	if (!res && eavail &&
-	    !(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
+	    !(res = ep_events_transfer(ep, events, maxevents)) &&
+	    timeout_jiffies)
 		goto retry;
 
 	return res;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 17:19       ` Nishanth Aravamudan
@ 2005-09-24 18:25         ` Davide Libenzi
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
  2005-09-24 21:25           ` [patch] sys_epoll_wait() timeout saga Vadim Lobanov
  2005-09-24 18:30         ` Willy Tarreau
  1 sibling, 2 replies; 33+ messages in thread
From: Davide Libenzi @ 2005-09-24 18:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Willy Tarreau, Nish Aravamudan, Linux Kernel Mailing List,
	Andrew Morton

On Sat, 24 Sep 2005, Nishanth Aravamudan wrote:

> Description: The @timeout parameter to ep_poll() is in milliseconds but
> we compare it to (MAX_SCHEDULE_TIMEOUT - 1000 / HZ), which is
> (jiffies/jiffies-per-sec) or seconds. That seems blatantly broken.  This
> led to improper overflow checking for @timeout. As Andrew Morton pointed
> out in a similar fix for sys_poll(), the best solution is to to check
> for potential overflow first, then either select an indefinite value or
> convert @timeout.
>
> To achieve this and clean-up the code, change the prototype of the
> ep_poll() to make it clear that the parameter is in milliseconds and
> rename jtimeout to timeout_jiffies to hold the corresonding jiffies
> value.
>
> fs/eventpoll.c |   46 ++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 38 insertions(+), 8 deletions(-)

Eh? "To achieve this and clean-up the code"? :)


- Davide


eventpoll.c |    7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

--- a/fs/eventpoll.c	2005-09-24 11:07:04.000000000 -0700
+++ b/fs/eventpoll.c	2005-09-24 11:11:06.000000000 -0700
@@ -101,6 +101,10 @@
  /* Maximum number of poll wake up nests we are allowing */
  #define EP_MAX_POLLWAKE_NESTS 4

+/* Maximum msec timeout value storeable in a long int */
+#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)
+
+
  struct epoll_filefd {
  	struct file *file;
  	int fd;
@@ -1507,8 +1511,7 @@
  	 * and the overflow condition. The passed timeout is in milliseconds,
  	 * that why (t * HZ) / 1000.
  	 */
-	jtimeout = (timeout < 0 ||
-		    (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ)) ?
+	jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
  		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;

  retry:

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies()
  2005-09-24 18:25         ` Davide Libenzi
@ 2005-09-24 19:38           ` Willy Tarreau
  2005-09-24 19:44             ` [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies() Willy Tarreau
                               ` (4 more replies)
  2005-09-24 21:25           ` [patch] sys_epoll_wait() timeout saga Vadim Lobanov
  1 sibling, 5 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 19:38 UTC (permalink / raw)
  To: Davide Libenzi, Andrew Morton
  Cc: Nishanth Aravamudan, Willy Tarreau, Nish Aravamudan,
	Linux Kernel Mailing List

Hello,

After the discussion around epoll() timeout, I noticed that the functions used
to detect the timeout could themselves overflow for some values of HZ.

So I decided to fix them by defining a macro which represents the maximal
acceptable argument which is guaranteed not to overflow. As an added bonus,
those functions can now be used in poll() and ep_poll() and remove the divide
if HZ == 1000, or replace it with a shift if (1000 % HZ) or (HZ % 1000) is a
power of two.

Patches against 2.6.14-rc2-mm1 sent as replies to this mail.

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
@ 2005-09-24 19:44             ` Willy Tarreau
  2005-09-29  9:43               ` Andrew Morton
  2005-09-24 19:47             ` [PATCH 2/3] 2.6.14-rc2-mm1: fixes for overflow in epoll() Willy Tarreau
                               ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 19:44 UTC (permalink / raw)
  To: Davide Libenzi, Andrew Morton
  Cc: Nishanth Aravamudan, Nish Aravamudan, Linux Kernel Mailing List

On Sat, Sep 24, 2005 at 09:38:39PM +0200, Willy Tarreau wrote:

msecs_to_jiffies() can compute a wrong timeout limit for some values of HZ :

#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)

static inline unsigned int jiffies_to_msecs(const unsigned long j)
{
#if HZ <= 1000 && !(1000 % HZ)
        return (1000 / HZ) * j;
#elif HZ > 1000 && !(HZ % 1000)
        return (j + (HZ / 1000) - 1)/(HZ / 1000);
#else
        return (j * 1000) / HZ;
#endif
}

static inline unsigned long msecs_to_jiffies(const unsigned int m)
{
        if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
                return MAX_JIFFY_OFFSET;
#if HZ <= 1000 && !(1000 % HZ)
        return (m + (1000 / HZ) - 1) / (1000 / HZ);
#elif HZ > 1000 && !(HZ % 1000)
        return m * (HZ / 1000);
#else
        return (m * HZ + 999) / 1000;
#endif
}

Calling jiffies_to_msecs(MAX_JIFFY_OFFSET) overflows if 1000/HZ >=2 or
if (1000 % HZ != 0 || HZ % 1000 != 0). Eg: for HZ=300, the max is set to
0xda7406 while it is set to 0xfffffff8 for HZ=250 !

The following patch implements two macros MAX_MSEC_OFFSET and MAX_USEC_OFFSET
which are used to fix msecs_to_jiffies() and usecs_to_jiffies() respectively.
They are defined depending on the operation which will be performed by the
function, so that they try to limit to the absolute maximum acceptable value.

If someone has better naming, feel free to update the patch. This patch applies
to 2.6.14-rc2 and 2.6.14-rc2-mm1. Please review and apply.

Signed-off-by: Willy Tarreau <willy@w.ods.org>

- Willy


diff -purN linux-2.6.14-rc2-mm1/include/linux/jiffies.h linux-2.6.14-rc2-mm1-jiffies/include/linux/jiffies.h
--- linux-2.6.14-rc2-mm1/include/linux/jiffies.h	Sat Sep 24 21:11:51 2005
+++ linux-2.6.14-rc2-mm1-jiffies/include/linux/jiffies.h	Sat Sep 24 21:17:28 2005
@@ -246,6 +246,37 @@ static inline u64 get_jiffies_64(void)
 
 #endif
 
+
+/*
+ * We define MAX_MSEC_OFFSET and MAX_USEC_OFFSET as maximal values that can be
+ * accepted by msecs_to_jiffies() and usec_to_jiffies() respectively, without
+ * risking a multiply overflow. Those functions return MAX_JIFFY_OFFSET for
+ * arguments above those values.
+ */
+
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+#  define MAX_MSEC_OFFSET \
+	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+#  define MAX_MSEC_OFFSET \
+	ULONG_MAX / (HZ / MSEC_PER_SEC)
+#else
+#  define MAX_MSEC_OFFSET \
+	(ULONG_MAX - (MSEC_PER_SEC - 1)) / HZ
+#endif
+
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+#  define MAX_USEC_OFFSET \
+	(ULONG_MAX - (USEC_PER_SEC / HZ) + 1)
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+#  define MAX_USEC_OFFSET \
+	ULONG_MAX / (HZ / USEC_PER_SEC)
+#else
+#  define MAX_USEC_OFFSET \
+	(ULONG_MAX - (USEC_PER_SEC - 1)) / HZ
+#endif
+
+
 /*
  * Convert jiffies to milliseconds and back.
  *
@@ -276,7 +307,7 @@ static inline unsigned int jiffies_to_us
 
 static inline unsigned long msecs_to_jiffies(const unsigned int m)
 {
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+	if (m > MAX_MSEC_OFFSET)
 		return MAX_JIFFY_OFFSET;
 #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
 	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
@@ -289,7 +320,7 @@ static inline unsigned long msecs_to_jif
 
 static inline unsigned long usecs_to_jiffies(const unsigned int u)
 {
-	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
+	if (u > MAX_USEC_OFFSET)
 		return MAX_JIFFY_OFFSET;
 #if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
 	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-24 19:44             ` [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies() Willy Tarreau
@ 2005-09-29  9:43               ` Andrew Morton
  2005-09-29 19:41                 ` Willy Tarreau
  0 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2005-09-29  9:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: davidel, nacc, nish.aravamudan, linux-kernel

Willy Tarreau <willy@w.ods.org> wrote:
>
> +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
>  +#  define MAX_MSEC_OFFSET \
>  +	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)

That generates numbers which don't fit into unsigned ints, yielding vast
numbers of

include/linux/jiffies.h: In function `msecs_to_jiffies':
include/linux/jiffies.h:310: warning: comparison is always false due to limited range of data type
include/linux/jiffies.h: In function `usecs_to_jiffies':
include/linux/jiffies.h:323: warning: comparison is always false due to limited range of data type



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-29  9:43               ` Andrew Morton
@ 2005-09-29 19:41                 ` Willy Tarreau
  2005-09-29 19:52                   ` Andrew Morton
  0 siblings, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-29 19:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidel, nacc, nish.aravamudan, linux-kernel

Thanks Andrew,

I'm very sorry because I have verified the code with gcc-2.95.3,
gcc-3.3.6 and gcc-3.4.4 on x86 and alpha to ensure that everything
went smooth on archs where sizeof(long) > sizeof(int). But I've
tested all the combinations in user-space for obvious ease of
validation. I believe I forgot to use -Wall. What architecture
gave you this, and with which compiler please ? I'm willing to
fix this as soon as I can understand the root of the problem.

Regards,
Willy


On Thu, Sep 29, 2005 at 02:43:12AM -0700, Andrew Morton wrote:
> Willy Tarreau <willy@w.ods.org> wrote:
> >
> > +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
> >  +#  define MAX_MSEC_OFFSET \
> >  +	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
> 
> That generates numbers which don't fit into unsigned ints, yielding vast
> numbers of
> 
> include/linux/jiffies.h: In function `msecs_to_jiffies':
> include/linux/jiffies.h:310: warning: comparison is always false due to limited range of data type
> include/linux/jiffies.h: In function `usecs_to_jiffies':
> include/linux/jiffies.h:323: warning: comparison is always false due to limited range of data type
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-29 19:41                 ` Willy Tarreau
@ 2005-09-29 19:52                   ` Andrew Morton
  2005-09-29 20:55                     ` Willy Tarreau
  2005-10-01 17:39                     ` Willy Tarreau
  0 siblings, 2 replies; 33+ messages in thread
From: Andrew Morton @ 2005-09-29 19:52 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: davidel, nacc, nish.aravamudan, linux-kernel

Willy Tarreau <willy@w.ods.org> wrote:
>
> Thanks Andrew,
> 
> I'm very sorry because I have verified the code with gcc-2.95.3,
> gcc-3.3.6 and gcc-3.4.4 on x86 and alpha to ensure that everything
> went smooth on archs where sizeof(long) > sizeof(int). But I've
> tested all the combinations in user-space for obvious ease of
> validation. I believe I forgot to use -Wall. What architecture
> gave you this, and with which compiler please ? I'm willing to
> fix this as soon as I can understand the root of the problem.
> 

http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt

> 
> 
> On Thu, Sep 29, 2005 at 02:43:12AM -0700, Andrew Morton wrote:
> > Willy Tarreau <willy@w.ods.org> wrote:
> > >
> > > +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
> > >  +#  define MAX_MSEC_OFFSET \
> > >  +	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
> > 
> > That generates numbers which don't fit into unsigned ints, yielding vast
> > numbers of
> > 
> > include/linux/jiffies.h: In function `msecs_to_jiffies':
> > include/linux/jiffies.h:310: warning: comparison is always false due to limited range of data type
> > include/linux/jiffies.h: In function `usecs_to_jiffies':
> > include/linux/jiffies.h:323: warning: comparison is always false due to limited range of data type
> > 

This was a ppc64 build, gcc-3.3.3, CONFIG_HZ=250

Look a the value which MAX_MSEC_OFFSET will take (it's 2^63 minus a bit). 
Comparing that to an unsigned int will generate the always-true or
always-false warning.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-29 19:52                   ` Andrew Morton
@ 2005-09-29 20:55                     ` Willy Tarreau
  2005-10-01 17:39                     ` Willy Tarreau
  1 sibling, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-29 20:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidel, nacc, nish.aravamudan, linux-kernel

On Thu, Sep 29, 2005 at 12:52:07PM -0700, Andrew Morton wrote:
> Willy Tarreau <willy@w.ods.org> wrote:
> >
> > Thanks Andrew,
> > 
> > I'm very sorry because I have verified the code with gcc-2.95.3,
> > gcc-3.3.6 and gcc-3.4.4 on x86 and alpha to ensure that everything
> > went smooth on archs where sizeof(long) > sizeof(int). But I've
> > tested all the combinations in user-space for obvious ease of
> > validation. I believe I forgot to use -Wall. What architecture
> > gave you this, and with which compiler please ? I'm willing to
> > fix this as soon as I can understand the root of the problem.
> > 
> 
> http://www.zip.com.au/~akpm/linux/patches/stuff/top-posting.txt

Guess what ? I hate it too when others do it, but I often think that
my mail will be a one-liner which will be easier to read this way, and
of course I'm wrong. Nice FAQ BTW.

> > On Thu, Sep 29, 2005 at 02:43:12AM -0700, Andrew Morton wrote:
> > > Willy Tarreau <willy@w.ods.org> wrote:
> > > >
> > > > +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
> > > >  +#  define MAX_MSEC_OFFSET \
> > > >  +	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
> > > 
> > > That generates numbers which don't fit into unsigned ints, yielding vast
> > > numbers of
> > > 
> > > include/linux/jiffies.h: In function `msecs_to_jiffies':
> > > include/linux/jiffies.h:310: warning: comparison is always false due to limited range of data type
> > > include/linux/jiffies.h: In function `usecs_to_jiffies':
> > > include/linux/jiffies.h:323: warning: comparison is always false due to limited range of data type
> > > 
> 
> This was a ppc64 build, gcc-3.3.3, CONFIG_HZ=250

OK, I have a free account on a ppc64 machine in case I cannot reproduce
on anything else.

> Look a the value which MAX_MSEC_OFFSET will take (it's 2^63 minus a bit). 
> Comparing that to an unsigned int will generate the always-true or
> always-false warning.

I see it. I've done the ulong magic only on the constant computation,
while in theory, I should have casted m to ulong before the comparison
because m is implicitly casted to ulong in the return (common type).
In practise, it should be better to cast MAX_MSEC_OFFSET to unsigned int
in the comparison, but there's still a risk of warning if MAX_MSEC_OFFSET
becomes equal to ~0.

I'll makes a few tests and check that gcc is smart enough to remove the
cast code when unneeded if I cast m to ulong.

Thanks for the details,
Willy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies()
  2005-09-29 19:52                   ` Andrew Morton
  2005-09-29 20:55                     ` Willy Tarreau
@ 2005-10-01 17:39                     ` Willy Tarreau
  1 sibling, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-10-01 17:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: davidel, nacc, nish.aravamudan, linux-kernel

Hi Andrew,

On Thu, Sep 29, 2005 at 12:52:07PM -0700, Andrew Morton wrote:
> > On Thu, Sep 29, 2005 at 02:43:12AM -0700, Andrew Morton wrote:
> > > Willy Tarreau <willy@w.ods.org> wrote:
> > > >
> > > > +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
> > > >  +#  define MAX_MSEC_OFFSET \
> > > >  +	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
> > > 
> > > That generates numbers which don't fit into unsigned ints, yielding vast
> > > numbers of
> > > 
> > > include/linux/jiffies.h: In function `msecs_to_jiffies':
> > > include/linux/jiffies.h:310: warning: comparison is always false due to limited range of data type
> > > include/linux/jiffies.h: In function `usecs_to_jiffies':
> > > include/linux/jiffies.h:323: warning: comparison is always false due to limited range of data type
> > > 
> 
> This was a ppc64 build, gcc-3.3.3, CONFIG_HZ=250
> 
> Look a the value which MAX_MSEC_OFFSET will take (it's 2^63 minus a bit). 
> Comparing that to an unsigned int will generate the always-true or
> always-false warning.

OK, this was not trivial because gcc is smart enough to detect that even
after a cast, the comparison is still always false. So I had to cut the
comparison in two parts : one which tests whether we need to compare, and
one which does the test on the unsigned int part if necessary. It has shut
the warnings on my alpha (HZ=1024) and on my ultrasparc (HZ=250). I've
noticed that the code in previous patch could still overflow in the cases
where a multiply was used first, because the common type was still int. My
test case in user-space used MSEC_PER_SEC = 1000UL so it did not happen.
I've fixed this too.

I've also checked the code produced on x86 (because alpha code is unreadable
to me), and it resumes to this :

  - HZ=1000 : no code generated
  - HZ=250  : comparison, add, right shift
  - HZ=100  : comparison, add, divide
  - HZ=1024 : comparison, left shift, add, mul, right shift

I tried to compile on ppc64, but unfortunately, the kernel does not build
there because sizeof(long) == 4 ! I guess this is because gcc's target is
powerpc-linux-gnu. I'm trying to recompile it with powerpc64-linux-gnu.
Could you send me the output of gcc -v on your ppc64, please ?

I've also added missing parenthesis in the #defines which might have caused
trouble to external users of MAX_?SEC_OFFSET (none at the moment).

Here's the new patch, I've build everything on 2.6.14-rc2-mm1 but verified
that this code was not touched in -mm2 and the patch still applies.

Could you please retest it on your ppc64 and apply it if you're OK with it ?

Thanks,
Willy



Signed-off-by: Willy Tarreau <willy@w.ods.org>

--- linux-2.6.14-rc2-mm1/include/linux/jiffies.h	Thu Sep 29 23:04:49 2005
+++ linux-2.6.14-rc2-mm1-jiffies2/include/linux/jiffies.h	Sat Oct  1 19:12:13 2005
@@ -246,6 +246,37 @@
 
 #endif
 
+
+/*
+ * We define MAX_MSEC_OFFSET and MAX_USEC_OFFSET as maximal values that can be
+ * accepted by msecs_to_jiffies() and usec_to_jiffies() respectively, without
+ * risking a multiply overflow. Those functions return MAX_JIFFY_OFFSET for
+ * arguments above those values.
+ */
+
+#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
+#  define MAX_MSEC_OFFSET \
+	(ULONG_MAX - (MSEC_PER_SEC / HZ) + 1)
+#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
+#  define MAX_MSEC_OFFSET \
+	(ULONG_MAX / (HZ / MSEC_PER_SEC))
+#else
+#  define MAX_MSEC_OFFSET \
+	((ULONG_MAX - (MSEC_PER_SEC - 1)) / HZ)
+#endif
+
+#if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
+#  define MAX_USEC_OFFSET \
+	(ULONG_MAX - (USEC_PER_SEC / HZ) + 1)
+#elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
+#  define MAX_USEC_OFFSET \
+	(ULONG_MAX / (HZ / USEC_PER_SEC))
+#else
+#  define MAX_USEC_OFFSET \
+	((ULONG_MAX - (USEC_PER_SEC - 1)) / HZ)
+#endif
+
+
 /*
  * Convert jiffies to milliseconds and back.
  *
@@ -276,27 +307,29 @@
 
 static inline unsigned long msecs_to_jiffies(const unsigned int m)
 {
-	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
+	if (MAX_MSEC_OFFSET < UINT_MAX && m > (unsigned int)MAX_MSEC_OFFSET)
 		return MAX_JIFFY_OFFSET;
 #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
-	return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
+	return ((unsigned long)m + (MSEC_PER_SEC / HZ) - 1) /
+		(MSEC_PER_SEC / HZ);
 #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
-	return m * (HZ / MSEC_PER_SEC);
+	return (unsigned long)m * (HZ / MSEC_PER_SEC);
 #else
-	return (m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
+	return ((unsigned long)m * HZ + MSEC_PER_SEC - 1) / MSEC_PER_SEC;
 #endif
 }
 
 static inline unsigned long usecs_to_jiffies(const unsigned int u)
 {
-	if (u > jiffies_to_usecs(MAX_JIFFY_OFFSET))
+	if (MAX_USEC_OFFSET < UINT_MAX && u > (unsigned int)MAX_USEC_OFFSET)
 		return MAX_JIFFY_OFFSET;
 #if HZ <= USEC_PER_SEC && !(USEC_PER_SEC % HZ)
-	return (u + (USEC_PER_SEC / HZ) - 1) / (USEC_PER_SEC / HZ);
+	return ((unsigned long)u + (USEC_PER_SEC / HZ) - 1) /
+		(USEC_PER_SEC / HZ);
 #elif HZ > USEC_PER_SEC && !(HZ % USEC_PER_SEC)
-	return u * (HZ / USEC_PER_SEC);
+	return (unsigned long)u * (HZ / USEC_PER_SEC);
 #else
-	return (u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
+	return ((unsigned long)u * HZ + USEC_PER_SEC - 1) / USEC_PER_SEC;
 #endif
 }
 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/3] 2.6.14-rc2-mm1: fixes for overflow in epoll()
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
  2005-09-24 19:44             ` [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies() Willy Tarreau
@ 2005-09-24 19:47             ` Willy Tarreau
  2005-09-24 19:52             ` [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll() Willy Tarreau
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 19:47 UTC (permalink / raw)
  To: Davide Libenzi, Andrew Morton
  Cc: Nishanth Aravamudan, Nish Aravamudan, Linux Kernel Mailing List

This patch fixes the potential timeout overflow in ep_poll(), tries to get
the best timeout and also removes the /1000 divide whenever possible. It
mostly ensures that timeout computation is uniform with other places in the
kernel.

This patch applies both to 2.6.14-rc2-mm1 and 2.6.14-rc2.

Please review and apply.

Signed-off-by: Willy Tarreau <willy@w.ods.org>

- Willy

diff -purN linux-2.6.14-rc2-mm1/fs/eventpoll.c linux-2.6.14-rc2-mm1-epoll/fs/eventpoll.c
--- linux-2.6.14-rc2-mm1/fs/eventpoll.c	Sat Sep 24 21:11:49 2005
+++ linux-2.6.14-rc2-mm1-epoll/fs/eventpoll.c	Sat Sep 24 21:20:10 2005
@@ -1502,12 +1502,11 @@ static int ep_poll(struct eventpoll *ep,
 	wait_queue_t wait;
 
 	/*
-	 * Calculate the timeout by checking for the "infinite" value ( -1 )
-	 * and the overflow condition. The passed timeout is in milliseconds,
-	 * that why (t * HZ) / 1000.
+	 * Calculate the timeout by checking for the "infinite" value ( < 0 )
+	 * and the overflow condition. The passed timeout is in milliseconds.
 	 */
-	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
-		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
+	jtimeout = (timeout < 0) ?
+	    MAX_SCHEDULE_TIMEOUT : msecs_to_jiffies(timeout);
 
 retry:
 	write_lock_irqsave(&ep->lock, flags);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll()
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
  2005-09-24 19:44             ` [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies() Willy Tarreau
  2005-09-24 19:47             ` [PATCH 2/3] 2.6.14-rc2-mm1: fixes for overflow in epoll() Willy Tarreau
@ 2005-09-24 19:52             ` Willy Tarreau
  2005-10-01 20:39               ` Willy Tarreau
  2005-09-24 20:08             ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Davide Libenzi
  2005-09-25 20:55             ` Nishanth Aravamudan
  4 siblings, 1 reply; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 19:52 UTC (permalink / raw)
  To: Davide Libenzi, Andrew Morton
  Cc: Nishanth Aravamudan, Nish Aravamudan, Linux Kernel Mailing List

This patch simplifies the overflow fix which was applied to sys_poll() in
2.6.14-rc2-mm1 by relying on the overflow detection code added to jiffies.h
by patch 1/3. This also removes a useless divide for HZ <= 1000.

It might be worth applying too. It's only for -mm1, and does *not* apply
to 2.6.14-rc2 because this code has already received patches in -mm1.

Signed-off-by: Willy Tarreau <willy@w.ods.org>

- Willy


diff -purN linux-2.6.14-rc2-mm1/fs/select.c linux-2.6.14-rc2-mm1-poll/fs/select.c
--- linux-2.6.14-rc2-mm1/fs/select.c	Sat Sep 24 21:12:36 2005
+++ linux-2.6.14-rc2-mm1-poll/fs/select.c	Sat Sep 24 21:23:21 2005
@@ -469,7 +469,6 @@ asmlinkage long sys_poll(struct pollfd _
 {
 	struct poll_wqueues table;
 	int fdcount, err;
-	int overflow;
  	unsigned int i;
 	struct poll_list *head;
  	struct poll_list *walk;
@@ -486,22 +485,11 @@ asmlinkage long sys_poll(struct pollfd _
 		return -EINVAL;
 
 	/*
-	 * We compare HZ with 1000 to work out which side of the
-	 * expression needs conversion.  Because we want to avoid
-	 * converting any value to a numerically higher value, which
-	 * could overflow.
-	 */
-#if HZ > 1000
-	overflow = timeout_msecs >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT);
-#else
-	overflow = msecs_to_jiffies(timeout_msecs) >= MAX_SCHEDULE_TIMEOUT;
-#endif
-
-	/*
 	 * If we would overflow in the conversion or a negative timeout
-	 * is requested, sleep indefinitely.
+	 * is requested, sleep indefinitely. Note: msecs_to_jiffies checks
+	 * for the overflow.
 	 */
-	if (overflow || timeout_msecs < 0)
+	if (timeout_msecs < 0)
 		timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
 	else
 		timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1;



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll()
  2005-09-24 19:52             ` [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll() Willy Tarreau
@ 2005-10-01 20:39               ` Willy Tarreau
  0 siblings, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-10-01 20:39 UTC (permalink / raw)
  To: Davide Libenzi, Andrew Morton
  Cc: Nishanth Aravamudan, Nish Aravamudan, Linux Kernel Mailing List

Hi,

On Sat, Sep 24, 2005 at 09:52:05PM +0200, Willy Tarreau wrote:
> This patch simplifies the overflow fix which was applied to sys_poll() in
> 2.6.14-rc2-mm1 by relying on the overflow detection code added to jiffies.h
> by patch 1/3. This also removes a useless divide for HZ <= 1000.
> 
> It might be worth applying too. It's only for -mm1, and does *not* apply
> to 2.6.14-rc2 because this code has already received patches in -mm1.

I noticed that the overflow check in the original code is either useless or
buggy, so this patch is even more worth applying. Look below at original
code : if HZ <= 1000, the overflow is set only when
msecs_to_jiffies() >= MAX_SCHEDULE_TIMEOUT.

But the maximal value that msecs_to_jiffies() can return is MAX_JIFFY_OFFSET.
Guess what ?

  include/linux/jiffies.h:#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
  include/linux/sched.h:#define    MAX_SCHEDULE_TIMEOUT    LONG_MAX
  include/linux/kernel.h:#define LONG_MAX  ((long)(~0UL>>1))

=> MAX_JIFFY_OFFSET == (MAX_SCHEDULE_TIMEOUT - 1)

=> So msecs_to_jiffies() cannot be >= MAX_SCHEDULE_TIMEOUT, and 'overflow'
   can never be set for HZ <= 1000.

Given how msecs_to_jiffies() is used everywhere, I wonder if we should not
provide a separate overflow detection function which would be used where
needed, and possibly remove the test from msecs_to_jiffies() which is mostly
called with constants or small delays which will never overflow.

Has anyone an opinion on this ?

Regards,
Willy

> diff -purN linux-2.6.14-rc2-mm1/fs/select.c linux-2.6.14-rc2-mm1-poll/fs/select.c
> --- linux-2.6.14-rc2-mm1/fs/select.c	Sat Sep 24 21:12:36 2005
> +++ linux-2.6.14-rc2-mm1-poll/fs/select.c	Sat Sep 24 21:23:21 2005
> @@ -469,7 +469,6 @@ asmlinkage long sys_poll(struct pollfd _
>  {
>  	struct poll_wqueues table;
>  	int fdcount, err;
> -	int overflow;
>   	unsigned int i;
>  	struct poll_list *head;
>   	struct poll_list *walk;
> @@ -486,22 +485,11 @@ asmlinkage long sys_poll(struct pollfd _
>  		return -EINVAL;
>  
>  	/*
> -	 * We compare HZ with 1000 to work out which side of the
> -	 * expression needs conversion.  Because we want to avoid
> -	 * converting any value to a numerically higher value, which
> -	 * could overflow.
> -	 */
> -#if HZ > 1000
> -	overflow = timeout_msecs >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT);
> -#else
> -	overflow = msecs_to_jiffies(timeout_msecs) >= MAX_SCHEDULE_TIMEOUT;
> -#endif
> -
> -	/*
>  	 * If we would overflow in the conversion or a negative timeout
> -	 * is requested, sleep indefinitely.
> +	 * is requested, sleep indefinitely. Note: msecs_to_jiffies checks
> +	 * for the overflow.
>  	 */
> -	if (overflow || timeout_msecs < 0)
> +	if (timeout_msecs < 0)
>  		timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
>  	else
>  		timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1;
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies()
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
                               ` (2 preceding siblings ...)
  2005-09-24 19:52             ` [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll() Willy Tarreau
@ 2005-09-24 20:08             ` Davide Libenzi
  2005-09-24 20:21               ` Willy TARREAU
  2005-09-25 20:55             ` Nishanth Aravamudan
  4 siblings, 1 reply; 33+ messages in thread
From: Davide Libenzi @ 2005-09-24 20:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andrew Morton, Nishanth Aravamudan, Nish Aravamudan,
	Linux Kernel Mailing List

On Sat, 24 Sep 2005, Willy Tarreau wrote:

> Hello,
>
> After the discussion around epoll() timeout, I noticed that the functions used
> to detect the timeout could themselves overflow for some values of HZ.
>
> So I decided to fix them by defining a macro which represents the maximal
> acceptable argument which is guaranteed not to overflow. As an added bonus,
> those functions can now be used in poll() and ep_poll() and remove the divide
> if HZ == 1000, or replace it with a shift if (1000 % HZ) or (HZ % 1000) is a
> power of two.

Why all that code, when you can have it with:

#define MAX_LONG_MSTIMEO (long) min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)

that gcc-O2 collapses into a single constant?


- Davide



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies()
  2005-09-24 20:08             ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Davide Libenzi
@ 2005-09-24 20:21               ` Willy TARREAU
  0 siblings, 0 replies; 33+ messages in thread
From: Willy TARREAU @ 2005-09-24 20:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Nishanth Aravamudan, Nish Aravamudan,
	Linux Kernel Mailing List

On Sat, Sep 24, 2005 at 01:08:22PM -0700, Davide Libenzi wrote:
> On Sat, 24 Sep 2005, Willy Tarreau wrote:
> 
> >Hello,
> >
> >After the discussion around epoll() timeout, I noticed that the functions 
> >used
> >to detect the timeout could themselves overflow for some values of HZ.
> >
> >So I decided to fix them by defining a macro which represents the maximal
> >acceptable argument which is guaranteed not to overflow. As an added bonus,
> >those functions can now be used in poll() and ep_poll() and remove the 
> >divide
> >if HZ == 1000, or replace it with a shift if (1000 % HZ) or (HZ % 1000) is 
> >a
> >power of two.
> 
> Why all that code, when you can have it with:
> 
> #define MAX_LONG_MSTIMEO (long) min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, 
> LONG_MAX / HZ - 1000ULL)
> 
> that gcc-O2 collapses into a single constant?

It is because I wanted to ensure that it matched exactly the limits of the
functions that we call. And since msec_to_jiffies() is defined in 3 possible
ways depending on HZ, 1000 % HZ and HZ % 1000, there are 3 different limits.

Then, once the msec_to_jiffies() function is fixed, it's valuable to use it
in ep_poll(), because its worst case does exactly what you already have
(timeout * HZ + 999) / 1000, and other optimal cases can do better (for
HZ=100, 250, 1000, there will be no divide at all).

Don't worry, in my case, gcc also produces a single constant. That's just
that it depends on how it will be used (check include/linux/jiffies.h,
you'll understand what I mean).

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies()
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
                               ` (3 preceding siblings ...)
  2005-09-24 20:08             ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Davide Libenzi
@ 2005-09-25 20:55             ` Nishanth Aravamudan
  2005-09-25 22:06               ` Willy Tarreau
  4 siblings, 1 reply; 33+ messages in thread
From: Nishanth Aravamudan @ 2005-09-25 20:55 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Davide Libenzi, Andrew Morton, Nish Aravamudan,
	Linux Kernel Mailing List

On 24.09.2005 [21:38:39 +0200], Willy Tarreau wrote:
> Hello,
> 
> After the discussion around epoll() timeout, I noticed that the functions used
> to detect the timeout could themselves overflow for some values of HZ.
> 
> So I decided to fix them by defining a macro which represents the maximal
> acceptable argument which is guaranteed not to overflow. As an added bonus,
> those functions can now be used in poll() and ep_poll() and remove the divide
> if HZ == 1000, or replace it with a shift if (1000 % HZ) or (HZ % 1000) is a
> power of two.
> 
> Patches against 2.6.14-rc2-mm1 sent as replies to this mail.

These look really good, Willy. Thanks for the fixes!

-Nish

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies()
  2005-09-25 20:55             ` Nishanth Aravamudan
@ 2005-09-25 22:06               ` Willy Tarreau
  0 siblings, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-25 22:06 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Davide Libenzi, Andrew Morton, Nish Aravamudan,
	Linux Kernel Mailing List

On Sun, Sep 25, 2005 at 01:55:18PM -0700, Nishanth Aravamudan wrote:
> On 24.09.2005 [21:38:39 +0200], Willy Tarreau wrote:
> > Hello,
> > 
> > After the discussion around epoll() timeout, I noticed that the functions used
> > to detect the timeout could themselves overflow for some values of HZ.
> > 
> > So I decided to fix them by defining a macro which represents the maximal
> > acceptable argument which is guaranteed not to overflow. As an added bonus,
> > those functions can now be used in poll() and ep_poll() and remove the divide
> > if HZ == 1000, or replace it with a shift if (1000 % HZ) or (HZ % 1000) is a
> > power of two.
> > 
> > Patches against 2.6.14-rc2-mm1 sent as replies to this mail.
> 
> These look really good, Willy. Thanks for the fixes!

Thanks.
I noticed that there still existed a high risk of overflow with HZ values
for which (HZ % 1000) != 0 && (1000 % HZ) != 0, because we hit the only
situation where we start with a multiply by HZ, which is even worse as HZ
grows. The only cases I've found are :

  - alpha : 1024 or 1200
  - ia64  : 1024

Fortunately, they're both 64 bits, and since the result is an unsigned long,
the multiply returns 64 bits result so it won't overflow in a while. However,
we should avoid those combinations on 32-bits archs, because the limit is
then rather low. For example, using HZ=1200 on an x86 will set MAX_MSEC_OFFSET
to 3579138 ms, which is just below one hour. For reference, with HZ=1000,
MAX_MSEC_OFFSET would be 2^32-1 ms, or 49.7 days, which makes quite a
difference.

It is possible that some self-made setups hit the overflow before the patch.
For this reason, I wonder how we could discourage people from using such
bastard HZ values on 32 bits platforms :-/

Regards,
Willy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 18:25         ` Davide Libenzi
  2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
@ 2005-09-24 21:25           ` Vadim Lobanov
  1 sibling, 0 replies; 33+ messages in thread
From: Vadim Lobanov @ 2005-09-24 21:25 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nishanth Aravamudan, Willy Tarreau, Nish Aravamudan,
	Linux Kernel Mailing List, Andrew Morton

On Sat, 24 Sep 2005, Davide Libenzi wrote:

> On Sat, 24 Sep 2005, Nishanth Aravamudan wrote:
>
> > Description: The @timeout parameter to ep_poll() is in milliseconds but
> > we compare it to (MAX_SCHEDULE_TIMEOUT - 1000 / HZ), which is
> > (jiffies/jiffies-per-sec) or seconds. That seems blatantly broken.  This
> > led to improper overflow checking for @timeout. As Andrew Morton pointed
> > out in a similar fix for sys_poll(), the best solution is to to check
> > for potential overflow first, then either select an indefinite value or
> > convert @timeout.
> >
> > To achieve this and clean-up the code, change the prototype of the
> > ep_poll() to make it clear that the parameter is in milliseconds and
> > rename jtimeout to timeout_jiffies to hold the corresonding jiffies
> > value.
> >
> > fs/eventpoll.c |   46 ++++++++++++++++++++++++++++++++++++++--------
> > 1 files changed, 38 insertions(+), 8 deletions(-)
>
> Eh? "To achieve this and clean-up the code"? :)
>
>
> - Davide
>
>
> eventpoll.c |    7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> --- a/fs/eventpoll.c	2005-09-24 11:07:04.000000000 -0700
> +++ b/fs/eventpoll.c	2005-09-24 11:11:06.000000000 -0700
> @@ -101,6 +101,10 @@
>   /* Maximum number of poll wake up nests we are allowing */
>   #define EP_MAX_POLLWAKE_NESTS 4
>
> +/* Maximum msec timeout value storeable in a long int */
> +#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULE_TIMEOUT / HZ, LONG_MAX / HZ - 1000ULL)

Not quite. It really should be:
#define EP_MAX_MSTIMEO min(1000ULL * MAX_SCHEDULETIMEOUT / HZ, (LONG_MAX - 999ULL) / HZ);

> +
> +
>   struct epoll_filefd {
>   	struct file *file;
>   	int fd;
> @@ -1507,8 +1511,7 @@
>   	 * and the overflow condition. The passed timeout is in milliseconds,
>   	 * that why (t * HZ) / 1000.
>   	 */
> -	jtimeout = (timeout < 0 ||
> -		    (timeout / 1000) >= (MAX_SCHEDULE_TIMEOUT / HZ)) ?
> +	jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
>   		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
>
>   retry:
> -

-Vadim Lobanov

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch] sys_epoll_wait() timeout saga ...
  2005-09-24 17:19       ` Nishanth Aravamudan
  2005-09-24 18:25         ` Davide Libenzi
@ 2005-09-24 18:30         ` Willy Tarreau
  1 sibling, 0 replies; 33+ messages in thread
From: Willy Tarreau @ 2005-09-24 18:30 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Nish Aravamudan, Davide Libenzi, Linux Kernel Mailing List,
	Andrew Morton

On Sat, Sep 24, 2005 at 10:19:28AM -0700, Nishanth Aravamudan wrote:
 
> Here is the patch I was referring to, adapted to the epoll use of
> schedule_timeout(). A similar fix is in -mm for almost identical code in
> sys_poll().
> 
> Description: The @timeout parameter to ep_poll() is in milliseconds but
> we compare it to (MAX_SCHEDULE_TIMEOUT - 1000 / HZ), which is
> (jiffies/jiffies-per-sec) or seconds. That seems blatantly broken.

as you've seen in my last mail, it happens to be valid because it checks
for the multiply overflow before doing it. But it's only after a math
demonstration that it appears valid anyway ;-)

> This
> led to improper overflow checking for @timeout. As Andrew Morton pointed
> out in a similar fix for sys_poll(), the best solution is to to check
> for potential overflow first, then either select an indefinite value or
> convert @timeout.
> 
> To achieve this and clean-up the code, change the prototype of the
> ep_poll() to make it clear that the parameter is in milliseconds and
> rename jtimeout to timeout_jiffies to hold the corresonding jiffies
> value.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> 
> ---
> 
>  fs/eventpoll.c |   46 ++++++++++++++++++++++++++++++++++++++--------
>  1 files changed, 38 insertions(+), 8 deletions(-)
> 
> diff -urpN 2.6.14-rc2-mm1/fs/eventpoll.c 2.6.14-rc2-mm1-dev/fs/eventpoll.c
> --- 2.6.14-rc2-mm1/fs/eventpoll.c	2005-09-21 23:50:28.000000000 -0700
> +++ 2.6.14-rc2-mm1-dev/fs/eventpoll.c	2005-09-24 10:07:51.000000000 -0700
> @@ -260,7 +260,7 @@ static int ep_events_transfer(struct eve
>  			      struct epoll_event __user *events,
>  			      int maxevents);
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> -		   int maxevents, long timeout);
> +		   int maxevents, long timeout_msecs);
>  static int eventpollfs_delete_dentry(struct dentry *dentry);
>  static struct inode *ep_eventpoll_inode(void);
>  static struct super_block *eventpollfs_get_sb(struct file_system_type *fs_type,
> @@ -1494,11 +1494,12 @@ static int ep_events_transfer(struct eve
>  
>  
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> -		   int maxevents, long timeout)
> +		   int maxevents, long timeout_msecs)
>  {
>  	int res, eavail;
> +	int overflow;
>  	unsigned long flags;
> -	long jtimeout;
> +	long timeout_jiffies;
>  	wait_queue_t wait;
>  
>  	/*
> @@ -1506,8 +1507,36 @@ static int ep_poll(struct eventpoll *ep,
>  	 * and the overflow condition. The passed timeout is in milliseconds,
>  	 * that why (t * HZ) / 1000.
>  	 */
> -	jtimeout = timeout == -1 || timeout > (MAX_SCHEDULE_TIMEOUT - 1000) / HZ ?
> -		MAX_SCHEDULE_TIMEOUT: (timeout * HZ + 999) / 1000;
> +
> +	if (timeout_msecs) {

Why not check for < 0 here ? it's the right moment, instead of doing all
the math first ? gcc will emit only one check with 3 results resumed in
just two consecutive cond jumps for 0, <0, >0.

> +		/*
> +		 * We compare HZ with 1000 to work out which side of the
> +		 * expression needs conversion.  Because we want to
> +		 * avoid converting any value to a numerically higher
> +		 * value, which could overflow.
> +		 */
> +#if HZ > 1000
> +		overflow = timeout_msecs >=
> +				jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT);
> +#else
> +		overflow = msecs_to_jiffies(timeout_msecs) >=
> +				MAX_SCHEDULE_TIMEOUT;
> +#endif

Although this seems clean, what worries me here is that for HZ <= 1000,
we're doing maths on a variable to compare it to a constant. The
msec_to_jiffies() function exists to return exact conversions, while
we're just checking for bounds. The magics here clearly depends on the
jiffies_to_msecs() implementation. So the real problem is not to check
for which values for timeout_msecs we can get an overflow, but to check
if timeout_msecs is higher than an HZ-dependant constant above which some
conversion functions will overflow.

Oh, BTW, while reading msecs_to_jiffies(), I noticed that it can overflow
too for small values of HZ (HZ < 1000) :

#define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)

static inline unsigned int jiffies_to_msecs(const unsigned long j)
{
#if HZ <= 1000 && !(1000 % HZ)
        return (1000 / HZ) * j;
#elif HZ > 1000 && !(HZ % 1000)
        return (j + (HZ / 1000) - 1)/(HZ / 1000);
#else
        return (j * 1000) / HZ;
#endif
}


static inline unsigned long msecs_to_jiffies(const unsigned int m)
{
        if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
                return MAX_JIFFY_OFFSET;
#if HZ <= 1000 && !(1000 % HZ)
        return (m + (1000 / HZ) - 1) / (1000 / HZ);
#elif HZ > 1000 && !(HZ % 1000)
        return m * (HZ / 1000);
#else
        return (m * HZ + 999) / 1000;
#endif
}

Calling jiffies_to_msecs(MAX_JIFFY_OFFSET) is not valid if 1000/HZ >=2 or
if (1000 % HZ != 0 || HZ % 1000 != 0). Eg: for HZ=300, the max is set to
0xda7406 while it is set to 0xfffffff8 for HZ=250 !

This clearly shows that we need some common macros to define some absolute
bounds for each unit and avoid any computation on the timers before having
checked them against the bounds.

I'll see if I can get something working.

> +
> +		/*
> +		 * If we would overflow in the conversion or a negative
> +		 * timeout is requested, sleep indefinitely.
> +		 */
> +		if (overflow || timeout_msecs < 0)
> +			timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
> +		else
> +			timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1;
> +	} else {
> +		/*
> +		 * 0 millisecond requests become 0 jiffy requests
> +		 */
> +		timeout_jiffies = 0;
> +	}
>  
>  retry:
>  	write_lock_irqsave(&ep->lock, flags);
> @@ -1529,7 +1558,7 @@ retry:
>  			 * to TASK_INTERRUPTIBLE before doing the checks.
>  			 */
>  			set_current_state(TASK_INTERRUPTIBLE);
> -			if (!list_empty(&ep->rdllist) || !jtimeout)
> +			if (!list_empty(&ep->rdllist) || !timeout_jiffies)
>  				break;
>  			if (signal_pending(current)) {
>  				res = -EINTR;
> @@ -1537,7 +1566,7 @@ retry:
>  			}
>  
>  			write_unlock_irqrestore(&ep->lock, flags);
> -			jtimeout = schedule_timeout(jtimeout);
> +			timeout_jiffies = schedule_timeout(timeout_jiffies);
>  			write_lock_irqsave(&ep->lock, flags);
>  		}
>  		remove_wait_queue(&ep->wq, &wait);
> @@ -1556,7 +1585,8 @@ retry:
>  	 * more luck.
>  	 */
>  	if (!res && eavail &&
> -	    !(res = ep_events_transfer(ep, events, maxevents)) && jtimeout)
> +	    !(res = ep_events_transfer(ep, events, maxevents)) &&
> +	    timeout_jiffies)
>  		goto retry;
>  
>  	return res;


Regards,
Willy


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2005-10-01 20:45 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-09-23 18:13 [patch] sys_epoll_wait() timeout saga Davide Libenzi
2005-09-23 18:24 ` Nish Aravamudan
2005-09-24  4:05 ` Willy Tarreau
2005-09-24  4:44   ` Nish Aravamudan
2005-09-24  6:15     ` Willy Tarreau
2005-09-24  7:33       ` Vadim Lobanov
2005-09-24  7:51         ` Willy Tarreau
2005-09-24 15:10       ` Davide Libenzi
2005-09-24 17:20         ` Willy Tarreau
2005-09-24 18:19           ` Davide Libenzi
2005-09-25  6:05             ` Andrew Morton
2005-09-25  6:20               ` Willy Tarreau
2005-09-25  6:32                 ` Andrew Morton
2005-09-25  7:08               ` Vadim Lobanov
2005-09-25  8:03                 ` Willy Tarreau
2005-09-24 17:19       ` Nishanth Aravamudan
2005-09-24 18:25         ` Davide Libenzi
2005-09-24 19:38           ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Willy Tarreau
2005-09-24 19:44             ` [PATCH 1/3] 2.6.14-rc2-mm1: fixes for overflow msec_to_jiffies() Willy Tarreau
2005-09-29  9:43               ` Andrew Morton
2005-09-29 19:41                 ` Willy Tarreau
2005-09-29 19:52                   ` Andrew Morton
2005-09-29 20:55                     ` Willy Tarreau
2005-10-01 17:39                     ` Willy Tarreau
2005-09-24 19:47             ` [PATCH 2/3] 2.6.14-rc2-mm1: fixes for overflow in epoll() Willy Tarreau
2005-09-24 19:52             ` [PATCH 3/3] 2.6.14-rc2-mm1 : fixes for overflow in sys_poll() Willy Tarreau
2005-10-01 20:39               ` Willy Tarreau
2005-09-24 20:08             ` [PATCH 0/3] fixes for overflow in poll(), epoll(), and msec_to_jiffies() Davide Libenzi
2005-09-24 20:21               ` Willy TARREAU
2005-09-25 20:55             ` Nishanth Aravamudan
2005-09-25 22:06               ` Willy Tarreau
2005-09-24 21:25           ` [patch] sys_epoll_wait() timeout saga Vadim Lobanov
2005-09-24 18:30         ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox