From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: Bug: epoll_wait timeout is shorter than requested Date: Mon, 17 Jan 2005 16:48:36 +0000 Message-ID: <20050117164836.GA25815@mail.shareable.org> References: <87651wl32d.fsf@qrnik.zagroda> <20050117114821.GB20152@mail.shareable.org> <87r7kk41gp.fsf@qrnik.zagroda> <20050117143348.GA23427@mail.shareable.org> <87acr8uizp.fsf@qrnik.zagroda> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mail.shareable.org ([81.29.64.88]:3513 "EHLO mail.shareable.org") by vger.kernel.org with ESMTP id S262258AbVAQQsh (ORCPT ); Mon, 17 Jan 2005 11:48:37 -0500 Received: from mail.shareable.org (localhost [127.0.0.1]) by mail.shareable.org (8.12.8/8.12.8) with ESMTP id j0HGma81026178 for ; Mon, 17 Jan 2005 16:48:36 GMT Received: (from jamie@localhost) by mail.shareable.org (8.12.8/8.12.8/Submit) id j0HGmaMK026176 for linux-fsdevel@vger.kernel.org; Mon, 17 Jan 2005 16:48:36 GMT To: linux-fsdevel@vger.kernel.org Content-Disposition: inline In-Reply-To: <87acr8uizp.fsf@qrnik.zagroda> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org Marcin 'Qrczak' Kowalczyk wrote: > > If you call select with { 0, 10000 } - that is, 10 milliseconds, then > > you get a delay between 0ms and 10ms on a 100Hz kernel. > > > > That is easy to measure. Just call select() in a loop and observe the > > times. > > I think HZ is now 1000 on x86, so I can't determine experimentally > which 1ms shifts come from the resolution of poll/epoll interface and > which come from the timer frequency. Sure you can. Call select(0,0,0,0,{0,1000}) in a loop, and you'll see it returns every 1ms. Call poll(0,0,1) and you'll see it returns every 2ms. Call epoll_wait (epfd,0,0,1) and you'll see it returns every 1ms. That 2ms minimum wait with poll() is the problem, although it was more of a problem when HZ was 100 (as it still is on some architectures). > What you are saying implies that the amount by which select/epoll may > shorten the timeout depends on the timer frequency. Thus a program > which does not know the timer frequency can't know how much to make > the timeout longer, without risking having to sleep again. Correct. Another way to look at it is that a program which doesn't know the timer frequency can't know how much to make the poll() timeout shorter if it wants shortest non-zero timeouts, or if it is trying to time an event as close as possible to an absolute time. In my experience with a simple interactive X video game, the only portable way to do this is to actually measure the times at which select/poll return and deduce the OS's granularity and late/early rounding using some kind of control system estimator. (Then if you need finer granularity (as the game did) you need a busy loop to add the remaining sub-tick time). > I guess the 1ms here is actually the timer tick and that in case of > epoll rules are the same, except that the timeout is specified in ms. > That is, it is rounded up to a multiple of timer ticks, and then the > actual timeout is between 0 and 1 tick shorter, such that it ends at > some tick. Right? Yes. Look at the kernel code: the only difference between epoll and poll is that poll has "+1" at the end of the equation for the number of ticks to wait. > This means that depending on the fraction of the current tick which > has elapsed, and the fraction of the timeout we want to sleep, the > optimal request may have two possible values. By optimal request I > mean the one which will give us the shortest delay which is not > shorter than the one we actually want. We don't know which request > to give if we don't know the timer frequency. > > For example, assuming 100Hz clock, if we are 3.333ms after a tick > and we want to sleep at least for 124ms, we should give some timeout > between 121ms and 130ms. We will actually sleep 126.667ms, which is > fine. But if we are 6.666ms after a tick is and we want to sleep at > least for 126ms, we should give some timeout between 131ms and 140ms. > This will give us an actual delay of 133.334ms - one tick earlier > would be too short. > > So perhaps my program should indeed do what it currently does. > Sometimes the actual delay will be too short and a separate epoll call > will sleep the remaining tick. But if the program always added one > tick, it would sometimes sleep one tick longer than necessary (and > another problem would be that it does know the timer frequency, > so it can't add one tick). > > I think this gives the optimal behavior wrt. the number of ticks to > sleep, and the only disadvantage is more syscalls in some cases. > > The kernel could make this better because it knows the timer frequency > and it can determine the fraction of the current tick: it would make > the delay longer by 1 tick in case the rest of the current tick is > shorter than the fractional part of the delay (wrt. a tick) reduced by > the unit of resolution of the interface (to allow for 1 unit to mean > "until the next tick"). > > But it would help only in case the tick is longer than the resolution > of the epoll interface, so perhaps it's not worth the effort - I think > today it's usually 1ms, equal to the epoll resolution. With select it > would help more than with epoll, because the select interface has a > finer resolution, but OTOH select is old-fashioned. And it would only > help in saving some syscalls, it would not provide a behavior which > is unimplementable today. > > > The man page for select says the timeout serves as an upper bound. > > Well, because of other processes a timeout can always become longer > than requested. It should be an upper bound in the sense that it will > return earlier if a fd is ready. But it should not return earlier if > fds are not ready, pretending that the timeout expired while in fact > it did not. > > Except that, as you say, it would prevent specifying a timeout > "until the next tick, even if it's shorter than the resolution > of the interface"... > > > By the way, select(), poll() and epoll_wait() all have another bug: if > > the timeout parameter is too large, they'll wait *indefinitely*. They > > call schedule_timeout(MAX_SCHEDULE_TIMEOUT) in that case, which just > > calls schedule() with no timer. > > Oops. But this should be easy to fix: give MAX_SCHEDULE_TIMEOUT-1. Yes. Feel free to submit the patch. > > If select/poll/epoll were implemented by the kernel reading the > > current time accurately before deciding how many ticks to wait for, > > they could satisfy SUSv3's constraint, _and_ allow the useful > > behaviour of application events at the tick rate, _and_ reduce the > > number of system calls in some programs which call select(). > > Right. > > > If you want to change the code in fs/select.c and fs/eventpoll.c to > > do this, please do so; I'll be happy to support the case for it. > > I'm still not sure what the behavior should be. It seems poll and > epoll with their current interfaces can't be made better if the tick > frequency is 1000Hz... Correct - but the tick frequency isn't 1000Hz on all architectures and it isn't likely to change either, because 1000Hz is too fast for slower CPUs such as in routers and PDAs. > > By the way, the most logically useful interface would take an > > *absolute* end time, in any of the forms that the POSIX timer code > > allows. > > Yes! I actually have an absolute time, and compute a timeout from it. Nearly every program does. > Even if a user of my language specifies a relative time, I convert it > to absolute time first. Then it's converted to relative time in order > to pass the timeout to epoll/poll/select, and then the kernel probably > converts it to absolute time again. Quite. Silly interface, isn't it? :) We even get to waste significant numbers of cycles reading the timer chip every time. 2 microseconds on your system, ~20 microseconds on some others. -- Jamie