From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ulrich Drepper Subject: Re: [take24 0/6] kevent: Generic event handling mechanism. Date: Sun, 19 Nov 2006 16:02:03 -0800 Message-ID: <4560F07B.10608@redhat.com> References: <11630606361046@2ka.mipt.ru> <45564EA5.6020607@redhat.com> <20061113105458.GA8182@2ka.mipt.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: David Miller , Andrew Morton , netdev , Zach Brown , Christoph Hellwig , Chase Venters , Johann Borck , linux-kernel@vger.kernel.org, Jeff Garzik , Alexander Viro Return-path: Received: from mx1.redhat.com ([66.187.233.31]:61910 "EHLO mx1.redhat.com") by vger.kernel.org with ESMTP id S933791AbWKTADr (ORCPT ); Sun, 19 Nov 2006 19:03:47 -0500 To: Evgeniy Polyakov In-Reply-To: <20061113105458.GA8182@2ka.mipt.ru> Sender: netdev-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Evgeniy Polyakov wrote: >> Possible solution: >> >> a) it would be possible to have a "used" flag in each ring buffer en= try. >> That's too expensive, I guess. >> >> b) kevent_wait needs another parameter which specifies the which is = the >> last (i.e., least recently added) entry in the ring buffer. >> Everything between this entry and the current head (in ->kidx) is >> occupied. If multiple threads arrive in kevent_wait the highest = idx >> (with wrap around possibly lowest) is used. >> >> kevent_wait will not try to move more entries into the ring buffe= r >> if ->kidx and the higest index passed in to any kevent_wait call >> is equal (i.e., the ring buffer is full). >> >> There is one issue, though, and that is that a system call is nee= ded >> to signal to the kernel that more entries in the ring buffer are >> processed and that they can be refilled. This goes against the >> kernel filling the ring buffer automatically (see below) >=20 > If thread calls kevent_wait() it means it has processed previous entr= ies,=20 > one can call kevent_wait() with $num parameter as zero, which > means that thread does not want any new events, so nothing will be > copied. This doesn't solve the problem. You could only request new events when= =20 all previously reported events are processed. Plus: how do you report=20 events if the you don't allow get_event pass them on? > Writable ring buffer does not sound too good to me - what if one thre= ad > will overwrite the whole ring buffer so kernel's indexes can be screw= ed? Agreed, there are problems. This is why I suggested the ring buffer ca= n=20 be a structured. Parts of it might be read-only, other parts=20 read/write. I don't necessarily think the 'used' flag is the right way= =2E=20 And front/tail pointer solution seems to be better. > Ring buffer processed not in FIFO order is wrong idea Not necessarily, see my comments about CPU affinity in the previous mai= l. > - ring buffer can > be potentially very big and searching there for the entry, which was > been marked as 'free' by userspace is not a solution at all - userspa= ce > in that case must provide ukevent so fast tree search would be used, > (and although it is already possible) it requires userspace to make > additional syscalls which is not what we want. It is not necessary. I've proposed to only have a fron and tail=20 pointer. The tail pointer is maintained by the application and passed=20 to the kernel explicitly or via shared memory. The kernel maintains th= e=20 front pointer. No tree needed. > As a solution I can create folowing scheme: > there are two syscalls (or one with a switch) which get events and > commits them. >=20 > kevent_wait() becomes a syscall which waits until number of events or > one of them becomes ready and just copies them into ring buffer and > returns. kevent_wait() will fail with special error code when ring > buffer is full. >=20 > kevent_commit() frees requested number of events _from the beginning_= , > i.e. from special index, visible from userspace. Userspace can create > special counters for events (and even put them into read-only ring=20 > buffer overwriting some fields of kevent, especially if we will incre= ase > it's size) and only call kevent_commit() when all events have zero us= age > counter. Right, that's basically the front/tail pointer implementation. That=20 would work. You just have to make sure that the kevent_wait() call=20 takes the current front pointer/index as a parameter. This way if the=20 buffer gets filled between the thread checking the ring buffer (and=20 finding it empty) and the syscall being handled the thread is not suspe= nded. > I disagree that having possibility to have holes in the ring buffer i= s a > good idea at all - it requires much more complex protocol, which will > fill and reuse that holes, and the main disavantge - it requires to > transfer much more information from userspace to kernelspace to free = the > ring entry in the hole - in that case it is already possible just to > call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new > approach at all. Well, it would require more data transport of we'd use writable shared=20 memory. But I agree, it's far too complicated and might not scale with= =20 growing ring buffer sizes. >> - implementing the kevent_wait syscall the proposed way means we are >> missing out on one possible optimization. The ring buffer is >> currently only filled on kevent_wait calls. I expect that in real= ly >> high traffic situations requests are coming in at a higher rate th= an >> the can be processed. At least for periods of time. If such >> situations it would be nice to not have to call into the kernel at >> all. If the kernel would deliver into the ring buffer on its own >> this would be possible. >=20 > Well, it can be done on behalf of workqueue or dedicated thread which > will bring up appropriate mm context, I think it should be done. It's potentially a huge advantage. > although it means that userspace > can not handle the load it requested, which is a bad sign... I don't understand. What is not supposed to work? There is nothing=20 which cannot work with automatic posting since the get_event() call doe= s=20 nothing but copying the event data over and wake a thread. >> - the kevent_get_event syscall is not needed at all. All reporting >> should be done using a ring buffer. There really is not reason to >> keep two interfaces around which serve the same purpose. Making >> the argument the kevent_get_event is so much easier to use is not >> valid. The exposed interface to access the ring buffer will be ea= sy, >> too. In the OLS paper I more or wait hinted at the interfaces. I >> think they should be like this (names are irrelevant): >=20 > Well, kevent_get_events() _is_ much easier to use. And actually havin= g > only that interface it is possible to implement ring buffer with any > kind or protocol for its controlling - userspace can have a wrapper > which will call kevent_get_events() with pointer which shows to the > place in the shared ring buffer where to place new events, that wrapp= er > can handle essentially any kind of flags/parameters which are suitabl= e > for that ring buffer implementation. That's far too slow. The whole point behind the ring buffer is speed.=20 And emulation would defeat the purpose. > But since we started to implement ring buffer as a additional feature= of > kevent, let's find the way all people will be happy with before remov= ing > something which was proven to work correctly. The get_event interface is basically the userlevel interface the runtim= e=20 (glibc probably) would provide. Programmers don't see the complexity. I'm concerned about the get_event interface holding the kernel=20 implementation back. For instance, automatic filling the ring buffer.=20 This would not be possible if the program is free to mix=20 kevent_get_event and kevent_wait calls freely. If you do away with the= =20 get_event syscall the automatic ring buffer filling is possible and a=20 logical extension. >=20 > The last three are exactly kevent_get_events() with different set of > parameters - it is possible to get events without sleeping, it is > possible to wait until at least something is ready and it is possible= to > sleep for timeout. Exactly. But these interfaces should be implemented at userlevel, not=20 at the syscall level. It's not necessary. The kernel interface should= =20 be kept as small as possible and the get_event syscall is pure duplicat= ion. > They all already imeplemented. Just all above, and it was done severa= l > months ago already. No need to reinvent what is already there. > Even if we will decide to remove kevent_get_events() in favour of rin= g > buffer-only implementation, winting-for-event syscall will be > essentially kevent_get_events() without pointer to the place where to > put events. Right, but this limitation of the interface is important. It means the= =20 interface of the kernel is smaller: fewer possibilities for problems an= d=20 fewer constraints if in future something should be changed (and smaller= =20 kernel). > I agree that having special syscall to initialize kevent is a good id= ea, > and initial kevent implementation had it, but it was removed due to A= PI > cleanup work by Cristoph Hellwing. Well, he is wrong. If, for instance, init or any of the programs which= =20 start first wants to use the syscall it couldn't because /dev isn't=20 mounted. The program might use libraries and therefore not have any=20 influence on whether the kevent stuff is used or not. Yes, the /dev interface is useful for some/many other kernel interfaces= =2E=20 But this is a core interface. For the same reason epoll_create is a=20 syscall. > Do you have _any_ kind of benchmarks with epoll() which would show th= at > it is feasible? ukevent is one cache line (well, 2 cache lines on old > CPUs), which can be setup way too far away from the time when it is > ready, and CPU which origianlly set that up can be busy, so we will l= ose > performance waiting until CPU becomes free instead of calling other > thread on different CPU. If the period between the generation of the event (e.g., incoming=20 network traffic or sent data) and the delivery of the event by waking a= =20 thread is too long, it makes not too much sense. But if the L2 cache=20 hasn't hasn't been flushed it might be a big advantage. I think it's reasonable to only have the last queued entry for a CPU=20 handled special. And note, this is only ever a hint. If an event entr= y=20 was created by the kernel in one CPU but none of the threads which wait= =20 to be waken is on that CPU, nothing has to be done. No, I don't have a benchmark. But it is likely quite easily possible t= o=20 create a synthetic benachmark. Maybe with pipes. > It is possible to specify CPU id in kevent (not in ukevent, i.e. not > in shared by userspace structure, but in it's kernel representation), > and then check if currently active CPU is the same or not, but what i= f > it is not the same CPU? Nothing special. It's up to the userlevel wrapper code. The CPU numbe= r=20 would only be a hint. > Entry order is important, since application can > take advantage of synchronization, so idea to skip some entries is ba= d. That's something the application should be make a call about. It's not= =20 always (or even mostly) the case that the ordering of the notification=20 is important. Furthermore, this would also require the kernel to=20 enforce an ordering. This is expensive on SMP machines. A locally=20 generated event (i.e., source and the thread reporting the event) can b= e=20 delivered faster than an event created on another CPU. > It is management task - kernel should not even know about someone has > died and can not process events it requested. But the kernel has to be involed. > Userspace can open a control pipe (and setup a kevent handler for it)= =20 > and glibc will write there a byte thus awakening some other thread. > It can be done in userspace and should be done in userspace. That's invasive. The problem is that no userlevel interface should hav= e=20 to implicitly keep file descriptors open. This would mean the=20 application would be influenced since suddenly a file descriptor is not= =20 available anymore. Yes, applications shouldn't care but they=20 unfortunately sometimes do. > Will we discuss it for death? >=20 > Kevent does not need to have absolute timeout. Of course it does. Just because you don't see a need for it for your=20 applications right now it doesn't mean it's not a valid use. > Because timeout specified there is always related to the start of > syscall, since it is a timeout which specifies maximum time frame > syscall can live. That's your current implementation. There is absolutely no reason=20 whatsoever why this couldn't be changed. > I created kevent_signal notifications - it allows user to setup any s= et > of interested signals before call to kevent_get_events() and friends. >=20 > No need to solve a problem with operation way when there is tactical = and > strategical ones Of course there is a need and I explained it before. Getting signal=20 notifications is in no way the same as changing the signal mask=20 temporarily. You cannot correctly emulate the case where you want to=20 block a signal while in the call as reenable it afterwards. Receiving=20 the signal as an event and then artificially raising it is not the same= =2E=20 Especially timing-wise, the signal kevent might not be seen long afte= r=20 the syscall returns because other entries are worked on first. The opposite case is equally impossible to emulate: unblocking a signal= =20 just for the duration of the syscall. These are all possible and used=20 cases. >> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I >> would reverse the default. I cannot see many places where you wan= t >> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. >=20 > I.e. to wake up only first thread always and in addon those threads > which have specified flag set? Ok, will put into todo foer the next > release. It's a flag for an event. So the threads won't have the flag set. If=20 an event is delivered with the flag set, wake all threads. Otherwise=20 just one. >> - there is really no reason to invent yet another timer implementati= on. >> We have the POSIX timers which are feature rich and nicely >> implemented. All that is needed is to implement SIGEV_KEVENT as a >> notification mechanism. The timer is registered as part of the >> timer_create() syscalls. >=20 > Feel free to add any interface you like - it is as simple as call for > kevent_user_add_ukevent() in userspace. No, that's not what I mean. There is no need for the special=20 timer-related part of your patch. Instead the existing POSIX timer=20 syscalls should be modified to handle SIGEV_KEVENT notification. Again= ,=20 keep the interface as small as possible. Plus, the POSIX timer=20 interface is very flexible. You don't want to duplicate all that=20 functionality. > And I almost silently stay behind with the fact that it is possbile t= o > implement _all_ above ring buffer things in userspace with > kevent_get_events() and this functionality is there for almost a year= :) Again, this defeats the purpose completely. The ring buffer is the=20 faster interface, especially when coupled with asynchronous filling of=20 ring buffer (i.e., without a syscal). > Let's solve problem in order of theirs appearance - what do you think > about above interface for ring buffer? Looks better, yes. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S= t =E2=9E=A7 Mountain View, CA =E2=9D=96