From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [take25 1/6] kevent: Description. Date: Fri, 24 Nov 2006 01:48:32 +0100 Message-ID: <45664160.6060504@cosmosbay.com> References: <11641265982190@2ka.mipt.ru> <456621AC.7000009@redhat.com> <45662522.9090101@garzik.org> <45663298.7000108@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jeff Garzik , Evgeniy Polyakov , David Miller , Andrew Morton , netdev , Zach Brown , Christoph Hellwig , Chase Venters , Johann Borck , linux-kernel@vger.kernel.org Return-path: Received: from gw1.cosmosbay.com ([86.65.150.130]:16513 "EHLO gw1.cosmosbay.com") by vger.kernel.org with ESMTP id S1757527AbWKXAtG (ORCPT ); Thu, 23 Nov 2006 19:49:06 -0500 To: Ulrich Drepper In-Reply-To: <45663298.7000108@redhat.com> Sender: netdev-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Ulrich Drepper a =C3=A9crit : >=20 > You create worker threads to handle to work for the entire program. L= ook=20 > at something like a web server. When creating several queues, how do= =20 > you distribute all the connections to the different queues? To ensur= e=20 > every connection is handled as quickly as possible you stuff them all= in=20 > the same queue and then have all threads use this one queue. Whenever= an=20 > event is posted a thread is woken. _One_ thread. If two events are=20 > posted, two threads are woken. In this situation we have a few atomi= c=20 > ops at userlevel to make sure that the two threads don't pick the sam= e=20 > event but that's all there is wrt "fighting". >=20 > The alternative is the sorry state we have now. In nscd, for instanc= e,=20 > we have one single thread waiting for incoming connections and it the= n=20 > has to wake up a worker thread to handle the processing. This is don= e=20 > because we cannot "park" all threads in the accept() call since when = a=20 > new connection is announced _all_ the threads are woken. With the ne= w=20 > event handling this wouldn't be the case, one thread only is woken an= d=20 > we don't have to wake worker threads. All threads can be worker thre= ads. Having one specialized thread handling the distribution of work to work= er=20 threads is better most of the time. This thread can be a worker thread = by=20 itself (to avoid context switchs), but can decide to wake up 'slave thr= eads'=20 if he believes it has too (for example if he can notice that a *lot* of= =20 requests are pending) This is because with moderate load, it's better to have only one CPU ru= nning=20 80% of its time, keeping its cache hot, than 'distribute' the work on f= our=20 CPU, that would be used 25% of their time, but with lot of cache line p= ing=20 pongs and poor cache reuse. If you let 'kevent'/'dumb kernel dispatcher'/'futex'/'whatever' decide = to wake=20 up one thread for each new event, you *may* have lower performance, bec= ause of=20 higher system overhead (system means : system scheduler/internals, but = also=20 bus trafic) Only the application writer can have a clue of average use of its wor= ker=20 threads, and can decide to dynamically adjust parameters if needed to h= andle=20 load spikes. SMP machines are nice, but for many workloads, it's better to avoid spr= eading=20 a working set on several CPUS that fight for common resources (memory). Back to 'kevent': ----------------- I think that having a syscall to commit events should not be mandatory.= A=20 syscall is needed only to wait for new events if the ring is empty. But= then=20 maybe we dont need yet a new syscall to perform a wait : We already have nice synchronisations primitives (futex for example). User program should be able to update a 'uidx' in user space (using ato= mic ops=20 only if multi-threaded), and could just use futex infrastructure if rin= g=20 buffer is empty (uidx =3D=3D kidx) , and call FUTEX_WAIT( &kidx, curren= t value =3D uidx) I think I already gave my opinion on a ring buffer, but let just rephra= se it : One part should be read/write for application (to be able to change uid= x) (or User app just give at init time to kernel the address of a futex in= its vm=20 space) One part could be read only for application (but could be read/write : = we dont=20 care if user application is stupid) : kernel writes its kidx (or a copy= of it)=20 and events. =46or best performance, uidx and kidx should be on different cache line= s (basic=20 isolation of producer / consumer) When kernel wants to queue a new event in a ring buffer it can : See if user program did consume some events since last invocation (kern= el=20 fetches uidx and compare it with its own uidx value : no syscall needed= ) Check if a slot is available in ring buffer. Copy the event in ring buffer, perform a memory barrier, then increment= kidx. call futex_wake(&kidx, 1 thread) User application is free to have one thread/process or several=20 threads/processes waiting for new events (or even no thread at all :) ) Eric