From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicolas Williams Date: Wed, 29 Jul 2009 14:22:30 -0500 Subject: [Lustre-devel] SMP Scalability, MDS, reducing cpu pingpong In-Reply-To: <002001ca1062$7b526fc0$71f74f40$@com> References: <7580C3C1-7634-47C8-827B-C93157C1301A@Sun.COM> <002001ca1062$7b526fc0$71f74f40$@com> Message-ID: <20090729192230.GU1020@Sun.COM> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org On Wed, Jul 29, 2009 at 04:37:29PM +0100, Eric Barton wrote: > > Also on lustre front - something I plan to tackle, though not yet > > completely sure how: Lustre has a concept of reserving one thread for > > difficult replies handling + one thread for high priority messages > > handling (if enabled). In SMP scalability branch that becomes 2x > > num_cpus reserved threads potentially per service since naturally > > rep_ack reply or high prio message might arrive on any cpu separately > > now (and message queues are per cpu) - seems like huge overkill to > > me. I see that there is a handle reply separate threads in HEAD now, > > so perhaps this could be greatly simplified by proper usage of those. > > the high prio seems to be harder to improve, though. > > These threads are required in case all normal service threads are > blocking. I don't suppose this can be a performance critical case, so > voilating CPU affinity for the sake of deadlock avoidance seems OK. > However is 1 extra thread per CPU such a big deal? We'll have > 10s-100s of them in any case. Probably not. You could have a single thread per-CPU if everything was written in async I/O, continuation passing style (CPS), blocking only in an event loop per-CPU. That'd reduce context switches, but it'd increase the amount of context being saved and read as that one thread services each event/event completion. In other words, you'd still have context switches! Also, the code would get insanely complicated -- CPS is for compilers, not humans (nor do we have Scheme-like continuations in C nor in the Linux kernel, and if we did that'd add quite a bit of run-time overhead too). And kernels are not usually written this way either, so it may not even be feasible. The thread model is just easier to code to. > > Do anybody else have any extra thoughts for lustre side > > improvements we can get off this? > > I think we need measurements to prove/disprove whether object affinity > trumps client affinity. If we have secure PTLRPC in the picture then client affinity is more likely to trump object affinity: between keys, key schedules, and sequence number windows may add up to enough. (Of course, we could have multiple streams per-client, so that a client could be serviced by multiple server CPUs.) Nico --