From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964973AbYEBPVk (ORCPT ); Fri, 2 May 2008 11:21:40 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1763137AbYEBPVb (ORCPT ); Fri, 2 May 2008 11:21:31 -0400 Received: from mail.gmx.net ([213.165.64.20]:57240 "HELO mail.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1759860AbYEBPVa (ORCPT ); Fri, 2 May 2008 11:21:30 -0400 X-Authenticated: #5039886 X-Provags-ID: V01U2FsdGVkX18q1mJ3t0EqwhLyGHv26uij24iCX92RecLe/fB8hE C6262FqfGti7jR Date: Fri, 2 May 2008 17:21:26 +0200 From: =?iso-8859-1?Q?Bj=F6rn?= Steinbrink To: Petr Tesarik Cc: Thomas Gleixner , Oleg Nesterov , linux-kernel@vger.kernel.org, Roland McGrath , Frank Mayhar Subject: Re: CPU POSIX timers livelock Message-ID: <20080502152126.GA3130@atjola.homenet> References: <1209740741.15210.49.camel@elijah.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1209740741.15210.49.camel@elijah.suse.cz> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Added Roland McGrath and Frank Mayhar to Cc:, as this sounds similar enough to what has been discussed here http://lkml.org/lkml/2008/2/6/505] On 2008.05.02 17:05:41 +0200, Petr Tesarik wrote: > Hello, > > there seems to be a possible livelock in the CPU timers (actually, we've > experienced it once already). The analysis lead to the conclusion that > fixing this properly will be a bit harder. > > Problem Statement > > 1. Any process can set ITIMER_PROF as short as 1 timer tick > 2. If the process is CPU-bound (e.g. a number-crunching application), > the timer will expire with every timer tick > 3. If the process has N threads and each thread runs on its own > CPU, the timer will expire N times per timer tick > > Now, this can become quite interesting on a larger system. For instance, > with 128 CPUs and a (conservative) setting of HZ 300, any process can > effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty > fast, but it can get worse if you consider the pitfalls of a NUMA > architecture. > > Whenever the timer expires, a new expiration time must be computed for > each thread in the thread group. The values are stored in task_struct of > each thread, and IIRC those are kept local to the CPU where the thread > is running (quite logically). Put in another way, walking all threads > means touching remote memory except for the few task_struct which are > local to the recomputing CPU. In the 128-CPU example, if each node has 2 > CPUs (think Altix), only 2 accesses are to the local node and 126 > accesses are to remote memory. These are already quite expensive, but it > gets even worse. The timer interrupt is generated locally for each CPU, > so if the process is spread over all of them (see above), all CPUs > recompute the timer (hint: write access), thus constantly invalidating > local caches of the remote memory. > > And those computations cannot run in parallel, of course, because > everything is serialized on sighand->siglock (see run_posix_cpu_timers). > So, each time one CPU finishes walking the threads, there's already a > queue of all the other CPUs waiting to do the same again. For this kind > of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may > be well insufficient. Even more so, if somebody else occasionally takes > write lock on tasklist_lock. It may so happen that the system spends all > CPU cycles re-computing the profiling timers over and over again and > never make any progress. Note this is partly caused by the fact that the > time needed to handle the profiling timer is itself also counted to the > profiling time. > > This simply does not scale for large systems (and I was not even > considering those really large 1024p ones). > > Ideas > > Any ideas for improvement? Obviously, storing per-process expiration > times in one place instead of dividing the interval by the number of > currently running threads would help here. But it would spoil the fast > path (when there are no timers at all). Avoiding that multiple threads > re-compute the same values might also help (although I've got no idea > how to implement this). > > Any more comments (before I start hacking something you'll NAK)? > > BTW did POSIX really mean to allow an indefinitely small interval for > the SIGPROF? > > Petr Tesarik