public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Petr Tesarik <ptesarik@suse.cz>
To: Thomas Gleixner <tglx@linutronix.de>, Oleg Nesterov <oleg@tv-sign.ru>
Cc: linux-kernel@vger.kernel.org
Subject: CPU POSIX timers livelock
Date: Fri, 02 May 2008 17:05:41 +0200	[thread overview]
Message-ID: <1209740741.15210.49.camel@elijah.suse.cz> (raw)

Hello,

there seems to be a possible livelock in the CPU timers (actually, we've
experienced it once already). The analysis lead to the conclusion that
fixing this properly will be a bit harder.

Problem Statement

 1. Any process can set ITIMER_PROF as short as 1 timer tick
 2. If the process is CPU-bound (e.g. a number-crunching application),
    the timer will expire with every timer tick
 3. If the process has N threads and each thread runs on its own
    CPU, the timer will expire N times per timer tick

Now, this can become quite interesting on a larger system. For instance,
with 128 CPUs and a (conservative) setting of HZ 300, any process can
effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty
fast, but it can get worse if you consider the pitfalls of a NUMA
architecture.

Whenever the timer expires, a new expiration time must be computed for
each thread in the thread group. The values are stored in task_struct of
each thread, and IIRC those are kept local to the CPU where the thread
is running (quite logically). Put in another way, walking all threads
means touching remote memory except for the few task_struct which are
local to the recomputing CPU. In the 128-CPU example, if each node has 2
CPUs (think Altix), only 2 accesses are to the local node and 126
accesses are to remote memory. These are already quite expensive, but it
gets even worse. The timer interrupt is generated locally for each CPU,
so if the process is spread over all of them (see above), all CPUs
recompute the timer (hint: write access), thus constantly invalidating
local caches of the remote memory.

And those computations cannot run in parallel, of course, because
everything is serialized on sighand->siglock (see run_posix_cpu_timers).
So, each time one CPU finishes walking the threads, there's already a
queue of all the other CPUs waiting to do the same again. For this kind
of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may
be well insufficient. Even more so, if somebody else occasionally takes
write lock on tasklist_lock. It may so happen that the system spends all
CPU cycles re-computing the profiling timers over and over again and
never make any progress. Note this is partly caused by the fact that the
time needed to handle the profiling timer is itself also counted to the
profiling time.

This simply does not scale for large systems (and I was not even
considering those really large 1024p ones).

Ideas

Any ideas for improvement? Obviously, storing per-process expiration
times in one place instead of dividing the interval by the number of
currently running threads would help here. But it would spoil the fast
path (when there are no timers at all). Avoiding that multiple threads
re-compute the same values might also help (although I've got no idea
how to implement this).

Any more comments (before I start hacking something you'll NAK)?

BTW did POSIX really mean to allow an indefinitely small interval for
the SIGPROF?

Petr Tesarik

             reply	other threads:[~2008-05-02 15:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-02 15:05 Petr Tesarik [this message]
2008-05-02 15:21 ` CPU POSIX timers livelock Björn Steinbrink
2008-05-02 16:13   ` Petr Tesarik
2008-05-02 16:43     ` Björn Steinbrink
2008-05-02 17:28       ` Oleg Nesterov
2008-05-02 17:10     ` Frank Mayhar
2008-05-17  0:16   ` CPU POSIX timers livelock (with patch for review) Frank Mayhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1209740741.15210.49.camel@elijah.suse.cz \
    --to=ptesarik@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oleg@tv-sign.ru \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox