public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Anna-Maria Behnsen <anna-maria@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com,
	Dmitry Vyukov <dvyukov@google.com>,
	Sebastian Siewior <bigeasy@linutronix.de>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Andrei Vagin <avagin@openvz.org>,
	Christian Brauner <brauner@kernel.org>,
	Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>,
	Pavel Tikhomirov <ptikhomirov@virtuozzo.com>,
	Pavel Emelyanov <xemul@openvz.org>
Subject: [RFD] posix-timers: CRIU woes
Date: Tue, 09 May 2023 23:42:39 +0200	[thread overview]
Message-ID: <87ednpyyeo.ffs@tglx> (raw)
In-Reply-To: <875y911xeg.ffs@tglx>

Hi!

This is a summary of several mails so that the CRIU people have the full
picture.

A recent syzbot report made me look at the timer ID management, which
was modified 7 years ago to accomodate CRIU:

    5ed67f05f66c ("posix timers: Allocate timer id per process (v2)")

and introduced that reported issue along with a bogus loop termination
problem. See

    https://lore.kernel.org/lkml/000000000000a3723305f9d98fc3@google.com/
    https://lore.kernel.org/lkml/20230425183312.932345089@linutronix.de

for details.

The intent to make the timer IDs per process is definitely correct, but
the implementation is beyond suboptimal. I really regret that I did not
catch this back then when picking those changes up.

The way it works is that each process keeps a 'next_timer_id' which it
uses to allocate the next timer. That allows CRIU to reconstruct timers
with the same ID by walking the ID space via

  do {
     timer_create(&timer, ...., &id);
     if (id == original_id)
        goto success;
     timer_delete(&timer);
  } while (idspace_not_exhausted());

That works by some definition of works, but it is problematic in two ways:

 1) As the timer ID space is up to INT_MAX, a process which creates and
    deletes timers frequently, can easily move up close to the INT_MAX
    id space over time.

    If such a process is checkpointed and restored, then the above loop
    will run for at least an hour to restore a single timer.

    And no, this is not only a hypothetical issue. There are legitimate
    scenarios where threads are created and the control thread arms a
    posix CPU timer on them. Such threads can be torn down on a regular
    base due to thread pool consolidations. As CRIU does not happen
    every 5 minutes it's not completely unlikely that such a process
    surives quite some time on a particular host and thereby approaches
    the ID space limit.

    Sure we can restrict the ID space to a way smaller number so the
    search wraps around earlier, but what's a sensible limit?

    Though restricting the ID space has its own issue vs. backwards
    compability. A process which created a timer on an older kernel with
    an ID larger than the newer kernels ID limit cannot longer be
    restored on that newer kernel.

    Aside of that it does not solve the other problem this created:

 2) That change created an user space ABI, which means that the kernel
    side has to stick with this next ID search mechanism forever.

    That prevents to get rid of that global lock and hash table by
    sticking an xarray into task::signal which makes a lot of sense in
    terms of cache locality and gets rid of the extra timer list
    management in task::signal. Making this change would be very useful
    to address some other issues in the posix-timer code without
    creating yet more duct tape horrors.

    Such a change obviously has to aim for a dense ID space to keep the
    memory overhead low, but that breaks existing CRIU userspace because
    dense ID space and next ID search does not fit together.

    Next ID search is obviously creating non-recoverable holes in the
    case that timers are deleted afterwards.

    A dense ID space approach can create holes too, but they are
    recoverable and well within the resource limits, because the process
    has to be able to create enough timers in the first place in order
    to release those in the middle.

    With the next ID search brute force recovery is not possible on a
    kernel with dense ID as there is no way to create all intermediate
    timers first before reaching the one at the far end due to resource
    limits.

So because of that half thought out user space ABI we are now up the
regression creek without a paddle, unless CRIU can accomodate to a
different restore mechanism to lift this restriction from the kernel.

Thoughts?

Thanks,

        tglx



  reply	other threads:[~2023-05-09 21:42 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-25 18:48 [patch 00/20] posix-timers: Fixes and cleanups Thomas Gleixner
2023-04-25 18:48 ` [patch 01/20] posix-timers: Prevent RT livelock in itimer_delete() Thomas Gleixner
2023-05-04 17:06   ` Frederic Weisbecker
2023-05-04 18:20     ` Thomas Gleixner
2023-05-05  7:57       ` Thomas Gleixner
2023-06-01 19:00         ` [patch v2 " Thomas Gleixner
2023-06-01 20:16           ` [patch v2a " Thomas Gleixner
2023-06-05 10:59             ` Frederic Weisbecker
2023-06-05 15:08             ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50             ` tip-bot2 for Thomas Gleixner
2023-04-25 18:48 ` [patch 02/20] posix-timers: Ensure timer ID search-loop limit is valid Thomas Gleixner
2023-05-05 14:50   ` Frederic Weisbecker
2023-05-05 22:58     ` Thomas Gleixner
2023-05-05 23:36       ` Thomas Gleixner
2023-05-08 21:57         ` Thomas Gleixner
2023-05-09  9:30           ` Thomas Gleixner
2023-05-09 12:50             ` Thomas Gleixner
2023-05-09 21:42               ` Thomas Gleixner [this message]
2023-05-10  4:36                 ` [RFD] posix-timers: CRIU woes Pavel Tikhomirov
2023-05-10  8:30                   ` Thomas Gleixner
2023-05-11  4:12                     ` Pavel Tikhomirov
2023-05-11  7:56                       ` Peter Zijlstra
2023-05-11  9:32                       ` Thomas Gleixner
2023-05-11 10:13                   ` David Laight
2023-05-10  8:16                 ` Andrey Vagin
2023-05-11  3:17                   ` Pavel Tikhomirov
2023-05-11  9:36                     ` Thomas Gleixner
2023-05-11  9:52                       ` Pavel Tikhomirov
2023-05-11 13:42                         ` Thomas Gleixner
2023-05-11 14:54                           ` Pavel Tikhomirov
2023-05-11 15:25                           ` Pavel Tikhomirov
2023-05-12  1:21                       ` Andrey Vagin
2023-05-31 17:38                         ` Thomas Gleixner
2023-05-11  7:49                   ` Cyrill Gorcunov
2023-05-10  0:42               ` [patch 02/20] posix-timers: Ensure timer ID search-loop limit is valid Andrey Vagin
2023-05-09  9:42         ` Frederic Weisbecker
2023-05-09 12:04           ` Thomas Gleixner
2023-05-09 12:38           ` Thomas Gleixner
2023-05-09 14:18             ` Frederic Weisbecker
2023-06-01 18:58               ` [patch v2 " Thomas Gleixner
2023-06-05 14:17                 ` Frederic Weisbecker
2023-06-05 15:08                 ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50                 ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 03/20] posix-timers: Clarify timer_wait_running() comment Thomas Gleixner
2023-05-09  9:50   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 04/20] posix-timers: Cleanup comments about timer ID tracking Thomas Gleixner
2023-05-09  9:58   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 05/20] posix-timers: Add comments about timer lookup Thomas Gleixner
2023-05-09 10:58   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 06/20] posix-timers: Annotate concurrent access to k_itimer::it_signal Thomas Gleixner
2023-05-09 11:04   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] posix-timers: Annotate concurrent access to k_itimer:: It_signal tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 07/20] posix-timers: Set k_itimer::it_signal to NULL on exit() Thomas Gleixner
2023-06-01 10:09   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] posix-timers: Set k_itimer:: It_signal " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 08/20] posix-timers: Remove pointless irqsafe from hash_lock Thomas Gleixner
2023-06-01 10:12   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 09/20] posix-timers: Split release_posix_timers() Thomas Gleixner
2023-06-01 10:25   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 10/20] posix-timers: Document sys_clock_getres() correctly Thomas Gleixner
2023-06-01 10:44   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 11/20] posix-timers: Document common_clock_get() correctly Thomas Gleixner
2023-06-01 11:00   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 12/20] posix-timers: Document sys_clock_getoverrun() Thomas Gleixner
2023-06-01 11:06   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 13/20] posix-timers: Document sys_clock_settime() permissions in place Thomas Gleixner
2023-06-01 11:22   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:50   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 14/20] posix-timers: Document nanosleep() details Thomas Gleixner
2023-06-01 12:30   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 15/20] posix-timers: Add proper comments in do_timer_create() Thomas Gleixner
2023-06-01 12:43   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 16/20] posix-timers: Comment SIGEV_THREAD_ID properly Thomas Gleixner
2023-06-01 12:47   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 17/20] posix-timers: Clarify posix_timer_rearm() comment Thomas Gleixner
2023-06-01 12:52   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 18/20] posix-timers: Clarify posix_timer_fn() comments Thomas Gleixner
2023-06-01 13:21   ` Frederic Weisbecker
2023-06-01 18:43     ` Thomas Gleixner
2023-06-01 19:07     ` Thomas Gleixner
2023-06-05 14:26       ` Frederic Weisbecker
2023-06-05 15:08       ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-05 22:17       ` tip-bot2 for Thomas Gleixner
2023-06-18 20:49       ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 19/20] posix-timers: Remove pointless comments Thomas Gleixner
2023-06-01 13:48   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-05 22:17   ` tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-04-25 18:49 ` [patch 20/20] posix-timers: Polish coding style in a few places Thomas Gleixner
2023-06-01 13:50   ` Frederic Weisbecker
2023-06-05 15:08   ` [tip: timers/core] " tip-bot2 for Thomas Gleixner
2023-06-05 22:17   ` tip-bot2 for Thomas Gleixner
2023-06-18 20:49   ` tip-bot2 for Thomas Gleixner
2023-06-05 14:32 ` [patch 00/20] posix-timers: Fixes and cleanups Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ednpyyeo.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=aleksandr.mikhalitsyn@canonical.com \
    --cc=anna-maria@linutronix.de \
    --cc=avagin@openvz.org \
    --cc=bigeasy@linutronix.de \
    --cc=brauner@kernel.org \
    --cc=dvyukov@google.com \
    --cc=frederic@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=peterz@infradead.org \
    --cc=ptikhomirov@virtuozzo.com \
    --cc=syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox