From: Peter Zijlstra <peterz@infradead.org>
To: Florian Weimer <fweimer@redhat.com>
Cc: "André Almeida" <andrealmeid@igalia.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Ingo Molnar" <mingo@redhat.com>,
"Darren Hart" <dvhart@infradead.org>,
"Davidlohr Bueso" <dave@stgolabs.net>,
"Arnd Bergmann" <arnd@arndb.de>,
sonicadvance1@gmail.com, linux-kernel@vger.kernel.org,
kernel-dev@igalia.com, linux-api@vger.kernel.org,
"Nathan Chancellor" <nathan@kernel.org>,
"Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
Subject: Re: [PATCH v2 0/3] futex: Create set_robust_list2
Date: Tue, 5 Nov 2024 13:18:37 +0100 [thread overview]
Message-ID: <20241105121837.GI24862@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <87jzdjxjj8.fsf@oldenburg3.str.redhat.com>
On Mon, Nov 04, 2024 at 01:36:43PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
>
> > On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
> >
> >> QEMU hints towards further problems (in linux-user/syscall.c):
> >>
> >> case TARGET_NR_set_robust_list:
> >> case TARGET_NR_get_robust_list:
> >> /* The ABI for supporting robust futexes has userspace pass
> >> * the kernel a pointer to a linked list which is updated by
> >> * userspace after the syscall; the list is walked by the kernel
> >> * when the thread exits. Since the linked list in QEMU guest
> >> * memory isn't a valid linked list for the host and we have
> >> * no way to reliably intercept the thread-death event, we can't
> >> * support these. Silently return ENOSYS so that guest userspace
> >> * falls back to a non-robust futex implementation (which should
> >> * be OK except in the corner case of the guest crashing while
> >> * holding a mutex that is shared with another process via
> >> * shared memory).
> >> */
> >> return -TARGET_ENOSYS;
> >
> > I don't think we can sanely fix that. Can't QEMU track the robust thing
> > itself and use waitpid() to discover the thread is gone and fudge things
> > from there?
>
> There are race conditions with munmap, I think, and they probably get a
> lot of worse if QEMU does that.
>
> See Rich Felker's bug report:
>
> | The corruption is performed by the kernel when it walks the robust
> | list. The basic situation is the same as in PR #13690, except that
> | here there's actually a potential write to the memory rather than just
> | a read.
> |
> | The sequence of events leading to corruption goes like this:
> |
> | 1. Thread A unlocks the process-shared, robust mutex and is preempted
> | after the mutex is removed from the robust list and atomically
> | unlocked, but before it's removed from the list_op_pending field of
> | the robust list header.
> |
> | 2. Thread B locks the mutex, and, knowing by program logic that it's
> | the last user of the mutex, unlocks and unmaps it, allocates/maps
> | something else that gets assigned the same address as the shared mutex
> | mapping, and then exits.
> |
> | 3. The kernel destroys the process, which involves walking each
> | thread's robust list and processing each thread's list_op_pending
> | field of the robust list header. Since thread A has a list_op_pending
> | pointing at the address previously occupied by the mutex, the kernel
> | obliviously "unlocks the mutex" by writing a 0 to the address and
> | futex-waking it. However, the kernel has instead overwritten part of
> | whatever mapping thread A created. If this is private memory it
> | (probably) doesn't matter since the process is ending anyway (but are
> | there race conditions where this can be seen?). If this is shared
> | memory or a shared file mapping, however, the kernel corrupts it.
> |
> | I suspect the race is difficult to hit since thread A has to get
> | preempted at exactly the wrong time AND thread B has to do a fair
> | amount of work without thread A getting scheduled again. So I'm not
> | sure how much luck we'd have getting a test case.
>
>
> <https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3>
So I've only managed to conjure up two horrible solutions for this:
- put the robust futex operations under user-space RCU, and mandate a
matching synchronize_rcu() before any munmap() calls.
- add a robust-barrier syscall that waits until all list_op_pending are
either NULL or changed since invocation. And mandate this call before
munmap().
Neither are particularly pretty I admit, but at least they should work.
But doing this and mandating the alignment thing should at least make
this qemu thing workable, no?
> We also have a silent unlocking failure because userspace does not know
> about ROBUST_LIST_LIMIT:
>
> Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account
> <https://sourceware.org/bugzilla/show_bug.cgi?id=19089>
>
> (I think we may have discussed this one before, and you may have
> suggested to just hard-code 2048 in userspace because the constant is
> not expected to change.)
>
> So the in-mutex linked list has quite a few problems even outside of
> emulation. 8-(
It's futex, ofcourse its a pain in the arse :-)
And yeah, no better ideas on that limit for now...
next prev parent reply other threads:[~2024-11-05 12:18 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-01 16:21 [PATCH v2 0/3] futex: Create set_robust_list2 André Almeida
2024-11-01 16:21 ` [PATCH v2 1/3] futex: Use explicit sizes for compat_exit_robust_list André Almeida
2024-11-02 5:44 ` kernel test robot
2024-11-02 14:57 ` kernel test robot
2024-11-01 16:21 ` [PATCH v2 2/3] futex: Create set_robust_list2 André Almeida
2024-11-02 16:40 ` kernel test robot
2024-11-04 11:22 ` Peter Zijlstra
2024-11-04 21:55 ` André Almeida
2024-11-05 12:10 ` Peter Zijlstra
2024-11-01 16:21 ` [PATCH v2 3/3] futex: Wire up set_robust_list2 syscall André Almeida
2024-11-02 5:13 ` kernel test robot
2024-11-02 6:05 ` kernel test robot
2024-11-02 21:58 ` [PATCH v2 0/3] futex: Create set_robust_list2 Florian Weimer
2024-11-04 11:32 ` Peter Zijlstra
2024-11-04 11:56 ` Peter Zijlstra
2024-11-04 12:36 ` Florian Weimer
2024-11-05 12:18 ` Peter Zijlstra [this message]
2024-11-04 21:49 ` André Almeida
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241105121837.GI24862@noisy.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=andrealmeid@igalia.com \
--cc=arnd@arndb.de \
--cc=dave@stgolabs.net \
--cc=dvhart@infradead.org \
--cc=fweimer@redhat.com \
--cc=kernel-dev@igalia.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@redhat.com \
--cc=nathan@kernel.org \
--cc=sonicadvance1@gmail.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.