Prevent inconsistent CPU state after sequence of dlclose/dlopen

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Prevent inconsistent CPU state after sequence of dlclose/dlopen
@ 2025-01-10 15:55 Mathieu Desnoyers
  2025-01-10 16:47 ` Adhemerval Zanella Netto
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 15:55 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
	linux-kernel, x86, paulmck, Michael Jeanson

Hi,

I was discussing with Mark Rutland recently, and he pointed out that a
sequence of dlclose/dlopen mapping new code at the same addresses in
multithreaded environments is an issue on ARM, and possibly on Intel/AMD
with the newer TLB broadcast maintenance.

I maintain the membarrier(2) system call, which provides a
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
purpose. It's been there since Linux 4.16. It can be configured
out (CONFIG_MEMBARRIER=n), but it's enabled by default.

Calling this after dlclose() in glibc would prevent this issue.

Is it handled in some other way, or should we open a bugzilla
entry to track this ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers
@ 2025-01-10 16:47 ` Adhemerval Zanella Netto
  2025-01-15 20:16   ` Mathieu Desnoyers
  2025-01-10 16:54 ` Peter Zijlstra
  2025-01-10 17:04 ` Florian Weimer
  2 siblings, 1 reply; 18+ messages in thread
From: Adhemerval Zanella Netto @ 2025-01-10 16:47 UTC (permalink / raw)
  To: Mathieu Desnoyers, libc-alpha
  Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
	linux-kernel, x86, paulmck, Michael Jeanson



On 10/01/25 12:55, Mathieu Desnoyers wrote:
> Hi,
> 
> I was discussing with Mark Rutland recently, and he pointed out that a
> sequence of dlclose/dlopen mapping new code at the same addresses in
> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> with the newer TLB broadcast maintenance.
> 
> I maintain the membarrier(2) system call, which provides a
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
> purpose. It's been there since Linux 4.16. It can be configured
> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
> 
> Calling this after dlclose() in glibc would prevent this issue.
> 
> Is it handled in some other way, or should we open a bugzilla
> entry to track this ?

Yes please, it would be helpful if you can add some information on 
what kind of hardware and kernel version this is an issue.  

Also, could you add some detail of the issue and why kernel itself does
not or can not guarantee memory consistent after the mmap call?

Is is because this would be an extra non-required overhead on
mmap that userland should handle? 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 16:47 ` Adhemerval Zanella Netto
@ 2025-01-15 20:16   ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-15 20:16 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, libc-alpha
  Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
	linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 11:47, Adhemerval Zanella Netto wrote:
> 
> 
> On 10/01/25 12:55, Mathieu Desnoyers wrote:
>> Hi,
>>
>> I was discussing with Mark Rutland recently, and he pointed out that a
>> sequence of dlclose/dlopen mapping new code at the same addresses in
>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>> with the newer TLB broadcast maintenance.
>>
>> I maintain the membarrier(2) system call, which provides a
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
>> purpose. It's been there since Linux 4.16. It can be configured
>> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
>>
>> Calling this after dlclose() in glibc would prevent this issue.
>>
>> Is it handled in some other way, or should we open a bugzilla
>> entry to track this ?
> 
> Yes please, it would be helpful if you can add some information on
> what kind of hardware and kernel version this is an issue.

Done:

https://sourceware.org/bugzilla/show_bug.cgi?id=32563

> 
> Also, could you add some detail of the issue and why kernel itself does
> not or can not guarantee memory consistent after the mmap call?

I've added a comment detailing this.

> 
> Is is because this would be an extra non-required overhead on
> mmap that userland should handle?

Yes, overhead is the culprit there, although it could be manageable
if we target this kind of extra sync-core operations on specific
sequences of mmap/munmap/mprotect with the PROT_EXEC flag.

I've documented a possible approach in the bugzilla entry.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers
  2025-01-10 16:47 ` Adhemerval Zanella Netto
@ 2025-01-10 16:54 ` Peter Zijlstra
  2025-01-10 17:02   ` Mathieu Desnoyers
  2025-01-10 17:04 ` Florian Weimer
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-01-10 16:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland,
	linux-kernel, x86, paulmck, Michael Jeanson

On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
> Hi,
> 
> I was discussing with Mark Rutland recently, and he pointed out that a
> sequence of dlclose/dlopen mapping new code at the same addresses in
> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> with the newer TLB broadcast maintenance.

What is the exact race? Should not munmap() invalidate the TLBs before
it allows overlapping mmap() to complete?

Any concurrent access after munmap() / before mmap() completes is UB
anyway, no?

> I maintain the membarrier(2) system call, which provides a
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
> purpose. It's been there since Linux 4.16. It can be configured
> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
> 
> Calling this after dlclose() in glibc would prevent this issue.
> 
> Is it handled in some other way, or should we open a bugzilla
> entry to track this ?

The problem is that the membarrier() call has significant cost, and is
only really needed if dlopen() is called right after (in the same
location).

Unconditionally adding that barrier, just in case, might regress things,
no?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 16:54 ` Peter Zijlstra
@ 2025-01-10 17:02   ` Mathieu Desnoyers
  2025-01-10 17:10     ` Florian Weimer
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 17:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland,
	linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 11:54, Peter Zijlstra wrote:
> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>> Hi,
>>
>> I was discussing with Mark Rutland recently, and he pointed out that a
>> sequence of dlclose/dlopen mapping new code at the same addresses in
>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>> with the newer TLB broadcast maintenance.
> 
> What is the exact race? Should not munmap() invalidate the TLBs before
> it allows overlapping mmap() to complete?

The race Mark mentioned (on ARM) is AFAIU the following scenario:

CPU 0                     CPU 1

- dlopen()
   - mmap PROT_EXEC @addr
                           - fetch insn @addr, CPU state expects unchanged insn.
                           - execute unrelated code
- dlclose(addr)
   - munmap @addr
- dlopen()
   - mmap PROT_EXEC @addr
                           - fetch new insn @addr. Incoherent CPU state.

> 
> Any concurrent access after munmap() / before mmap() completes is UB
> anyway, no?

The problematic access happens after the second mmap. The issue is
stale CPU state.

> 
>> I maintain the membarrier(2) system call, which provides a
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
>> purpose. It's been there since Linux 4.16. It can be configured
>> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
>>
>> Calling this after dlclose() in glibc would prevent this issue.
>>
>> Is it handled in some other way, or should we open a bugzilla
>> entry to track this ?
> 
> The problem is that the membarrier() call has significant cost, and is
> only really needed if dlopen() is called right after (in the same
> location).

Or if it has any overlapping executable range.

> 
> Unconditionally adding that barrier, just in case, might regress things,
> no?

Or perhaps we could add this barrier within mprotect(2) and munmap(2) in the
following cases:

- mprotect removes PROT_EXEC from a mapping,
- munmap unmaps a PROT_EXEC mapping.

Else userspace has to explicitly invoke membarrier sync-core from dlclose.

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:02   ` Mathieu Desnoyers
@ 2025-01-10 17:10     ` Florian Weimer
  2025-01-10 17:14       ` Adhemerval Zanella Netto
  2025-01-10 17:15       ` Mathieu Desnoyers
  2025-01-10 17:11     ` Peter Zijlstra
  2025-01-10 17:12     ` Adhemerval Zanella Netto
  2 siblings, 2 replies; 18+ messages in thread
From: Florian Weimer @ 2025-01-10 17:10 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Peter Zijlstra, libc-alpha

* Mathieu Desnoyers:

> On 2025-01-10 11:54, Peter Zijlstra wrote:
>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>> with the newer TLB broadcast maintenance.
>> What is the exact race? Should not munmap() invalidate the TLBs
>> before
>> it allows overlapping mmap() to complete?
>
> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>
> CPU 0                     CPU 1
>
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch insn @addr, CPU state expects unchanged insn.
>                           - execute unrelated code
> - dlclose(addr)
>   - munmap @addr
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch new insn @addr. Incoherent CPU state.

Unmapping an object while code is executing in it is undefined.

We have a problem with things like pthread_atfork handlers.  We can't
use locking there because fork handlers are expected to perform ample
locking themselves, and an extra lock around them would run into lock
ordering issues.  (We tried for unrelated reasons and saw deadlocks in
applications.)

What we can do is bump a reference counter while we run a pthread_atfork
callback (we already associate them with DSOs) and skip the munmap part
in dlclose if the counter is not zero.  We can complete the unmapping
after the fork handler returns (maybe in the parent only).

There might be other callbacks besides fork handlers that have this
problem.  A similar treatment is possible for some of them, hopefully
all of them in glibc.  We cannot cover things like std::shared_ptr
destructor calls, though.  But adding more barriers won't fix those,
either.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:10     ` Florian Weimer
@ 2025-01-10 17:14       ` Adhemerval Zanella Netto
  2025-01-10 17:15       ` Mathieu Desnoyers
  1 sibling, 0 replies; 18+ messages in thread
From: Adhemerval Zanella Netto @ 2025-01-10 17:14 UTC (permalink / raw)
  To: Florian Weimer, Mathieu Desnoyers
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson



On 10/01/25 14:10, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>> Hi,
>>>>
>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>> with the newer TLB broadcast maintenance.
>>> What is the exact race? Should not munmap() invalidate the TLBs
>>> before
>>> it allows overlapping mmap() to complete?
>>
>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>
>> CPU 0                     CPU 1
>>
>> - dlopen()
>>   - mmap PROT_EXEC @addr
>>                           - fetch insn @addr, CPU state expects unchanged insn.
>>                           - execute unrelated code
>> - dlclose(addr)
>>   - munmap @addr
>> - dlopen()
>>   - mmap PROT_EXEC @addr
>>                           - fetch new insn @addr. Incoherent CPU state.
> 
> Unmapping an object while code is executing in it is undefined.
> 
> We have a problem with things like pthread_atfork handlers.  We can't
> use locking there because fork handlers are expected to perform ample
> locking themselves, and an extra lock around them would run into lock
> ordering issues.  (We tried for unrelated reasons and saw deadlocks in
> applications.)
> 
> What we can do is bump a reference counter while we run a pthread_atfork
> callback (we already associate them with DSOs) and skip the munmap part
> in dlclose if the counter is not zero.  We can complete the unmapping
> after the fork handler returns (maybe in the parent only).

We can also make dlclose a no-op (like some runtimes do), although this
has other implications.

> 
> There might be other callbacks besides fork handlers that have this
> problem.  A similar treatment is possible for some of them, hopefully
> all of them in glibc.  We cannot cover things like std::shared_ptr
> destructor calls, though.  But adding more barriers won't fix those,
> either.
> 
> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:10     ` Florian Weimer
  2025-01-10 17:14       ` Adhemerval Zanella Netto
@ 2025-01-10 17:15       ` Mathieu Desnoyers
  2025-01-10 17:24         ` Adhemerval Zanella Netto
  2025-01-10 17:46         ` Florian Weimer
  1 sibling, 2 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 17:15 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 12:10, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>> Hi,
>>>>
>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>> with the newer TLB broadcast maintenance.
>>> What is the exact race? Should not munmap() invalidate the TLBs
>>> before
>>> it allows overlapping mmap() to complete?
>>
>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>
>> CPU 0                     CPU 1
>>
>> - dlopen()
>>    - mmap PROT_EXEC @addr
>>                            - fetch insn @addr, CPU state expects unchanged insn.
>>                            - execute unrelated code
>> - dlclose(addr)
>>    - munmap @addr
>> - dlopen()
>>    - mmap PROT_EXEC @addr
>>                            - fetch new insn @addr. Incoherent CPU state.
> 
> Unmapping an object while code is executing in it is undefined.

That's not the scenario though. In this scenario, CPU 1 executes
_unrelated code_ while we unmap @addr.

The issue is the stale CPU state that persists.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:15       ` Mathieu Desnoyers
@ 2025-01-10 17:24         ` Adhemerval Zanella Netto
  2025-01-10 17:35           ` Mathieu Desnoyers
  2025-01-10 17:46         ` Florian Weimer
  1 sibling, 1 reply; 18+ messages in thread
From: Adhemerval Zanella Netto @ 2025-01-10 17:24 UTC (permalink / raw)
  To: Mathieu Desnoyers, Florian Weimer
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson



On 10/01/25 14:15, Mathieu Desnoyers wrote:
> On 2025-01-10 12:10, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>>
>>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>>> Hi,
>>>>>
>>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>>> with the newer TLB broadcast maintenance.
>>>> What is the exact race? Should not munmap() invalidate the TLBs
>>>> before
>>>> it allows overlapping mmap() to complete?
>>>
>>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>>
>>> CPU 0                     CPU 1
>>>
>>> - dlopen()
>>>    - mmap PROT_EXEC @addr
>>>                            - fetch insn @addr, CPU state expects unchanged insn.
>>>                            - execute unrelated code
>>> - dlclose(addr)
>>>    - munmap @addr
>>> - dlopen()
>>>    - mmap PROT_EXEC @addr
>>>                            - fetch new insn @addr. Incoherent CPU state.
>>
>> Unmapping an object while code is executing in it is undefined.
> 
> That's not the scenario though. In this scenario, CPU 1 executes
> _unrelated code_ while we unmap @addr.

But in this scenario you still a concurrent dlclose while you have a running
thread executing code from that module, right? Or am I still missing something
here? 

Or, are you saying that even after dlopen returns (assuming the scenario where
it maps the code in a previous used mapping), the CPU is in an inconsistent
state unless MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is issued?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:24         ` Adhemerval Zanella Netto
@ 2025-01-10 17:35           ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 17:35 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Florian Weimer
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 12:24, Adhemerval Zanella Netto wrote:
> 
> 
> On 10/01/25 14:15, Mathieu Desnoyers wrote:
>> On 2025-01-10 12:10, Florian Weimer wrote:
>>> * Mathieu Desnoyers:
>>>
>>>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>>>> with the newer TLB broadcast maintenance.
>>>>> What is the exact race? Should not munmap() invalidate the TLBs
>>>>> before
>>>>> it allows overlapping mmap() to complete?
>>>>
>>>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>>>
>>>> CPU 0                     CPU 1
>>>>
>>>> - dlopen()
>>>>     - mmap PROT_EXEC @addr
>>>>                             - fetch insn @addr, CPU state expects unchanged insn.
>>>>                             - execute unrelated code
>>>> - dlclose(addr)
>>>>     - munmap @addr
>>>> - dlopen()
>>>>     - mmap PROT_EXEC @addr
>>>>                             - fetch new insn @addr. Incoherent CPU state.
>>>
>>> Unmapping an object while code is executing in it is undefined.
>>
>> That's not the scenario though. In this scenario, CPU 1 executes
>> _unrelated code_ while we unmap @addr.
> 
> But in this scenario you still a concurrent dlclose while you have a running
> thread executing code from that module, right? Or am I still missing something
> here?

No.

> 
> Or, are you saying that even after dlopen returns (assuming the scenario where
> it maps the code in a previous used mapping), the CPU is in an inconsistent
> state unless MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is issued?

Yes, this is it.

And the issue is not specific to dlopen/dlclose. We can have a similar
issue if we have a sequence of:

dlopen  @addr
dlclose
mmap PROT_EXEC|PROT_WRITE  @addr
   - JIT writes some code and jumps to it.

So it appears to be something we may want to fix at the kernel level.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:15       ` Mathieu Desnoyers
  2025-01-10 17:24         ` Adhemerval Zanella Netto
@ 2025-01-10 17:46         ` Florian Weimer
  2025-01-10 19:16           ` Mathieu Desnoyers
  1 sibling, 1 reply; 18+ messages in thread
From: Florian Weimer @ 2025-01-10 17:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson

* Mathieu Desnoyers:

> On 2025-01-10 12:10, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>> 
>>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>>> Hi,
>>>>>
>>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>>> with the newer TLB broadcast maintenance.
>>>> What is the exact race? Should not munmap() invalidate the TLBs
>>>> before
>>>> it allows overlapping mmap() to complete?
>>>
>>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>>
>>> CPU 0                     CPU 1
>>>
>>> - dlopen()
>>>    - mmap PROT_EXEC @addr
>>>                            - fetch insn @addr, CPU state expects unchanged insn.
>>>                            - execute unrelated code
>>> - dlclose(addr)
>>>    - munmap @addr
>>> - dlopen()
>>>    - mmap PROT_EXEC @addr
>>>                            - fetch new insn @addr. Incoherent CPU state.
>> Unmapping an object while code is executing in it is undefined.
>
> That's not the scenario though. In this scenario, CPU 1 executes
> _unrelated code_ while we unmap @addr.

Oh, so CPU 1 initially executes some code, returns to some safe,
persistent code (“the execute unrelated code” part), this code
synchronizes with the dlclose and the dlopen that execute on CPU 0,
obtains a pointer to some supposedly safely published function in the
newly mapped object, and calls it.  And that fails because previously
cached information about the code is invalid?

Additional awkwardness may result if the initial execution is
speculative, and the code on CPU 1 only synchronizes with the dlopen,
and not the previous dlclose because it does not know about it at all?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:46         ` Florian Weimer
@ 2025-01-10 19:16           ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 19:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com,
	Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 12:46, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> On 2025-01-10 12:10, Florian Weimer wrote:
>>> * Mathieu Desnoyers:
>>>
>>>> On 2025-01-10 11:54, Peter Zijlstra wrote:
>>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>>>>> with the newer TLB broadcast maintenance.
>>>>> What is the exact race? Should not munmap() invalidate the TLBs
>>>>> before
>>>>> it allows overlapping mmap() to complete?
>>>>
>>>> The race Mark mentioned (on ARM) is AFAIU the following scenario:
>>>>
>>>> CPU 0                     CPU 1
>>>>
>>>> - dlopen()
>>>>     - mmap PROT_EXEC @addr
>>>>                             - fetch insn @addr, CPU state expects unchanged insn.
>>>>                             - execute unrelated code
>>>> - dlclose(addr)
>>>>     - munmap @addr
>>>> - dlopen()
>>>>     - mmap PROT_EXEC @addr
>>>>                             - fetch new insn @addr. Incoherent CPU state.
>>> Unmapping an object while code is executing in it is undefined.
>>
>> That's not the scenario though. In this scenario, CPU 1 executes
>> _unrelated code_ while we unmap @addr.
> 
> Oh, so CPU 1 initially executes some code, returns to some safe,
> persistent code (“the execute unrelated code” part), this code
> synchronizes with the dlclose and the dlopen that execute on CPU 0,
> obtains a pointer to some supposedly safely published function in the
> newly mapped object, and calls it.  And that fails because previously
> cached information about the code is invalid?

Correct.

> 
> Additional awkwardness may result if the initial execution is
> speculative, and the code on CPU 1 only synchronizes with the dlopen,
> and not the previous dlclose because it does not know about it at all?

I'm not sure I follow this last example. Can you explain further what
you have in mind ?

Thanks,

Mathieu

> 
> Thanks,
> Florian
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:02   ` Mathieu Desnoyers
  2025-01-10 17:10     ` Florian Weimer
@ 2025-01-10 17:11     ` Peter Zijlstra
  2025-01-10 18:41       ` Mark Rutland
  2025-01-10 17:12     ` Adhemerval Zanella Netto
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2025-01-10 17:11 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland,
	linux-kernel, x86, paulmck, Michael Jeanson

On Fri, Jan 10, 2025 at 12:02:27PM -0500, Mathieu Desnoyers wrote:
> On 2025-01-10 11:54, Peter Zijlstra wrote:
> > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
> > > Hi,
> > > 
> > > I was discussing with Mark Rutland recently, and he pointed out that a
> > > sequence of dlclose/dlopen mapping new code at the same addresses in
> > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> > > with the newer TLB broadcast maintenance.
> > 
> > What is the exact race? Should not munmap() invalidate the TLBs before
> > it allows overlapping mmap() to complete?
> 
> The race Mark mentioned (on ARM) is AFAIU the following scenario:
> 
> CPU 0                     CPU 1
> 
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch insn @addr, CPU state expects unchanged insn.
>                           - execute unrelated code
> - dlclose(addr)
>   - munmap @addr
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch new insn @addr. Incoherent CPU state.

Urgh.. Mark, is this because of non-coherent i-cache or somesuch misery?

But shouldn't flush_{,i}cache_range() or something along those lines not
handle this?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:11     ` Peter Zijlstra
@ 2025-01-10 18:41       ` Mark Rutland
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Rutland @ 2025-01-10 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, libc-alpha, Florian Weimer, carlos@redhat.com,
	linux-kernel, x86, paulmck, Michael Jeanson, Andy Lutomirski,
	Will Deacon

Hi Peter,

[adding Andy and Will, since we've discussed related cases in the past]

On Fri, Jan 10, 2025 at 06:11:12PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2025 at 12:02:27PM -0500, Mathieu Desnoyers wrote:
> > On 2025-01-10 11:54, Peter Zijlstra wrote:
> > > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
> > > > Hi,
> > > > 
> > > > I was discussing with Mark Rutland recently, and he pointed out that a
> > > > sequence of dlclose/dlopen mapping new code at the same addresses in
> > > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> > > > with the newer TLB broadcast maintenance.
> > > 
> > > What is the exact race? Should not munmap() invalidate the TLBs before
> > > it allows overlapping mmap() to complete?
> > 
> > The race Mark mentioned (on ARM) is AFAIU the following scenario:
> > 
> > CPU 0                     CPU 1
> > 
> > - dlopen()
> >   - mmap PROT_EXEC @addr
> >                           - fetch insn @addr, CPU state expects unchanged insn.
> >                           - execute unrelated code
> > - dlclose(addr)
> >   - munmap @addr
> > - dlopen()
> >   - mmap PROT_EXEC @addr
> >                           - fetch new insn @addr. Incoherent CPU state.

For the benefit of others, what I specifically said was:

| There's a fun (latent, been around forever) issue whereby reusing the
| same VA for different code (e.g. dlopen() ... dlclose() ... dlopen())
| could blow up in a multi-threaded environment

I hadn't reported this on a list yet because there are many subtleties,
this is vanishingly unlikely to occur in practice today, and I didn't
want to get people excited/confused/angry over an incomplete or
misleading description.

> Urgh.. Mark, is this because of non-coherent i-cache or somesuch misery?

Sort-of. 

The key detail is that while instructions are being executed (including
speculative execution), the CPU pipeline/OoO-engine/whatever effectively
caches a copy of an instruction while it is "in-flight" (e.g.
potentially broken down into micro-ops):

On the ARM architecture, those in-flight copies are only guaranteed to
be discarded by a context-synchronization-event, and are not guaranteed
to be discarded due to TLB maintenance, data cache maintenance, or
instruction cache maintenance. Instruction cache maintenance will
guarantee that *subsequent* fetches from any instruction cache observe
the new value.

The first time a page of executable code is mapped in at a VA, this
isn't a problem because there was nothing previously at that VA which
could have been fetched from (since entering userspace, as exception
return from kernel to user provides a context-synchronization-event).

However, if some code A is mapped at a VA, then unmapped, then some
distinct code B is mapped at that VA, then some CPUs might still have
code A in-flight, regardless of TLB and cache maintenance, unless a
context-synchronization-event occurs.

Imagine you have a CPU microarchitecture with a long speculative
execution window, and you have to threads running pinned on two CPUs
sharing an address space, with some shared function pointer P which is
initially NULL. Then you have something like:

Thread 0			Thread 1

				// Speculating some long-to-resolve
				// sequence of instructions.
- mmap() code A at VA X
  - enters kernel
  - kernel loads data into page
  - kernel performs D$ + I$ maintenance
  - kernel updates page tables
  - returns to userspace

				// Begins speculating if (P) { P() };

  				// Begins speculating P(), predicted as
				// VA X.

				// Fetches code A from X into pipeline
				// Code A now in-flight
- munmap() VA X
  - enters kernel
  - updates page tables
  - performs TLB maintenance (broadcast)
  - returns to userspace

				// Code A still in-flight

- mmap() code B at VA X
  - enters kernel
  - kernel loads data into page
  - kernel performs D$ + I$ maintenance
				// Code A no longer in I$
				// Code A still in-flight
  - kernel updates page tables
  - returns to userspace

				// Code A still in-flight

- Publishes P as pointer in code B
  (e.g. WRITE_ONCE(P, X))

				// Completes speculation of
				// long-to-resolve sequence.

				// Resolves P is VA X, and
				// commits speculation of code A

... and BANG, stale instructions executed.

Note that on architectures that use IPIs for TLB invalidation (e.g. x86
today), the munmap() is likely to provide the serialization to discard
the in-flight copy by virtue of the IPI.

Practically speaking, actually hitting this is very unlikely because you
need to get very unlucky with predication, the predicted instructions
need to remain in-flight for a very long time without being discarded
for other reasons (e.g. a mis-prediction, IRQ, etc), and the VA needs to
be reused within that time window.

> But shouldn't flush_{,i}cache_range() or something along those lines not
> handle this?

Unfortunately not, those only affect the explicit data/instruction
caches, and not the in-flight copies of the instructions.

Mark.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:02   ` Mathieu Desnoyers
  2025-01-10 17:10     ` Florian Weimer
  2025-01-10 17:11     ` Peter Zijlstra
@ 2025-01-10 17:12     ` Adhemerval Zanella Netto
  2 siblings, 0 replies; 18+ messages in thread
From: Adhemerval Zanella Netto @ 2025-01-10 17:12 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland,
	linux-kernel, x86, paulmck, Michael Jeanson



On 10/01/25 14:02, Mathieu Desnoyers wrote:
> On 2025-01-10 11:54, Peter Zijlstra wrote:
>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
>>> Hi,
>>>
>>> I was discussing with Mark Rutland recently, and he pointed out that a
>>> sequence of dlclose/dlopen mapping new code at the same addresses in
>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>>> with the newer TLB broadcast maintenance.
>>
>> What is the exact race? Should not munmap() invalidate the TLBs before
>> it allows overlapping mmap() to complete?
> 
> The race Mark mentioned (on ARM) is AFAIU the following scenario:
> 
> CPU 0                     CPU 1
> 
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch insn @addr, CPU state expects unchanged insn.
>                           - execute unrelated code

It is not clear to me from userland/libc perspective how this would happen,
since to dlopen get the same address you will need to either dlclose or call
munmap.

Either you have UB where some thread dclose a library while is being used
by a different thread, or the thread will ended up executing a potentially
different code it is intended to do.

Do you have a realworld case where current glibc code show this issue?

> - dlclose(addr)
>   - munmap @addr
> - dlopen()
>   - mmap PROT_EXEC @addr
>                           - fetch new insn @addr. Incoherent CPU state.
> 
>>
>> Any concurrent access after munmap() / before mmap() completes is UB
>> anyway, no?
> 
> The problematic access happens after the second mmap. The issue is
> stale CPU state.
> 
>>
>>> I maintain the membarrier(2) system call, which provides a
>>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
>>> purpose. It's been there since Linux 4.16. It can be configured
>>> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
>>>
>>> Calling this after dlclose() in glibc would prevent this issue.
>>>
>>> Is it handled in some other way, or should we open a bugzilla
>>> entry to track this ?
>>
>> The problem is that the membarrier() call has significant cost, and is
>> only really needed if dlopen() is called right after (in the same
>> location).
> 
> Or if it has any overlapping executable range.
> 
>>
>> Unconditionally adding that barrier, just in case, might regress things,
>> no?
> 
> Or perhaps we could add this barrier within mprotect(2) and munmap(2) in the
> following cases:
> 
> - mprotect removes PROT_EXEC from a mapping,
> - munmap unmaps a PROT_EXEC mapping.
> 
> Else userspace has to explicitly invoke membarrier sync-core from dlclose.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers
  2025-01-10 16:47 ` Adhemerval Zanella Netto
  2025-01-10 16:54 ` Peter Zijlstra
@ 2025-01-10 17:04 ` Florian Weimer
  2025-01-10 17:13   ` Mathieu Desnoyers
  2 siblings, 1 reply; 18+ messages in thread
From: Florian Weimer @ 2025-01-10 17:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: libc-alpha, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
	linux-kernel, x86, paulmck, Michael Jeanson

* Mathieu Desnoyers:

> I was discussing with Mark Rutland recently, and he pointed out that a
> sequence of dlclose/dlopen mapping new code at the same addresses in
> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> with the newer TLB broadcast maintenance.
>
> I maintain the membarrier(2) system call, which provides a
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
> purpose. It's been there since Linux 4.16. It can be configured
> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
>
> Calling this after dlclose() in glibc would prevent this issue.
>
> Is it handled in some other way, or should we open a bugzilla
> entry to track this ?

There is nothing special about dlopen/dlclose, we just use mmap/munmap.
If there is a synchronization problem, we'd have to add to add barriers
to mmap and munmap.

But why isn't it up to the kernel to handle this correctly?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:04 ` Florian Weimer
@ 2025-01-10 17:13   ` Mathieu Desnoyers
  2025-01-10 18:33     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 17:13 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
	linux-kernel, x86, paulmck, Michael Jeanson

On 2025-01-10 12:04, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>> I was discussing with Mark Rutland recently, and he pointed out that a
>> sequence of dlclose/dlopen mapping new code at the same addresses in
>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD
>> with the newer TLB broadcast maintenance.
>>
>> I maintain the membarrier(2) system call, which provides a
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
>> purpose. It's been there since Linux 4.16. It can be configured
>> out (CONFIG_MEMBARRIER=n), but it's enabled by default.
>>
>> Calling this after dlclose() in glibc would prevent this issue.
>>
>> Is it handled in some other way, or should we open a bugzilla
>> entry to track this ?
> 
> There is nothing special about dlopen/dlclose, we just use mmap/munmap.
> If there is a synchronization problem, we'd have to add to add barriers
> to mmap and munmap.
> 
> But why isn't it up to the kernel to handle this correctly?

As I mentioned to Peter, we could add this barrier within mprotect(2)
and munmap(2) in the following cases:

- mprotect removes PROT_EXEC from a mapping,
- munmap unmaps a PROT_EXEC mapping.

We could even go further and batch this: we only need to
issue membarrier-sync-core on the following sequence for an mm:

On either of those, set current->mm->pending_membarrier_sync_core = true:
   - mprotect removes PROT_EXEC from a mapping, or
   - munmap unmaps a PROT_EXEC mapping,

And then, if current->mm->pending_membarrier_sync_core == true when:
   - mmap is called to create a PROT_EXEC mapping, or
   - mprotect sets PROT_EXEC on a mapping.

   invoke membarrier sync-core and set
   current->mm_pending_membarrier = false

Thoughts ?

Thanks,

Mathieu

> 
> Thanks,
> Florian
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen
  2025-01-10 17:13   ` Mathieu Desnoyers
@ 2025-01-10 18:33     ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2025-01-10 18:33 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Florian Weimer, libc-alpha, carlos@redhat.com, Mark Rutland,
	Peter Zijlstra, linux-kernel, x86, Michael Jeanson

On Fri, Jan 10, 2025 at 12:13:58PM -0500, Mathieu Desnoyers wrote:
> On 2025-01-10 12:04, Florian Weimer wrote:
> > * Mathieu Desnoyers:
> > 
> > > I was discussing with Mark Rutland recently, and he pointed out that a
> > > sequence of dlclose/dlopen mapping new code at the same addresses in
> > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> > > with the newer TLB broadcast maintenance.
> > > 
> > > I maintain the membarrier(2) system call, which provides a
> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
> > > purpose. It's been there since Linux 4.16. It can be configured
> > > out (CONFIG_MEMBARRIER=n), but it's enabled by default.
> > > 
> > > Calling this after dlclose() in glibc would prevent this issue.
> > > 
> > > Is it handled in some other way, or should we open a bugzilla
> > > entry to track this ?
> > 
> > There is nothing special about dlopen/dlclose, we just use mmap/munmap.
> > If there is a synchronization problem, we'd have to add to add barriers
> > to mmap and munmap.
> > 
> > But why isn't it up to the kernel to handle this correctly?
> 
> As I mentioned to Peter, we could add this barrier within mprotect(2)
> and munmap(2) in the following cases:
> 
> - mprotect removes PROT_EXEC from a mapping,
> - munmap unmaps a PROT_EXEC mapping.
> 
> We could even go further and batch this: we only need to
> issue membarrier-sync-core on the following sequence for an mm:
> 
> On either of those, set current->mm->pending_membarrier_sync_core = true:
>   - mprotect removes PROT_EXEC from a mapping, or
>   - munmap unmaps a PROT_EXEC mapping,
> 
> And then, if current->mm->pending_membarrier_sync_core == true when:
>   - mmap is called to create a PROT_EXEC mapping, or
>   - mprotect sets PROT_EXEC on a mapping.
> 
>   invoke membarrier sync-core and set
>   current->mm_pending_membarrier = false
> 
> Thoughts ?

In the case where the kernel is permitted to choose the address of the
new mapping, we could batch the sys_membarrier() calls, keeping the
corresponding address space out of service in the meantime.

Don't tell me, a given dynamic library always picks the same address?  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-15 20:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers
2025-01-10 16:47 ` Adhemerval Zanella Netto
2025-01-15 20:16   ` Mathieu Desnoyers
2025-01-10 16:54 ` Peter Zijlstra
2025-01-10 17:02   ` Mathieu Desnoyers
2025-01-10 17:10     ` Florian Weimer
2025-01-10 17:14       ` Adhemerval Zanella Netto
2025-01-10 17:15       ` Mathieu Desnoyers
2025-01-10 17:24         ` Adhemerval Zanella Netto
2025-01-10 17:35           ` Mathieu Desnoyers
2025-01-10 17:46         ` Florian Weimer
2025-01-10 19:16           ` Mathieu Desnoyers
2025-01-10 17:11     ` Peter Zijlstra
2025-01-10 18:41       ` Mark Rutland
2025-01-10 17:12     ` Adhemerval Zanella Netto
2025-01-10 17:04 ` Florian Weimer
2025-01-10 17:13   ` Mathieu Desnoyers
2025-01-10 18:33     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox