* Prevent inconsistent CPU state after sequence of dlclose/dlopen
@ 2025-01-10 15:55 Mathieu Desnoyers
2025-01-10 16:47 ` Adhemerval Zanella Netto
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2025-01-10 15:55 UTC (permalink / raw)
To: libc-alpha
Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra,
linux-kernel, x86, paulmck, Michael Jeanson
Hi,
I was discussing with Mark Rutland recently, and he pointed out that a
sequence of dlclose/dlopen mapping new code at the same addresses in
multithreaded environments is an issue on ARM, and possibly on Intel/AMD
with the newer TLB broadcast maintenance.
I maintain the membarrier(2) system call, which provides a
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this
purpose. It's been there since Linux 4.16. It can be configured
out (CONFIG_MEMBARRIER=n), but it's enabled by default.
Calling this after dlclose() in glibc would prevent this issue.
Is it handled in some other way, or should we open a bugzilla
entry to track this ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers @ 2025-01-10 16:47 ` Adhemerval Zanella Netto 2025-01-15 20:16 ` Mathieu Desnoyers 2025-01-10 16:54 ` Peter Zijlstra 2025-01-10 17:04 ` Florian Weimer 2 siblings, 1 reply; 18+ messages in thread From: Adhemerval Zanella Netto @ 2025-01-10 16:47 UTC (permalink / raw) To: Mathieu Desnoyers, libc-alpha Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra, linux-kernel, x86, paulmck, Michael Jeanson On 10/01/25 12:55, Mathieu Desnoyers wrote: > Hi, > > I was discussing with Mark Rutland recently, and he pointed out that a > sequence of dlclose/dlopen mapping new code at the same addresses in > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > with the newer TLB broadcast maintenance. > > I maintain the membarrier(2) system call, which provides a > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this > purpose. It's been there since Linux 4.16. It can be configured > out (CONFIG_MEMBARRIER=n), but it's enabled by default. > > Calling this after dlclose() in glibc would prevent this issue. > > Is it handled in some other way, or should we open a bugzilla > entry to track this ? Yes please, it would be helpful if you can add some information on what kind of hardware and kernel version this is an issue. Also, could you add some detail of the issue and why kernel itself does not or can not guarantee memory consistent after the mmap call? Is is because this would be an extra non-required overhead on mmap that userland should handle? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 16:47 ` Adhemerval Zanella Netto @ 2025-01-15 20:16 ` Mathieu Desnoyers 0 siblings, 0 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-15 20:16 UTC (permalink / raw) To: Adhemerval Zanella Netto, libc-alpha Cc: Florian Weimer, carlos@redhat.com, Mark Rutland, Peter Zijlstra, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 11:47, Adhemerval Zanella Netto wrote: > > > On 10/01/25 12:55, Mathieu Desnoyers wrote: >> Hi, >> >> I was discussing with Mark Rutland recently, and he pointed out that a >> sequence of dlclose/dlopen mapping new code at the same addresses in >> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >> with the newer TLB broadcast maintenance. >> >> I maintain the membarrier(2) system call, which provides a >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this >> purpose. It's been there since Linux 4.16. It can be configured >> out (CONFIG_MEMBARRIER=n), but it's enabled by default. >> >> Calling this after dlclose() in glibc would prevent this issue. >> >> Is it handled in some other way, or should we open a bugzilla >> entry to track this ? > > Yes please, it would be helpful if you can add some information on > what kind of hardware and kernel version this is an issue. Done: https://sourceware.org/bugzilla/show_bug.cgi?id=32563 > > Also, could you add some detail of the issue and why kernel itself does > not or can not guarantee memory consistent after the mmap call? I've added a comment detailing this. > > Is is because this would be an extra non-required overhead on > mmap that userland should handle? Yes, overhead is the culprit there, although it could be manageable if we target this kind of extra sync-core operations on specific sequences of mmap/munmap/mprotect with the PROT_EXEC flag. I've documented a possible approach in the bugzilla entry. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers 2025-01-10 16:47 ` Adhemerval Zanella Netto @ 2025-01-10 16:54 ` Peter Zijlstra 2025-01-10 17:02 ` Mathieu Desnoyers 2025-01-10 17:04 ` Florian Weimer 2 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2025-01-10 16:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: > Hi, > > I was discussing with Mark Rutland recently, and he pointed out that a > sequence of dlclose/dlopen mapping new code at the same addresses in > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > with the newer TLB broadcast maintenance. What is the exact race? Should not munmap() invalidate the TLBs before it allows overlapping mmap() to complete? Any concurrent access after munmap() / before mmap() completes is UB anyway, no? > I maintain the membarrier(2) system call, which provides a > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this > purpose. It's been there since Linux 4.16. It can be configured > out (CONFIG_MEMBARRIER=n), but it's enabled by default. > > Calling this after dlclose() in glibc would prevent this issue. > > Is it handled in some other way, or should we open a bugzilla > entry to track this ? The problem is that the membarrier() call has significant cost, and is only really needed if dlopen() is called right after (in the same location). Unconditionally adding that barrier, just in case, might regress things, no? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 16:54 ` Peter Zijlstra @ 2025-01-10 17:02 ` Mathieu Desnoyers 2025-01-10 17:10 ` Florian Weimer ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-10 17:02 UTC (permalink / raw) To: Peter Zijlstra Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 11:54, Peter Zijlstra wrote: > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >> Hi, >> >> I was discussing with Mark Rutland recently, and he pointed out that a >> sequence of dlclose/dlopen mapping new code at the same addresses in >> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >> with the newer TLB broadcast maintenance. > > What is the exact race? Should not munmap() invalidate the TLBs before > it allows overlapping mmap() to complete? The race Mark mentioned (on ARM) is AFAIU the following scenario: CPU 0 CPU 1 - dlopen() - mmap PROT_EXEC @addr - fetch insn @addr, CPU state expects unchanged insn. - execute unrelated code - dlclose(addr) - munmap @addr - dlopen() - mmap PROT_EXEC @addr - fetch new insn @addr. Incoherent CPU state. > > Any concurrent access after munmap() / before mmap() completes is UB > anyway, no? The problematic access happens after the second mmap. The issue is stale CPU state. > >> I maintain the membarrier(2) system call, which provides a >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this >> purpose. It's been there since Linux 4.16. It can be configured >> out (CONFIG_MEMBARRIER=n), but it's enabled by default. >> >> Calling this after dlclose() in glibc would prevent this issue. >> >> Is it handled in some other way, or should we open a bugzilla >> entry to track this ? > > The problem is that the membarrier() call has significant cost, and is > only really needed if dlopen() is called right after (in the same > location). Or if it has any overlapping executable range. > > Unconditionally adding that barrier, just in case, might regress things, > no? Or perhaps we could add this barrier within mprotect(2) and munmap(2) in the following cases: - mprotect removes PROT_EXEC from a mapping, - munmap unmaps a PROT_EXEC mapping. Else userspace has to explicitly invoke membarrier sync-core from dlclose. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:02 ` Mathieu Desnoyers @ 2025-01-10 17:10 ` Florian Weimer 2025-01-10 17:14 ` Adhemerval Zanella Netto 2025-01-10 17:15 ` Mathieu Desnoyers 2025-01-10 17:11 ` Peter Zijlstra 2025-01-10 17:12 ` Adhemerval Zanella Netto 2 siblings, 2 replies; 18+ messages in thread From: Florian Weimer @ 2025-01-10 17:10 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Peter Zijlstra, libc-alpha * Mathieu Desnoyers: > On 2025-01-10 11:54, Peter Zijlstra wrote: >> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>> Hi, >>> >>> I was discussing with Mark Rutland recently, and he pointed out that a >>> sequence of dlclose/dlopen mapping new code at the same addresses in >>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>> with the newer TLB broadcast maintenance. >> What is the exact race? Should not munmap() invalidate the TLBs >> before >> it allows overlapping mmap() to complete? > > The race Mark mentioned (on ARM) is AFAIU the following scenario: > > CPU 0 CPU 1 > > - dlopen() > - mmap PROT_EXEC @addr > - fetch insn @addr, CPU state expects unchanged insn. > - execute unrelated code > - dlclose(addr) > - munmap @addr > - dlopen() > - mmap PROT_EXEC @addr > - fetch new insn @addr. Incoherent CPU state. Unmapping an object while code is executing in it is undefined. We have a problem with things like pthread_atfork handlers. We can't use locking there because fork handlers are expected to perform ample locking themselves, and an extra lock around them would run into lock ordering issues. (We tried for unrelated reasons and saw deadlocks in applications.) What we can do is bump a reference counter while we run a pthread_atfork callback (we already associate them with DSOs) and skip the munmap part in dlclose if the counter is not zero. We can complete the unmapping after the fork handler returns (maybe in the parent only). There might be other callbacks besides fork handlers that have this problem. A similar treatment is possible for some of them, hopefully all of them in glibc. We cannot cover things like std::shared_ptr destructor calls, though. But adding more barriers won't fix those, either. Thanks, Florian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:10 ` Florian Weimer @ 2025-01-10 17:14 ` Adhemerval Zanella Netto 2025-01-10 17:15 ` Mathieu Desnoyers 1 sibling, 0 replies; 18+ messages in thread From: Adhemerval Zanella Netto @ 2025-01-10 17:14 UTC (permalink / raw) To: Florian Weimer, Mathieu Desnoyers Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 10/01/25 14:10, Florian Weimer wrote: > * Mathieu Desnoyers: > >> On 2025-01-10 11:54, Peter Zijlstra wrote: >>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>> Hi, >>>> >>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>> with the newer TLB broadcast maintenance. >>> What is the exact race? Should not munmap() invalidate the TLBs >>> before >>> it allows overlapping mmap() to complete? >> >> The race Mark mentioned (on ARM) is AFAIU the following scenario: >> >> CPU 0 CPU 1 >> >> - dlopen() >> - mmap PROT_EXEC @addr >> - fetch insn @addr, CPU state expects unchanged insn. >> - execute unrelated code >> - dlclose(addr) >> - munmap @addr >> - dlopen() >> - mmap PROT_EXEC @addr >> - fetch new insn @addr. Incoherent CPU state. > > Unmapping an object while code is executing in it is undefined. > > We have a problem with things like pthread_atfork handlers. We can't > use locking there because fork handlers are expected to perform ample > locking themselves, and an extra lock around them would run into lock > ordering issues. (We tried for unrelated reasons and saw deadlocks in > applications.) > > What we can do is bump a reference counter while we run a pthread_atfork > callback (we already associate them with DSOs) and skip the munmap part > in dlclose if the counter is not zero. We can complete the unmapping > after the fork handler returns (maybe in the parent only). We can also make dlclose a no-op (like some runtimes do), although this has other implications. > > There might be other callbacks besides fork handlers that have this > problem. A similar treatment is possible for some of them, hopefully > all of them in glibc. We cannot cover things like std::shared_ptr > destructor calls, though. But adding more barriers won't fix those, > either. > > Thanks, > Florian > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:10 ` Florian Weimer 2025-01-10 17:14 ` Adhemerval Zanella Netto @ 2025-01-10 17:15 ` Mathieu Desnoyers 2025-01-10 17:24 ` Adhemerval Zanella Netto 2025-01-10 17:46 ` Florian Weimer 1 sibling, 2 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-10 17:15 UTC (permalink / raw) To: Florian Weimer Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 12:10, Florian Weimer wrote: > * Mathieu Desnoyers: > >> On 2025-01-10 11:54, Peter Zijlstra wrote: >>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>> Hi, >>>> >>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>> with the newer TLB broadcast maintenance. >>> What is the exact race? Should not munmap() invalidate the TLBs >>> before >>> it allows overlapping mmap() to complete? >> >> The race Mark mentioned (on ARM) is AFAIU the following scenario: >> >> CPU 0 CPU 1 >> >> - dlopen() >> - mmap PROT_EXEC @addr >> - fetch insn @addr, CPU state expects unchanged insn. >> - execute unrelated code >> - dlclose(addr) >> - munmap @addr >> - dlopen() >> - mmap PROT_EXEC @addr >> - fetch new insn @addr. Incoherent CPU state. > > Unmapping an object while code is executing in it is undefined. That's not the scenario though. In this scenario, CPU 1 executes _unrelated code_ while we unmap @addr. The issue is the stale CPU state that persists. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:15 ` Mathieu Desnoyers @ 2025-01-10 17:24 ` Adhemerval Zanella Netto 2025-01-10 17:35 ` Mathieu Desnoyers 2025-01-10 17:46 ` Florian Weimer 1 sibling, 1 reply; 18+ messages in thread From: Adhemerval Zanella Netto @ 2025-01-10 17:24 UTC (permalink / raw) To: Mathieu Desnoyers, Florian Weimer Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 10/01/25 14:15, Mathieu Desnoyers wrote: > On 2025-01-10 12:10, Florian Weimer wrote: >> * Mathieu Desnoyers: >> >>> On 2025-01-10 11:54, Peter Zijlstra wrote: >>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>>> Hi, >>>>> >>>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>>> with the newer TLB broadcast maintenance. >>>> What is the exact race? Should not munmap() invalidate the TLBs >>>> before >>>> it allows overlapping mmap() to complete? >>> >>> The race Mark mentioned (on ARM) is AFAIU the following scenario: >>> >>> CPU 0 CPU 1 >>> >>> - dlopen() >>> - mmap PROT_EXEC @addr >>> - fetch insn @addr, CPU state expects unchanged insn. >>> - execute unrelated code >>> - dlclose(addr) >>> - munmap @addr >>> - dlopen() >>> - mmap PROT_EXEC @addr >>> - fetch new insn @addr. Incoherent CPU state. >> >> Unmapping an object while code is executing in it is undefined. > > That's not the scenario though. In this scenario, CPU 1 executes > _unrelated code_ while we unmap @addr. But in this scenario you still a concurrent dlclose while you have a running thread executing code from that module, right? Or am I still missing something here? Or, are you saying that even after dlopen returns (assuming the scenario where it maps the code in a previous used mapping), the CPU is in an inconsistent state unless MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is issued? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:24 ` Adhemerval Zanella Netto @ 2025-01-10 17:35 ` Mathieu Desnoyers 0 siblings, 0 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-10 17:35 UTC (permalink / raw) To: Adhemerval Zanella Netto, Florian Weimer Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 12:24, Adhemerval Zanella Netto wrote: > > > On 10/01/25 14:15, Mathieu Desnoyers wrote: >> On 2025-01-10 12:10, Florian Weimer wrote: >>> * Mathieu Desnoyers: >>> >>>> On 2025-01-10 11:54, Peter Zijlstra wrote: >>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>>>> Hi, >>>>>> >>>>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>>>> with the newer TLB broadcast maintenance. >>>>> What is the exact race? Should not munmap() invalidate the TLBs >>>>> before >>>>> it allows overlapping mmap() to complete? >>>> >>>> The race Mark mentioned (on ARM) is AFAIU the following scenario: >>>> >>>> CPU 0 CPU 1 >>>> >>>> - dlopen() >>>> - mmap PROT_EXEC @addr >>>> - fetch insn @addr, CPU state expects unchanged insn. >>>> - execute unrelated code >>>> - dlclose(addr) >>>> - munmap @addr >>>> - dlopen() >>>> - mmap PROT_EXEC @addr >>>> - fetch new insn @addr. Incoherent CPU state. >>> >>> Unmapping an object while code is executing in it is undefined. >> >> That's not the scenario though. In this scenario, CPU 1 executes >> _unrelated code_ while we unmap @addr. > > But in this scenario you still a concurrent dlclose while you have a running > thread executing code from that module, right? Or am I still missing something > here? No. > > Or, are you saying that even after dlopen returns (assuming the scenario where > it maps the code in a previous used mapping), the CPU is in an inconsistent > state unless MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is issued? Yes, this is it. And the issue is not specific to dlopen/dlclose. We can have a similar issue if we have a sequence of: dlopen @addr dlclose mmap PROT_EXEC|PROT_WRITE @addr - JIT writes some code and jumps to it. So it appears to be something we may want to fix at the kernel level. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:15 ` Mathieu Desnoyers 2025-01-10 17:24 ` Adhemerval Zanella Netto @ 2025-01-10 17:46 ` Florian Weimer 2025-01-10 19:16 ` Mathieu Desnoyers 1 sibling, 1 reply; 18+ messages in thread From: Florian Weimer @ 2025-01-10 17:46 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson * Mathieu Desnoyers: > On 2025-01-10 12:10, Florian Weimer wrote: >> * Mathieu Desnoyers: >> >>> On 2025-01-10 11:54, Peter Zijlstra wrote: >>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>>> Hi, >>>>> >>>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>>> with the newer TLB broadcast maintenance. >>>> What is the exact race? Should not munmap() invalidate the TLBs >>>> before >>>> it allows overlapping mmap() to complete? >>> >>> The race Mark mentioned (on ARM) is AFAIU the following scenario: >>> >>> CPU 0 CPU 1 >>> >>> - dlopen() >>> - mmap PROT_EXEC @addr >>> - fetch insn @addr, CPU state expects unchanged insn. >>> - execute unrelated code >>> - dlclose(addr) >>> - munmap @addr >>> - dlopen() >>> - mmap PROT_EXEC @addr >>> - fetch new insn @addr. Incoherent CPU state. >> Unmapping an object while code is executing in it is undefined. > > That's not the scenario though. In this scenario, CPU 1 executes > _unrelated code_ while we unmap @addr. Oh, so CPU 1 initially executes some code, returns to some safe, persistent code (“the execute unrelated code” part), this code synchronizes with the dlclose and the dlopen that execute on CPU 0, obtains a pointer to some supposedly safely published function in the newly mapped object, and calls it. And that fails because previously cached information about the code is invalid? Additional awkwardness may result if the initial execution is speculative, and the code on CPU 1 only synchronizes with the dlopen, and not the previous dlclose because it does not know about it at all? Thanks, Florian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:46 ` Florian Weimer @ 2025-01-10 19:16 ` Mathieu Desnoyers 0 siblings, 0 replies; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-10 19:16 UTC (permalink / raw) To: Florian Weimer Cc: Peter Zijlstra, libc-alpha@sourceware.org, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 12:46, Florian Weimer wrote: > * Mathieu Desnoyers: > >> On 2025-01-10 12:10, Florian Weimer wrote: >>> * Mathieu Desnoyers: >>> >>>> On 2025-01-10 11:54, Peter Zijlstra wrote: >>>>> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>>>>> Hi, >>>>>> >>>>>> I was discussing with Mark Rutland recently, and he pointed out that a >>>>>> sequence of dlclose/dlopen mapping new code at the same addresses in >>>>>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>>>>> with the newer TLB broadcast maintenance. >>>>> What is the exact race? Should not munmap() invalidate the TLBs >>>>> before >>>>> it allows overlapping mmap() to complete? >>>> >>>> The race Mark mentioned (on ARM) is AFAIU the following scenario: >>>> >>>> CPU 0 CPU 1 >>>> >>>> - dlopen() >>>> - mmap PROT_EXEC @addr >>>> - fetch insn @addr, CPU state expects unchanged insn. >>>> - execute unrelated code >>>> - dlclose(addr) >>>> - munmap @addr >>>> - dlopen() >>>> - mmap PROT_EXEC @addr >>>> - fetch new insn @addr. Incoherent CPU state. >>> Unmapping an object while code is executing in it is undefined. >> >> That's not the scenario though. In this scenario, CPU 1 executes >> _unrelated code_ while we unmap @addr. > > Oh, so CPU 1 initially executes some code, returns to some safe, > persistent code (“the execute unrelated code” part), this code > synchronizes with the dlclose and the dlopen that execute on CPU 0, > obtains a pointer to some supposedly safely published function in the > newly mapped object, and calls it. And that fails because previously > cached information about the code is invalid? Correct. > > Additional awkwardness may result if the initial execution is > speculative, and the code on CPU 1 only synchronizes with the dlopen, > and not the previous dlclose because it does not know about it at all? I'm not sure I follow this last example. Can you explain further what you have in mind ? Thanks, Mathieu > > Thanks, > Florian > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:02 ` Mathieu Desnoyers 2025-01-10 17:10 ` Florian Weimer @ 2025-01-10 17:11 ` Peter Zijlstra 2025-01-10 18:41 ` Mark Rutland 2025-01-10 17:12 ` Adhemerval Zanella Netto 2 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2025-01-10 17:11 UTC (permalink / raw) To: Mathieu Desnoyers Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On Fri, Jan 10, 2025 at 12:02:27PM -0500, Mathieu Desnoyers wrote: > On 2025-01-10 11:54, Peter Zijlstra wrote: > > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: > > > Hi, > > > > > > I was discussing with Mark Rutland recently, and he pointed out that a > > > sequence of dlclose/dlopen mapping new code at the same addresses in > > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > > > with the newer TLB broadcast maintenance. > > > > What is the exact race? Should not munmap() invalidate the TLBs before > > it allows overlapping mmap() to complete? > > The race Mark mentioned (on ARM) is AFAIU the following scenario: > > CPU 0 CPU 1 > > - dlopen() > - mmap PROT_EXEC @addr > - fetch insn @addr, CPU state expects unchanged insn. > - execute unrelated code > - dlclose(addr) > - munmap @addr > - dlopen() > - mmap PROT_EXEC @addr > - fetch new insn @addr. Incoherent CPU state. Urgh.. Mark, is this because of non-coherent i-cache or somesuch misery? But shouldn't flush_{,i}cache_range() or something along those lines not handle this? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:11 ` Peter Zijlstra @ 2025-01-10 18:41 ` Mark Rutland 0 siblings, 0 replies; 18+ messages in thread From: Mark Rutland @ 2025-01-10 18:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathieu Desnoyers, libc-alpha, Florian Weimer, carlos@redhat.com, linux-kernel, x86, paulmck, Michael Jeanson, Andy Lutomirski, Will Deacon Hi Peter, [adding Andy and Will, since we've discussed related cases in the past] On Fri, Jan 10, 2025 at 06:11:12PM +0100, Peter Zijlstra wrote: > On Fri, Jan 10, 2025 at 12:02:27PM -0500, Mathieu Desnoyers wrote: > > On 2025-01-10 11:54, Peter Zijlstra wrote: > > > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: > > > > Hi, > > > > > > > > I was discussing with Mark Rutland recently, and he pointed out that a > > > > sequence of dlclose/dlopen mapping new code at the same addresses in > > > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > > > > with the newer TLB broadcast maintenance. > > > > > > What is the exact race? Should not munmap() invalidate the TLBs before > > > it allows overlapping mmap() to complete? > > > > The race Mark mentioned (on ARM) is AFAIU the following scenario: > > > > CPU 0 CPU 1 > > > > - dlopen() > > - mmap PROT_EXEC @addr > > - fetch insn @addr, CPU state expects unchanged insn. > > - execute unrelated code > > - dlclose(addr) > > - munmap @addr > > - dlopen() > > - mmap PROT_EXEC @addr > > - fetch new insn @addr. Incoherent CPU state. For the benefit of others, what I specifically said was: | There's a fun (latent, been around forever) issue whereby reusing the | same VA for different code (e.g. dlopen() ... dlclose() ... dlopen()) | could blow up in a multi-threaded environment I hadn't reported this on a list yet because there are many subtleties, this is vanishingly unlikely to occur in practice today, and I didn't want to get people excited/confused/angry over an incomplete or misleading description. > Urgh.. Mark, is this because of non-coherent i-cache or somesuch misery? Sort-of. The key detail is that while instructions are being executed (including speculative execution), the CPU pipeline/OoO-engine/whatever effectively caches a copy of an instruction while it is "in-flight" (e.g. potentially broken down into micro-ops): On the ARM architecture, those in-flight copies are only guaranteed to be discarded by a context-synchronization-event, and are not guaranteed to be discarded due to TLB maintenance, data cache maintenance, or instruction cache maintenance. Instruction cache maintenance will guarantee that *subsequent* fetches from any instruction cache observe the new value. The first time a page of executable code is mapped in at a VA, this isn't a problem because there was nothing previously at that VA which could have been fetched from (since entering userspace, as exception return from kernel to user provides a context-synchronization-event). However, if some code A is mapped at a VA, then unmapped, then some distinct code B is mapped at that VA, then some CPUs might still have code A in-flight, regardless of TLB and cache maintenance, unless a context-synchronization-event occurs. Imagine you have a CPU microarchitecture with a long speculative execution window, and you have to threads running pinned on two CPUs sharing an address space, with some shared function pointer P which is initially NULL. Then you have something like: Thread 0 Thread 1 // Speculating some long-to-resolve // sequence of instructions. - mmap() code A at VA X - enters kernel - kernel loads data into page - kernel performs D$ + I$ maintenance - kernel updates page tables - returns to userspace // Begins speculating if (P) { P() }; // Begins speculating P(), predicted as // VA X. // Fetches code A from X into pipeline // Code A now in-flight - munmap() VA X - enters kernel - updates page tables - performs TLB maintenance (broadcast) - returns to userspace // Code A still in-flight - mmap() code B at VA X - enters kernel - kernel loads data into page - kernel performs D$ + I$ maintenance // Code A no longer in I$ // Code A still in-flight - kernel updates page tables - returns to userspace // Code A still in-flight - Publishes P as pointer in code B (e.g. WRITE_ONCE(P, X)) // Completes speculation of // long-to-resolve sequence. // Resolves P is VA X, and // commits speculation of code A ... and BANG, stale instructions executed. Note that on architectures that use IPIs for TLB invalidation (e.g. x86 today), the munmap() is likely to provide the serialization to discard the in-flight copy by virtue of the IPI. Practically speaking, actually hitting this is very unlikely because you need to get very unlucky with predication, the predicted instructions need to remain in-flight for a very long time without being discarded for other reasons (e.g. a mis-prediction, IRQ, etc), and the VA needs to be reused within that time window. > But shouldn't flush_{,i}cache_range() or something along those lines not > handle this? Unfortunately not, those only affect the explicit data/instruction caches, and not the in-flight copies of the instructions. Mark. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:02 ` Mathieu Desnoyers 2025-01-10 17:10 ` Florian Weimer 2025-01-10 17:11 ` Peter Zijlstra @ 2025-01-10 17:12 ` Adhemerval Zanella Netto 2 siblings, 0 replies; 18+ messages in thread From: Adhemerval Zanella Netto @ 2025-01-10 17:12 UTC (permalink / raw) To: Mathieu Desnoyers, Peter Zijlstra Cc: libc-alpha, Florian Weimer, carlos@redhat.com, Mark Rutland, linux-kernel, x86, paulmck, Michael Jeanson On 10/01/25 14:02, Mathieu Desnoyers wrote: > On 2025-01-10 11:54, Peter Zijlstra wrote: >> On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote: >>> Hi, >>> >>> I was discussing with Mark Rutland recently, and he pointed out that a >>> sequence of dlclose/dlopen mapping new code at the same addresses in >>> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >>> with the newer TLB broadcast maintenance. >> >> What is the exact race? Should not munmap() invalidate the TLBs before >> it allows overlapping mmap() to complete? > > The race Mark mentioned (on ARM) is AFAIU the following scenario: > > CPU 0 CPU 1 > > - dlopen() > - mmap PROT_EXEC @addr > - fetch insn @addr, CPU state expects unchanged insn. > - execute unrelated code It is not clear to me from userland/libc perspective how this would happen, since to dlopen get the same address you will need to either dlclose or call munmap. Either you have UB where some thread dclose a library while is being used by a different thread, or the thread will ended up executing a potentially different code it is intended to do. Do you have a realworld case where current glibc code show this issue? > - dlclose(addr) > - munmap @addr > - dlopen() > - mmap PROT_EXEC @addr > - fetch new insn @addr. Incoherent CPU state. > >> >> Any concurrent access after munmap() / before mmap() completes is UB >> anyway, no? > > The problematic access happens after the second mmap. The issue is > stale CPU state. > >> >>> I maintain the membarrier(2) system call, which provides a >>> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this >>> purpose. It's been there since Linux 4.16. It can be configured >>> out (CONFIG_MEMBARRIER=n), but it's enabled by default. >>> >>> Calling this after dlclose() in glibc would prevent this issue. >>> >>> Is it handled in some other way, or should we open a bugzilla >>> entry to track this ? >> >> The problem is that the membarrier() call has significant cost, and is >> only really needed if dlopen() is called right after (in the same >> location). > > Or if it has any overlapping executable range. > >> >> Unconditionally adding that barrier, just in case, might regress things, >> no? > > Or perhaps we could add this barrier within mprotect(2) and munmap(2) in the > following cases: > > - mprotect removes PROT_EXEC from a mapping, > - munmap unmaps a PROT_EXEC mapping. > > Else userspace has to explicitly invoke membarrier sync-core from dlclose. > > Thoughts ? > > Thanks, > > Mathieu > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers 2025-01-10 16:47 ` Adhemerval Zanella Netto 2025-01-10 16:54 ` Peter Zijlstra @ 2025-01-10 17:04 ` Florian Weimer 2025-01-10 17:13 ` Mathieu Desnoyers 2 siblings, 1 reply; 18+ messages in thread From: Florian Weimer @ 2025-01-10 17:04 UTC (permalink / raw) To: Mathieu Desnoyers Cc: libc-alpha, carlos@redhat.com, Mark Rutland, Peter Zijlstra, linux-kernel, x86, paulmck, Michael Jeanson * Mathieu Desnoyers: > I was discussing with Mark Rutland recently, and he pointed out that a > sequence of dlclose/dlopen mapping new code at the same addresses in > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > with the newer TLB broadcast maintenance. > > I maintain the membarrier(2) system call, which provides a > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this > purpose. It's been there since Linux 4.16. It can be configured > out (CONFIG_MEMBARRIER=n), but it's enabled by default. > > Calling this after dlclose() in glibc would prevent this issue. > > Is it handled in some other way, or should we open a bugzilla > entry to track this ? There is nothing special about dlopen/dlclose, we just use mmap/munmap. If there is a synchronization problem, we'd have to add to add barriers to mmap and munmap. But why isn't it up to the kernel to handle this correctly? Thanks, Florian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:04 ` Florian Weimer @ 2025-01-10 17:13 ` Mathieu Desnoyers 2025-01-10 18:33 ` Paul E. McKenney 0 siblings, 1 reply; 18+ messages in thread From: Mathieu Desnoyers @ 2025-01-10 17:13 UTC (permalink / raw) To: Florian Weimer Cc: libc-alpha, carlos@redhat.com, Mark Rutland, Peter Zijlstra, linux-kernel, x86, paulmck, Michael Jeanson On 2025-01-10 12:04, Florian Weimer wrote: > * Mathieu Desnoyers: > >> I was discussing with Mark Rutland recently, and he pointed out that a >> sequence of dlclose/dlopen mapping new code at the same addresses in >> multithreaded environments is an issue on ARM, and possibly on Intel/AMD >> with the newer TLB broadcast maintenance. >> >> I maintain the membarrier(2) system call, which provides a >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this >> purpose. It's been there since Linux 4.16. It can be configured >> out (CONFIG_MEMBARRIER=n), but it's enabled by default. >> >> Calling this after dlclose() in glibc would prevent this issue. >> >> Is it handled in some other way, or should we open a bugzilla >> entry to track this ? > > There is nothing special about dlopen/dlclose, we just use mmap/munmap. > If there is a synchronization problem, we'd have to add to add barriers > to mmap and munmap. > > But why isn't it up to the kernel to handle this correctly? As I mentioned to Peter, we could add this barrier within mprotect(2) and munmap(2) in the following cases: - mprotect removes PROT_EXEC from a mapping, - munmap unmaps a PROT_EXEC mapping. We could even go further and batch this: we only need to issue membarrier-sync-core on the following sequence for an mm: On either of those, set current->mm->pending_membarrier_sync_core = true: - mprotect removes PROT_EXEC from a mapping, or - munmap unmaps a PROT_EXEC mapping, And then, if current->mm->pending_membarrier_sync_core == true when: - mmap is called to create a PROT_EXEC mapping, or - mprotect sets PROT_EXEC on a mapping. invoke membarrier sync-core and set current->mm_pending_membarrier = false Thoughts ? Thanks, Mathieu > > Thanks, > Florian > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen 2025-01-10 17:13 ` Mathieu Desnoyers @ 2025-01-10 18:33 ` Paul E. McKenney 0 siblings, 0 replies; 18+ messages in thread From: Paul E. McKenney @ 2025-01-10 18:33 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Florian Weimer, libc-alpha, carlos@redhat.com, Mark Rutland, Peter Zijlstra, linux-kernel, x86, Michael Jeanson On Fri, Jan 10, 2025 at 12:13:58PM -0500, Mathieu Desnoyers wrote: > On 2025-01-10 12:04, Florian Weimer wrote: > > * Mathieu Desnoyers: > > > > > I was discussing with Mark Rutland recently, and he pointed out that a > > > sequence of dlclose/dlopen mapping new code at the same addresses in > > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD > > > with the newer TLB broadcast maintenance. > > > > > > I maintain the membarrier(2) system call, which provides a > > > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE command for this > > > purpose. It's been there since Linux 4.16. It can be configured > > > out (CONFIG_MEMBARRIER=n), but it's enabled by default. > > > > > > Calling this after dlclose() in glibc would prevent this issue. > > > > > > Is it handled in some other way, or should we open a bugzilla > > > entry to track this ? > > > > There is nothing special about dlopen/dlclose, we just use mmap/munmap. > > If there is a synchronization problem, we'd have to add to add barriers > > to mmap and munmap. > > > > But why isn't it up to the kernel to handle this correctly? > > As I mentioned to Peter, we could add this barrier within mprotect(2) > and munmap(2) in the following cases: > > - mprotect removes PROT_EXEC from a mapping, > - munmap unmaps a PROT_EXEC mapping. > > We could even go further and batch this: we only need to > issue membarrier-sync-core on the following sequence for an mm: > > On either of those, set current->mm->pending_membarrier_sync_core = true: > - mprotect removes PROT_EXEC from a mapping, or > - munmap unmaps a PROT_EXEC mapping, > > And then, if current->mm->pending_membarrier_sync_core == true when: > - mmap is called to create a PROT_EXEC mapping, or > - mprotect sets PROT_EXEC on a mapping. > > invoke membarrier sync-core and set > current->mm_pending_membarrier = false > > Thoughts ? In the case where the kernel is permitted to choose the address of the new mapping, we could batch the sys_membarrier() calls, keeping the corresponding address space out of service in the meantime. Don't tell me, a given dynamic library always picks the same address? :-/ Thanx, Paul ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-01-15 20:17 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-10 15:55 Prevent inconsistent CPU state after sequence of dlclose/dlopen Mathieu Desnoyers 2025-01-10 16:47 ` Adhemerval Zanella Netto 2025-01-15 20:16 ` Mathieu Desnoyers 2025-01-10 16:54 ` Peter Zijlstra 2025-01-10 17:02 ` Mathieu Desnoyers 2025-01-10 17:10 ` Florian Weimer 2025-01-10 17:14 ` Adhemerval Zanella Netto 2025-01-10 17:15 ` Mathieu Desnoyers 2025-01-10 17:24 ` Adhemerval Zanella Netto 2025-01-10 17:35 ` Mathieu Desnoyers 2025-01-10 17:46 ` Florian Weimer 2025-01-10 19:16 ` Mathieu Desnoyers 2025-01-10 17:11 ` Peter Zijlstra 2025-01-10 18:41 ` Mark Rutland 2025-01-10 17:12 ` Adhemerval Zanella Netto 2025-01-10 17:04 ` Florian Weimer 2025-01-10 17:13 ` Mathieu Desnoyers 2025-01-10 18:33 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox