From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, x86 <x86@kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
Nicholas Piggin <npiggin@gmail.com>,
Arnd Bergmann <arnd@arndb.de>, Anton Blanchard <anton@ozlabs.org>
Subject: Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully
Date: Tue, 1 Dec 2020 15:51:42 -0500 (EST) [thread overview]
Message-ID: <466812913.70175.1606855902511.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <CALCETrUPiBOUEwJsLb8n0DJkHpTAYkFXJQuJ7swuRankvCzrRg@mail.gmail.com>
----- On Dec 1, 2020, at 1:48 PM, Andy Lutomirski luto@kernel.org wrote:
> On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >>
>> >> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@infradead.org wrote:
>> >>
>> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
>> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
>> >> >> other CPUs, but there are two issues.
>> >> >>
>> >> >> - membarrier() does not sync_core() or rseq_preempt() the calling
>> >> >> CPU. Aside from the logic being mind-bending, this also means
>> >> >> that it may not be safe to modify user code through an alias,
>> >> >> call membarrier(), and then jump to a different executable alias
>> >> >> of the same code.
>> >> >
>> >> > I always understood this to be on purpose. The calling CPU can fix up
>> >> > itself just fine. The pain point is fixing up the other CPUs, and that's
>> >> > where membarrier() helps.
>> >>
>> >> Indeed, as documented in the man page:
>> >>
>> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
>> >> In addition to providing the memory ordering guarantees de‐
>> >> scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from
>> >> system call the calling thread has a guarantee that all its run‐
>> >> ning thread siblings have executed a core serializing instruc‐
>> >> tion. This guarantee is provided only for threads in the same
>> >> process as the calling thread.
>> >>
>> >> membarrier sync core guarantees a core serializing instruction on the siblings,
>> >> not on the caller thread. This has been done on purpose given that the caller
>> >> thread can always issue its core serializing instruction from user-space on
>> >> its own.
>> >>
>> >> >
>> >> > That said, I don't mind including self, these aren't fast calls by any
>> >> > means.
>> >>
>> >> I don't mind including self either, but this would require documentation
>> >> updates, including man pages, to state that starting from kernel Y this
>> >> is the guaranteed behavior. It's then tricky for user-space to query what
>> >> the behavior is unless we introduce a new membarrier command for it. So this
>> >> could introduce issues if software written for the newer kernels runs on older
>> >> kernels.
>> >
>> > For rseq at least, if we do this now we don't have this issue -- I
>> > don't think any released kernel has the rseq mode.
>>
>> But for rseq, there is no core-sync. And considering that it is invalid
>> to issue a system call within an rseq critical section (including membarrier),
>> I don't see what we gain by doing a rseq barrier on self ?
>>
>> The only case where it really changes the semantic is for core-sync I think.
>> And in this case, it would be adding an additional core-sync on self. I
>> am OK with doing that considering that it will simplify use of the system
>> call. I'm just wondering how we should document this change in the man page.
>>
>> >
>> >>
>> >> >
>> >> >> - membarrier() does not explicitly sync_core() remote CPUs either;
>> >> >> instead, it relies on the assumption that an IPI will result in a
>> >> >> core sync. On x86, I think this may be true in practice, but
>> >> >> it's not architecturally reliable. In particular, the SDM and
>> >> >> APM do not appear to guarantee that interrupt delivery is
>> >> >> serializing.
>> >> >
>> >> > Right, I don't think we rely on that, we do rely on interrupt delivery
>> >> > providing order though -- as per the previous email.
>> >> >
>> >> >> On a preemptible kernel, IPI return can schedule,
>> >> >> thereby switching to another task in the same mm that was
>> >> >> sleeping in a syscall. The new task could then SYSRET back to
>> >> >> usermode without ever executing IRET.
>> >> >
>> >> > This; I think we all overlooked this scenario.
>> >>
>> >> Indeed, this is an issue which needs to be fixed.
>> >>
>> >> >
>> >> >> This patch simplifies the code to treat the calling CPU just like
>> >> >> all other CPUs, and explicitly sync_core() on all target CPUs. This
>> >> >> eliminates the need for the smp_mb() at the end of the function
>> >> >> except in the special case of a targeted remote membarrier(). This
>> >> >> patch updates that code and the comments accordingly.
>> >>
>> >> I am not confident that removing the smp_mb at the end of membarrier is
>> >> an appropriate change, nor that it simplifies the model.
>> >
>> > Ah, but I didn't remove it. I carefully made sure that every possible
>> > path through the function does an smp_mb() or stronger after all the
>> > cpu_rq reads. ipi_func(), on_each_cpu(), and the explicit smp_mb()
>> > cover the three cases.
>> >
>> > That being said, if you prefer, I can make the change to skip the
>> > calling CPU, in which case I'll leave the smp_mb() at the end alone.
>>
>> For the memory barrier commands, I prefer skipping self and leaving the
>> smp_mb at the very beginning/end of the system call. Those are the key
>> before/after points we are synchronizing against, and those are simple
>> to document.
>>
>
> Is there a reason that doing the barrier at the very end could make an
> observable difference? The two models are:
>
> membarrier() {
> smp_mb();
> read a bunch of cpu_rq memory and make decisions;
> execute smp_mb() on relevant cpus including self;
you forget the fact that you also add a smp_mb() after each
individual IPI returns, which is why you can get away with removing
the smp_mb at the end of the membarrier syscall without introducing
issues.
> }
>
> versus
>
> membarrier() {
> smp_mb();
> read a bunch of cpu_rq memory and make decisions;
> execute smp_mb() on relevant non-self cpus;
> wait for that to finish (acquire-style on the local cpu);
> smp_mb();
> }
>
> Is the idea that, on a sufficiently weakly ordered architecture, some
> remote CPU could do a store before the IPI and a local load after the
> membarrier() syscall might not observe the load unless the local
> smp_mb() is after the remote smp_mb()? If so, I'm not entirely
> convinced that this is observably different from the store simply
> occurring after the IPI, but maybe there are some gnarly situations in
> which this could happen.
>
> If your concern is something along these lines, I could try to write
> up an appropriate comment, and I'll rework the patch.
This is already documented in the scenarios I added as comments in the
patch sitting in the tip tree:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=25595eb6aaa9fbb31330f1e0b400642694bc6574
See "Scenario B Userspace thread execution before IPI vs membarrier's memory barrier after completing the IPI"
I think the change you proposed would be technically still OK:
- The callback on self issuing the smp_mb would take care of ensuring
that at least one memory barrier is issued after loading rq->curr
state for each cpu.
- The smp_mb after each ipi return would ensure we have barrier ordering
between the smp_mb in the ipi handler / before the membarrier system
call returns to userspace.
But then rather than having one clear spot where the smp_mb needs to be
placed and documented, we have a more complex maze of conditions we need
to consider. Hence my preference for keeping the smp_mb at the beginning/end
of the membarrier system call.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
next prev parent reply other threads:[~2020-12-01 20:52 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-30 17:50 [PATCH 0/3] membarrier fixes Andy Lutomirski
2020-11-30 17:50 ` [PATCH 1/3] x86/membarrier: Get rid of a dubious optimization Andy Lutomirski
2020-12-01 14:39 ` Mathieu Desnoyers
2020-12-01 17:47 ` Andy Lutomirski
2020-11-30 17:50 ` [PATCH 2/3] membarrier: Add an actual barrier before rseq_preempt() Andy Lutomirski
2020-12-01 10:06 ` Peter Zijlstra
2020-12-01 14:31 ` Mathieu Desnoyers
2020-12-01 17:55 ` Andy Lutomirski
2020-11-30 17:50 ` [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully Andy Lutomirski
2020-12-01 10:16 ` Peter Zijlstra
2020-12-01 14:28 ` Mathieu Desnoyers
2020-12-01 18:12 ` Andy Lutomirski
2020-12-01 18:29 ` Mathieu Desnoyers
2020-12-01 18:48 ` Andy Lutomirski
2020-12-01 20:51 ` Mathieu Desnoyers [this message]
2020-12-01 18:09 ` Andy Lutomirski
2020-12-01 18:53 ` Peter Zijlstra
2020-12-01 18:55 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=466812913.70175.1606855902511.JavaMail.zimbra@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=anton@ozlabs.org \
--cc=arnd@arndb.de \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=npiggin@gmail.com \
--cc=peterz@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.