* [PATCH AUTOSEL 5.10 06/31] powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
From: Sasha Levin @ 2020-12-30 13:02 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Sasha Levin, linuxppc-dev, Qinglang Miao
In-Reply-To: <20201230130314.3636961-1-sashal@kernel.org>
From: Qinglang Miao <miaoqinglang@huawei.com>
[ Upstream commit ffa1797040c5da391859a9556be7b735acbe1242 ]
I noticed that iounmap() of msgr_block_addr before return from
mpic_msgr_probe() in the error handling case is missing. So use
devm_ioremap() instead of just ioremap() when remapping the message
register block, so the mapping will be automatically released on
probe failure.
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201028091551.136400-1-miaoqinglang@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/powerpc/sysdev/mpic_msgr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/sysdev/mpic_msgr.c b/arch/powerpc/sysdev/mpic_msgr.c
index f6b253e2be409..36ec0bdd8b63c 100644
--- a/arch/powerpc/sysdev/mpic_msgr.c
+++ b/arch/powerpc/sysdev/mpic_msgr.c
@@ -191,7 +191,7 @@ static int mpic_msgr_probe(struct platform_device *dev)
/* IO map the message register block. */
of_address_to_resource(np, 0, &rsrc);
- msgr_block_addr = ioremap(rsrc.start, resource_size(&rsrc));
+ msgr_block_addr = devm_ioremap(&dev->dev, rsrc.start, resource_size(&rsrc));
if (!msgr_block_addr) {
dev_err(&dev->dev, "Failed to iomap MPIC message registers");
return -EFAULT;
--
2.27.0
^ permalink raw reply related
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-30 11:57 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
linux-kernel, stable, Will Deacon, Mathieu Desnoyers,
Andy Lutomirski, Catalin Marinas, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20201230105847.GQ1551@shell.armlinux.org.uk>
Excerpts from Russell King - ARM Linux admin's message of December 30, 2020 8:58 pm:
> On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote:
>> On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:
>> > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:
>> > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
>> > >> I think it should certainly be documented in terms of what guarantees
>> > >> it provides to application, _not_ the kinds of instructions it may or
>> > >> may not induce the core to execute. And if existing API can't be
>> > >> re-documented sanely, then deprecatd and new ones added that DTRT.
>> > >> Possibly under a new system call, if arch's like ARM want a range
>> > >> flush and we don't want to expand the multiplexing behaviour of
>> > >> membarrier even more (sigh).
>> > >
>> > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying
>> > > code, and takes whatever actions are necessary to support that.
>> > > Exactly what actions it takes are cache implementation specific, and
>> > > should be of no concern to the caller, but the underlying thing is...
>> > > it's to support self-modifying code.
>> >
>> > Caveat
>> > cacheflush() should not be used in programs intended to be portable.
>> > On Linux, this call first appeared on the MIPS architecture, but nowa‐
>> > days, Linux provides a cacheflush() system call on some other architec‐
>> > tures, but with different arguments.
>> >
>> > What a disaster. Another badly designed interface, although it didn't
>> > originate in Linux it sounds like we weren't to be outdone so
>> > we messed it up even worse.
>> >
>> > flushing caches is neither necessary nor sufficient for code modification
>> > on many processors. Maybe some old MIPS specific private thing was fine,
>> > but certainly before it grew to other architectures, somebody should
>> > have thought for more than 2 minutes about it. Sigh.
>>
>> WARNING: You are bordering on being objectionable and offensive with
>> that comment.
>>
>> The ARM interface was designed by me back in the very early days of
>> Linux, probably while you were still in dypers, based on what was
>> known at the time. Back in the early 2000s, ideas such as relaxed
>> memory ordering were not known. All there was was one level of
>> harvard cache.
I wasn't talking about memory ordering at all, and I assumed it
came earlier and was brought to Linux for portability reasons -
CONFORMING TO
Historically, this system call was available on all MIPS UNIX variants
including RISC/os, IRIX, Ultrix, NetBSD, OpenBSD, and FreeBSD (and also
on some non-UNIX MIPS operating systems), so that the existence of this
call in MIPS operating systems is a de-facto standard.
I don't think the call was totally unreasonable for incoherent virtual
caches or incoherent i/d caches etc. Although early unix system call interface
demonstrates that people understood how to define good interfaces that dealt
with intent at an abstract level rather than implementation -- munmap
doesn't specify anywhere that a TLB flush instruction must be executed,
for example. So "cacheflush" was obviously never a well designed interface
but rather the typical hardware-centric hack to get their stuff working
(which was fine for its purpose I'm sure).
>
> Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec
> 1998:
>
> http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1
>
> by Philip Blundell, and first appeared in the ARM
> pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the
> kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has
> a slightly different signature: the third argument on ARM is a flags
> argument, whereas the MIPS code, it is some undocumented "cache"
> parameter.
>
> Whether the ARM version pre or post dates the MIPS code, I couldn't say.
> Whether it was ultimately taken from the MIPS implementation, again, I
> couldn't say.
I can, it was in MIPS in late 1.3 kernels at least. I would guess it
came from IRIX.
> However, please stop insulting your fellow developers ability to think.
Sorry, I was being melodramatic. Everyone makes mistakes or decisions
which with hindsight could have been better or were under some
constraint that isn't apparent. I shouldn't have suggested it indicated
thoughtlessness.
Thanks,
Nick
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Russell King - ARM Linux admin @ 2020-12-30 10:58 UTC (permalink / raw)
To: Nicholas Piggin
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra,
Catalin Marinas, x86, linux-kernel, stable, Will Deacon,
Mathieu Desnoyers, Andy Lutomirski, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20201230100028.GP1551@shell.armlinux.org.uk>
On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote:
> On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:
> > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:
> > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
> > >> I think it should certainly be documented in terms of what guarantees
> > >> it provides to application, _not_ the kinds of instructions it may or
> > >> may not induce the core to execute. And if existing API can't be
> > >> re-documented sanely, then deprecatd and new ones added that DTRT.
> > >> Possibly under a new system call, if arch's like ARM want a range
> > >> flush and we don't want to expand the multiplexing behaviour of
> > >> membarrier even more (sigh).
> > >
> > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying
> > > code, and takes whatever actions are necessary to support that.
> > > Exactly what actions it takes are cache implementation specific, and
> > > should be of no concern to the caller, but the underlying thing is...
> > > it's to support self-modifying code.
> >
> > Caveat
> > cacheflush() should not be used in programs intended to be portable.
> > On Linux, this call first appeared on the MIPS architecture, but nowa‐
> > days, Linux provides a cacheflush() system call on some other architec‐
> > tures, but with different arguments.
> >
> > What a disaster. Another badly designed interface, although it didn't
> > originate in Linux it sounds like we weren't to be outdone so
> > we messed it up even worse.
> >
> > flushing caches is neither necessary nor sufficient for code modification
> > on many processors. Maybe some old MIPS specific private thing was fine,
> > but certainly before it grew to other architectures, somebody should
> > have thought for more than 2 minutes about it. Sigh.
>
> WARNING: You are bordering on being objectionable and offensive with
> that comment.
>
> The ARM interface was designed by me back in the very early days of
> Linux, probably while you were still in dypers, based on what was
> known at the time. Back in the early 2000s, ideas such as relaxed
> memory ordering were not known. All there was was one level of
> harvard cache.
Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec
1998:
http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1
by Philip Blundell, and first appeared in the ARM
pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the
kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has
a slightly different signature: the third argument on ARM is a flags
argument, whereas the MIPS code, it is some undocumented "cache"
parameter.
Whether the ARM version pre or post dates the MIPS code, I couldn't say.
Whether it was ultimately taken from the MIPS implementation, again, I
couldn't say.
However, please stop insulting your fellow developers ability to think.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Russell King - ARM Linux admin @ 2020-12-30 10:00 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linuxppc-dev, Catalin Marinas, paulmck, Arnd Bergmann, Jann Horn,
Peter Zijlstra, x86, linux-kernel, stable, Mathieu Desnoyers,
Andy Lutomirski, Paul Mackerras, Will Deacon, linux-arm-kernel
In-Reply-To: <1609290821.wrfh89v23a.astroid@bobo.none>
On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote:
> Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:
> > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
> >> I think it should certainly be documented in terms of what guarantees
> >> it provides to application, _not_ the kinds of instructions it may or
> >> may not induce the core to execute. And if existing API can't be
> >> re-documented sanely, then deprecatd and new ones added that DTRT.
> >> Possibly under a new system call, if arch's like ARM want a range
> >> flush and we don't want to expand the multiplexing behaviour of
> >> membarrier even more (sigh).
> >
> > The 32-bit ARM sys_cacheflush() is there only to support self-modifying
> > code, and takes whatever actions are necessary to support that.
> > Exactly what actions it takes are cache implementation specific, and
> > should be of no concern to the caller, but the underlying thing is...
> > it's to support self-modifying code.
>
> Caveat
> cacheflush() should not be used in programs intended to be portable.
> On Linux, this call first appeared on the MIPS architecture, but nowa‐
> days, Linux provides a cacheflush() system call on some other architec‐
> tures, but with different arguments.
>
> What a disaster. Another badly designed interface, although it didn't
> originate in Linux it sounds like we weren't to be outdone so
> we messed it up even worse.
>
> flushing caches is neither necessary nor sufficient for code modification
> on many processors. Maybe some old MIPS specific private thing was fine,
> but certainly before it grew to other architectures, somebody should
> have thought for more than 2 minutes about it. Sigh.
WARNING: You are bordering on being objectionable and offensive with
that comment.
The ARM interface was designed by me back in the very early days of
Linux, probably while you were still in dypers, based on what was
known at the time. Back in the early 2000s, ideas such as relaxed
memory ordering were not known. All there was was one level of
harvard cache.
So, juts shut up with your insults.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* [PATCH] ibmvnic: fix: NULL pointer dereference.
From: YANG LI @ 2020-12-30 7:23 UTC (permalink / raw)
To: davem
Cc: linux-kernel, YANG LI, netdev, ljp, drt, kuba, sukadev,
linuxppc-dev, paulus
The error is due to dereference a null pointer in function
reset_one_sub_crq_queue():
if (!scrq) {
netdev_dbg(adapter->netdev,
"Invalid scrq reset. irq (%d) or msgs(%p).\n",
scrq->irq, scrq->msgs);
return -EINVAL;
}
If the expression is true, scrq must be a null pointer and cannot
dereference.
Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci <abaci@linux.alibaba.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index f302504..d7472be 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2981,9 +2981,7 @@ static int reset_one_sub_crq_queue(struct ibmvnic_adapter *adapter,
int rc;
if (!scrq) {
- netdev_dbg(adapter->netdev,
- "Invalid scrq reset. irq (%d) or msgs (%p).\n",
- scrq->irq, scrq->msgs);
+ netdev_dbg(adapter->netdev, "Invalid scrq reset.\n");
return -EINVAL;
}
--
1.8.3.1
^ permalink raw reply related
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-30 2:33 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
linux-kernel, stable, Will Deacon, Mathieu Desnoyers,
Andy Lutomirski, Catalin Marinas, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <20201229104456.GK1551@shell.armlinux.org.uk>
Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm:
> On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
>> I think it should certainly be documented in terms of what guarantees
>> it provides to application, _not_ the kinds of instructions it may or
>> may not induce the core to execute. And if existing API can't be
>> re-documented sanely, then deprecatd and new ones added that DTRT.
>> Possibly under a new system call, if arch's like ARM want a range
>> flush and we don't want to expand the multiplexing behaviour of
>> membarrier even more (sigh).
>
> The 32-bit ARM sys_cacheflush() is there only to support self-modifying
> code, and takes whatever actions are necessary to support that.
> Exactly what actions it takes are cache implementation specific, and
> should be of no concern to the caller, but the underlying thing is...
> it's to support self-modifying code.
Caveat
cacheflush() should not be used in programs intended to be portable.
On Linux, this call first appeared on the MIPS architecture, but nowa‐
days, Linux provides a cacheflush() system call on some other architec‐
tures, but with different arguments.
What a disaster. Another badly designed interface, although it didn't
originate in Linux it sounds like we weren't to be outdone so
we messed it up even worse.
flushing caches is neither necessary nor sufficient for code modification
on many processors. Maybe some old MIPS specific private thing was fine,
but certainly before it grew to other architectures, somebody should
have thought for more than 2 minutes about it. Sigh.
>
> Sadly, because it's existed for 20+ years, and it has historically been
> sufficient for other purposes too, it has seen quite a bit of abuse
> despite its design purpose not changing - it's been used by graphics
> drivers for example. They quickly learnt the error of their ways with
> ARMv6+, since it does not do sufficient for their purposes given the
> cache architectures found there.
>
> Let's not go around redesigning this after twenty odd years, requiring
> a hell of a lot of pain to users. This interface is called by code
> generated by GCC, so to change it you're looking at patching GCC as
> well as the kernel, and you basically will make new programs
> incompatible with older kernels - very bad news for users.
For something to be redesigned it had to have been designed in the first
place, so there is no danger of that don't worry... But no I never
suggested making incompatible changes to any existing system call, I
said "re-documented". And yes I said deprecated but in Linux that really
means kept indefinitely.
If ARM, MIPS, 68k etc programs and toolchains keep using what they are
using it'll keep working no problem.
The point is we're growing new interfaces, and making the same mistakes.
It's not portable (ARCH_HAS_MEMBARRIER_SYNC_CORE), it's also specified
in terms of low level processor operations rather than higher level
intent, and also is not sufficient for self-modifying code (without
additional cache flush on some processors).
The application wants a call that says something like "memory modified
before the call will be visible as instructions (including illegal
instructions) by all threads in the program after the system call
returns, and no threads will be subject to any effects of executing the
previous contents of that memory.
So I think the basics are simple (although should confirm with some JIT
and debugger etc developers, and not just Android mind you). There are
some complications in details, address ranges, virtual/physical, thread
local vs process vs different process or system-wide, memory ordering
and propagation of i and d sides, etc. But that can be worked through,
erring on the side of sanity rather than pointless micro-optmisations.
Thanks,
Nick
^ permalink raw reply
* [PATCH] powerpc/mm: add sanity check to avoid null pointer dereference
From: Defang Bo @ 2020-12-28 3:10 UTC (permalink / raw)
To: mpe, benh, paulus, christophe.leroy, akpm, geert, rppt,
linuxppc-dev, linux-kernel
Cc: Defang Bo
Similar to commit<0dc294f717d4>("powerpc/mm: bail out early when flushing TLB page"),
there should be a check for 'mm' to prevent Null pointer dereference
in case of 'mm' argument was legitimately passed.
Signed-off-by: Defang Bo <bodefang@126.com>
---
arch/powerpc/mm/nohash/tlb.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5872f69..1d89335 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -192,6 +192,9 @@ void local_flush_tlb_mm(struct mm_struct *mm)
{
unsigned int pid;
+ if (WARN_ON(!mm))
+ return;
+
preempt_disable();
pid = mm->context.id;
if (pid != MMU_NO_CONTEXT)
@@ -205,8 +208,11 @@ void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
{
unsigned int pid;
+ if (WARN_ON(!mm))
+ return;
+
preempt_disable();
- pid = mm ? mm->context.id : 0;
+ pid = mm->context.id;
if (pid != MMU_NO_CONTEXT)
_tlbil_va(vmaddr, pid, tsize, ind);
preempt_enable();
@@ -268,6 +274,9 @@ void flush_tlb_mm(struct mm_struct *mm)
{
unsigned int pid;
+ if (WARN_ON(!mm))
+ return;
+
preempt_disable();
pid = mm->context.id;
if (unlikely(pid == MMU_NO_CONTEXT))
--
2.7.4
^ permalink raw reply related
* [PATCH] powerpc/mm: avoid null pointer dereference in flush_tlb_mm
From: Defang Bo @ 2020-12-25 7:11 UTC (permalink / raw)
To: mpe, benh, paulus
Cc: linux-kernel, penberg, Defang Bo, akpm, linuxppc-dev, rppt
Similar to commit<0dc294f7>, there should be a check for 'mm' to prevent Null pointer dereference in case of 'mm' argument was legitimately passed.
Signed-off-by: Defang Bo <bodefang@126.com>
---
arch/powerpc/mm/nohash/tlb.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5872f69..09796c9 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -268,6 +268,9 @@ void flush_tlb_mm(struct mm_struct *mm)
{
unsigned int pid;
+ if (unlikely(!mm))
+ return;
+
preempt_disable();
pid = mm->context.id;
if (unlikely(pid == MMU_NO_CONTEXT))
--
2.7.4
^ permalink raw reply related
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Russell King - ARM Linux admin @ 2020-12-29 10:44 UTC (permalink / raw)
To: Nicholas Piggin
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
linux-kernel, stable, Will Deacon, Mathieu Desnoyers,
Andy Lutomirski, Catalin Marinas, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <1609210162.4d8dqilke6.astroid@bobo.none>
On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote:
> I think it should certainly be documented in terms of what guarantees
> it provides to application, _not_ the kinds of instructions it may or
> may not induce the core to execute. And if existing API can't be
> re-documented sanely, then deprecatd and new ones added that DTRT.
> Possibly under a new system call, if arch's like ARM want a range
> flush and we don't want to expand the multiplexing behaviour of
> membarrier even more (sigh).
The 32-bit ARM sys_cacheflush() is there only to support self-modifying
code, and takes whatever actions are necessary to support that.
Exactly what actions it takes are cache implementation specific, and
should be of no concern to the caller, but the underlying thing is...
it's to support self-modifying code.
Sadly, because it's existed for 20+ years, and it has historically been
sufficient for other purposes too, it has seen quite a bit of abuse
despite its design purpose not changing - it's been used by graphics
drivers for example. They quickly learnt the error of their ways with
ARMv6+, since it does not do sufficient for their purposes given the
cache architectures found there.
Let's not go around redesigning this after twenty odd years, requiring
a hell of a lot of pain to users. This interface is called by code
generated by GCC, so to change it you're looking at patching GCC as
well as the kernel, and you basically will make new programs
incompatible with older kernels - very bad news for users.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
* unsubscribe
From: Shawn Landden @ 2020-12-29 8:54 UTC (permalink / raw)
To: linuxppc-dev
--
Shawn Landden
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-29 3:31 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Arnd Bergmann, X86 ML, LKML, stable, Will Deacon,
Mathieu Desnoyers, Catalin Marinas, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <CALCETrX4v1KEf6ikVtFg6juh3Z_esJ-+6PLT1A21JJeTVh2k8g@mail.gmail.com>
Excerpts from Andy Lutomirski's message of December 29, 2020 10:36 am:
> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:
>> > The old sync_core_before_usermode() comments said that a non-icache-syncing
>> > return-to-usermode instruction is x86-specific and that all other
>> > architectures automatically notice cross-modified code on return to
>> > userspace. Based on my general understanding of how CPUs work and based on
>> > my atttempt to read the ARM manual, this is not true at all. In fact, x86
>> > seems to be a bit of an anomaly in the other direction: x86's IRET is
>> > unusually heavyweight for a return-to-usermode instruction.
>>
>> "sync_core_before_usermode" as I've said says nothing to arch, or to the
>> scheduler, or to membarrier.
>
> Agreed. My patch tries to fix this. I agree that the name is bad and
> could be improved further. We should define what
> membarrier(...SYNC_CORE) actually does and have arch hooks to make it
> happen.
>
>> > So let's drop any pretense that we can have a generic way implementation
>> > behind membarrier's SYNC_CORE flush and require all architectures that opt
>> > in to supply their own. This means x86, arm64, and powerpc for now. Let's
>> > also rename the function from sync_core_before_usermode() to
>> > membarrier_sync_core_before_usermode() because the precise flushing details
>> > may very well be specific to membarrier, and even the concept of
>> > "sync_core" in the kernel is mostly an x86-ism.
>>
>> The concept of "sync_core" (x86: serializing instruction, powerpc: context
>> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted
>> to add a serializing instruction to generic code so it grew this nasty API,
>> but the concept applies broadly.
>
> I mean that the mapping from the name "sync_core" to its semantics is
> x86 only. The string "sync_core" appears in the kernel only in
> arch/x86, membarrier code, membarrier docs, and a single SGI driver
> that is x86-only. Sure, the idea of serializing things is fairly
> generic, but exactly what operations serialize what, when things need
> serialization, etc is quite architecture specific.
Okay, well yes it's x86 only in name, I was more talking about the
concept.
> Heck, on 486 you serialize the instruction stream with JMP.
x86-specific aside, I did think the semantics of a "serializing
instruction" was reasonably well architected in x86. Sure it could do
other things as well, but if you executed a serializing instruction,
then you had a decent set of guarantees (e.g., what you might want
for code modification).
>
>> > +static inline void membarrier_sync_core_before_usermode(void)
>> > +{
>> > + /*
>> > + * XXX: I know basically nothing about powerpc cache management.
>> > + * Is this correct?
>> > + */
>> > + isync();
>>
>> This is not about memory ordering or cache management, it's about
>> pipeline management. Powerpc's return to user mode serializes the
>> CPU (aka the hardware thread, _not_ the core; another wrongness of
>> the name, but AFAIKS the HW thread is what is required for
>> membarrier). So this is wrong, powerpc needs nothing here.
>
> Fair enough. I'm happy to defer to you on the powerpc details. In
> any case, this just illustrates that we need feedback from a person
> who knows more about ARM64 than I do.
>
Thanks,
Nick
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-29 3:09 UTC (permalink / raw)
To: Andy Lutomirski
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
Russell King, ARM Linux, stable, linux-kernel, Will Deacon,
Mathieu Desnoyers, Catalin Marinas, Paul Mackerras, linuxppc-dev,
linux-arm-kernel
In-Reply-To: <CALCETrX6MOqmN5_jhyO1jJB7M3_T+hbomjxPYZLJmLVNmXAVzA@mail.gmail.com>
Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am:
> On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>
>> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
>> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>> > <mathieu.desnoyers@efficios.com> wrote:
>> >>
>> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
>> >>
>> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >> > <linux@armlinux.org.uk> wrote:
>> >> >>
>> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> >> > After chatting with rmk about this (but without claiming that any of
>> >> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> >> > to synchronize code changes, userspace should first do the code
>> >> >> > change, then flush icache as appropriate for the architecture, and
>> >> >> > then do the membarrier() to ensure that the old code is unused?
>> >>
>> >> ^ exactly, yes.
>> >>
>> >> >> >
>> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> >> > 32-bit arm?
>> >>
>> >> This was the original thinking, yes. The cacheflush IPI will flush specific
>> >> regions of code, and the membarrier IPI issues context synchronizing
>> >> instructions.
>>
>> APIs should be written in terms of the service they provide to
>> userspace, and in highest level terms as possible, rather than directing
>> hardware to do some low level operation. Unfortunately we're stuck with
>> this for now. We could deprecate it and replace it though.
>>
>> If userspace wants to modify code and ensure that after the system call
>> returns then no other thread will be executing the previous code, then
>> there should be an API for that. It could actually combine the two IPIs
>> into one for architectures that require both too.
>
> I agree. The membarrier API for SYNC_CORE is pretty nasty. I would
> much prefer a real API for JITs to use.
>
>>
>> >>
>> >> Architectures with coherent i/d caches don't need the cacheflush step.
>> >
>> > There are different levels of coherency -- VIVT architectures may have
>> > differing requirements compared to PIPT, etc.
>> >
>> > In any case, I feel like the approach taken by the documentation is
>> > fundamentally confusing. Architectures don't all speak the same
>> > language How about something like:
>> >
>> > The SYNC_CORE operation causes all threads in the caller's address
>> > space (including the caller) to execute an architecture-defined
>> > barrier operation. membarrier() will ensure that this barrier is
>> > executed at a time such that all data writes done by the calling
>> > thread before membarrier() are made visible by the barrier.
>> > Additional architecture-dependent cache management operations may be
>> > required to use this for JIT code.
>>
>> As said this isn't what SYNC_CORE does, and it's not what powerpc
>> context synchronizing instructions do either, it will very much
>> re-order visibility of stores around such an instruction.
>
> Perhaps the docs should be entirely arch-specific. It may well be
> impossible to state what it does in an arch-neutral way.
I think what I wrote above -- after the call returns, none of the
threads in the process will be executing instructions overwritten
previously by the caller (provided their i-caches are made coherent
with the caller's modifications).
>> A thread completes store instructions into a store queue, which is
>> as far as a context synchronizing event goes. Visibility comes at
>> some indeterminite time later.
>
> As currently implemented, it has the same visibility semantics as
> regular membarrier, too.
Ah I actually missed that SYNC_CORE is in _addition_ to the memory
barriers, and that's documented API too, not just implementation
sorry.
> So if I do:
>
> a = 1;
> membarrier(SYNC_CORE);
> b = 1;
>
> and another thread does:
>
> while (READ_ONCE(b) != 1)
> ;
> barrier();
> assert(a == 1);
Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither
that barrier or the added SYNC_CORE behaviour imply visibility though.
>
> then the assertion will pass. Similarly, one can do this, I hope:
>
> memcpy(codeptr, [some new instructions], len);
> arch_dependent_cache_flush(codeptr, len);
> membarrier(SYNC_CORE);
> ready = 1;
>
> and another thread does:
>
> while (READ_ONCE(ready) != 1)
> ;
> barrier();
> (*codeptr)();
>
> arch_dependent_cache_flush is a nop on x86. On arm and arm64, it
> appears to be a syscall, although maybe arm64 can do it from
> userspace. I still don't know what it is on powerpc.
>
> Even using the term "cache" here is misleading. x86 chips have all
> kinds of barely-documented instruction caches, and they have varying
> degrees of coherency. The architecture actually promises that, if you
> do a certain incantation, then you get the desired result.
> membarrier() allows a user to do this incantation. But trying to
> replicate the incantation verbatim on an architecture like ARM is
> insufficient, and trying to flush the things that are documented as
> being caches on x86 is expensive and a complete waste of time on x86.
> When you write some JIT code, you do *not* want to flush it all the
> way to main memory, especially on CPUs don't have the ability to write
> back invalidating. (That's most CPUs.)
>
> Even on x86, I suspect that the various decoded insn caches are rather
> more coherent than documented, and I have questions in to Intel about
> this. No answers yet.
>
> So perhaps the right approach is to say that membarrier() helps you
> perform the architecture-specific sequence of steps needed to safely
> modify code. On x86, you use it like this. On arm64, you do this
> other thing. On powerpc, you do something else.
I think it should certainly be documented in terms of what guarantees
it provides to application, _not_ the kinds of instructions it may or
may not induce the core to execute. And if existing API can't be
re-documented sanely, then deprecatd and new ones added that DTRT.
Possibly under a new system call, if arch's like ARM want a range
flush and we don't want to expand the multiplexing behaviour of
membarrier even more (sigh).
>
>>
>> I would be surprised if x86's serializing instructions were different
>> than powerpc. rdtsc ordering or flushing stores to cache would be
>> surprising.
>>
>
> At the very least, x86 has several levels of what ARM might call
> "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of
> context synchronization in that the changes they cause to the MMU take
> effect immediately, but they are not documented as synchronizing the
> instruction stream. "Serializing" instructions do all kinds of
> things, not all of which may be architecturally visible at all.
> MFENCE and LFENCE do various complicated things, and LFENCE has magic
> retroactive capabilities on old CPUs that were not documented when
> those CPUs were released. SFENCE does a different form of
> synchronization entirely. LOCK does something else. (The
> relationship between LOCK and MFENCE is confusing at best.) RDTSC
> doesn't serialize anything at all, but RDTSCP does provide a form of
> serialization that's kind of ilke LFENCE. It's a mess. Even the
> manuals are inconsistent about what "serialize" means. JMP has its
> own magic on x86, but only on very very old CPUs.
>
> The specific instruction that flushes everything into the coherency
> domain is SFENCE, and SFENCE is not, for normal purposes, needed for
> self- or cross-modifying code.
>
Good reason to avoid such language in the system call interface!
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH 4/5] ibmvfc: complete commands outside the host/queue lock
From: kernel test robot @ 2020-12-29 2:29 UTC (permalink / raw)
To: Tyrel Datwyler, james.bottomley
Cc: Tyrel Datwyler, kbuild-all, martin.petersen, linux-scsi,
linux-kernel, Brian King, brking, linuxppc-dev
In-Reply-To: <20201218231916.279833-5-tyreld@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 5008 bytes --]
Hi Tyrel,
I love your patch! Perhaps something to improve:
[auto build test WARNING on next-20201218]
[also build test WARNING on v5.11-rc1]
[cannot apply to powerpc/next v5.10 v5.10-rc7 v5.10-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Tyrel-Datwyler/ibmvfc-MQ-preparatory-locking-work/20201219-072334
base: 0d52778b8710eb11cb616761a02aee0a7fd60425
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/2609a2691baaf1f06d8306a56575ae2eb076f625
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Tyrel-Datwyler/ibmvfc-MQ-preparatory-locking-work/20201219-072334
git checkout 2609a2691baaf1f06d8306a56575ae2eb076f625
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> drivers/scsi/ibmvscsi/ibmvfc.c:1418:6: warning: no previous prototype for 'ibmvfc_locked_done' [-Wmissing-prototypes]
1418 | void ibmvfc_locked_done(struct ibmvfc_event *evt)
| ^~~~~~~~~~~~~~~~~~
In file included from arch/powerpc/include/asm/paca.h:15,
from arch/powerpc/include/asm/current.h:13,
from include/linux/thread_info.h:21,
from include/asm-generic/preempt.h:5,
from ./arch/powerpc/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:6,
from include/linux/umh.h:4,
from include/linux/kmod.h:9,
from include/linux/module.h:16,
from drivers/scsi/ibmvscsi/ibmvfc.c:10:
In function 'strncpy',
inlined from 'ibmvfc_gather_partition_info' at drivers/scsi/ibmvscsi/ibmvfc.c:1267:3,
inlined from 'ibmvfc_npiv_login' at drivers/scsi/ibmvscsi/ibmvfc.c:4538:2:
include/linux/string.h:291:30: warning: '__builtin_strncpy' specified bound 97 equals destination size [-Wstringop-truncation]
291 | #define __underlying_strncpy __builtin_strncpy
| ^
include/linux/string.h:301:9: note: in expansion of macro '__underlying_strncpy'
301 | return __underlying_strncpy(p, q, size);
| ^~~~~~~~~~~~~~~~~~~~
In function 'strncpy',
inlined from 'ibmvfc_set_login_info' at drivers/scsi/ibmvscsi/ibmvfc.c:1307:2,
inlined from 'ibmvfc_npiv_login' at drivers/scsi/ibmvscsi/ibmvfc.c:4539:2:
include/linux/string.h:291:30: warning: '__builtin_strncpy' specified bound 256 equals destination size [-Wstringop-truncation]
291 | #define __underlying_strncpy __builtin_strncpy
| ^
include/linux/string.h:301:9: note: in expansion of macro '__underlying_strncpy'
301 | return __underlying_strncpy(p, q, size);
| ^~~~~~~~~~~~~~~~~~~~
In function 'strncpy',
inlined from 'ibmvfc_set_login_info' at drivers/scsi/ibmvscsi/ibmvfc.c:1312:2,
inlined from 'ibmvfc_npiv_login' at drivers/scsi/ibmvscsi/ibmvfc.c:4539:2:
include/linux/string.h:291:30: warning: '__builtin_strncpy' specified bound 256 equals destination size [-Wstringop-truncation]
291 | #define __underlying_strncpy __builtin_strncpy
| ^
include/linux/string.h:301:9: note: in expansion of macro '__underlying_strncpy'
301 | return __underlying_strncpy(p, q, size);
| ^~~~~~~~~~~~~~~~~~~~
vim +/ibmvfc_locked_done +1418 drivers/scsi/ibmvscsi/ibmvfc.c
1409
1410 /**
1411 * ibmvfc_locked_done - Calls evt completion with host_lock held
1412 * @evt: ibmvfc evt to complete
1413 *
1414 * All non-scsi command completion callbacks have the expectation that the
1415 * host_lock is held. This callback is used by ibmvfc_init_event to wrap a
1416 * MAD evt with the host_lock.
1417 **/
> 1418 void ibmvfc_locked_done(struct ibmvfc_event *evt)
1419 {
1420 unsigned long flags;
1421
1422 spin_lock_irqsave(evt->vhost->host->host_lock, flags);
1423 evt->_done(evt);
1424 spin_unlock_irqrestore(evt->vhost->host->host_lock, flags);
1425 }
1426
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 72494 bytes --]
^ permalink raw reply
* [PATCH] tools/perf: Fix powerpc gap between kernel end and module start
From: Athira Rajeev @ 2020-12-29 2:14 UTC (permalink / raw)
To: mpe, acme, jolsa; +Cc: kjain, maddy, linuxppc-dev
Running "perf mem report" in TUI mode fails with ENOMEM message
in powerpc:
failed to process sample
Running with debug and verbose options points that issue is while
allocating memory for sample histograms.
The error path is:
symbol__inc_addr_samples -> __symbol__inc_addr_samples
-> annotated_source__histogram
symbol__inc_addr_samples calls annotated_source__alloc_histograms
to allocate memory for sample histograms using calloc. Here calloc fails
since the size of symbol is huge. The size of a symbol is calculated as
difference between its start and end address.
Example histogram allocation that fails is:
sym->name is _end, sym->start is 0xc0000000027a0000, sym->end is
0xc008000003890000, symbol__size(sym) is 0x80000010f0000
In above case, difference between sym->start (0xc0000000027a0000)
and sym->end (0xc008000003890000) is huge.
This is same problem as in s390 and arm64 which are fixed in commits:
'commit b9c0a64901d5 ("perf annotate: Fix s390 gap between kernel end
and module start")'
'commit 78886f3ed37e ("perf symbols: Fix arm64 gap between kernel start
and module end")'
When this symbol was read first, its start and end address was set to
address which matches with data from /proc/kallsyms.
After symbol__new:
symbol__new: _end 0xc0000000027a0000-0xc0000000027a0000
From /proc/kallsyms:
...
c000000002799370 b backtrace_flag
c000000002799378 B radix_tree_node_cachep
c000000002799380 B __bss_stop
c0000000027a0000 B _end
c008000003890000 t icmp_checkentry [ip_tables]
c008000003890038 t ipt_alloc_initial_table [ip_tables]
c008000003890468 T ipt_do_table [ip_tables]
c008000003890de8 T ipt_unregister_table_pre_exit [ip_tables]
...
Perf calls function symbols__fixup_end() which sets the end of symbol
to 0xc008000003890000, which is the next address and this is the start
address of first module (icmp_checkentry in above) which will make the
huge symbol size of 0x80000010f0000.
After symbols__fixup_end:
symbols__fixup_end: sym->name: _end, sym->start: 0xc0000000027a0000,
sym->end: 0xc008000003890000
On powerpc, kernel text segment is located at 0xc000000000000000
whereas the modules are located at very high memory addresses,
0xc00800000xxxxxxx. Since the gap between end of kernel text segment
and beginning of first module's address is high, histogram allocation
using calloc fails.
Fix this by detecting the kernel's last symbol and limiting
the range of last kernel symbol to pagesize.
Signed-off-by: Athira Rajeev<atrajeev@linux.vnet.ibm.com>
---
tools/perf/arch/powerpc/util/Build | 1 +
tools/perf/arch/powerpc/util/machine.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+)
create mode 100644 tools/perf/arch/powerpc/util/machine.c
diff --git a/tools/perf/arch/powerpc/util/Build b/tools/perf/arch/powerpc/util/Build
index e86e210bf514..b7945e5a543b 100644
--- a/tools/perf/arch/powerpc/util/Build
+++ b/tools/perf/arch/powerpc/util/Build
@@ -1,4 +1,5 @@
perf-y += header.o
+perf-y += machine.o
perf-y += kvm-stat.o
perf-y += perf_regs.o
perf-y += mem-events.o
diff --git a/tools/perf/arch/powerpc/util/machine.c b/tools/perf/arch/powerpc/util/machine.c
new file mode 100644
index 000000000000..c30e5cc88c16
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/machine.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <string.h>
+#include <internal/lib.h> // page_size
+#include "debug.h"
+#include "symbol.h"
+
+/* On powerpc kernel text segment start at memory addresses, 0xc000000000000000
+ * whereas the modules are located at very high memory addresses,
+ * for example 0xc00800000xxxxxxx. The gap between end of kernel text segment
+ * and beginning of first module's text segment is very high.
+ * Therefore do not fill this gap and do not assign it to the kernel dso map.
+ */
+
+void arch__symbols__fixup_end(struct symbol *p, struct symbol *c)
+{
+ if (strchr(p->name, '[') == NULL && strchr(c->name, '['))
+ /* Limit the range of last kernel symbol */
+ p->end += page_size;
+ else
+ p->end = c->start;
+ pr_debug4("%s sym:%s end:%#lx\n", __func__, p->name, p->end);
+}
--
1.8.3.1
^ permalink raw reply related
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Andy Lutomirski @ 2020-12-29 0:56 UTC (permalink / raw)
To: Nicholas Piggin
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
Russell King, ARM Linux, stable, linux-kernel, Will Deacon,
Mathieu Desnoyers, Andy Lutomirski, Catalin Marinas,
Paul Mackerras, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1609200902.me5niwm2t6.astroid@bobo.none>
On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
> >>
> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> >> > <linux@armlinux.org.uk> wrote:
> >> >>
> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> >> >> > After chatting with rmk about this (but without claiming that any of
> >> >> > this is his opinion), based on the manpage, I think membarrier()
> >> >> > currently doesn't really claim to be synchronizing caches? It just
> >> >> > serializes cores. So arguably if userspace wants to use membarrier()
> >> >> > to synchronize code changes, userspace should first do the code
> >> >> > change, then flush icache as appropriate for the architecture, and
> >> >> > then do the membarrier() to ensure that the old code is unused?
> >>
> >> ^ exactly, yes.
> >>
> >> >> >
> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> >> >> > syscall. That might cause you to end up with two IPIs instead of one
> >> >> > in total, but we probably don't care _that_ much about extra IPIs on
> >> >> > 32-bit arm?
> >>
> >> This was the original thinking, yes. The cacheflush IPI will flush specific
> >> regions of code, and the membarrier IPI issues context synchronizing
> >> instructions.
>
> APIs should be written in terms of the service they provide to
> userspace, and in highest level terms as possible, rather than directing
> hardware to do some low level operation. Unfortunately we're stuck with
> this for now. We could deprecate it and replace it though.
>
> If userspace wants to modify code and ensure that after the system call
> returns then no other thread will be executing the previous code, then
> there should be an API for that. It could actually combine the two IPIs
> into one for architectures that require both too.
I agree. The membarrier API for SYNC_CORE is pretty nasty. I would
much prefer a real API for JITs to use.
>
> >>
> >> Architectures with coherent i/d caches don't need the cacheflush step.
> >
> > There are different levels of coherency -- VIVT architectures may have
> > differing requirements compared to PIPT, etc.
> >
> > In any case, I feel like the approach taken by the documentation is
> > fundamentally confusing. Architectures don't all speak the same
> > language How about something like:
> >
> > The SYNC_CORE operation causes all threads in the caller's address
> > space (including the caller) to execute an architecture-defined
> > barrier operation. membarrier() will ensure that this barrier is
> > executed at a time such that all data writes done by the calling
> > thread before membarrier() are made visible by the barrier.
> > Additional architecture-dependent cache management operations may be
> > required to use this for JIT code.
>
> As said this isn't what SYNC_CORE does, and it's not what powerpc
> context synchronizing instructions do either, it will very much
> re-order visibility of stores around such an instruction.
Perhaps the docs should be entirely arch-specific. It may well be
impossible to state what it does in an arch-neutral way.
>
> A thread completes store instructions into a store queue, which is
> as far as a context synchronizing event goes. Visibility comes at
> some indeterminite time later.
As currently implemented, it has the same visibility semantics as
regular membarrier, too. So if I do:
a = 1;
membarrier(SYNC_CORE);
b = 1;
and another thread does:
while (READ_ONCE(b) != 1)
;
barrier();
assert(a == 1);
then the assertion will pass. Similarly, one can do this, I hope:
memcpy(codeptr, [some new instructions], len);
arch_dependent_cache_flush(codeptr, len);
membarrier(SYNC_CORE);
ready = 1;
and another thread does:
while (READ_ONCE(ready) != 1)
;
barrier();
(*codeptr)();
arch_dependent_cache_flush is a nop on x86. On arm and arm64, it
appears to be a syscall, although maybe arm64 can do it from
userspace. I still don't know what it is on powerpc.
Even using the term "cache" here is misleading. x86 chips have all
kinds of barely-documented instruction caches, and they have varying
degrees of coherency. The architecture actually promises that, if you
do a certain incantation, then you get the desired result.
membarrier() allows a user to do this incantation. But trying to
replicate the incantation verbatim on an architecture like ARM is
insufficient, and trying to flush the things that are documented as
being caches on x86 is expensive and a complete waste of time on x86.
When you write some JIT code, you do *not* want to flush it all the
way to main memory, especially on CPUs don't have the ability to write
back invalidating. (That's most CPUs.)
Even on x86, I suspect that the various decoded insn caches are rather
more coherent than documented, and I have questions in to Intel about
this. No answers yet.
So perhaps the right approach is to say that membarrier() helps you
perform the architecture-specific sequence of steps needed to safely
modify code. On x86, you use it like this. On arm64, you do this
other thing. On powerpc, you do something else.
>
> I would be surprised if x86's serializing instructions were different
> than powerpc. rdtsc ordering or flushing stores to cache would be
> surprising.
>
At the very least, x86 has several levels of what ARM might call
"context synchronization" AFAICT. STAC, CLAC, and POPF do a form of
context synchronization in that the changes they cause to the MMU take
effect immediately, but they are not documented as synchronizing the
instruction stream. "Serializing" instructions do all kinds of
things, not all of which may be architecturally visible at all.
MFENCE and LFENCE do various complicated things, and LFENCE has magic
retroactive capabilities on old CPUs that were not documented when
those CPUs were released. SFENCE does a different form of
synchronization entirely. LOCK does something else. (The
relationship between LOCK and MFENCE is confusing at best.) RDTSC
doesn't serialize anything at all, but RDTSCP does provide a form of
serialization that's kind of ilke LFENCE. It's a mess. Even the
manuals are inconsistent about what "serialize" means. JMP has its
own magic on x86, but only on very very old CPUs.
The specific instruction that flushes everything into the coherency
domain is SFENCE, and SFENCE is not, for normal purposes, needed for
self- or cross-modifying code.
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-29 0:36 UTC (permalink / raw)
To: Andy Lutomirski, Mathieu Desnoyers
Cc: paulmck, Arnd Bergmann, Jann Horn, Peter Zijlstra, x86,
Russell King, ARM Linux, stable, linux-kernel, Will Deacon,
Paul Mackerras, Catalin Marinas, linuxppc-dev, linux-arm-kernel
In-Reply-To: <CALCETrXx3Xe+4Y6WM-mp0cTUU=r3bW6PV2b25yA8bm1Gvak6wQ@mail.gmail.com>
Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> > <linux@armlinux.org.uk> wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
APIs should be written in terms of the service they provide to
userspace, and in highest level terms as possible, rather than directing
hardware to do some low level operation. Unfortunately we're stuck with
this for now. We could deprecate it and replace it though.
If userspace wants to modify code and ensure that after the system call
returns then no other thread will be executing the previous code, then
there should be an API for that. It could actually combine the two IPIs
into one for architectures that require both too.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
>
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
>
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing. Architectures don't all speak the same
> language How about something like:
>
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation. membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
As said this isn't what SYNC_CORE does, and it's not what powerpc
context synchronizing instructions do either, it will very much
re-order visibility of stores around such an instruction.
A thread completes store instructions into a store queue, which is
as far as a context synchronizing event goes. Visibility comes at
some indeterminite time later.
Also note that the regular membarrier syscall which does order
memory also does not make writes visible, it just ensures an
ordering.
>
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes. Currently this will be a
I would be surprised if x86's serializing instructions were different
than powerpc. rdtsc ordering or flushing stores to cache would be
surprising.
Thanks,
Nick
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change. The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions. This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed. No additional
> cache management is required on x86.
>
> arm: Something about the cache management syscalls.
>
> arm64: Ditto
>
> powerpc: I have no idea.
>
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Andy Lutomirski @ 2020-12-29 0:36 UTC (permalink / raw)
To: Nicholas Piggin
Cc: Arnd Bergmann, X86 ML, LKML, stable, Will Deacon,
Mathieu Desnoyers, Andy Lutomirski, Catalin Marinas,
Paul Mackerras, linuxppc-dev, linux-arm-kernel
In-Reply-To: <1609199804.yrsu9vagzk.astroid@bobo.none>
On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:
> > The old sync_core_before_usermode() comments said that a non-icache-syncing
> > return-to-usermode instruction is x86-specific and that all other
> > architectures automatically notice cross-modified code on return to
> > userspace. Based on my general understanding of how CPUs work and based on
> > my atttempt to read the ARM manual, this is not true at all. In fact, x86
> > seems to be a bit of an anomaly in the other direction: x86's IRET is
> > unusually heavyweight for a return-to-usermode instruction.
>
> "sync_core_before_usermode" as I've said says nothing to arch, or to the
> scheduler, or to membarrier.
Agreed. My patch tries to fix this. I agree that the name is bad and
could be improved further. We should define what
membarrier(...SYNC_CORE) actually does and have arch hooks to make it
happen.
> > So let's drop any pretense that we can have a generic way implementation
> > behind membarrier's SYNC_CORE flush and require all architectures that opt
> > in to supply their own. This means x86, arm64, and powerpc for now. Let's
> > also rename the function from sync_core_before_usermode() to
> > membarrier_sync_core_before_usermode() because the precise flushing details
> > may very well be specific to membarrier, and even the concept of
> > "sync_core" in the kernel is mostly an x86-ism.
>
> The concept of "sync_core" (x86: serializing instruction, powerpc: context
> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted
> to add a serializing instruction to generic code so it grew this nasty API,
> but the concept applies broadly.
I mean that the mapping from the name "sync_core" to its semantics is
x86 only. The string "sync_core" appears in the kernel only in
arch/x86, membarrier code, membarrier docs, and a single SGI driver
that is x86-only. Sure, the idea of serializing things is fairly
generic, but exactly what operations serialize what, when things need
serialization, etc is quite architecture specific.
Heck, on 486 you serialize the instruction stream with JMP.
> > +static inline void membarrier_sync_core_before_usermode(void)
> > +{
> > + /*
> > + * XXX: I know basically nothing about powerpc cache management.
> > + * Is this correct?
> > + */
> > + isync();
>
> This is not about memory ordering or cache management, it's about
> pipeline management. Powerpc's return to user mode serializes the
> CPU (aka the hardware thread, _not_ the core; another wrongness of
> the name, but AFAIKS the HW thread is what is required for
> membarrier). So this is wrong, powerpc needs nothing here.
Fair enough. I'm happy to defer to you on the powerpc details. In
any case, this just illustrates that we need feedback from a person
who knows more about ARM64 than I do.
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Andy Lutomirski @ 2020-12-29 0:30 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Catalin Marinas, Arnd Bergmann, paulmck, Peter Zijlstra, x86,
linux-kernel, Nicholas Piggin, Russell King, ARM Linux,
Will Deacon, Paul Mackerras, stable, Andy Lutomirski,
linuxppc-dev, linux-arm-kernel
In-Reply-To: <1670059472.3671.1609189779376.JavaMail.zimbra@efficios.com>
On Mon, Dec 28, 2020 at 1:09 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski luto@kernel.org wrote:
>
> [...]
>
> >> You seem to have noticed odd cases on arm64 where this guarantee does not
> >> match reality. Where exactly can we find this in the code, and which part
> >> of the architecture manual can you point us to which supports your concern ?
> >>
> >> Based on the notes I have, use of `eret` on aarch64 guarantees a context
> >> synchronizing
> >> instruction when returning to user-space.
> >
> > Based on my reading of the manual, ERET on ARM doesn't synchronize
> > anything at all. I can't find any evidence that it synchronizes data
> > or instructions, and I've seen reports that the CPU will happily
> > speculate right past it.
>
> Reading [1] there appears to be 3 kind of context synchronization events:
>
> - Taking an exception,
> - Returning from an exception,
> - ISB.
My reading of [1] is that all three of these are "context
synchronization event[s]", but that only ISB flushes the pipeline,
etc. The little description of context synchronization seems to
suggest that it only implies that certain register changes become
effective.
>
> This other source [2] adds (search for Context synchronization operation):
>
> - Exit from Debug state
> - Executing a DCPS instruction
> - Executing a DRPS instruction
>
> "ERET" falls into the second kind of events, and AFAIU should be context
> synchronizing. That was confirmed to me by Will Deacon when membarrier
> sync-core was implemented for aarch64. If the architecture reference manuals
> are wrong, is there an errata ?
>
> As for the algorithm to use on ARMv8 to update instructions, see [2]
> B2.3.4 Implication of caches for the application programmer
> "Synchronization and coherency issues between data and instruction accesses"
This specifically discusses ISB.
Let's wait for an actual ARM64 expert to chime in, though.
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Nicholas Piggin @ 2020-12-29 0:11 UTC (permalink / raw)
To: Andy Lutomirski, Mathieu Desnoyers, x86
Cc: Arnd Bergmann, LKML, stable, Will Deacon, Paul Mackerras,
Catalin Marinas, linuxppc-dev, linux-arm-kernel
In-Reply-To: <bf59ecb5487171a852bcc8cdd553ec797aedc485.1609093476.git.luto@kernel.org>
Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:
> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace. Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all. In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
"sync_core_before_usermode" as I've said says nothing to arch, or to the
scheduler, or to membarrier. It's badly named to start with so if
renaming it it should be something else. exit_lazy_tlb() at least says
something quite precise to scheudler and arch code that implements
the membarrier.
But I don't mind the idea of just making it x86 specific if as you say the
arch code can detect lazy mm switches more precisely than generic and
you want to do that.
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own. This means x86, arm64, and powerpc for now. Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.
The concept of "sync_core" (x86: serializing instruction, powerpc: context
synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted
to add a serializing instruction to generic code so it grew this nasty API,
but the concept applies broadly.
>
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested. My apologies
> for not reviewing this more carefully in the first place.
>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: x86@kernel.org
> Cc: stable@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>
> Hi arm64 and powerpc people-
>
> This is part of a series here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
>
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right. Once I get
> some feedback on this patch, I'll send out the whole pile. And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
>
> The x86 part is already fixed in Linus' tree.
>
> Thanks,
> Andy
>
> arch/arm64/include/asm/sync_core.h | 21 +++++++++++++++++++++
> arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
> arch/x86/Kconfig | 1 -
> arch/x86/include/asm/sync_core.h | 7 +++----
> include/linux/sched/mm.h | 1 -
> include/linux/sync_core.h | 21 ---------------------
> init/Kconfig | 3 ---
> kernel/sched/membarrier.c | 15 +++++++++++----
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
>
> diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..5be4531caabd
> --- /dev/null
> +++ b/arch/arm64/include/asm/sync_core.h
> @@ -0,0 +1,21 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_ARM64_SYNC_CORE_H
> +#define _ASM_ARM64_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> + /*
> + * XXX: is this enough or do we need a DMB first to make sure that
> + * writes from other CPUs become visible to this CPU? We have an
> + * smp_mb() already, but that's not quite the same thing.
> + */
> + isb();
> +}
> +
> +#endif /* _ASM_ARM64_SYNC_CORE_H */
> diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
> new file mode 100644
> index 000000000000..71dfbe7794e5
> --- /dev/null
> +++ b/arch/powerpc/include/asm/sync_core.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_SYNC_CORE_H
> +#define _ASM_POWERPC_SYNC_CORE_H
> +
> +#include <asm/barrier.h>
> +
> +/*
> + * Ensure that the CPU notices any instruction changes before the next time
> + * it returns to usermode.
> + */
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> + /*
> + * XXX: I know basically nothing about powerpc cache management.
> + * Is this correct?
> + */
> + isync();
This is not about memory ordering or cache management, it's about
pipeline management. Powerpc's return to user mode serializes the
CPU (aka the hardware thread, _not_ the core; another wrongness of
the name, but AFAIKS the HW thread is what is required for
membarrier). So this is wrong, powerpc needs nothing here.
Thanks,
Nick
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Mathieu Desnoyers @ 2020-12-28 21:26 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linuxppc-dev, Catalin Marinas, Arnd Bergmann, paulmck, Jann Horn,
Peter Zijlstra, x86, Russell King, ARM Linux, Nicholas Piggin,
linux-kernel, Paul Mackerras, stable, Will Deacon,
linux-arm-kernel
In-Reply-To: <CALCETrXx3Xe+4Y6WM-mp0cTUU=r3bW6PV2b25yA8bm1Gvak6wQ@mail.gmail.com>
----- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski luto@kernel.org wrote:
> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> > <linux@armlinux.org.uk> wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
>
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
>
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing. Architectures don't all speak the same
> language
Agreed.
> How about something like:
I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.
[ Sorry if I take more time to reply and if my replies are a bit more
concise than usual. I'm currently on parental leave, so I have
non-maskable interrupts to attend to. ;-) ]
Thanks,
Mathieu
>
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation. membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
>
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes. Currently this will be a
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change. The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions. This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed. No additional
> cache management is required on x86.
>
> arm: Something about the cache management syscalls.
>
> arm64: Ditto
>
> powerpc: I have no idea.
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Mathieu Desnoyers @ 2020-12-28 21:09 UTC (permalink / raw)
To: Andy Lutomirski, paulmck, Peter Zijlstra
Cc: Catalin Marinas, Arnd Bergmann, x86, Russell King, ARM Linux,
Nicholas Piggin, linux-kernel, Will Deacon, Paul Mackerras,
stable, linuxppc-dev, linux-arm-kernel
In-Reply-To: <CALCETrVdcn2r2Jvd1=-bM=FQ8KbX4aH-v4ytdojL7r7Nb6k8YQ@mail.gmail.com>
----- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski luto@kernel.org wrote:
[...]
>> You seem to have noticed odd cases on arm64 where this guarantee does not
>> match reality. Where exactly can we find this in the code, and which part
>> of the architecture manual can you point us to which supports your concern ?
>>
>> Based on the notes I have, use of `eret` on aarch64 guarantees a context
>> synchronizing
>> instruction when returning to user-space.
>
> Based on my reading of the manual, ERET on ARM doesn't synchronize
> anything at all. I can't find any evidence that it synchronizes data
> or instructions, and I've seen reports that the CPU will happily
> speculate right past it.
Reading [1] there appears to be 3 kind of context synchronization events:
- Taking an exception,
- Returning from an exception,
- ISB.
This other source [2] adds (search for Context synchronization operation):
- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction
"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?
As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4 Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"
Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.
Thanks,
Mathieu
[1] https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Andy Lutomirski @ 2020-12-28 21:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linuxppc-dev, Catalin Marinas, Arnd Bergmann, paulmck, Jann Horn,
Peter Zijlstra, x86, Russell King, ARM Linux, Nicholas Piggin,
linux-kernel, Paul Mackerras, stable, Andy Lutomirski,
Will Deacon, linux-arm-kernel
In-Reply-To: <1086654515.3607.1609187556216.JavaMail.zimbra@efficios.com>
On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
>
> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> > <linux@armlinux.org.uk> wrote:
> >>
> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> >> > After chatting with rmk about this (but without claiming that any of
> >> > this is his opinion), based on the manpage, I think membarrier()
> >> > currently doesn't really claim to be synchronizing caches? It just
> >> > serializes cores. So arguably if userspace wants to use membarrier()
> >> > to synchronize code changes, userspace should first do the code
> >> > change, then flush icache as appropriate for the architecture, and
> >> > then do the membarrier() to ensure that the old code is unused?
>
> ^ exactly, yes.
>
> >> >
> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> >> > syscall. That might cause you to end up with two IPIs instead of one
> >> > in total, but we probably don't care _that_ much about extra IPIs on
> >> > 32-bit arm?
>
> This was the original thinking, yes. The cacheflush IPI will flush specific
> regions of code, and the membarrier IPI issues context synchronizing
> instructions.
>
> Architectures with coherent i/d caches don't need the cacheflush step.
There are different levels of coherency -- VIVT architectures may have
differing requirements compared to PIPT, etc.
In any case, I feel like the approach taken by the documentation is
fundamentally confusing. Architectures don't all speak the same
language How about something like:
The SYNC_CORE operation causes all threads in the caller's address
space (including the caller) to execute an architecture-defined
barrier operation. membarrier() will ensure that this barrier is
executed at a time such that all data writes done by the calling
thread before membarrier() are made visible by the barrier.
Additional architecture-dependent cache management operations may be
required to use this for JIT code.
x86: SYNC_CORE executes a barrier that will cause subsequent
instruction fetches to observe prior writes. Currently this will be a
"serializing" instruction, but, if future improved CPU documentation
becomes available and relaxes this requirement, the barrier may
change. The kernel guarantees that writing new or modified
instructions to normal memory (and issuing SFENCE if the writes were
non-temporal) then doing a membarrier SYNC_CORE operation is
sufficient to cause all threads in the caller's address space to
execute the new or modified instructions. This is true regardless of
whether or not those instructions are written at the same virtual
address from which they are subsequently executed. No additional
cache management is required on x86.
arm: Something about the cache management syscalls.
arm64: Ditto
powerpc: I have no idea.
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Mathieu Desnoyers @ 2020-12-28 20:40 UTC (permalink / raw)
To: Russell King, ARM Linux
Cc: linuxppc-dev, Catalin Marinas, Arnd Bergmann, Jann Horn, x86,
linux-kernel, Nicholas Piggin, Paul Mackerras, stable,
Andy Lutomirski, Will Deacon, linux-arm-kernel
In-Reply-To: <20201228202451.GJ1551@shell.armlinux.org.uk>
----- On Dec 28, 2020, at 3:24 PM, Russell King, ARM Linux linux@armlinux.org.uk wrote:
> On Mon, Dec 28, 2020 at 11:44:33AM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>> >
>> > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> > > After chatting with rmk about this (but without claiming that any of
>> > > this is his opinion), based on the manpage, I think membarrier()
>> > > currently doesn't really claim to be synchronizing caches? It just
>> > > serializes cores. So arguably if userspace wants to use membarrier()
>> > > to synchronize code changes, userspace should first do the code
>> > > change, then flush icache as appropriate for the architecture, and
>> > > then do the membarrier() to ensure that the old code is unused?
>> > >
>> > > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> > > syscall. That might cause you to end up with two IPIs instead of one
>> > > in total, but we probably don't care _that_ much about extra IPIs on
>> > > 32-bit arm?
>> > >
>> > > For arm64, I believe userspace can flush icache across the entire
>> > > system with some instructions from userspace - "DC CVAU" followed by
>> > > "DSB ISH", or something like that, I think? (See e.g.
>> > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit
>> > > arm cacheflush() syscall.)
>> >
>> > Note that the ARM cacheflush syscall calls flush_icache_user_range()
>> > over the range of addresses that userspace has passed - it's intention
>> > since day one is to support cases where userspace wants to change
>> > executable code.
>> >
>> > It will issue the appropriate write-backs to the data cache (DCCMVAU),
>> > the invalidates to the instruction cache (ICIMVAU), invalidate the
>> > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue
>> > the appropriate barriers (DSB ISHST, ISB).
>> >
>> > Note that neither flush_icache_user_range() nor flush_icache_range()
>> > result in IPIs; cache operations are broadcast across all CPUs (which
>> > is one of the minimums we require for SMP systems.)
>> >
>> > Now, that all said, I think the question that has to be asked is...
>> >
>> > What is the basic purpose of membarrier?
>> >
>> > Is the purpose of it to provide memory barriers, or is it to provide
>> > memory coherence?
>> >
>> > If it's the former and not the latter, then cache flushes are out of
>> > scope, and expecting memory written to be visible to the instruction
>> > stream is totally out of scope of the membarrier interface, whether
>> > or not the writes happen on the same or a different CPU to the one
>> > executing the rewritten code.
>> >
>> > The documentation in the kernel does not seem to describe what it's
>> > supposed to be doing - the only thing I could find is this:
>> > Documentation/features/sched/membarrier-sync-core/arch-support.txt
>> > which describes it as "arch supports core serializing membarrier"
>> > whatever that means.
>> >
>> > Seems to be the standard and usual case of utterly poor to non-existent
>> > documentation within the kernel tree, or even a pointer to where any
>> > useful documentation can be found.
>> >
>> > Reading the membarrier(2) man page, I find nothing in there that talks
>> > about any kind of cache coherency for self-modifying code - it only
>> > seems to be about _barriers_ and nothing more, and barriers alone do
>> > precisely nothing to save you from non-coherent Harvard caches.
>> >
>> > So, either Andy has a misunderstanding, or the man page is wrong, or
>> > my rudimentary understanding of what membarrier is supposed to be
>> > doing is wrong...
>>
>> Look at the latest man page:
>>
>> https://man7.org/linux/man-pages/man2/membarrier.2.html
>>
>> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be
>> all that enlightening.
>
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
> In addition to providing the memory ordering guarantees de■
> scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from
> system call the calling thread has a guarantee that all its run■
> ning thread siblings have executed a core serializing instruc■
> tion. This guarantee is provided only for threads in the same
> process as the calling thread.
>
> The "expedited" commands complete faster than the non-expedited
> ones, they never block, but have the downside of causing extra
> overhead.
>
> A process must register its intent to use the private expedited
> sync core command prior to using it.
>
> This just says that the siblings have executed a serialising
> instruction, in other words a barrier. It makes no claims concerning
> cache coherency - and without some form of cache maintenance, there
> can be no expectation that the I and D streams to be coherent with
> each other.
Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges.
> This description is also weird in another respect. "guarantee that
> all its running thread siblings have executed a core serializing
> instruction" ... "The expedited commands ... never block".
>
> So, the core executing this call is not allowed to block, but the
> other part indicates that the other CPUs _have_ executed a serialising
> instruction before this call returns... one wonders how that happens
> without blocking. Maybe the CPU spins waiting for completion instead?
Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.
For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.
Thanks,
Mathieu
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Mathieu Desnoyers @ 2020-12-28 20:32 UTC (permalink / raw)
To: Andy Lutomirski, paulmck, Peter Zijlstra
Cc: linuxppc-dev, Catalin Marinas, Arnd Bergmann, Jann Horn, x86,
linux-kernel, Nicholas Piggin, Russell King, ARM Linux,
Paul Mackerras, stable, Will Deacon, linux-arm-kernel
In-Reply-To: <CALCETrVpvrBufrJgXNY=ogtZQLo7zgxQmD7k9eVCFjcdcvarmA@mail.gmail.com>
----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
>>
>> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> > After chatting with rmk about this (but without claiming that any of
>> > this is his opinion), based on the manpage, I think membarrier()
>> > currently doesn't really claim to be synchronizing caches? It just
>> > serializes cores. So arguably if userspace wants to use membarrier()
>> > to synchronize code changes, userspace should first do the code
>> > change, then flush icache as appropriate for the architecture, and
>> > then do the membarrier() to ensure that the old code is unused?
^ exactly, yes.
>> >
>> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> > syscall. That might cause you to end up with two IPIs instead of one
>> > in total, but we probably don't care _that_ much about extra IPIs on
>> > 32-bit arm?
This was the original thinking, yes. The cacheflush IPI will flush specific
regions of code, and the membarrier IPI issues context synchronizing
instructions.
Architectures with coherent i/d caches don't need the cacheflush step.
>> >
>> > For arm64, I believe userspace can flush icache across the entire
>> > system with some instructions from userspace - "DC CVAU" followed by
>> > "DSB ISH", or something like that, I think? (See e.g.
>> > compat_arm_syscall(), the arm64 compat code that implements the 32-bit
>> > arm cacheflush() syscall.)
>>
>> Note that the ARM cacheflush syscall calls flush_icache_user_range()
>> over the range of addresses that userspace has passed - it's intention
>> since day one is to support cases where userspace wants to change
>> executable code.
>>
>> It will issue the appropriate write-backs to the data cache (DCCMVAU),
>> the invalidates to the instruction cache (ICIMVAU), invalidate the
>> branch target buffer (BPIALLIS or BPIALL as appropriate), and issue
>> the appropriate barriers (DSB ISHST, ISB).
>>
>> Note that neither flush_icache_user_range() nor flush_icache_range()
>> result in IPIs; cache operations are broadcast across all CPUs (which
>> is one of the minimums we require for SMP systems.)
But the ISB AFAIU is not a broadcast. Based on
https://developer.arm.com/documentation/ddi0406/c/Appendices/Barrier-Litmus-Tests/Cache-and-TLB-maintenance-operations-and-barriers/Instruction-cache-maintenance-operations
"Ensuring the visibility of updates to instructions for a multiprocessor"
the ISB needs to be issued on each processor in a multiprocessor system.
This is where membarrier sync-core comes handy.
>>
>> Now, that all said, I think the question that has to be asked is...
>>
>> What is the basic purpose of membarrier?
For basic membarrier (not sync-core), the purpose is to provide memory
barriers across a given set of threads.
For sync-core membarrier, the purpose is to guarantee context synchronizing
instructions on all threads belonging to the same mm.
>>
>> Is the purpose of it to provide memory barriers, or is it to provide
>> memory coherence?
The purpose has never been to provide any kind of memory coherence, as
this would duplicate what is already achieved by other architecture-specific
system calls.
>>
>> If it's the former and not the latter, then cache flushes are out of
>> scope, and expecting memory written to be visible to the instruction
>> stream is totally out of scope of the membarrier interface, whether
>> or not the writes happen on the same or a different CPU to the one
>> executing the rewritten code.
>>
>> The documentation in the kernel does not seem to describe what it's
>> supposed to be doing - the only thing I could find is this:
>> Documentation/features/sched/membarrier-sync-core/arch-support.txt
>> which describes it as "arch supports core serializing membarrier"
>> whatever that means.
It's described in include/uapi/linux/membarrier.h.
>>
>> Seems to be the standard and usual case of utterly poor to non-existent
>> documentation within the kernel tree, or even a pointer to where any
>> useful documentation can be found.
>>
>> Reading the membarrier(2) man page, I find nothing in there that talks
>> about any kind of cache coherency for self-modifying code - it only
>> seems to be about _barriers_ and nothing more, and barriers alone do
>> precisely nothing to save you from non-coherent Harvard caches.
There is no mention of cache coherency because membarrier does not
do anything wrt cache coherency. We should perhaps explicitly point
it out though.
>>
>> So, either Andy has a misunderstanding, or the man page is wrong, or
>> my rudimentary understanding of what membarrier is supposed to be
>> doing is wrong...
>
> Look at the latest man page:
>
> https://man7.org/linux/man-pages/man2/membarrier.2.html
>
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be
> all that enlightening.
As described in the membarrier(2) man page and in include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core serializing
instructions in addition to the memory ordering guaranteed by MEMBARRIER_CMD_PRIVATE_EXPEDITED.
It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.
So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.
Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
From: Russell King - ARM Linux admin @ 2020-12-28 20:24 UTC (permalink / raw)
To: Andy Lutomirski
Cc: linuxppc-dev, Catalin Marinas, Arnd Bergmann, Jann Horn, x86,
linux-kernel, Nicholas Piggin, Mathieu Desnoyers, stable,
Paul Mackerras, Will Deacon, linux-arm-kernel
In-Reply-To: <CALCETrVpvrBufrJgXNY=ogtZQLo7zgxQmD7k9eVCFjcdcvarmA@mail.gmail.com>
On Mon, Dec 28, 2020 at 11:44:33AM -0800, Andy Lutomirski wrote:
> On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> > > After chatting with rmk about this (but without claiming that any of
> > > this is his opinion), based on the manpage, I think membarrier()
> > > currently doesn't really claim to be synchronizing caches? It just
> > > serializes cores. So arguably if userspace wants to use membarrier()
> > > to synchronize code changes, userspace should first do the code
> > > change, then flush icache as appropriate for the architecture, and
> > > then do the membarrier() to ensure that the old code is unused?
> > >
> > > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> > > syscall. That might cause you to end up with two IPIs instead of one
> > > in total, but we probably don't care _that_ much about extra IPIs on
> > > 32-bit arm?
> > >
> > > For arm64, I believe userspace can flush icache across the entire
> > > system with some instructions from userspace - "DC CVAU" followed by
> > > "DSB ISH", or something like that, I think? (See e.g.
> > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit
> > > arm cacheflush() syscall.)
> >
> > Note that the ARM cacheflush syscall calls flush_icache_user_range()
> > over the range of addresses that userspace has passed - it's intention
> > since day one is to support cases where userspace wants to change
> > executable code.
> >
> > It will issue the appropriate write-backs to the data cache (DCCMVAU),
> > the invalidates to the instruction cache (ICIMVAU), invalidate the
> > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue
> > the appropriate barriers (DSB ISHST, ISB).
> >
> > Note that neither flush_icache_user_range() nor flush_icache_range()
> > result in IPIs; cache operations are broadcast across all CPUs (which
> > is one of the minimums we require for SMP systems.)
> >
> > Now, that all said, I think the question that has to be asked is...
> >
> > What is the basic purpose of membarrier?
> >
> > Is the purpose of it to provide memory barriers, or is it to provide
> > memory coherence?
> >
> > If it's the former and not the latter, then cache flushes are out of
> > scope, and expecting memory written to be visible to the instruction
> > stream is totally out of scope of the membarrier interface, whether
> > or not the writes happen on the same or a different CPU to the one
> > executing the rewritten code.
> >
> > The documentation in the kernel does not seem to describe what it's
> > supposed to be doing - the only thing I could find is this:
> > Documentation/features/sched/membarrier-sync-core/arch-support.txt
> > which describes it as "arch supports core serializing membarrier"
> > whatever that means.
> >
> > Seems to be the standard and usual case of utterly poor to non-existent
> > documentation within the kernel tree, or even a pointer to where any
> > useful documentation can be found.
> >
> > Reading the membarrier(2) man page, I find nothing in there that talks
> > about any kind of cache coherency for self-modifying code - it only
> > seems to be about _barriers_ and nothing more, and barriers alone do
> > precisely nothing to save you from non-coherent Harvard caches.
> >
> > So, either Andy has a misunderstanding, or the man page is wrong, or
> > my rudimentary understanding of what membarrier is supposed to be
> > doing is wrong...
>
> Look at the latest man page:
>
> https://man7.org/linux/man-pages/man2/membarrier.2.html
>
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be
> all that enlightening.
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
In addition to providing the memory ordering guarantees de■
scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from
system call the calling thread has a guarantee that all its run■
ning thread siblings have executed a core serializing instruc■
tion. This guarantee is provided only for threads in the same
process as the calling thread.
The "expedited" commands complete faster than the non-expedited
ones, they never block, but have the downside of causing extra
overhead.
A process must register its intent to use the private expedited
sync core command prior to using it.
This just says that the siblings have executed a serialising
instruction, in other words a barrier. It makes no claims concerning
cache coherency - and without some form of cache maintenance, there
can be no expectation that the I and D streams to be coherent with
each other.
This description is also weird in another respect. "guarantee that
all its running thread siblings have executed a core serializing
instruction" ... "The expedited commands ... never block".
So, the core executing this call is not allowed to block, but the
other part indicates that the other CPUs _have_ executed a serialising
instruction before this call returns... one wonders how that happens
without blocking. Maybe the CPU spins waiting for completion instead?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox