LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: ppc64le and 32-bit LE userland compatibility
From: Daniel Kolesa @ 2020-06-02  2:03 UTC (permalink / raw)
  To: Segher Boessenkool, Joseph Myers
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev
In-Reply-To: <20200602014227.GM31009@gate.crashing.org>

On Tue, Jun 2, 2020, at 03:42, Segher Boessenkool wrote:
> Hi Joseph,
> 
> On Mon, Jun 01, 2020 at 09:28:25PM +0000, Joseph Myers wrote:
> > On Fri, 29 May 2020, Will Springer via Binutils wrote:
> > 
> > > Hey all, a couple of us over in #talos-workstation on freenode have been
> > > working on an effort to bring up a Linux PowerPC userland that runs in 32-bit
> > > little-endian mode, aka ppcle. As far as we can tell, no ABI has ever been
> > > designated for this (unless you count the patchset from a decade ago [1]), so
> > > it's pretty much uncharted territory as far as Linux is concerned. We want to
> > > sync up with libc and the relevant kernel folks to establish the best path
> > > forward.
> > 
> > As a general comment on the glibc side of things, if this is considered 
> > like a new port, and it probably is, the same principles that apply to new 
> > ports apply here.
> > 
> > There's a general discussion at 
> > <https://sourceware.org/glibc/wiki/NewPorts>, although much of that is 
> > only applicable when adding new CPU architecture support.  More specific 
> > points include that new 32-bit ports should default to 64-bit time and 
> > file offsets from the start, with no support for 32-bit time or offsets 
> > (meaning that if you want to use this with some kind of library call 
> > translation, the library call translation will need to deal with 
> > corresponding type size conversions).
> 
> Either that, or use the same as BE 32-bit PowerPC Linux, I'd say (it
> won't make things worse, and if it is easier?)  But preferably the
> newer, better, thing of course :-)
> 
> > And a new port should not be added 
> > that uses the IBM long double format.  You can use IEEE binary128 long 
> > double, possibly with an ABI similar to that used on powerpc64le, or can 
> > use long double = double, but should not support IBM long double, and 
> > preferably should only have one long double format rather than using the 
> > glibc support for building with different options resulting in functions 
> > for different long double formats being called.
> 
> You cannot use IEEE QP float ("binary128") here, but more on that in a
> later post.
> 
> (I so very much agree about the problems having more than one long
> double format -- on the other hand, you'll just share it with BE, and
> with the existing powerpcle-linux (sup)port).

Well, it'd be nice to use the opportunity to ditch the IBM long doubles altogether, since these get in the way for me in various places (for instance, GCC still can't constant-fold them, which breaks on constexpr in C++, as well as makes it impossible to e.g. enable the runtime/standard library for GDC in GCC as that heavily relies on the `real` type in D, which maps directly to `long double` and druntime/phobos heavily relies on constant folding of the `real` type being functional; there are also assorted libraries and applications in the common userland that don't like the IBM format for one reason or another)

That said, that's also problematic:

1) ppc64le is going to newly use IEEE754 binary128, so we're all good there - baseline mandates VSX, so at least passing them is fine, which is the important thing (actual binary128 instructions came with ISA 3.0 but that can be detected at runtime, at least in glibc) - ibm128 is then implemented via symvers/compat, which is fine.

2) ppc64 for now uses IBM 128-bit long double, so it's problematic. I don't care about ELFv1, as it's legacy and has tons of drawbacks on its own, and there's a whole legacy ecosystem relying on it; that said, a new ELFv2 port (let's say, with ld64.so.3 dynamic linker) could easily default to another long double format, without worrying about compat - I propose this format to be binary64, as it's already used by musl on BE/64 (allows projects such as `gcompat` to work) and thus is already implemented in all toolchains we care about and can easily be flipped on, and is fully compatible with all CPUs that can run ppc64 code, even without VMX/VSX

3) ppcle would be a new port (let's say ld.so.2), so it could default to a new format; binary64 would be good

4) that leaves ppc32/BE, which would be the only outlier - while I could probably implement compat in much the same way as ppc64le does with binary128 (bump symvers for math symbols, leave older symvers for existing binaries, change defaults), this would be divergent from the existing port; glibc/IBM probably won't want to switch this, and while I would definitely like to, maintaining a divergent downstream patchset seems non-ideal. There is some basis for this - glibc did use to use binary64 on ppc32 and ppc64 BE in the past, and the compatibility symbols are still there under the old symvers, but not quite sure what to do there.

> 
> 
> Segher
>

Daniel

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Segher Boessenkool @ 2020-06-02  1:58 UTC (permalink / raw)
  To: Daniel Kolesa
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev,
	Joseph Myers
In-Reply-To: <c821b608-f14f-4a68-bbec-b7b6c1d8bddc@www.fastmail.com>

On Tue, Jun 02, 2020 at 01:26:37AM +0200, Daniel Kolesa wrote:
> On Mon, Jun 1, 2020, at 23:28, Joseph Myers wrote:
> Are you sure this would be a new port? Glibc already works in this combination, as it seems to me it'd be best if it was just a variant of the existing 32-bit PowerPC port, sharing most conventions besides endianness with the BE port.

That's right.  Except it isn't an "official" existing port, never has
been "officially" supported.

> 128-bit IEEE long double would not work, since that relies on VSX being present (gcc will explicitly complain if it's not). I'd be all for using 64-bit long double, though (musl already does, on all ppc ports).

The current IEEE QP float support requires VSX for its emulation, yes
(possibly even Power8?)  As Mike reminded me today, it also requires
__int128 support, which rules out anything 32-bit currently.  Without
that restriction, we could just make QP float passed in GPRs (use the
ABIs for any struct passed that way), and that'll just work out with
all ABIs, older or not.

> While we're at long double, I'd actually be interested in transitioning the existing big endian ports in Void (64-bit and 32-bit, neither has VSX baseline requirement in my case) to using 64-bit long double, abandoning the IBM format altogether (little endian will transition to 128-bit IEEE long double once it's ready on your side, as that assumes POWER8 baseline which includes VSX).

I recommend new ports that cannot jump to IEEE QP float directly to use
long double == double for the time being, avoiding the extra
complications that IBM double double would bring.  But you'll still have
a transition to IEEE 128 if you ever want to go there.

But if you already use double-double, I don't know if the cost changing
away from that is worth it now.

> What would be the best way for me to proceed with that? I actually experimented with this, using the old glibc compat symbols from pre-ibm128 times, and I mostly had it working, except I haven't managed to find a way to switch the default symbols to 64-bit ones, which is problematic as linking everything against nldbl_nonshared is fragile and potentially quirky (breaks dlsym, function pointer equality across libraries, etc).

Yup.  "Rebuild the world" works :-/  I don't have any  better advice,
nothing you cannot figure out yourself.

> There is also one more thing while we're at this. The 64-bit big endian Void port uses the ELFv2 ABI, even on glibc. This is not officially supported on glibc as far as I can tell, but it does work out of box, without any patching (things in general match little endian then, i.e. ld64.so.2 etc, but they're big endian). Is there any chance of making that support official?

(I don't talk for glibc).

The first thing needed is for "us" to have faith in it.  That starts
with seeing test results for the testsuites!

(Something similar goes for the GCC port -- there is no official support
for BE ELFv2, but of course it does work, and if we get test results we
may keep it that way, hint hint :-) )

Segher

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Segher Boessenkool @ 2020-06-02  1:42 UTC (permalink / raw)
  To: Joseph Myers
  Cc: libc-alpha, eery, daniel, musl, Will Springer, binutils, libc-dev,
	linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012119010.11121@digraph.polyomino.org.uk>

Hi Joseph,

On Mon, Jun 01, 2020 at 09:28:25PM +0000, Joseph Myers wrote:
> On Fri, 29 May 2020, Will Springer via Binutils wrote:
> 
> > Hey all, a couple of us over in #talos-workstation on freenode have been
> > working on an effort to bring up a Linux PowerPC userland that runs in 32-bit
> > little-endian mode, aka ppcle. As far as we can tell, no ABI has ever been
> > designated for this (unless you count the patchset from a decade ago [1]), so
> > it's pretty much uncharted territory as far as Linux is concerned. We want to
> > sync up with libc and the relevant kernel folks to establish the best path
> > forward.
> 
> As a general comment on the glibc side of things, if this is considered 
> like a new port, and it probably is, the same principles that apply to new 
> ports apply here.
> 
> There's a general discussion at 
> <https://sourceware.org/glibc/wiki/NewPorts>, although much of that is 
> only applicable when adding new CPU architecture support.  More specific 
> points include that new 32-bit ports should default to 64-bit time and 
> file offsets from the start, with no support for 32-bit time or offsets 
> (meaning that if you want to use this with some kind of library call 
> translation, the library call translation will need to deal with 
> corresponding type size conversions).

Either that, or use the same as BE 32-bit PowerPC Linux, I'd say (it
won't make things worse, and if it is easier?)  But preferably the
newer, better, thing of course :-)

> And a new port should not be added 
> that uses the IBM long double format.  You can use IEEE binary128 long 
> double, possibly with an ABI similar to that used on powerpc64le, or can 
> use long double = double, but should not support IBM long double, and 
> preferably should only have one long double format rather than using the 
> glibc support for building with different options resulting in functions 
> for different long double formats being called.

You cannot use IEEE QP float ("binary128") here, but more on that in a
later post.

(I so very much agree about the problems having more than one long
double format -- on the other hand, you'll just share it with BE, and
with the existing powerpcle-linux (sup)port).


Segher

^ permalink raw reply

* Re: [musl] Re: ppc64le and 32-bit LE userland compatibility
From: Segher Boessenkool @ 2020-06-02  1:36 UTC (permalink / raw)
  To: Daniel Kolesa
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, libc-dev, linuxppc-dev
In-Reply-To: <4205b197-b964-451e-bc41-59b35d1dd233@www.fastmail.com>

On Mon, Jun 01, 2020 at 12:29:56AM +0200, Daniel Kolesa wrote:
> On Sun, May 31, 2020, at 22:42, Segher Boessenkool wrote:
> > > There was just an assumption that LE == powerpc64le in libgo, spotted by 
> > > q66 (daniel@ on the CC). I just pushed the patch to [1].
> > 
> > Please send GCC patches to gcc-patches@ ?
> 
> FWIW, that patch alone is not very useful, we'd need to otherwise patch libgo to recognize a new GOARCH (as right now it's likely to just use 'ppc' which is wrong).

Gotcha.

> That said, we'll get back to you with any patches we have. One I can already think of - we will need to update the dynamic linker name so that it uses ld-musl-powerpcle.so instead of powerpc (musl needs to be updated the same way by adding the subarch variable for the 'le' prefix).

Thanks!  That would be good progress.

> > > > Almost no project that used 32-bit PowerPC in LE mode has sent patches
> > > > to the upstreams.
> > > 
> > > Right, but I have heard concerns from at least one person familiar with 
> > > the ppc kernel about breaking existing users of this arch-endianness 
> > > combo, if any. It seems likely that none of those use upstream, though ^^;
> > 
> > So we don't care, because we *cannot* care.
> 
> Well, that's the reason this thread was opened in the first place - to call out to any potential users, and synchronize with upstreams on a single way forward that all upstreams can agree on, since this effort requires changes in various parts of the stack. We don't want to hog changes locally or otherwise do any changes that would be in conflict with upstream projects, as that would mean needlessly diverging, which only means trouble later on.

Much appreciated!

I don't actually foresee any huge problems -- just lots of hard work ;-)


Segher

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Daniel Kolesa @ 2020-06-02  0:13 UTC (permalink / raw)
  To: Joseph Myers
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012351270.11121@digraph.polyomino.org.uk>

On Tue, Jun 2, 2020, at 01:55, Joseph Myers wrote:
> On Mon, 1 Jun 2020, Joseph Myers wrote:
> 
> > The minimum supported architecture for powerpc64le (POWER8) has VSX.  My 
> > understanding was that the suggestion was for 32-bit userspace to run 
> > under powerpc64le kernels running on POWER8 or later, meaning that such a 
> > 32-bit LE port, and any ABI designed for such a port, can assume VSX is 
> > available.  Or does VSX not work, at the hardware level, for 32-bit 
> > POWER8?  (In which case you could pick another ABI for binary128 argument 
> > passing and return.)
> 
> In fact, my understanding is that the ABI for passing binary128 values in 
> vector registers is perfectly implementable for processors with just VMX 
> (AltiVec) and not VSX.  So if you do want to support binary128 for a new 
> ABI for either 32-bit LE or 32-bit or 64-bit BE, you don't need to require 
> VSX for that ABI, you just need to change any GCC requirement for VSX for 
> binary128 to allow it with VMX when building for your new ABI.

Which still doesn't help us even if true, since we plan to support hardware that doesn't have any kind of vector functionality in the first place (PowerPC G3/G4, and for the ELFv2 64-bit BE port, the minimum for binary packages is 970/G5 which does have AltiVec, but it is also supported to build your userland from source without this, for e.g. POWER5 machines, or e5500 SoCs)

> 
> -- 
> Joseph S. Myers
> joseph@codesourcery.com
>

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Daniel Kolesa @ 2020-06-02  0:11 UTC (permalink / raw)
  To: Joseph Myers
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012329420.11121@digraph.polyomino.org.uk>

On Tue, Jun 2, 2020, at 01:45, Joseph Myers wrote:
> On Tue, 2 Jun 2020, Daniel Kolesa wrote:
> 
> > Are you sure this would be a new port? Glibc already works in this 
> > combination, as it seems to me it'd be best if it was just a variant of 
> > the existing 32-bit PowerPC port, sharing most conventions besides 
> > endianness with the BE port.
> 
> The supported glibc ABIs are listed at 
> <https://sourceware.org/glibc/wiki/ABIList>.  This would be a new ABI, 
> which should have a new ABI-and-architecture-specific dynamic linker name 
> (all new ports are expected to have a unique dynamic linker name for each 
> ABI, to support systems using multiarch directory arrangements), new 
> symbol versions and avoid legacy features such as 32-bit time or offsets 
> or IBM long double.
> 
> > 128-bit IEEE long double would not work, since that relies on VSX being 
> > present (gcc will explicitly complain if it's not). I'd be all for using 
> 
> The minimum supported architecture for powerpc64le (POWER8) has VSX.  My 
> understanding was that the suggestion was for 32-bit userspace to run 
> under powerpc64le kernels running on POWER8 or later, meaning that such a 
> 32-bit LE port, and any ABI designed for such a port, can assume VSX is 
> available.  Or does VSX not work, at the hardware level, for 32-bit 
> POWER8?  (In which case you could pick another ABI for binary128 argument 
> passing and return.)

POWER8 may have VSX (well, actually POWER7 and newer has VSX and can run LE, but glibc does not support this, musl potentially does), but the overall assumption here is that the resulting binaries should eventually not be limited to being just userspace under ppc64le, but should be runnable on a native kernel as well, which should not be limited to any particular baseline other than just PowerPC.

While it should in theory be possible to do IEEE ldbl128 using a different ABI, I don't really see any benefit in this - for one, the baseline hardware doesn't support on any level, it would mean further complicating the ABI, and it would require explicit support in the compiler, which currently doesn't exist. Using 64-bit long doubles sounds like a much better way out to me.

> 
> > There is also one more thing while we're at this. The 64-bit big endian 
> > Void port uses the ELFv2 ABI, even on glibc. This is not officially 
> > supported on glibc as far as I can tell, but it does work out of box, 
> > without any patching (things in general match little endian then, i.e. 
> > ld64.so.2 etc, but they're big endian). Is there any chance of making 
> > that support official?
> 
> If you want to support ELFv2 for 64-bit big endian in glibc, again that 
> should have a unique dynamic linker name, new symbol versions, only 
> binary128 long double, etc. - avoid all the legacy aspects of the existing 
> ELFv1 port rather than selectively saying that "ELFv1" itself is the only 
> legacy aspect and keeping the others (when it's the others that are 
> actually more problematic in glibc).

Again, the BE port cannot use binary128 long double, at least not with the same ABI as on POWER8, since it runs on all 64-bit PowerPC systems starting with 970 (G5, and potentially even POWER4 if built without AltiVec). Unique dynamic linker names are complicated, since as it is, glibc uses ld64.so.1 for ELFv1, and ld64.so.2 for ELFv2. (on 32-bit PowerPC, it's ld.so.1, and uses the SVR4 ABI which is not related to either the AIX/ELFv1 nor the ELFv2 ABIs) If we were to introduce new ports, what would those use? ld64.so.3 for BE/v2? ld.so.2 for LE/32-bit? I can see the reason for a new dynamic linker name though (multi-arch setups).

However, the effective difference between the ports would be rather minimal, if any, as far as I can see. As I already said, we have a whole glibc/ELFv2/BE system, with nearly all of the existing Linux userland covered by the distro, and there haven't been any issues whatsoever.

> 
> -- 
> Joseph S. Myers
> joseph@codesourcery.com
>

Daniel

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Joseph Myers @ 2020-06-01 23:55 UTC (permalink / raw)
  To: Daniel Kolesa
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012329420.11121@digraph.polyomino.org.uk>

On Mon, 1 Jun 2020, Joseph Myers wrote:

> The minimum supported architecture for powerpc64le (POWER8) has VSX.  My 
> understanding was that the suggestion was for 32-bit userspace to run 
> under powerpc64le kernels running on POWER8 or later, meaning that such a 
> 32-bit LE port, and any ABI designed for such a port, can assume VSX is 
> available.  Or does VSX not work, at the hardware level, for 32-bit 
> POWER8?  (In which case you could pick another ABI for binary128 argument 
> passing and return.)

In fact, my understanding is that the ABI for passing binary128 values in 
vector registers is perfectly implementable for processors with just VMX 
(AltiVec) and not VSX.  So if you do want to support binary128 for a new 
ABI for either 32-bit LE or 32-bit or 64-bit BE, you don't need to require 
VSX for that ABI, you just need to change any GCC requirement for VSX for 
binary128 to allow it with VMX when building for your new ABI.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: [PATCH 8/8] macintosh/adb-iop: Implement SRQ autopolling
From: Finn Thain @ 2020-06-01 23:49 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	Joshua Thompson
In-Reply-To: <CAMuHMdVs8ur9pWWEFbYmPLRgdH67coSSrPO0QE8RqFvKjhgyYg@mail.gmail.com>

On Mon, 1 Jun 2020, Geert Uytterhoeven wrote:

> >
> > Sure, it could be absorbed by both asm/mac_iop.h and 
> > drivers/macintosh/adb-iop.c [...]
> 
> asm/mac_iop.h doesn't include asm/adb_iop.h (at least not in my tree, 
> but perhaps you have plans to change that?), so there's only a single 
> user.

What I meant by "both" was that part of asm/adb_iop.h could be absorbed by 
drivers/macintosh.adb-iop.c and the rest by asm/mac_iop.h. (And some of it 
could be tossed out.) I suspect that much of arch/m68k/include/asm could 
get the same treatment. But I doubt that there is any pay off, because the 
headers rarely change where they relate to hardware characteristics.

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Joseph Myers @ 2020-06-01 23:45 UTC (permalink / raw)
  To: Daniel Kolesa
  Cc: libc-alpha, eery, musl, Will Springer,
	Palmer Dabbelt via binutils, via libc-dev, linuxppc-dev
In-Reply-To: <c821b608-f14f-4a68-bbec-b7b6c1d8bddc@www.fastmail.com>

On Tue, 2 Jun 2020, Daniel Kolesa wrote:

> Are you sure this would be a new port? Glibc already works in this 
> combination, as it seems to me it'd be best if it was just a variant of 
> the existing 32-bit PowerPC port, sharing most conventions besides 
> endianness with the BE port.

The supported glibc ABIs are listed at 
<https://sourceware.org/glibc/wiki/ABIList>.  This would be a new ABI, 
which should have a new ABI-and-architecture-specific dynamic linker name 
(all new ports are expected to have a unique dynamic linker name for each 
ABI, to support systems using multiarch directory arrangements), new 
symbol versions and avoid legacy features such as 32-bit time or offsets 
or IBM long double.

> 128-bit IEEE long double would not work, since that relies on VSX being 
> present (gcc will explicitly complain if it's not). I'd be all for using 

The minimum supported architecture for powerpc64le (POWER8) has VSX.  My 
understanding was that the suggestion was for 32-bit userspace to run 
under powerpc64le kernels running on POWER8 or later, meaning that such a 
32-bit LE port, and any ABI designed for such a port, can assume VSX is 
available.  Or does VSX not work, at the hardware level, for 32-bit 
POWER8?  (In which case you could pick another ABI for binary128 argument 
passing and return.)

> There is also one more thing while we're at this. The 64-bit big endian 
> Void port uses the ELFv2 ABI, even on glibc. This is not officially 
> supported on glibc as far as I can tell, but it does work out of box, 
> without any patching (things in general match little endian then, i.e. 
> ld64.so.2 etc, but they're big endian). Is there any chance of making 
> that support official?

If you want to support ELFv2 for 64-bit big endian in glibc, again that 
should have a unique dynamic linker name, new symbol versions, only 
binary128 long double, etc. - avoid all the legacy aspects of the existing 
ELFv1 port rather than selectively saying that "ELFv1" itself is the only 
legacy aspect and keeping the others (when it's the others that are 
actually more problematic in glibc).

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Daniel Kolesa @ 2020-06-01 23:26 UTC (permalink / raw)
  To: Joseph Myers, Will Springer
  Cc: libc-alpha, eery, musl, Palmer Dabbelt via binutils, via libc-dev,
	linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012119010.11121@digraph.polyomino.org.uk>

On Mon, Jun 1, 2020, at 23:28, Joseph Myers wrote:
> On Fri, 29 May 2020, Will Springer via Binutils wrote:
> 
> > Hey all, a couple of us over in #talos-workstation on freenode have been
> > working on an effort to bring up a Linux PowerPC userland that runs in 32-bit
> > little-endian mode, aka ppcle. As far as we can tell, no ABI has ever been
> > designated for this (unless you count the patchset from a decade ago [1]), so
> > it's pretty much uncharted territory as far as Linux is concerned. We want to
> > sync up with libc and the relevant kernel folks to establish the best path
> > forward.
> 
> As a general comment on the glibc side of things, if this is considered 
> like a new port, and it probably is, the same principles that apply to new 
> ports apply here.
> 
> There's a general discussion at 
> <https://sourceware.org/glibc/wiki/NewPorts>, although much of that is 
> only applicable when adding new CPU architecture support.  More specific 
> points include that new 32-bit ports should default to 64-bit time and 
> file offsets from the start, with no support for 32-bit time or offsets 
> (meaning that if you want to use this with some kind of library call 
> translation, the library call translation will need to deal with 
> corresponding type size conversions).  And a new port should not be added 
> that uses the IBM long double format.  You can use IEEE binary128 long 
> double, possibly with an ABI similar to that used on powerpc64le, or can 
> use long double = double, but should not support IBM long double, and 
> preferably should only have one long double format rather than using the 
> glibc support for building with different options resulting in functions 
> for different long double formats being called.

Are you sure this would be a new port? Glibc already works in this combination, as it seems to me it'd be best if it was just a variant of the existing 32-bit PowerPC port, sharing most conventions besides endianness with the BE port.

128-bit IEEE long double would not work, since that relies on VSX being present (gcc will explicitly complain if it's not). I'd be all for using 64-bit long double, though (musl already does, on all ppc ports).

While we're at long double, I'd actually be interested in transitioning the existing big endian ports in Void (64-bit and 32-bit, neither has VSX baseline requirement in my case) to using 64-bit long double, abandoning the IBM format altogether (little endian will transition to 128-bit IEEE long double once it's ready on your side, as that assumes POWER8 baseline which includes VSX).

What would be the best way for me to proceed with that? I actually experimented with this, using the old glibc compat symbols from pre-ibm128 times, and I mostly had it working, except I haven't managed to find a way to switch the default symbols to 64-bit ones, which is problematic as linking everything against nldbl_nonshared is fragile and potentially quirky (breaks dlsym, function pointer equality across libraries, etc).

There is also one more thing while we're at this. The 64-bit big endian Void port uses the ELFv2 ABI, even on glibc. This is not officially supported on glibc as far as I can tell, but it does work out of box, without any patching (things in general match little endian then, i.e. ld64.so.2 etc, but they're big endian). Is there any chance of making that support official?

> 
> -- 
> Joseph S. Myers
> joseph@codesourcery.com
>

Daniel

^ permalink raw reply

* Re: [musl] Re: ppc64le and 32-bit LE userland compatibility
From: Rich Felker @ 2020-06-01 21:36 UTC (permalink / raw)
  To: Joseph Myers
  Cc: libc-alpha, eery, daniel, musl, Will Springer, binutils, libc-dev,
	linuxppc-dev
In-Reply-To: <alpine.DEB.2.21.2006012119010.11121@digraph.polyomino.org.uk>

On Mon, Jun 01, 2020 at 09:28:25PM +0000, Joseph Myers wrote:
> On Fri, 29 May 2020, Will Springer via Binutils wrote:
> 
> > Hey all, a couple of us over in #talos-workstation on freenode have been
> > working on an effort to bring up a Linux PowerPC userland that runs in 32-bit
> > little-endian mode, aka ppcle. As far as we can tell, no ABI has ever been
> > designated for this (unless you count the patchset from a decade ago [1]), so
> > it's pretty much uncharted territory as far as Linux is concerned. We want to
> > sync up with libc and the relevant kernel folks to establish the best path
> > forward.
> 
> As a general comment on the glibc side of things, if this is considered 
> like a new port, and it probably is, the same principles that apply to new 
> ports apply here.
> 
> There's a general discussion at 
> <https://sourceware.org/glibc/wiki/NewPorts>, although much of that is 
> only applicable when adding new CPU architecture support.  More specific 
> points include that new 32-bit ports should default to 64-bit time and 
> file offsets from the start, with no support for 32-bit time or offsets 
> (meaning that if you want to use this with some kind of library call 
> translation, the library call translation will need to deal with 
> corresponding type size conversions).  And a new port should not be added 
> that uses the IBM long double format.  You can use IEEE binary128 long 
> double, possibly with an ABI similar to that used on powerpc64le, or can 
> use long double = double, but should not support IBM long double, and 
> preferably should only have one long double format rather than using the 
> glibc support for building with different options resulting in functions 
> for different long double formats being called.

Thanks, these are great points, and the same applies for musl I think.
We always have 64-bit off_t anyway, but new ports should have 64-bit
time_t to begin with rather than defining _REDIR_TIME64.

It's a little bit complicated by the fact that powerpcle would be a
"subarch" for powerpc, and we don't yet have any like that where the
subarchs' time64 statuses differ, but it doesn't look like that should
be hard to do. The arch-specific alltypes.h.in already has access to
endianness knowledge. src/ldso/powerpc/dlsym_time64.S would also need
added preprocessor conditionals.

Exemption from this would be open to discussion if there are existing
non-upstream users of powerpcle musl that otherwise complies with ABI
policy except for time64, but I'm not aware of any that aren't
experimental.

Rich

^ permalink raw reply

* Re: ppc64le and 32-bit LE userland compatibility
From: Joseph Myers @ 2020-06-01 21:28 UTC (permalink / raw)
  To: Will Springer
  Cc: libc-alpha, eery, daniel, musl, binutils, libc-dev, linuxppc-dev
In-Reply-To: <2047231.C4sosBPzcN@sheen>

On Fri, 29 May 2020, Will Springer via Binutils wrote:

> Hey all, a couple of us over in #talos-workstation on freenode have been
> working on an effort to bring up a Linux PowerPC userland that runs in 32-bit
> little-endian mode, aka ppcle. As far as we can tell, no ABI has ever been
> designated for this (unless you count the patchset from a decade ago [1]), so
> it's pretty much uncharted territory as far as Linux is concerned. We want to
> sync up with libc and the relevant kernel folks to establish the best path
> forward.

As a general comment on the glibc side of things, if this is considered 
like a new port, and it probably is, the same principles that apply to new 
ports apply here.

There's a general discussion at 
<https://sourceware.org/glibc/wiki/NewPorts>, although much of that is 
only applicable when adding new CPU architecture support.  More specific 
points include that new 32-bit ports should default to 64-bit time and 
file offsets from the start, with no support for 32-bit time or offsets 
(meaning that if you want to use this with some kind of library call 
translation, the library call translation will need to deal with 
corresponding type size conversions).  And a new port should not be added 
that uses the IBM long double format.  You can use IEEE binary128 long 
double, possibly with an ABI similar to that used on powerpc64le, or can 
use long double = double, but should not support IBM long double, and 
preferably should only have one long double format rather than using the 
glibc support for building with different options resulting in functions 
for different long double formats being called.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: [PATCH v1 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE
From: Ram Pai @ 2020-06-01 19:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: ldufour, cclaudio, kvm-ppc, aneesh.kumar, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200601115518.GA31382@in.ibm.com>

On Mon, Jun 01, 2020 at 05:25:18PM +0530, Bharata B Rao wrote:
> On Sat, May 30, 2020 at 07:27:50PM -0700, Ram Pai wrote:
> > H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
> > called H_SVM_PAGE_IN for all secure pages.
> 
> I don't think that is quite true. HV doesn't assume anything about
> secure pages by itself.

Yes. Currently, it does not assume anything about secure pages.  But I am
proposing that it should consider all pages (except the shared pages) as
secure pages, when H_SVM_INIT_DONE is called.

In other words, HV should treat all pages; except shared pages, as
secure pages once H_SVM_INIT_DONE is called. And this includes pages
added subsequently through memory hotplug.

Yes. the Ultravisor can explicitly request the HV to move the pages
individually.  But that will slow down the transition too significantly.
It takes above 20min to transition them, for a SVM of size 100G.

With this proposed enhancement, the switch completes in a few seconds.

> 
> > These GFNs continue to be
> > normal GFNs associated with normal PFNs; when infact, these GFNs should
> > have been secure GFNs associated with device PFNs.
> 
> Transition to secure state is driven by SVM/UV and HV just responds to
> hcalls by issuing appropriate uvcalls. SVM/UV is in the best position to
> determine the required pages that need to be moved into secure side.
> HV just responds to it and tracks such pages as device private pages.
> 
> If SVM/UV doesn't get in all the pages to secure side by the time
> of H_SVM_INIT_DONE, the remaining pages are just normal (shared or
> otherwise) pages as far as HV is concerned.  Why should HV assume that
> SVM/UV didn't ask for a few pages and hence push those pages during
> H_SVM_INIT_DONE?

By definition, SVM is a VM backed by secure pages.
Hence all pages(except shared) must turn secure when a VM switches to SVM.

UV is interested in only a certain pages for the VM, which it will
request explicitly through H_SVM_PAGE_IN.  All other pages, need not
be paged-in through UV_PAGE_IN.  They just need to be switched to
device-pages.

> 
> I think UV should drive the movement of pages into secure side both
> of boot-time SVM memory and hot-plugged memory. HV does memslot
> registration uvcall when new memory is plugged in, UV should explicitly
> get the required pages in at that time instead of expecting HV to drive
> the same.
> 
> > +static int uv_migrate_mem_slot(struct kvm *kvm,
> > +		const struct kvm_memory_slot *memslot)
> > +{
> > +	unsigned long gfn = memslot->base_gfn;
> > +	unsigned long end;
> > +	bool downgrade = false;
> > +	struct vm_area_struct *vma;
> > +	int i, ret = 0;
> > +	unsigned long start = gfn_to_hva(kvm, gfn);
> > +
> > +	if (kvm_is_error_hva(start))
> > +		return H_STATE;
> > +
> > +	end = start + (memslot->npages << PAGE_SHIFT);
> > +
> > +	down_write(&kvm->mm->mmap_sem);
> > +
> > +	mutex_lock(&kvm->arch.uvmem_lock);
> > +	vma = find_vma_intersection(kvm->mm, start, end);
> > +	if (!vma || vma->vm_start > start || vma->vm_end < end) {
> > +		ret = H_STATE;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> > +			  MADV_UNMERGEABLE, &vma->vm_flags);
> > +	downgrade_write(&kvm->mm->mmap_sem);
> > +	downgrade = true;
> > +	if (ret) {
> > +		ret = H_STATE;
> > +		goto out_unlock;
> > +	}
> > +
> > +	for (i = 0; i < memslot->npages; i++, ++gfn) {
> > +		/* skip paged-in pages and shared pages */
> > +		if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL) ||
> > +			kvmppc_gfn_is_uvmem_shared(gfn, kvm))
> > +			continue;
> > +
> > +		start = gfn_to_hva(kvm, gfn);
> > +		end = start + (1UL << PAGE_SHIFT);
> > +		ret = kvmppc_svm_migrate_page(vma, start, end,
> > +			(gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
> > +
> > +		if (ret)
> > +			goto out_unlock;
> > +	}
> 
> Is there a guarantee that the vma you got for the start address remains
> valid for all the addresses till end in a memslot? If not, you should
> re-get the vma for the current address in each iteration I suppose.


mm->mmap_sem  is the semaphore that guards the vma. right?  If that
semaphore is held, can the vma change?


Thanks for your comments,
RP

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Jan Kara @ 2020-06-01 14:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, jack, Jeff Moyer, oohall, dan.j.williams,
	Michal Suchánek, linuxppc-dev
In-Reply-To: <2bf026cc-2ed0-70b6-bf99-ecfd0fa3dac4@linux.ibm.com>

On Mon 01-06-20 17:31:50, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
> > > > > > >     static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
>         prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
>         a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
>                         MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?

Yes, from the application POV the code would look like this plus the
application would use instructions appropriate for POWER10 for flushing
caches...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v8 2/5] seq_buf: Export seq_buf_printf
From: Vaibhav Jain @ 2020-06-01 14:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Santosh Sivaraj, Ira Weiny, linux-nvdimm, Aneesh Kumar K . V,
	Cezary Rojewski, Piotr Maziarz, linux-kernel, Christoph Hellwig,
	Oliver O'Halloran, Borislav Petkov, Dan Williams,
	linuxppc-dev
In-Reply-To: <20200601094842.3cd0cab6@gandalf.local.home>

Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 01 Jun 2020 17:31:31 +0530
> Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>
>> Hi Christoph and Steven,
>> 
>> Have addressed your review comment to update the patch description and
>> title for this patch. Can you please provide your ack to this patch.
>> 
>> 
>
> I thought I already did, but it appears it was a reply to a private email
> you sent to me. I didn't realize it was off list.
>
> Anyway:
>
>  Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Thanks Steven,

Had added your ack to Resend-v7 of this patch at [1] on which Christoph
Hellwig requested an update of patch title. Hence needed your re-ack for
this version of the patch

[1] : https://lore.kernel.org/linux-nvdimm/20200519190058.257981-3-vaibhav@linux.ibm.com/

>
> -- Steve

Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v8 2/5] seq_buf: Export seq_buf_printf
From: Steven Rostedt @ 2020-06-01 13:48 UTC (permalink / raw)
  To: Vaibhav Jain
  Cc: Santosh Sivaraj, linux-nvdimm, Aneesh Kumar K . V, linuxppc-dev,
	Cezary Rojewski, Piotr Maziarz, Christoph Hellwig,
	Oliver O'Halloran, Borislav Petkov, Dan Williams, Ira Weiny,
	linux-kernel
In-Reply-To: <87367f9eqs.fsf@linux.ibm.com>

On Mon, 01 Jun 2020 17:31:31 +0530
Vaibhav Jain <vaibhav@linux.ibm.com> wrote:

> Hi Christoph and Steven,
> 
> Have addressed your review comment to update the patch description and
> title for this patch. Can you please provide your ack to this patch.
> 
> 

I thought I already did, but it appears it was a reply to a private email
you sent to me. I didn't realize it was off list.

Anyway:

 Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply

* [powerpc:next-test 177/198] arch/powerpc/kernel/rtas.c:519:9: error: 'local_paca' undeclared; did you mean 'local_inc'?
From: kbuild test robot @ 2020-06-01 12:45 UTC (permalink / raw)
  To: Leonardo, Bras,; +Cc: linuxppc-dev, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3066 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
head:   e376ca093587eafd840bb0f9df04090e2a54249c
commit: a3871a8b701613da2a13d6d1c523d0bb29ba62de [177/198] powerpc/rtas: Implement reentrant rtas call
config: powerpc-chrp32_defconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        git checkout a3871a8b701613da2a13d6d1c523d0bb29ba62de
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

arch/powerpc/kernel/rtas.c: In function 'rtas_call_reentrant':
>> arch/powerpc/kernel/rtas.c:519:9: error: 'local_paca' undeclared (first use in this function); did you mean 'local_inc'?
519 |  args = local_paca->rtas_args_reentrant;
|         ^~~~~~~~~~
|         local_inc
arch/powerpc/kernel/rtas.c:519:9: note: each undeclared identifier is reported only once for each function it appears in

vim +519 arch/powerpc/kernel/rtas.c

   486	
   487	/**
   488	 * rtas_call_reentrant() - Used for reentrant rtas calls
   489	 * @token:	Token for desired reentrant RTAS call
   490	 * @nargs:	Number of Input Parameters
   491	 * @nret:	Number of Output Parameters
   492	 * @outputs:	Array of outputs
   493	 * @...:	Inputs for desired RTAS call
   494	 *
   495	 * According to LoPAR documentation, only "ibm,int-on", "ibm,int-off",
   496	 * "ibm,get-xive" and "ibm,set-xive" are currently reentrant.
   497	 * Reentrant calls need their own rtas_args buffer, so not using rtas.args, but
   498	 * PACA one instead.
   499	 *
   500	 * Return:	-1 on error,
   501	 *		First output value of RTAS call if (nret > 0),
   502	 *		0 otherwise,
   503	 */
   504	
   505	int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...)
   506	{
   507		va_list list;
   508		struct rtas_args *args;
   509		unsigned long flags;
   510		int i, ret = 0;
   511	
   512		if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
   513			return -1;
   514	
   515		local_irq_save(flags);
   516		preempt_disable();
   517	
   518		/* We use the per-cpu (PACA) rtas args buffer */
 > 519		args = local_paca->rtas_args_reentrant;
   520	
   521		va_start(list, outputs);
   522		va_rtas_call_unlocked(args, token, nargs, nret, list);
   523		va_end(list);
   524	
   525		if (nret > 1 && outputs)
   526			for (i = 0; i < nret - 1; ++i)
   527				outputs[i] = be32_to_cpu(args->rets[i + 1]);
   528	
   529		if (nret > 0)
   530			ret = be32_to_cpu(args->rets[0]);
   531	
   532		local_irq_restore(flags);
   533		preempt_enable();
   534	
   535		return ret;
   536	}
   537	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 18831 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Aneesh Kumar K.V @ 2020-06-01 12:20 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Jan Kara, linux-nvdimm, jack, Jeff Moyer, oohall, dan.j.williams,
	linuxppc-dev
In-Reply-To: <20200601120705.GQ25173@kitsune.suse.cz>

On 6/1/20 5:37 PM, Michal Suchánek wrote:
> On Mon, Jun 01, 2020 at 05:31:50PM +0530, Aneesh Kumar K.V wrote:
>> On 6/1/20 3:39 PM, Jan Kara wrote:
>>> On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
>>>> On 5/29/20 3:22 PM, Jan Kara wrote:
>>>>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
>>>>>> Thanks Michal. I also missed Jeff in this email thread.
>>>>>
>>>>> And I think you'll also need some of the sched maintainers for the prctl
>>>>> bits...
>>>>>
>>>>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
>>>>>>> Adding Jan
>>>>>>>
>>>>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
>>>>>>>> With POWER10, architecture is adding new pmem flush and sync instructions.
>>>>>>>> The kernel should prevent the usage of MAP_SYNC if applications are not using
>>>>>>>> the new instructions on newer hardware.
>>>>>>>>
>>>>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
>>>>>>>> the usage of MAP_SYNC. The kernel config option is added to allow the user
>>>>>>>> to control whether MAP_SYNC should be enabled by default or not.
>>>>>>>>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ...
>>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>>>> index 8c700f881d92..d5a9a363e81e 100644
>>>>>>>> --- a/kernel/fork.c
>>>>>>>> +++ b/kernel/fork.c
>>>>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
>>>>>>>>      static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>>>>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
>>>>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
>>>>>>>> +#else
>>>>>>>> +unsigned long default_map_sync_mask = 0;
>>>>>>>> +#endif
>>>>>>>> +
>>>>>
>>>>> I'm not sure CONFIG is really the right approach here. For a distro that would
>>>>> basically mean to disable MAP_SYNC for all PPC kernels unless application
>>>>> explicitly uses the right prctl. Shouldn't we rather initialize
>>>>> default_map_sync_mask on boot based on whether the CPU we run on requires
>>>>> new flush instructions or not? Otherwise the patch looks sensible.
>>>>>
>>>>
>>>> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
>>>> But on a virtualized platform there is no easy way to detect that. We could
>>>> ideally hook this into the nvdimm driver where we look at the new compat
>>>> string ibm,persistent-memory-v2 and then disable MAP_SYNC
>>>> if we find a device with the specific value.
>>>
>>> Hum, couldn't we set some flag for nvdimm devices with
>>> "ibm,persistent-memory-v2" property and then check it during mmap(2) time
>>> and when the device has this propery and the mmap(2) caller doesn't have
>>> the prctl set, we'd disallow MAP_SYNC? That should make things mostly
>>> seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
>>> devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
>>> applications need to be aware of new instructions so this isn't that much
>>> additional burden...
>>
>> I am not sure application would want to add that much details/knowledge
>> about a platform in their code. I was expecting application to do
>>
>> #ifdef __ppc64__
>>          prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
>> #endif
>>          a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                          MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
>>
>>
>> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
>> is not useful. Do you see a value in making all these device specific rather
>> than a conditional on  __ppc64__?

> If the vpmem devices continue to work with the old instruction on
> POWER10 then it makes sense to make this per-device.

vPMEM doesn't have write_cache and hence it is synchronous even without 
using any specific flush instruction. The question is do we want to have
different programming steps when running on vPMEM vs a persistent PMEM 
device on ppc64.

I will work on the device specific ENABLE flag and then we can compare 
the kernel complexity against the added benefit.


-aneesh

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Michal Suchánek @ 2020-06-01 12:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, jack, Jeff Moyer, oohall, dan.j.williams,
	linuxppc-dev
In-Reply-To: <2bf026cc-2ed0-70b6-bf99-ecfd0fa3dac4@linux.ibm.com>

On Mon, Jun 01, 2020 at 05:31:50PM +0530, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
> > > > > > >     static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
>         prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
>         a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
>                         MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?
If the vpmem devices continue to work with the old instruction on
POWER10 then it makes sense to make this per-device.

Also adding a message to kernel log in case the application does not do
the prctl would be helful for people migrating old code to POWER10.

Thanks

Michal

^ permalink raw reply

* Re: [PATCH v8 2/5] seq_buf: Export seq_buf_printf
From: Vaibhav Jain @ 2020-06-01 12:01 UTC (permalink / raw)
  To: Christoph Hellwig, Steven Rostedt
  Cc: Santosh Sivaraj, linux-nvdimm, Aneesh Kumar K . V, linuxppc-dev,
	Cezary Rojewski, Steven Rostedt, Piotr Maziarz, Christoph Hellwig,
	Oliver O'Halloran, Borislav Petkov, Dan Williams, Ira Weiny,
	linux-kernel
In-Reply-To: <20200527041244.37821-3-vaibhav@linux.ibm.com>


Hi Christoph and Steven,

Have addressed your review comment to update the patch description and
title for this patch. Can you please provide your ack to this patch.

Thanks,
~ Vaibhav

Vaibhav Jain <vaibhav@linux.ibm.com> writes:

> 'seq_buf' provides a very useful abstraction for writing to a string
> buffer without needing to worry about it over-flowing. However even
> though the API has been stable for couple of years now its still not
> exported to kernel loadable modules limiting its usage.
>
> Hence this patch proposes update to 'seq_buf.c' to mark
> seq_buf_printf() which is part of the seq_buf API to be exported to
> kernel loadable GPL modules. This symbol will be used in later parts
> of this patch-set to simplify content creation for a sysfs attribute.
>
> Cc: Piotr Maziarz <piotrx.maziarz@linux.intel.com>
> Cc: Cezary Rojewski <cezary.rojewski@intel.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
> Changelog:
>
> v7..v8:
> * Updated the patch title [ Christoph Hellwig ]
> * Updated patch description to replace confusing term 'external kernel
>   modules' to 'kernel lodable modules'.
>
> Resend:
> * Added ack from Steven Rostedt
>
> v6..v7:
> * New patch in the series
> ---
>  lib/seq_buf.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/lib/seq_buf.c b/lib/seq_buf.c
> index 4e865d42ab03..707453f5d58e 100644
> --- a/lib/seq_buf.c
> +++ b/lib/seq_buf.c
> @@ -91,6 +91,7 @@ int seq_buf_printf(struct seq_buf *s, const char *fmt, ...)
>  
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(seq_buf_printf);
>  
>  #ifdef CONFIG_BINARY_PRINTF
>  /**
> -- 
> 2.26.2
>
-- 

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Aneesh Kumar K.V @ 2020-06-01 12:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, jack, Jeff Moyer, oohall, dan.j.williams,
	Michal Suchánek, linuxppc-dev
In-Reply-To: <20200601100925.GC3960@quack2.suse.cz>

On 6/1/20 3:39 PM, Jan Kara wrote:
> On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
>> On 5/29/20 3:22 PM, Jan Kara wrote:
>>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
>>>> Thanks Michal. I also missed Jeff in this email thread.
>>>
>>> And I think you'll also need some of the sched maintainers for the prctl
>>> bits...
>>>
>>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
>>>>> Adding Jan
>>>>>
>>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
>>>>>> With POWER10, architecture is adding new pmem flush and sync instructions.
>>>>>> The kernel should prevent the usage of MAP_SYNC if applications are not using
>>>>>> the new instructions on newer hardware.
>>>>>>
>>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
>>>>>> the usage of MAP_SYNC. The kernel config option is added to allow the user
>>>>>> to control whether MAP_SYNC should be enabled by default or not.
>>>>>>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ...
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index 8c700f881d92..d5a9a363e81e 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
>>>>>>     static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
>>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
>>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
>>>>>> +#else
>>>>>> +unsigned long default_map_sync_mask = 0;
>>>>>> +#endif
>>>>>> +
>>>
>>> I'm not sure CONFIG is really the right approach here. For a distro that would
>>> basically mean to disable MAP_SYNC for all PPC kernels unless application
>>> explicitly uses the right prctl. Shouldn't we rather initialize
>>> default_map_sync_mask on boot based on whether the CPU we run on requires
>>> new flush instructions or not? Otherwise the patch looks sensible.
>>>
>>
>> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
>> But on a virtualized platform there is no easy way to detect that. We could
>> ideally hook this into the nvdimm driver where we look at the new compat
>> string ibm,persistent-memory-v2 and then disable MAP_SYNC
>> if we find a device with the specific value.
> 
> Hum, couldn't we set some flag for nvdimm devices with
> "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> and when the device has this propery and the mmap(2) caller doesn't have
> the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> applications need to be aware of new instructions so this isn't that much
> additional burden...

I am not sure application would want to add that much details/knowledge 
about a platform in their code. I was expecting application to do

#ifdef __ppc64__
         prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
#endif
         a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
                         MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);


For that code all the complexity that we add w.r.t 
ibm,persistent-memory-v2 is not useful. Do you see a value in making all 
these device specific rather than a conditional on  __ppc64__?


> 
>> With that I am wondering should we even have this patch? Can we expect
>> userspace get updated to use new instruction?.
>>
>> With ppc64 we never had a real persistent memory device available for end
>> user to try. The available persistent memory stack was using vPMEM which was
>> presented as a volatile memory region for which there is no need to use any
>> of the flush instructions. We could safely assume that as we get
>> applications certified/verified for working with pmem device on ppc64, they
>> would all be using the new instructions?
> 
> This is a bit of a gamble... I don't have too much trust in certification /
> verification because only the "big players" may do powerfail testing
> throughout enough that they'd uncover these problems. So the question
> really is: How many apps are out there using MAP_SYNC on ppc64? Hopefully
> not many given the HW didn't ship yet as you wrote but I have no real clue.
> Similarly there's a question: How many app writers will read manual for
> older ppc64 architecture and write apps that won't work reliably on
> POWER10? Again, I have no idea.
> 
> So the prctl would be IMHO a nice safety belt but I'm not 100% certain it
> will be needed...
> 
>

-aneesh

^ permalink raw reply

* Re: [PATCH v1 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE
From: Bharata B Rao @ 2020-06-01 11:55 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, aneesh.kumar, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <1590892071-25549-4-git-send-email-linuxram@us.ibm.com>

On Sat, May 30, 2020 at 07:27:50PM -0700, Ram Pai wrote:
> H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
> called H_SVM_PAGE_IN for all secure pages.

I don't think that is quite true. HV doesn't assume anything about
secure pages by itself.

> These GFNs continue to be
> normal GFNs associated with normal PFNs; when infact, these GFNs should
> have been secure GFNs associated with device PFNs.

Transition to secure state is driven by SVM/UV and HV just responds to
hcalls by issuing appropriate uvcalls. SVM/UV is in the best position to
determine the required pages that need to be moved into secure side.
HV just responds to it and tracks such pages as device private pages.

If SVM/UV doesn't get in all the pages to secure side by the time
of H_SVM_INIT_DONE, the remaining pages are just normal (shared or
otherwise) pages as far as HV is concerned.  Why should HV assume that
SVM/UV didn't ask for a few pages and hence push those pages during
H_SVM_INIT_DONE?

I think UV should drive the movement of pages into secure side both
of boot-time SVM memory and hot-plugged memory. HV does memslot
registration uvcall when new memory is plugged in, UV should explicitly
get the required pages in at that time instead of expecting HV to drive
the same.

> +static int uv_migrate_mem_slot(struct kvm *kvm,
> +		const struct kvm_memory_slot *memslot)
> +{
> +	unsigned long gfn = memslot->base_gfn;
> +	unsigned long end;
> +	bool downgrade = false;
> +	struct vm_area_struct *vma;
> +	int i, ret = 0;
> +	unsigned long start = gfn_to_hva(kvm, gfn);
> +
> +	if (kvm_is_error_hva(start))
> +		return H_STATE;
> +
> +	end = start + (memslot->npages << PAGE_SHIFT);
> +
> +	down_write(&kvm->mm->mmap_sem);
> +
> +	mutex_lock(&kvm->arch.uvmem_lock);
> +	vma = find_vma_intersection(kvm->mm, start, end);
> +	if (!vma || vma->vm_start > start || vma->vm_end < end) {
> +		ret = H_STATE;
> +		goto out_unlock;
> +	}
> +
> +	ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> +			  MADV_UNMERGEABLE, &vma->vm_flags);
> +	downgrade_write(&kvm->mm->mmap_sem);
> +	downgrade = true;
> +	if (ret) {
> +		ret = H_STATE;
> +		goto out_unlock;
> +	}
> +
> +	for (i = 0; i < memslot->npages; i++, ++gfn) {
> +		/* skip paged-in pages and shared pages */
> +		if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL) ||
> +			kvmppc_gfn_is_uvmem_shared(gfn, kvm))
> +			continue;
> +
> +		start = gfn_to_hva(kvm, gfn);
> +		end = start + (1UL << PAGE_SHIFT);
> +		ret = kvmppc_svm_migrate_page(vma, start, end,
> +			(gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
> +
> +		if (ret)
> +			goto out_unlock;
> +	}

Is there a guarantee that the vma you got for the start address remains
valid for all the addresses till end in a memslot? If not, you should
re-get the vma for the current address in each iteration I suppose.

Regards,
Bharata.

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Jan Kara @ 2020-06-01 10:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, linux-nvdimm, jack, Jeff Moyer, oohall, dan.j.williams,
	Michal Suchánek, linuxppc-dev
In-Reply-To: <7e8ee9e3-4d4d-e4b9-913b-1c2448adc62a@linux.ibm.com>

On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> On 5/29/20 3:22 PM, Jan Kara wrote:
> > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > Thanks Michal. I also missed Jeff in this email thread.
> > 
> > And I think you'll also need some of the sched maintainers for the prctl
> > bits...
> > 
> > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > Adding Jan
> > > > 
> > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > With POWER10, architecture is adding new pmem flush and sync instructions.
> > > > > The kernel should prevent the usage of MAP_SYNC if applications are not using
> > > > > the new instructions on newer hardware.
> > > > > 
> > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > > > > the usage of MAP_SYNC. The kernel config option is added to allow the user
> > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > 
> > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ...
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
> > > > >    static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > +#else
> > > > > +unsigned long default_map_sync_mask = 0;
> > > > > +#endif
> > > > > +
> > 
> > I'm not sure CONFIG is really the right approach here. For a distro that would
> > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > explicitly uses the right prctl. Shouldn't we rather initialize
> > default_map_sync_mask on boot based on whether the CPU we run on requires
> > new flush instructions or not? Otherwise the patch looks sensible.
> > 
> 
> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> But on a virtualized platform there is no easy way to detect that. We could
> ideally hook this into the nvdimm driver where we look at the new compat
> string ibm,persistent-memory-v2 and then disable MAP_SYNC
> if we find a device with the specific value.

Hum, couldn't we set some flag for nvdimm devices with
"ibm,persistent-memory-v2" property and then check it during mmap(2) time
and when the device has this propery and the mmap(2) caller doesn't have
the prctl set, we'd disallow MAP_SYNC? That should make things mostly
seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
applications need to be aware of new instructions so this isn't that much
additional burden...

> With that I am wondering should we even have this patch? Can we expect
> userspace get updated to use new instruction?.
> 
> With ppc64 we never had a real persistent memory device available for end
> user to try. The available persistent memory stack was using vPMEM which was
> presented as a volatile memory region for which there is no need to use any
> of the flush instructions. We could safely assume that as we get
> applications certified/verified for working with pmem device on ppc64, they
> would all be using the new instructions?

This is a bit of a gamble... I don't have too much trust in certification /
verification because only the "big players" may do powerfail testing
throughout enough that they'd uncover these problems. So the question
really is: How many apps are out there using MAP_SYNC on ppc64? Hopefully
not many given the HW didn't ship yet as you wrote but I have no real clue.
Similarly there's a question: How many app writers will read manual for
older ppc64 architecture and write apps that won't work reliably on
POWER10? Again, I have no idea.

So the prctl would be IMHO a nice safety belt but I'm not 100% certain it
will be needed...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
From: Jan Kara @ 2020-06-01  9:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Aneesh Kumar K.V, jack, Jeff Moyer,
	Oliver O'Halloran, Michal Suchánek, linuxppc-dev
In-Reply-To: <CAPcyv4i7k7t8is_6FKAWbWsGHVO0kvj-OqqqJTzw=VS7xtZVvQ@mail.gmail.com>

On Sat 30-05-20 09:35:19, Dan Williams wrote:
> On Sat, May 30, 2020 at 12:18 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > On 5/30/20 12:52 AM, Dan Williams wrote:
> > > On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > >>
> > >> On 5/29/20 3:22 PM, Jan Kara wrote:
> > >>> Hi!
> > >>>
> > >>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > >>>> Thanks Michal. I also missed Jeff in this email thread.
> > >>>
> > >>> And I think you'll also need some of the sched maintainers for the prctl
> > >>> bits...
> > >>>
> > >>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > >>>>> Adding Jan
> > >>>>>
> > >>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > >>>>>> With POWER10, architecture is adding new pmem flush and sync instructions.
> > >>>>>> The kernel should prevent the usage of MAP_SYNC if applications are not using
> > >>>>>> the new instructions on newer hardware.
> > >>>>>>
> > >>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > >>>>>> the usage of MAP_SYNC. The kernel config option is added to allow the user
> > >>>>>> to control whether MAP_SYNC should be enabled by default or not.
> > >>>>>>
> > >>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > >>> ...
> > >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> > >>>>>> index 8c700f881d92..d5a9a363e81e 100644
> > >>>>>> --- a/kernel/fork.c
> > >>>>>> +++ b/kernel/fork.c
> > >>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
> > >>>>>>     static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > >>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > >>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > >>>>>> +#else
> > >>>>>> +unsigned long default_map_sync_mask = 0;
> > >>>>>> +#endif
> > >>>>>> +
> > >>>
> > >>> I'm not sure CONFIG is really the right approach here. For a distro that would
> > >>> basically mean to disable MAP_SYNC for all PPC kernels unless application
> > >>> explicitly uses the right prctl. Shouldn't we rather initialize
> > >>> default_map_sync_mask on boot based on whether the CPU we run on requires
> > >>> new flush instructions or not? Otherwise the patch looks sensible.
> > >>>
> > >>
> > >> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> > >> POWER10. But on a virtualized platform there is no easy way to detect
> > >> that. We could ideally hook this into the nvdimm driver where we look at
> > >> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > >> if we find a device with the specific value.
> > >>
> > >> BTW with the recent changes I posted for the nvdimm driver, older kernel
> > >> won't initialize persistent memory device on newer hardware. Newer
> > >> hardware will present the device to OS with a different device tree
> > >> compat string.
> > >>
> > >> My expectation  w.r.t this patch was, Distro would want to  mark
> > >> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> > >> certification.  Otherwise application will have to end up calling the
> > >> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> > >> be dependent on P10?
> > >>
> > >> With that I am wondering should we even have this patch? Can we expect
> > >> userspace get updated to use new instruction?.
> > >>
> > >> With ppc64 we never had a real persistent memory device available for
> > >> end user to try. The available persistent memory stack was using vPMEM
> > >> which was presented as a volatile memory region for which there is no
> > >> need to use any of the flush instructions. We could safely assume that
> > >> as we get applications certified/verified for working with pmem device
> > >> on ppc64, they would all be using the new instructions?
> > >
> > > I think prctl is the wrong interface for this. I was thinking a sysfs
> > > interface along the same lines as /sys/block/pmemX/dax/write_cache.
> > > That attribute is toggling DAXDEV_WRITE_CACHE for the determination of
> > > whether the platform or the kernel needs to handle cache flushing
> > > relative to power loss. A similar attribute can be established for
> > > DAXDEV_SYNC, it would simply default to off based on a configuration
> > > time policy, but be dynamically changeable at runtime via sysfs.
> > >
> > > These flags are device properties that affect the kernel and
> > > userspace's handling of persistence.
> > >
> >
> > That will not handle the scenario with multiple applications using the
> > same fsdax mount point where one is updated to use the new instruction
> > and the other is not.
> 
> Right, it needs to be a global setting / flag day to switch from one
> regime to another. Per-process control is a recipe for disaster.

First I'd like to mention that hopefully the concern is mostly theoretical
since as Aneesh wrote above, real persistent memory never shipped for PPC
and so there are very few apps (if any) using the old way to ensure cache
flushing.

But I'd like to understand why do you think per-process control is a recipe
for disaster? Because from my POV the sysfs interface you propose is actually
difficult to use in practice. As a distributor, you have hard time picking
the default because you have a choice between picking safe option which is
going to confuse users because of failing MAP_SYNC and unsafe option where
everyone will be happy until someone looses data because of some ancient
application using wrong instructions to persist data. Poor experience for
users in either way. And when distro defaults to "safe option", then the
burden is on the sysadmin to toggle the switch but how is he supposed to
decide when that is safe? First he has to understand what the problem
actually is, then he has to audit all the applications using pmem whether
they use the new instruction - which is IMO a lot of effort if you have a
couple of applications and practically infeasible if you have more of them.
So IMO the burden should be *on the application* to declare that it is
aware of the new instructions to flush pmem on the platform and only to
such application the kernel should give the trust to use MAP_SYNC mappings.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 8/8] macintosh/adb-iop: Implement SRQ autopolling
From: Geert Uytterhoeven @ 2020-06-01  9:12 UTC (permalink / raw)
  To: Finn Thain
  Cc: linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	Joshua Thompson
In-Reply-To: <alpine.LNX.2.22.394.2006011006080.8@nippy.intranet>

Hi Finn,

On Mon, Jun 1, 2020 at 2:15 AM Finn Thain <fthain@telegraphics.com.au> wrote:
> On Sun, 31 May 2020, Geert Uytterhoeven wrote:
> > On Sun, May 31, 2020 at 1:20 AM Finn Thain <fthain@telegraphics.com.au> wrote:
> > >  arch/m68k/include/asm/adb_iop.h |  1 +
> > >  drivers/macintosh/adb-iop.c     | 32 ++++++++++++++++++++++++++------
> >
> > As this header file is used by a single source file only, perhaps it
> > should just be absorbed by the latter?
>
> Sure, it could be absorbed by both asm/mac_iop.h and
> drivers/macintosh/adb-iop.c but I don't see the point...

asm/mac_iop.h doesn't include asm/adb_iop.h (at least not in my tree,
but perhaps you have plans to change that?), so there's only a single
user.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox