LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Matthew Wilcox @ 2020-09-20 19:22 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, io-uring, linux-arm-kernel,
	Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920191031.GQ3421308@ZenIV.linux.org.uk>

On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> IMO it's much saner to mark those and refuse to touch them from io_uring...

Simpler solution is to remove io_uring from the 32-bit syscall list.
If you're a 32-bit process, you don't get to use io_uring.  Would
any real users actually care about that?

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Andy Lutomirski @ 2020-09-20 19:14 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-aio, open list:MIPS, David Howells, Linux-MM, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390,
	Linux SCSI List, X86 ML, Matthew Wilcox, Arnd Bergmann,
	linux-block, io-uring, linux-arm-kernel, Jens Axboe, Parisc List,
	Network Development, LKML, LSM List, Linux FS Devel,
	Andrew Morton, linuxppc-dev
In-Reply-To: <20200920180742.GN3421308@ZenIV.linux.org.uk>

On Sun, Sep 20, 2020 at 11:07 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> > On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > > Add a flag to force processing a syscall as a compat syscall.  This is
> > > required so that in_compat_syscall() works for I/O submitted by io_uring
> > > helper threads on behalf of compat syscalls.
> >
> > Al doesn't like this much, but my suggestion is to introduce two new
> > opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> > can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> > code can know what that user pointer is pointing to.
>
> Let's separate two issues:
>         1) compat syscalls want 32bit iovecs.  Nothing to do with the
> drivers, dealt with just fine.
>         2) a few drivers are really fucked in head.  They use different
> *DATA* layouts for reads/writes, depending upon the calling process.
> IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> reads from stdin in parent and child will yield different data layouts.
> On the same struct file.
>         That's what Christoph worries about (/dev/sg he'd mentioned is
> one of those).
>
>         IMO we should simply have that dozen or so of pathological files
> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> it describes the userland ABI provided by those.  And it's cast in stone.
>

I wonder if this is really quite cast in stone.  We could also have
FMODE_SHITTY_COMPAT and set that when a file like this is *opened* in
compat mode.  Then that particular struct file would be read and
written using the compat data format.  The change would be
user-visible, but the user that would see it would be very strange
indeed.

I don't have a strong opinion as to whether that is better or worse
than denying io_uring access to these things, but at least it moves
the special case out of io_uring.

--Andy

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20 19:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, io-uring, linux-arm-kernel,
	Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920190159.GT32101@casper.infradead.org>

On Sun, Sep 20, 2020 at 08:01:59PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:
> > 	2) a few drivers are really fucked in head.  They use different
> > *DATA* layouts for reads/writes, depending upon the calling process.
> > IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> > reads from stdin in parent and child will yield different data layouts.
> > On the same struct file.
> > 	That's what Christoph worries about (/dev/sg he'd mentioned is
> > one of those).
> > 
> > 	IMO we should simply have that dozen or so of pathological files
> > marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> > it describes the userland ABI provided by those.  And it's cast in stone.
> > 
> > 	Any in_compat_syscall() in ->read()/->write() instances is an ABI
> > bug, plain and simple.  Some are unfixable for compatibility reasons, but
> > any new caller like that should be a big red flag.
> 
> So an IOCB_COMPAT flag would let us know whether the caller is expecting
> a 32-bit or 64-bit layout?  And io_uring could set it based on the
> ctx->compat flag.

So you want to convert all that crap to a form that would see iocb
(IOW, ->read_iter()/->write_iter())?  And, since quite a few are sysfs
attributes, propagate that through sysfs, changing the method signatures
to match that and either modifying fuckloads of instances or introducing
new special methods for the ones that are sensitive to that crap?

IMO it's much saner to mark those and refuse to touch them from io_uring...

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Matthew Wilcox @ 2020-09-20 19:01 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, io-uring, linux-arm-kernel,
	Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920180742.GN3421308@ZenIV.linux.org.uk>

On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:
> 	2) a few drivers are really fucked in head.  They use different
> *DATA* layouts for reads/writes, depending upon the calling process.
> IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> reads from stdin in parent and child will yield different data layouts.
> On the same struct file.
> 	That's what Christoph worries about (/dev/sg he'd mentioned is
> one of those).
> 
> 	IMO we should simply have that dozen or so of pathological files
> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> it describes the userland ABI provided by those.  And it's cast in stone.
> 
> 	Any in_compat_syscall() in ->read()/->write() instances is an ABI
> bug, plain and simple.  Some are unfixable for compatibility reasons, but
> any new caller like that should be a big red flag.

So an IOCB_COMPAT flag would let us know whether the caller is expecting
a 32-bit or 64-bit layout?  And io_uring could set it based on the
ctx->compat flag.

> 	Current list of those turds:
> /dev/sg (pointer-chasing, generally insane)
> /sys/firmware/efi/vars/*/raw_var (fucked binary structure)
> /sys/firmware/efi/vars/new_var (fucked binary structure)
> /sys/firmware/efi/vars/del_var (fucked binary structure)
> /dev/uhid	(pointer-chasing for one obsolete command)
> /dev/input/event* (timestamps)
> /dev/uinput (timestamps)
> /proc/bus/input/devices (fucked bitmap-to-text representation)
> /sys/class/input/*/capabilities/* (fucked bitmap-to-text representation)

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, io-uring, linux-arm-kernel,
	Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920180742.GN3421308@ZenIV.linux.org.uk>

On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:

> /proc/bus/input/devices (fucked bitmap-to-text representation)

To illustrate the, er, beauty of that stuff:

; cat32 /proc/bus/input/devices >/tmp/a
; cat /proc/bus/input/devices >/tmp/b
; diff -u /tmp/a /tmp/b|grep '^[-+]'
--- /tmp/a      2020-09-20 14:28:43.442560691 -0400
+++ /tmp/b      2020-09-20 14:28:49.018543230 -0400
-B: KEY=1100f 2902000 8380307c f910f001 feffffdf ffefffff ffffffff fffffffe
+B: KEY=1100f02902000 8380307cf910f001 feffffdfffefffff fffffffffffffffe
-B: KEY=70000 0 0 0 0 0 0 0 0
+B: KEY=70000 0 0 0 0
-B: KEY=420 0 70000 0 0 0 0 0 0 0 0
+B: KEY=420 70000 0 0 0 0
-B: KEY=100000 0 0 0
+B: KEY=10000000000000 0
-B: KEY=4000 0 0 0 0
+B: KEY=4000 0 0
-B: KEY=8000 0 0 0 0 0 1100b 800 2 200000 0 0 0 0
+B: KEY=800000000000 0 0 1100b00000800 200200000 0 0
-B: KEY=3e000b 0 0 0 0 0 0 0
+B: KEY=3e000b00000000 0 0 0
-B: KEY=ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffe
+B: KEY=ffffffffffffffff ffffffffffffffff ffffffffffffffff fffffffffffffffe
; 
(cat32 being a 32bit binary of cat)
All the differences are due to homegrown bitmap-to-text conversion.

Note that feeding that into a pipe leaves the recepient with a lovely problem -
you can't go by the width of words (they are not zero-padded) and you can't
go by the number of words either - it varies from device to device.

And there's nothing we can do with that - it's a userland ABI, can't be
changed without breaking stuff.  I would prefer to avoid additional examples
like that, TYVM...

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: William Kucharski @ 2020-09-20 15:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, Alexander Viro, io-uring,
	linux-arm-kernel, Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920151510.GS32101@casper.infradead.org>

I really like that as it’s self-documenting and anyone debugging it can see what is actually being used at a glance.

> On Sep 20, 2020, at 09:15, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
>> Add a flag to force processing a syscall as a compat syscall.  This is
>> required so that in_compat_syscall() works for I/O submitted by io_uring
>> helper threads on behalf of compat syscalls.
> 
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20 18:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-aio, open list:MIPS, David Howells, Linux-MM, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390,
	Linux SCSI List, X86 ML, Arnd Bergmann, linux-block, io-uring,
	linux-arm-kernel, Jens Axboe, Parisc List, Network Development,
	LKML, LSM List, Linux FS Devel, Andrew Morton, linuxppc-dev
In-Reply-To: <CALCETrWj1i-oyfA1rCXsNqdJddK6Vwm=W31YEf=k-OMBTC0vHw@mail.gmail.com>

On Sun, Sep 20, 2020 at 09:59:36AM -0700, Andy Lutomirski wrote:

> As one example, look at __sys_setsockopt().  It's called for the
> native and compat versions, and it contains an in_compat_syscall()
> check.  (This particularly check looks dubious to me, but that's
> another story.)  If this were to be done with equivalent semantics
> without a separate COMPAT_DEFINE_SYSCALL and without
> in_compat_syscall(), there would need to be some indication as to
> whether this is compat or native setsockopt.  There are other
> setsockopt implementations in the net stack with more
> legitimate-seeming uses of in_compat_syscall() that would need some
> other mechanism if in_compat_syscall() were to go away.
> 
> setsockopt is (I hope!) out of scope for io_uring, but the situation
> isn't fundamentally different from read and write.

	Except that setsockopt() had that crap very widespread; for read()
and write() those are very rare exceptions.

	Andy, please RTFS.  Or dig through archives.  The situation
with setsockopt() is *NOT* a good thing - it's (probably) the least
of the evils.  The last thing we need is making that the norm.

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20 18:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390, linux-scsi,
	x86, Arnd Bergmann, linux-block, io-uring, linux-arm-kernel,
	Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920151510.GS32101@casper.infradead.org>

On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > Add a flag to force processing a syscall as a compat syscall.  This is
> > required so that in_compat_syscall() works for I/O submitted by io_uring
> > helper threads on behalf of compat syscalls.
> 
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

Let's separate two issues:
	1) compat syscalls want 32bit iovecs.  Nothing to do with the
drivers, dealt with just fine.
	2) a few drivers are really fucked in head.  They use different
*DATA* layouts for reads/writes, depending upon the calling process.
IOW, if you fork/exec a 32bit binary and your stdin is one of those,
reads from stdin in parent and child will yield different data layouts.
On the same struct file.
	That's what Christoph worries about (/dev/sg he'd mentioned is
one of those).

	IMO we should simply have that dozen or so of pathological files
marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
it describes the userland ABI provided by those.  And it's cast in stone.

	Any in_compat_syscall() in ->read()/->write() instances is an ABI
bug, plain and simple.  Some are unfixable for compatibility reasons, but
any new caller like that should be a big red flag.

	How we import iovec array is none of the drivers' concern; we do
not need to mess with in_compat_syscall() reporting the matching value,
etc. for that.  It's about the instances that want in_compat_syscall() to
decide between the 32bit and 64bit data layouts.  And I believe that
we should simply have them marked as such and rejected by io_uring.  With
any new occurences getting slapped down hard.

	Current list of those turds:
/dev/sg (pointer-chasing, generally insane)
/sys/firmware/efi/vars/*/raw_var (fucked binary structure)
/sys/firmware/efi/vars/new_var (fucked binary structure)
/sys/firmware/efi/vars/del_var (fucked binary structure)
/dev/uhid	(pointer-chasing for one obsolete command)
/dev/input/event* (timestamps)
/dev/uinput (timestamps)
/proc/bus/input/devices (fucked bitmap-to-text representation)
/sys/class/input/*/capabilities/* (fucked bitmap-to-text representation)

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Linus Torvalds @ 2020-09-20 17:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <CAHk-=wgF-upZVpqJWK=TK7MS9H-Rp1ZxGfOG+dDW=JThtxAzVQ@mail.gmail.com>

On Sun, Sep 20, 2020 at 10:42 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, that looks much easier to explain. Ack.

Btw, one thing that might be a good idea at least initially is to add
a check for p->kmap_ctrl.idx being zero at fork, exit and maybe
syscall return time (but that last one may be too cumbersome to really
worry about).

The kmap_atomic() interface basically has a lot of coverage for leaked
as part of all the "might_sleep()" checks sprinkled around,  The new
kmap_temporary/local/whatever wouldn't have that kind of incidental
debug checking, and any leaked kmap indexes would be rather hard to
debug (much) later when they cause index overflows or whatever.

                Linus

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Linus Torvalds @ 2020-09-20 17:42 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <87eemwcpnq.fsf@nanos.tec.linutronix.de>

On Sun, Sep 20, 2020 at 10:40 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> I think the more obvious solution is to split the whole exercise:
>
>   schedule()
>      prepare_switch()
>         unmap()
>
>     switch_to()
>
>     finish_switch()
>         map()

Yeah, that looks much easier to explain. Ack.

               Linus

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-20 17:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <CAHk-=wgbmwsTOKs23Z=71EBTrULoeaH2U3TNqT2atHEWvkBKdw@mail.gmail.com>

On Sun, Sep 20 2020 at 09:57, Linus Torvalds wrote:
> On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> Btw, looking at the stack code, Ithink your new implementation of it
> is a bit scary:
>
>    static inline int kmap_atomic_idx_push(void)
>    {
>   -       int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
>   +       int idx = current->kmap_ctrl.idx++;
>
> and now that 'current->kmap_ctrl.idx' is not atomic wrt
>
>  (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with
> nesting I think it's fine anyway - the NMI will undo whatever it did)

Right. Nesting should be a non issue, but I don't think we have
kmap_atomic() in NMI context.

>  (b) the prev/next switch
>
> And that (b) part worries me. You do the kmap_switch_temporary() to
> switch the entries, but you do that *separately* from actually
> switching 'current' to the new value.
>
> So kmap_switch_temporary() looks safe, but I don't think it actually
> is. Because while it first unmaps the old entries and then remaps the
> new ones, an interrupt can come in, and at that point it matters what
> is *CURRENT*.
>
> And regardless of whether 'current' is 'prev' or 'next', that
> kmap_switch_temporary() loop may be doing the wrong thing, depending
> on which one had the deeper stack. The interrupt will be using
> whatever "current->kmap_ctrl.idx" is, but that might overwrite entries
> that are in the process of being restored (if current is still 'prev',
> but kmap_switch_temporary() is in the "restore @next's kmaps" pgase),
> or it might stomp on entries that have been pte_clear()'ed by the
> 'prev' thing.

Duh yes. Never thought about that.

> Alternatively, that process counter would need about a hundred lines
> of commentary about exactly why it's safe. Because I don't think it
> is.

I think the more obvious solution is to split the whole exercise:


  schedule()
     prepare_switch()
        unmap()

    switch_to()

    finish_switch()
        map()

That's safe because neither the unmap() nor the map() code changes
kmap_ctrl.idx. So if there is an interrupt coming in between unmap() and
switch_to() then a kmap_local() there will use the next entry. So we
could even do the unmap() with interrupts enabled (preemption disabled).
Same for the map() part.

To explain that we need only a few lines of commentry methinks.

Thanks,

        tglx


^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-20 17:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, sparclinux, Vincent Chen, Will Deacon, Ard Biesheuvel,
	open list:GENERIC INCLUDE/A..., Vincent Guittot, Herbert Xu,
	X86 ML, Russell King, linux-csky, David Airlie, Mel Gorman, arcml,
	linux-xtensa, Paul McKenney, intel-gfx, linuxppc-dev,
	Steven Rostedt, Linus Torvalds, Jani Nikula, Rodrigo Vivi,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Linux-MM, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller, Greentime Hu
In-Reply-To: <20200920082353.GG438822@phenom.ffwll.local>

On Sun, Sep 20 2020 at 10:23, Daniel Vetter wrote:

> On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
>> On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
>> > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> >> I think it should be the case, but I want to double check: Will
>> >> copy_*_user be allowed within a kmap_temporary section? This would
>> >> allow us to ditch an absolute pile of slowpaths.
>> >
>> > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
>> > page fault you get a short read/write. This looks like it would remove
>> > the need to handle these in a slowpath, since page faults can now be
>> > served in this new kmap_temporary sections. But this sounds too good
>> > to be true, so I'm wondering what I'm missing.
>> 
>> In principle we could allow pagefaults, but not with the currently
>> proposed interface which can be called from any context. Obviously if
>> called from atomic context it can't handle user page faults.
>  
> Yeah that's clear, but does the implemention need to disable pagefaults
> unconditionally?

As I wrote in the other reply it's not required and the final interface
will neither disable preemption nor pagefaults (except for the atomic
wrapper around it).

Thanks,

        tglx

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Linus Torvalds @ 2020-09-20 16:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <87k0wode9a.fsf@nanos.tec.linutronix.de>

On Sun, Sep 20, 2020 at 1:49 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Actually most usage sites of kmap atomic do not need page faults to be
> disabled at all.

Right. I think the pagefault disabling has (almost) nothing at all to
do with the kmap() itself - it comes from the "atomic" part, not the
"kmap" part.

I say *almost*, because there is one issue that needs some thought:
the amount of kmap nesting.

The kmap_atomic() interface - and your local/temporary/whatever
versions of it - depends very much inherently on being strictly
nesting. In fact, it depends so much on it that maybe that should be
part of the new name?

It's very wrong to do

    addr1 = kmap_atomic();
    addr2 = kmap_atomic();
    ..do something with addr 1..
    kunmap_atomic(addr1);
    .. do something with addr 2..
    kunmap_atomic(addr2);

because the way we allocate the slots is by using a percpu-atomic
inc-return (and we deallocate using dec).

So it's fundamentally a stack.

And that's perfectly fine for page faults - if they do any kmaps,
those will obviously nest.

So the only issue with page faults might be that the stack grows
_larger_. And that might need some thought. We already make the kmap
stack bigger for CONFIG_DEBUG_HIGHMEM, and it's possibly that if we
allow page faults we need to make the kmap stack bigger still.

Btw, looking at the stack code, Ithink your new implementation of it
is a bit scary:

   static inline int kmap_atomic_idx_push(void)
   {
  -       int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
  +       int idx = current->kmap_ctrl.idx++;

and now that 'current->kmap_ctrl.idx' is not atomic wrt

 (a) NMI's (this may be ok, maybe we never do kmaps in NMIs, and with
nesting I think it's fine anyway - the NMI will undo whatever it did)

 (b) the prev/next switch

And that (b) part worries me. You do the kmap_switch_temporary() to
switch the entries, but you do that *separately* from actually
switching 'current' to the new value.

So kmap_switch_temporary() looks safe, but I don't think it actually
is. Because while it first unmaps the old entries and then remaps the
new ones, an interrupt can come in, and at that point it matters what
is *CURRENT*.

And regardless of whether 'current' is 'prev' or 'next', that
kmap_switch_temporary() loop may be doing the wrong thing, depending
on which one had the deeper stack. The interrupt will be using
whatever "current->kmap_ctrl.idx" is, but that might overwrite entries
that are in the process of being restored (if current is still 'prev',
but kmap_switch_temporary() is in the "restore @next's kmaps" pgase),
or it might stomp on entries that have been pte_clear()'ed by the
'prev' thing.

I dunno. The latter may be one of those "it works anyway, it
overwrites things we don't care about", but the former will most
definitely not work.

And it will be completely impossible to debug, because it will depend
on an interrupt that uses kmap_local/atomic/whatever() coming in
_just_ at the right point in the scheduler, and only when the
scheduler has been entered with the right number of kmap entries on
the prev/next stack.

And no developer will ever see this with any amount of debug code
enabled, because it will only hit on legacy platforms that do this
kmap anyway.

So honestly, that code scares me. I think it's buggy. And even if it
"happens to work", it does so for all the wrong reasons, and is very
fragile.

So I would suggest:

 - continue to use an actual per-cpu kmap_atomic_idx

 - make the switching code save the old idx, then unmap the old
entries one by one (while doing the proper "pop" action), and then map
the new entries one by one (while doing the proper "push" action).

which would mean that the only index that is actually ever *USED* is
the percpu one, and it's always up-to-date and pushed/popped for
individual entries, rather than this - imho completely bogus -
optimization where you use "p->kmap_ctrl.idx" directly and very very
unsafely.

Alternatively, that process counter would need about a hundred lines
of commentary about exactly why it's safe. Because I don't think it
is.

                   Linus

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Andy Lutomirski @ 2020-09-20 16:59 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-aio, open list:MIPS, David Howells, Linux-MM, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390,
	Linux SCSI List, X86 ML, Arnd Bergmann, linux-block,
	Andy Lutomirski, io-uring, linux-arm-kernel, Jens Axboe,
	Parisc List, Network Development, LKML, LSM List, Linux FS Devel,
	Andrew Morton, linuxppc-dev
In-Reply-To: <20200920025745.GL3421308@ZenIV.linux.org.uk>

On Sat, Sep 19, 2020 at 7:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:
>
> > > 2) have you counted the syscalls that do and do not need that?
> >
> > No.
>
> Might be illuminating...
>
> > > 3) how many of those realistically *can* be unified with their
> > > compat counterparts?  [hint: ioctl(2) cannot]
> >
> > There would be no requirement to unify anything.  The idea is that
> > we'd get rid of all the global state flags.
>
> _What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
> and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
> local.  They only come into the play when you try to share the same function
> for both, right on the top level.

...

>
> > For ioctl, we'd have a new file_operation:
> >
> > long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> >
> > I'm not saying this is easy, but I think it's possible and the result
> > would be more obviously correct than what we have now.
>
> No, it would not.  Seriously, from time to time a bit of RTFS before grand
> proposals turns out to be useful.

As one example, look at __sys_setsockopt().  It's called for the
native and compat versions, and it contains an in_compat_syscall()
check.  (This particularly check looks dubious to me, but that's
another story.)  If this were to be done with equivalent semantics
without a separate COMPAT_DEFINE_SYSCALL and without
in_compat_syscall(), there would need to be some indication as to
whether this is compat or native setsockopt.  There are other
setsockopt implementations in the net stack with more
legitimate-seeming uses of in_compat_syscall() that would need some
other mechanism if in_compat_syscall() were to go away.

setsockopt is (I hope!) out of scope for io_uring, but the situation
isn't fundamentally different from read and write.

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Arnd Bergmann @ 2020-09-20 16:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-aio, open list:BROADCOM NVRAM DRIVER, David Howells,
	Linux-MM, keyrings, sparclinux, Christoph Hellwig, linux-arch,
	linux-s390, linux-scsi, the arch/x86 maintainers, linux-block,
	Alexander Viro, io-uring, Linux ARM, Jens Axboe, Parisc List,
	Networking, linux-kernel@vger.kernel.org, LSM List,
	Linux FS-devel Mailing List, Andrew Morton, linuxppc-dev
In-Reply-To: <20200920151510.GS32101@casper.infradead.org>

On Sun, Sep 20, 2020 at 5:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > Add a flag to force processing a syscall as a compat syscall.  This is
> > required so that in_compat_syscall() works for I/O submitted by io_uring
> > helper threads on behalf of compat syscalls.
>
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

How is that different from the current approach of storing the ABI as
a flag in ctx->compat?

     Arnd

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Matthew Wilcox @ 2020-09-20 15:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-aio, linux-mips, David Howells, linux-mm, keyrings,
	sparclinux, linux-arch, linux-s390, linux-scsi, x86,
	Arnd Bergmann, linux-block, Alexander Viro, io-uring,
	linux-arm-kernel, Jens Axboe, linux-parisc, netdev, linux-kernel,
	linux-security-module, linux-fsdevel, Andrew Morton, linuxppc-dev
In-Reply-To: <20200918124533.3487701-2-hch@lst.de>

On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> Add a flag to force processing a syscall as a compat syscall.  This is
> required so that in_compat_syscall() works for I/O submitted by io_uring
> helper threads on behalf of compat syscalls.

Al doesn't like this much, but my suggestion is to introduce two new
opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
can translate IORING_OP_READV to IORING_OP_READV32 and then the core
code can know what that user pointer is pointing to.


^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20 15:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-aio, open list:BROADCOM NVRAM DRIVER, David Howells,
	Linux-MM, keyrings, sparclinux, Christoph Hellwig, linux-arch,
	linux-s390, linux-scsi, the arch/x86 maintainers, linux-block,
	io-uring, Linux ARM, Jens Axboe, Parisc List, Networking,
	linux-kernel@vger.kernel.org, LSM List,
	Linux FS-devel Mailing List, Andrew Morton, linuxppc-dev
In-Reply-To: <CAK8P3a3QApj3isPu3TkLahArsfb5jaABb7DJ7EKMxey1T1YPbA@mail.gmail.com>

On Sun, Sep 20, 2020 at 03:55:47PM +0200, Arnd Bergmann wrote:
> On Sun, Sep 20, 2020 at 12:09 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> > > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > > Said that, why not provide a variant that would take an explicit
> > > > "is it compat" argument and use it there?  And have the normal
> > > > one pass in_compat_syscall() to that...
> > >
> > > That would help to not introduce a regression with this series yes.
> > > But it wouldn't fix existing bugs when io_uring is used to access
> > > read or write methods that use in_compat_syscall().  One example that
> > > I recently ran into is drivers/scsi/sg.c.
> >
> > So screw such read/write methods - don't use them with io_uring.
> > That, BTW, is one of the reasons I'm sceptical about burying the
> > decisions deep into the callchain - we don't _want_ different
> > data layouts on read/write depending upon the 32bit vs. 64bit
> > caller, let alone the pointer-chasing garbage that is /dev/sg.
> 
> Would it be too late to limit what kind of file descriptors we allow
> io_uring to read/write on?
> 
> If io_uring can get changed to return -EINVAL on trying to
> read/write something other than S_IFREG file descriptors,
> that particular problem space gets a lot simpler, but this
> is of course only possible if nobody actually relies on it yet.

S_IFREG is almost certainly too heavy as a restriction.  Looking through
the stuff sensitive to 32bit/64bit, we seem to have
	* /dev/sg - pointer-chasing horror
	* sysfs files for efivar - different layouts for compat and native,
shitty userland ABI design (
struct efi_variable {
        efi_char16_t  VariableName[EFI_VAR_NAME_LEN/sizeof(efi_char16_t)];
        efi_guid_t    VendorGuid;
        unsigned long DataSize;
        __u8          Data[1024];
        efi_status_t  Status;
        __u32         Attributes;
} __attribute__((packed));
) is the piece of crap in question; 'DataSize' is where the headache comes
from.  Regular files, BTW...
	* uhid - character device, milder pointer-chasing horror.  Trouble
comes from this:
/* Obsolete! Use UHID_CREATE2. */
struct uhid_create_req {
        __u8 name[128];
        __u8 phys[64];
        __u8 uniq[64];
        __u8 __user *rd_data;
        __u16 rd_size;

        __u16 bus;
        __u32 vendor;
        __u32 product;
        __u32 version;
        __u32 country;
} __attribute__((__packed__));
and suggested replacement doesn't do any pointer-chasing (rd_data is an
embedded array in the end of struct uhid_create2_req).
	* evdev, uinput - bitness-sensitive layout, due to timestamps
	* /proc/bus/input/devices - weird crap with printing bitmap, different
_text_ layouts seen by 32bit and 64bit readers.  Binary structures are PITA,
but with sufficient effort you can screw the text just as hard...  Oh, and it's
a regular file.
	* similar in sysfs analogue

And AFAICS, that's it for read/write-related method instances.

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Arnd Bergmann @ 2020-09-20 13:55 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-aio, open list:BROADCOM NVRAM DRIVER, David Howells,
	Linux-MM, keyrings, sparclinux, Christoph Hellwig, linux-arch,
	linux-s390, linux-scsi, the arch/x86 maintainers, linux-block,
	io-uring, Linux ARM, Jens Axboe, Parisc List, Networking,
	linux-kernel@vger.kernel.org, LSM List,
	Linux FS-devel Mailing List, Andrew Morton, linuxppc-dev
In-Reply-To: <20200919220920.GI3421308@ZenIV.linux.org.uk>

On Sun, Sep 20, 2020 at 12:09 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > Said that, why not provide a variant that would take an explicit
> > > "is it compat" argument and use it there?  And have the normal
> > > one pass in_compat_syscall() to that...
> >
> > That would help to not introduce a regression with this series yes.
> > But it wouldn't fix existing bugs when io_uring is used to access
> > read or write methods that use in_compat_syscall().  One example that
> > I recently ran into is drivers/scsi/sg.c.
>
> So screw such read/write methods - don't use them with io_uring.
> That, BTW, is one of the reasons I'm sceptical about burying the
> decisions deep into the callchain - we don't _want_ different
> data layouts on read/write depending upon the 32bit vs. 64bit
> caller, let alone the pointer-chasing garbage that is /dev/sg.

Would it be too late to limit what kind of file descriptors we allow
io_uring to read/write on?

If io_uring can get changed to return -EINVAL on trying to
read/write something other than S_IFREG file descriptors,
that particular problem space gets a lot simpler, but this
is of course only possible if nobody actually relies on it yet.

      Arnd

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-20  8:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <87mu1lc5mp.fsf@nanos.tec.linutronix.de>

On Sun, Sep 20 2020 at 08:41, Thomas Gleixner wrote:
> On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
>> Maybe I've missed something.  Is it because the new interface still
>> does "pagefault_disable()" perhaps?
>>
>> But does it even need the pagefault_disable() at all? Yes, the
>> *atomic* one obviously needed it. But why does this new one need to
>> disable page faults?
>
> It disables pagefaults because it can be called from atomic and
> non-atomic context. That was the point to get rid of
>
>          if (preeemptible())
>          	kmap();
>          else
>                 kmap_atomic();
>
> If it does not disable pagefaults, then it's just a lightweight variant
> of kmap() for short lived mappings.

Actually most usage sites of kmap atomic do not need page faults to be
disabled at all. As Daniel pointed out the implicit pagefault disable
enforces copy_from_user_inatomic() even in places which are clearly
preemptible task context.

As we need to look at each instance anyway we can add the PF disable
explicitely to the very few places which actually need it.

Thanks,

        tglx


^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Daniel Vetter @ 2020-09-20  8:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, sparclinux, Vincent Chen, Will Deacon, Ard Biesheuvel,
	open list:GENERIC INCLUDE/A..., Vincent Guittot, Herbert Xu,
	X86 ML, Russell King, linux-csky, David Airlie, Mel Gorman, arcml,
	linux-xtensa, Paul McKenney, intel-gfx, linuxppc-dev,
	Steven Rostedt, Linus Torvalds, Jani Nikula, Rodrigo Vivi,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Linux-MM, Vineet Gupta, LKML,
	Arnd Bergmann, Daniel Vetter, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller, Greentime Hu
In-Reply-To: <87pn6hc6g1.fsf@nanos.tec.linutronix.de>

On Sun, Sep 20, 2020 at 08:23:26AM +0200, Thomas Gleixner wrote:
> On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
> > On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> I think it should be the case, but I want to double check: Will
> >> copy_*_user be allowed within a kmap_temporary section? This would
> >> allow us to ditch an absolute pile of slowpaths.
> >
> > (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
> > page fault you get a short read/write. This looks like it would remove
> > the need to handle these in a slowpath, since page faults can now be
> > served in this new kmap_temporary sections. But this sounds too good
> > to be true, so I'm wondering what I'm missing.
> 
> In principle we could allow pagefaults, but not with the currently
> proposed interface which can be called from any context. Obviously if
> called from atomic context it can't handle user page faults.
 
Yeah that's clear, but does the implemention need to disable pagefaults
unconditionally?

> In theory we could make a variant which does not disable pagefaults, but
> that's what kmap() already provides.

Currently we have a bunch of code which roughly does

	kmap_atomic();
	copy_*_user();
	kunmap_atomic();

	if (short_copy_user) {
		kmap();
		copy_*_user(remaining_stuff);
		kunmap();
	}

And the only reason is that kmap is a notch slower, hence the fastpath. If
we get a kmap which is fast and allows pagefaults (only in contexts that
allow pagefaults already ofc) then we can ditch a pile of kmap users.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-20  6:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, linux-sparc, Vincent Chen, Will Deacon, Ard Biesheuvel,
	linux-arch, Vincent Guittot, Herbert Xu, the arch/x86 maintainers,
	Russell King, linux-csky, David Airlie, Mel Gorman,
	open list:SYNOPSYS ARC ARCHITECTURE, linux-xtensa, Paul McKenney,
	intel-gfx, linuxppc-dev, Steven Rostedt, Jani Nikula,
	Rodrigo Vivi, Dietmar Eggemann, Linux ARM, Chris Zankel,
	Michal Simek, Thomas Bogendoerfer, Nick Hu, Linux-MM,
	Vineet Gupta, LKML, Arnd Bergmann, Daniel Vetter, Paul Mackerras,
	Andrew Morton, Daniel Bristot de Oliveira, David S. Miller,
	Greentime Hu
In-Reply-To: <CAHk-=wiYGyrFRbA1cc71D2-nc5U9LM9jUJesXGqpPnB7E4X1YQ@mail.gmail.com>

On Sat, Sep 19 2020 at 10:18, Linus Torvalds wrote:
> On Sat, Sep 19, 2020 at 2:50 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> this provides a preemptible variant of kmap_atomic & related
>> interfaces. This is achieved by:
>
> Ack. This looks really nice, even apart from the new capability.
>
> The only thing I really reacted to is that the name doesn't make sense
> to me: "kmap_temporary()" seems a bit odd.

Yeah. Couldn't come up with something useful.

> Particularly for an interface that really is basically meant as a
> better replacement of "kmap_atomic()" (but is perhaps also a better
> replacement for "kmap()").
>
> I think I understand how the name came about: I think the "temporary"
> is there as a distinction from the "longterm" regular kmap(). So I
> think it makes some sense from an internal implementation angle, but I
> don't think it makes a lot of sense from an interface name.
>
> I don't know what might be a better name, but if we want to emphasize
> that it's thread-private and a one-off, maybe "local" would be a
> better naming, and make it distinct from the "global" nature of the
> old kmap() interface?

Right, _local or _thread would be more intuitive.

> However, another solution might be to just use this new preemptible
> "local" kmap(), and remove the old global one entirely. Yes, the old
> global one caches the page table mapping and that sounds really
> efficient and nice. But it's actually horribly horribly bad, because
> it means that we need to use locking for them. Your new "temporary"
> implementation seems to be fundamentally better locking-wise, and only
> need preemption disabling as locking (and is equally fast for the
> non-highmem case).
>
> So I wonder if the single-page TLB flush isn't a better model, and
> whether it wouldn't be a lot simpler to just get rid of the old
> complex kmap() entirely, and replace it with this?
>
> I agree we can't replace the kmap_atomic() version, because maybe
> people depend on the preemption disabling it also implied. But what
> about replacing the non-atomic kmap()?
>
> Maybe I've missed something.  Is it because the new interface still
> does "pagefault_disable()" perhaps?
>
> But does it even need the pagefault_disable() at all? Yes, the
> *atomic* one obviously needed it. But why does this new one need to
> disable page faults?

It disables pagefaults because it can be called from atomic and
non-atomic context. That was the point to get rid of

         if (preeemptible())
         	kmap();
         else
                kmap_atomic();

If it does not disable pagefaults, then it's just a lightweight variant
of kmap() for short lived mappings.

> But apart from that question about naming (and perhaps replacing
> kmap() entirely), I very much like it.

I thought about it, but then I figured that kmap pointers can be
handed to other contexts from the thread which sets up the mapping
because it's 'permanent'.

I'm not sure whether that actually happens, so we'd need to audit all
kmap() users to be sure. If there is no such use case, then we surely
can get of rid of kmap() completely. It's only 300+ instances to stare
at and quite some of them are wrapped into other functions.

Highmem sucks no matter what and the only sane solution is to remove it
completely.

Thanks,

        tglx


^ permalink raw reply

* Re: [patch RFC 00/15] mm/highmem: Provide a preemptible variant of kmap_atomic & friends
From: Thomas Gleixner @ 2020-09-20  6:23 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Juri Lelli, Peter Zijlstra, Sebastian Andrzej Siewior,
	Joonas Lahtinen, dri-devel, linux-mips, Ben Segall, Max Filippov,
	Guo Ren, sparclinux, Vincent Chen, Will Deacon, Ard Biesheuvel,
	open list:GENERIC INCLUDE/A..., Vincent Guittot, Herbert Xu,
	X86 ML, Russell King, linux-csky, David Airlie, Mel Gorman, arcml,
	linux-xtensa, Paul McKenney, intel-gfx, linuxppc-dev,
	Steven Rostedt, Linus Torvalds, Jani Nikula, Rodrigo Vivi,
	Dietmar Eggemann, Linux ARM, Chris Zankel, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, Linux-MM, Vineet Gupta, LKML,
	Arnd Bergmann, Paul Mackerras, Andrew Morton,
	Daniel Bristot de Oliveira, David S. Miller, Greentime Hu
In-Reply-To: <CAKMK7uENFDANQKebS_H0bhHeQRijrp1aVHQqyZPute3KBZ+fVQ@mail.gmail.com>

On Sat, Sep 19 2020 at 12:37, Daniel Vetter wrote:
> On Sat, Sep 19, 2020 at 12:35 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>> I think it should be the case, but I want to double check: Will
>> copy_*_user be allowed within a kmap_temporary section? This would
>> allow us to ditch an absolute pile of slowpaths.
>
> (coffee just kicked in) copy_*_user is ofc allowed, but if you hit a
> page fault you get a short read/write. This looks like it would remove
> the need to handle these in a slowpath, since page faults can now be
> served in this new kmap_temporary sections. But this sounds too good
> to be true, so I'm wondering what I'm missing.

In principle we could allow pagefaults, but not with the currently
proposed interface which can be called from any context. Obviously if
called from atomic context it can't handle user page faults.

In theory we could make a variant which does not disable pagefaults, but
that's what kmap() already provides.

Thanks,

        tglx




^ permalink raw reply

* Re: [PATCH v2] powerpc/tm: Save and restore AMR on treclaim and trechkpt
From: Aneesh Kumar K.V @ 2020-09-20  5:56 UTC (permalink / raw)
  To: Gustavo Romero, linuxppc-dev; +Cc: mikey, gromero
In-Reply-To: <20200919150025.9609-1-gromero@linux.ibm.com>

Gustavo Romero <gromero@linux.ibm.com> writes:

> Althought AMR is stashed in the checkpoint area, currently we don't save
> it to the per thread checkpoint struct after a treclaim and so we don't
> restore it either from that struct when we trechkpt. As a consequence when
> the transaction is later rolled back the kernel space AMR value when the
> trechkpt was done appears in userspace.
>
> That commit saves and restores AMR accordingly on treclaim and trechkpt.
> Since AMR value is also used in kernel space in other functions, it also
> takes care of stashing kernel live AMR into the stack before treclaim and
> before trechkpt, restoring it later, just before returning from tm_reclaim
> and __tm_recheckpoint.
>
> Is also fixes two nonrelated comments about CR and MSR.
>

Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

> Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/processor.h |  1 +
>  arch/powerpc/kernel/asm-offsets.c    |  1 +
>  arch/powerpc/kernel/tm.S             | 35 ++++++++++++++++++++++++----
>  3 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index ed0d633ab5aa..9f4f6cc033ac 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -220,6 +220,7 @@ struct thread_struct {
>  	unsigned long	tm_tar;
>  	unsigned long	tm_ppr;
>  	unsigned long	tm_dscr;
> +	unsigned long   tm_amr;
>  
>  	/*
>  	 * Checkpointed FP and VSX 0-31 register set.
> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> index 8711c2164b45..c2722ff36e98 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -176,6 +176,7 @@ int main(void)
>  	OFFSET(THREAD_TM_TAR, thread_struct, tm_tar);
>  	OFFSET(THREAD_TM_PPR, thread_struct, tm_ppr);
>  	OFFSET(THREAD_TM_DSCR, thread_struct, tm_dscr);
> +	OFFSET(THREAD_TM_AMR, thread_struct, tm_amr);
>  	OFFSET(PT_CKPT_REGS, thread_struct, ckpt_regs);
>  	OFFSET(THREAD_CKVRSTATE, thread_struct, ckvr_state.vr);
>  	OFFSET(THREAD_CKVRSAVE, thread_struct, ckvrsave);
> diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S
> index 6ba0fdd1e7f8..2b91f233b05d 100644
> --- a/arch/powerpc/kernel/tm.S
> +++ b/arch/powerpc/kernel/tm.S
> @@ -122,6 +122,13 @@ _GLOBAL(tm_reclaim)
>  	std	r3, STK_PARAM(R3)(r1)
>  	SAVE_NVGPRS(r1)
>  
> +	/*
> +	 * Save kernel live AMR since it will be clobbered by treclaim
> +	 * but can be used elsewhere later in kernel space.
> +	 */
> +	mfspr	r3, SPRN_AMR
> +	std	r3, TM_FRAME_L1(r1)
> +
>  	/* We need to setup MSR for VSX register save instructions. */
>  	mfmsr	r14
>  	mr	r15, r14
> @@ -245,7 +252,7 @@ _GLOBAL(tm_reclaim)
>  	 * but is used in signal return to 'wind back' to the abort handler.
>  	 */
>  
> -	/* ******************** CR,LR,CCR,MSR ********** */
> +	/* ***************** CTR, LR, CR, XER ********** */
>  	mfctr	r3
>  	mflr	r4
>  	mfcr	r5
> @@ -256,7 +263,6 @@ _GLOBAL(tm_reclaim)
>  	std	r5, _CCR(r7)
>  	std	r6, _XER(r7)
>  
> -
>  	/* ******************** TAR, DSCR ********** */
>  	mfspr	r3, SPRN_TAR
>  	mfspr	r4, SPRN_DSCR
> @@ -264,6 +270,10 @@ _GLOBAL(tm_reclaim)
>  	std	r3, THREAD_TM_TAR(r12)
>  	std	r4, THREAD_TM_DSCR(r12)
>  
> +        /* ******************** AMR **************** */
> +        mfspr	r3, SPRN_AMR
> +        std	r3, THREAD_TM_AMR(r12)
> +
>  	/*
>  	 * MSR and flags: We don't change CRs, and we don't need to alter MSR.
>  	 */
> @@ -308,7 +318,9 @@ _GLOBAL(tm_reclaim)
>  	std	r3, THREAD_TM_TFHAR(r12)
>  	std	r4, THREAD_TM_TFIAR(r12)
>  
> -	/* AMR is checkpointed too, but is unsupported by Linux. */
> +	/* Restore kernel live AMR */
> +	ld	r8, TM_FRAME_L1(r1)
> +	mtspr	SPRN_AMR, r8
>  
>  	/* Restore original MSR/IRQ state & clear TM mode */
>  	ld	r14, TM_FRAME_L0(r1)		/* Orig MSR */
> @@ -355,6 +367,13 @@ _GLOBAL(__tm_recheckpoint)
>  	 */
>  	SAVE_NVGPRS(r1)
>  
> +	/*
> +	 * Save kernel live AMR since it will be clobbered for trechkpt
> +	 * but can be used elsewhere later in kernel space.
> +	 */
> +	mfspr	r8, SPRN_AMR
> +	std	r8, TM_FRAME_L0(r1)
> +
>  	/* Load complete register state from ts_ckpt* registers */
>  
>  	addi	r7, r3, PT_CKPT_REGS		/* Thread's ckpt_regs */
> @@ -404,7 +423,7 @@ _GLOBAL(__tm_recheckpoint)
>  
>  restore_gprs:
>  
> -	/* ******************** CR,LR,CCR,MSR ********** */
> +	/* ****************** CTR, LR, XER ************* */
>  	ld	r4, _CTR(r7)
>  	ld	r5, _LINK(r7)
>  	ld	r8, _XER(r7)
> @@ -417,6 +436,10 @@ restore_gprs:
>  	ld	r4, THREAD_TM_TAR(r3)
>  	mtspr	SPRN_TAR,	r4
>  
> +	/* ******************** AMR ******************** */
> +	ld	r4, THREAD_TM_AMR(r3)
> +	mtspr	SPRN_AMR, r4
> +
>  	/* Load up the PPR and DSCR in GPRs only at this stage */
>  	ld	r5, THREAD_TM_DSCR(r3)
>  	ld	r6, THREAD_TM_PPR(r3)
> @@ -509,6 +532,10 @@ restore_gprs:
>  	li	r4, MSR_RI
>  	mtmsrd	r4, 1
>  
> +	/* Restore kernel live AMR */
> +	ld	r8, TM_FRAME_L0(r1)
> +	mtspr	SPRN_AMR, r8
> +
>  	REST_NVGPRS(r1)
>  
>  	addi    r1, r1, TM_FRAME_SIZE
> -- 
> 2.25.1

^ permalink raw reply

* Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
From: Al Viro @ 2020-09-20  2:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-aio, open list:MIPS, David Howells, Linux-MM, keyrings,
	sparclinux, Christoph Hellwig, linux-arch, linux-s390,
	Linux SCSI List, X86 ML, Arnd Bergmann, linux-block, io-uring,
	linux-arm-kernel, Jens Axboe, Parisc List, Network Development,
	LKML, LSM List, Linux FS Devel, Andrew Morton, linuxppc-dev
In-Reply-To: <CALCETrViwOdFia_aX4p4riE8aqop1zoOqVfiQtSAZEzheC+Ozg@mail.gmail.com>

On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:

> > 2) have you counted the syscalls that do and do not need that?
> 
> No.

Might be illuminating...

> > 3) how many of those realistically *can* be unified with their
> > compat counterparts?  [hint: ioctl(2) cannot]
> 
> There would be no requirement to unify anything.  The idea is that
> we'd get rid of all the global state flags.

_What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
local.  They only come into the play when you try to share the same function
for both, right on the top level.

> For ioctl, we'd have a new file_operation:
> 
> long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> 
> I'm not saying this is easy, but I think it's possible and the result
> would be more obviously correct than what we have now.

No, it would not.  Seriously, from time to time a bit of RTFS before grand
proposals turns out to be useful.

^ permalink raw reply

* [PATCH v1 2/2] usb: dwc2: add support for APM82181 USB OTG
From: Christian Lamparter @ 2020-09-20  0:18 UTC (permalink / raw)
  To: linuxppc-dev, devicetree, linux-usb
  Cc: Minas Harutyunyan, Greg Kroah-Hartman, Rob Herring
In-Reply-To: <a43868b06566f5d959d8cfc4e763bede2885931a.1600560884.git.chunkeey@gmail.com>

adds the specific compatible string for the DWC2 IP found in the APM82181
SoCs. The IP is setup correctly through the auto detection... With the
exception of the AHB Burst Size. The default of GAHBCFG_HBSTLEN_INCR4 of
the "snps,dwc2" can cause a system hang when the USB and SATA is used
concurrently. Because the predecessor (PPC460EX (Canyonlands)) already
had the same problem, this SoC can make use of the existing
dwc2_set_amcc_params() function.

Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
---
 drivers/usb/dwc2/params.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/usb/dwc2/params.c b/drivers/usb/dwc2/params.c
index 8f9d061c4d5f..6d2b9a6c247c 100644
--- a/drivers/usb/dwc2/params.c
+++ b/drivers/usb/dwc2/params.c
@@ -210,6 +210,7 @@ const struct of_device_id dwc2_of_match_table[] = {
 	{ .compatible = "amlogic,meson-g12a-usb",
 	  .data = dwc2_set_amlogic_g12a_params },
 	{ .compatible = "amcc,dwc-otg", .data = dwc2_set_amcc_params },
+	{ .compatible = "apm,apm82181-dwc-otg", .data = dwc2_set_amcc_params },
 	{ .compatible = "st,stm32f4x9-fsotg",
 	  .data = dwc2_set_stm32f4x9_fsotg_params },
 	{ .compatible = "st,stm32f4x9-hsotg" },
-- 
2.28.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox