LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Aleksa Sarai @ 2019-05-15 14:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, Yann Droneaud, linux-s390,
	linux-arch, linux-xtensa, keescook, arnd, jannh, linux-m68k, viro,
	luto, oleg, tglx, linux-arm-kernel, linux-parisc, torvalds,
	linux-mips, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515141634.lrc5ynllcmjr64mn@brauner.io>

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

On 2019-05-15, Christian Brauner <christian@brauner.io> wrote:
> On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> > Would it be possible to create file descriptor with "restricted"
> > operation ?
> > 
> > - O_RDONLY: waiting for process completion allowed (for example)
> > - O_WRONLY: sending process signal allowed
> 
> Yes, something like this is likely going to be possible in the future.
> We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
> is not the right model. It makes more sense to have specialized flags
> that restrict actions.

Not to mention that the O_* flags have silly values which we shouldn't
replicate in new syscalls IMHO.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Yann Droneaud @ 2019-05-15 14:00 UTC (permalink / raw)
  To: Christian Brauner, jannh, oleg, viro, torvalds, linux-kernel,
	arnd, dhowells
  Cc: linux-ia64, linux-sh, linux-mips, linux-kselftest, sparclinux,
	elena.reshetova, linux-arch, linux-s390, linux-xtensa, keescook,
	linux-m68k, luto, tglx, linux-arm-kernel, linux-parisc, linux-api,
	cyphar, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>

Hi,

Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..237d18d6ecb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> pid_namespace *ns)
>  	return idr_get_next(&ns->idr, &nr);
>  }
>  
> +/**
> + * pidfd_open() - Open new pid file descriptor.
> + *
> + * @pid:   pid for which to retrieve a pidfd
> + * @flags: flags to pass
> + *
> + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> + * the process identified by @pid. Currently, the process identified by
> + * @pid must be a thread-group leader. This restriction currently exists
> + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> + * leaders).
> + *

Would it be possible to create file descriptor with "restricted"
operation ?

- O_RDONLY: waiting for process completion allowed (for example)
- O_WRONLY: sending process signal allowed

For example, a process could send over a Unix socket a process a pidfd,
allowing this to only wait for completion, but not sending signal ?

I see the permission check is not done in pidfd_open(), so what prevent
a user from sending a signal to another user owned process ?

If it's in pidfd_send_signal(), then, passing the socket through
SCM_RIGHT won't be useful if the target process is not owned by the
same user, or root.

> + * Return: On success, a cloexec pidfd is returned.
> + *         On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> +	int fd, ret;
> +	struct pid *p;
> +	struct task_struct *tsk;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (pid <= 0)
> +		return -EINVAL;
> +
> +	p = find_get_pid(pid);
> +	if (!p)
> +		return -ESRCH;
> +
> +	rcu_read_lock();
> +	tsk = pid_task(p, PIDTYPE_PID);
> +	if (!tsk)
> +		ret = -ESRCH;
> +	else if (unlikely(!thread_group_leader(tsk)))
> +		ret = -EINVAL;
> +	else
> +		ret = 0;
> +	rcu_read_unlock();
> +
> +	fd = ret ?: pidfd_create(p);
> +	put_pid(p);
> +	return fd;
> +}
> +
>  void __init pid_idr_init(void)
>  {
>  	/* Verify no one has done anything silly: */

Regards.

-- 
Yann Droneaud
OPTEYA



^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Yann Droneaud @ 2019-05-15 15:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, oleg,
	tglx, linux-arm-kernel, linux-parisc, cyphar, torvalds,
	linux-mips, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515141634.lrc5ynllcmjr64mn@brauner.io>

Hi,

Le mercredi 15 mai 2019 à 16:16 +0200, Christian Brauner a écrit :
> On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> > Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..237d18d6ecb8 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> > > pid_namespace *ns)
> > >  	return idr_get_next(&ns->idr, &nr);
> > >  }
> > >  
> > > +/**
> > > + * pidfd_open() - Open new pid file descriptor.
> > > + *
> > > + * @pid:   pid for which to retrieve a pidfd
> > > + * @flags: flags to pass
> > > + *
> > > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > > + * the process identified by @pid. Currently, the process identified by
> > > + * @pid must be a thread-group leader. This restriction currently exists
> > > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > > + * leaders).
> > > + *
> > 
> > Would it be possible to create file descriptor with "restricted"
> > operation ?
> > 
> > - O_RDONLY: waiting for process completion allowed (for example)
> > - O_WRONLY: sending process signal allowed
> 
> Yes, something like this is likely going to be possible in the future.
> We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
> is not the right model. It makes more sense to have specialized flags
> that restrict actions.

Yes, dedicated flags are the way to go. I've used the old open() flags
here as examples as an echo of the O_CLOEXEC flag used in the comment.

> > For example, a process could send over a Unix socket a process a pidfd,
> > allowing this to only wait for completion, but not sending signal ?
> > 
> > I see the permission check is not done in pidfd_open(), so what prevent
> > a user from sending a signal to another user owned process ?
> 
> That's supposed to be possible. You can do the same right now already
> with pids. Tools like LMK need this probably very much.
> Permission checking for signals is done at send time right now.
> And if you can't signal via a pid you can't signal via a pidfd as
> they're both subject to the same permissions checks.
> 

I would have expect it to behave like most other file descriptor,
permission check done at opening time, which allow more privileged
process to open the file descriptor, then pass it to a less privileged
process, or change its own privileged with setuid() and such. Then the
less privileged process can act on behalf of the privileged process
through the file descriptor.

> > If it's in pidfd_send_signal(), then, passing the socket through
> > SCM_RIGHT won't be useful if the target process is not owned by the
> > same user, or root.
> > 

If the permission check is done at sending time, the scenario above
case cannot be implemented.

Sending a pidfd through SCM_RIGHT is only useful if the receiver
process is equally or more privileged than the sender then.

For isolation purpose, I would have expect to be able to give a right
to send a signal to a highly privileged process a specific less
privileged process though Unix socket.

But I can't come up with a specific use case. So I dunno.

Regards.

-- 
Yann Droneaud
OPTEYA



^ permalink raw reply

* Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV
From: Christophe Leroy @ 2019-05-15 18:49 UTC (permalink / raw)
  To: Horia Geanta, Herbert Xu, David S. Miller
  Cc: linuxppc-dev@lists.ozlabs.org, linux-crypto@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <VI1PR0402MB34858D80A15D4B55F64570E398090@VI1PR0402MB3485.eurprd04.prod.outlook.com>



Le 15/05/2019 à 16:05, Horia Geanta a écrit :
> On 5/15/2019 3:29 PM, Christophe Leroy wrote:
>> Selftests report the following:
>>
>> [    2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>> [    2.995377] 00000000: 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
>> [    3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>> [    3.043185] 00000000: fe dc ba 98 76 54 32 10
>> [    3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>> [    3.073818] 00000000: 7d 33 88 93 0f 93 b2 42
>>
>> This above dumps show that the actual output IV is indeed the input IV.
>> This is due to the IV not being copied back into the request.
>>
>> This patch fixes that.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> Reviewed-by: Horia Geantă <horia.geanta@nxp.com>

It's missing a Fixes: tag and a Cc: to stable.

I'll resend tomorrow.

> 
> While here, could you please check ecb mode (which by definition does not have
> an IV) is behaving correctly?
> Looking in driver_algs[] list of crypto algorithms supported by talitos,
> ecb(aes,des,3des) are declared with ivsize != 0.

According to /proc/crypto, test are passed for ecb.

Christophe

> 
> Thanks,
> Horia
> 

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Daniel Colascione @ 2019-05-15 17:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-kernel, David Howells,
	open list:KERNEL SELFTEST FRAMEWORK, sparclinux, Linux API,
	elena.reshetova, linux-arch, linux-s390, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linux-m68k, Al Viro, Andy Lutomirski,
	Oleg Nesterov, Thomas Gleixner, linux-arm-kernel, linux-parisc,
	Aleksa Sarai, Linus Torvalds, linux-mips, Andy Lutomirski,
	Eric W. Biederman, linux-alpha, Andrew Morton, linuxppc-dev
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>

On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
>
> This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> process that is created via traditional fork()/clone() calls that is only
> referenced by a PID:

Thanks for doing this work. I'm really looking forward to this new
approach to process management.

> int pidfd = pidfd_open(1234, 0);
> ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
>
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.
> However, a lot of processes get created with traditional PID-based calls
> such as fork() or clone() (without CLONE_PIDFD). For these processes a
> caller can currently not create a pollable pidfd. This is a huge problem
> for Android's low memory killer (LMK) and service managers such as systemd.
> Both are examples of tools that want to make use of pidfds to get reliable
> notification of process exit for non-parents (pidfd polling) and race-free
> signal sending (pidfd_send_signal()). They intend to switch to this API for
> process supervision/management as soon as possible. Having no way to get
> pollable pidfds from PID-only processes is one of the biggest blockers for
> them in adopting this api. With pidfd_open() making it possible to retrieve
> pidfd for PID-based processes we enable them to adopt this api.
>
> In line with Arnd's recent changes to consolidate syscall numbers across
> architectures, I have added the pidfd_open() syscall to all architectures
> at the same time.

I'm glad it's easier now.

>  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>  arch/arm64/include/asm/unistd32.h           |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +

It'd be nice to arrange the system call tables so that we need to
change only one file when adding a new system call.

[Snip system call wiring]

> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -67,6 +67,7 @@ struct pid
>  extern struct pid init_struct_pid;
>
>  extern const struct file_operations pidfd_fops;
> +extern int pidfd_create(struct pid *pid);
>
>  static inline struct pid *get_pid(struct pid *pid)
>  {
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..989055e0b501 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
>                                 struct old_timex32 __user *tx);
>  asmlinkage long sys_syncfs(int fd);
>  asmlinkage long sys_setns(int fd, int nstype);
> +asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
>  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
>                              unsigned int vlen, unsigned flags);
>  asmlinkage long sys_process_vm_readv(pid_t pid,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..94a257a93d20 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_pidfd_open 428
> +__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
>
>  #undef __NR_syscalls
> -#define __NR_syscalls 428
> +#define __NR_syscalls 429
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 737db1828437..980cc1d2b8d4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
>   * Return: On success, a cloexec pidfd is returned.
>   *         On error, a negative errno number will be returned.
>   */
> -static int pidfd_create(struct pid *pid)
> +int pidfd_create(struct pid *pid)
>  {
>         int fd;
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..237d18d6ecb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -38,6 +38,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/proc_ns.h>
>  #include <linux/proc_fs.h>
> +#include <linux/sched/signal.h>
>  #include <linux/sched/task.h>
>  #include <linux/idr.h>
>
> @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
>         return idr_get_next(&ns->idr, &nr);
>  }
>
> +/**
> + * pidfd_open() - Open new pid file descriptor.
> + *
> + * @pid:   pid for which to retrieve a pidfd
> + * @flags: flags to pass
> + *
> + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> + * the process identified by @pid. Currently, the process identified by
> + * @pid must be a thread-group leader. This restriction currently exists
> + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> + * leaders).
> + *
> + * Return: On success, a cloexec pidfd is returned.
> + *         On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> +       int fd, ret;
> +       struct pid *p;
> +       struct task_struct *tsk;
> +
> +       if (flags)
> +               return -EINVAL;

If we support blocking operations on pidfds, we'll want to be able to
put them in non-blocking mode. Does it make sense to accept and ignore
O_NONBLOCK here now?

> +       if (pid <= 0)
> +               return -EINVAL;

WDYT of defining pid == 0 to mean "open myself"?

> +       p = find_get_pid(pid);
> +       if (!p)
> +               return -ESRCH;
> +
> +       rcu_read_lock();
> +       tsk = pid_task(p, PIDTYPE_PID);
> +       if (!tsk)
> +               ret = -ESRCH;
> +       else if (unlikely(!thread_group_leader(tsk)))
> +               ret = -EINVAL;
> +       else
> +               ret = 0;
> +       rcu_read_unlock();
> +
> +       fd = ret ?: pidfd_create(p);
> +       put_pid(p);
> +       return fd;
> +}

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 15:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515153515.GA20783@redhat.com>

On Wed, May 15, 2019 at 05:35:15PM +0200, Oleg Nesterov wrote:
> On 05/15, Oleg Nesterov wrote:
> >
> > On 05/15, Christian Brauner wrote:
> > >
> > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > > +{
> > > +	int fd, ret;
> > > +	struct pid *p;
> > > +	struct task_struct *tsk;
> > > +
> > > +	if (flags)
> > > +		return -EINVAL;
> > > +
> > > +	if (pid <= 0)
> > > +		return -EINVAL;
> > > +
> > > +	p = find_get_pid(pid);
> > > +	if (!p)
> > > +		return -ESRCH;
> > > +
> > > +	rcu_read_lock();
> > > +	tsk = pid_task(p, PIDTYPE_PID);
> >
> > You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> > You can just do find_vpid() under rcu_read_lock().
> 
> Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't
> do this under rcu_read_lock().
> 
> So I was wrong, you can't avoid get/put_pid.

Yeah, I haven't made any changes yet.

Christian

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 15:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515143857.GB18892@redhat.com>

On 05/15, Oleg Nesterov wrote:
>
> On 05/15, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +	int fd, ret;
> > +	struct pid *p;
> > +	struct task_struct *tsk;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (pid <= 0)
> > +		return -EINVAL;
> > +
> > +	p = find_get_pid(pid);
> > +	if (!p)
> > +		return -ESRCH;
> > +
> > +	rcu_read_lock();
> > +	tsk = pid_task(p, PIDTYPE_PID);
>
> You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> You can just do find_vpid() under rcu_read_lock().

Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't
do this under rcu_read_lock().

So I was wrong, you can't avoid get/put_pid.

Oleg.


^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 15:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515151912.GE18892@redhat.com>

On Wed, May 15, 2019 at 05:19:13PM +0200, Oleg Nesterov wrote:
> On 05/15, Christian Brauner wrote:
> >
> > On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> > >
> > > it seems that you can do a single check
> > >
> > > 	tsk = pid_task(p, PIDTYPE_TGID);
> > > 	if (!tsk)
> > > 		ret = -ESRCH;
> > >
> > > this even looks more correct if we race with exec changing the leader.
> >
> > The logic here being that you can only reach the thread_group leader
> > from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?
> 
> Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid,
> struct pid has no "type" or something like this.
> 
> The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use
> this pid as "XXX" type.
> 
> For example, clone(CLONE_THREAD) creates a pid which has a single non-
> empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or
> SID.
> 
> So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was
> used for a group-leader, see copy_process() which does

Ah, this was what I was asking myself when I worked on thread-specific
signal sending. This clarifies quite a lot of things!

Though I wonder how one reliably gets a the PGID or SID from a
PIDTYPE_PID.

> 
> 	if (thread_group_leader(p))
> 		attach_pid(p, PIDTYPE_TGID);
> 
> 
> If we race with exec which changes the leader pid_task(TGID) can return
> the old leader. We do not care, but this means that we should not check
> thread_group_leader().

Nice!

Thank you, Oleg! :)
Christian

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 15:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515144927.f2yxyi6w6lhn3xx7@brauner.io>

On 05/15, Christian Brauner wrote:
>
> On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> >
> > it seems that you can do a single check
> >
> > 	tsk = pid_task(p, PIDTYPE_TGID);
> > 	if (!tsk)
> > 		ret = -ESRCH;
> >
> > this even looks more correct if we race with exec changing the leader.
>
> The logic here being that you can only reach the thread_group leader
> from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?

Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid,
struct pid has no "type" or something like this.

The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use
this pid as "XXX" type.

For example, clone(CLONE_THREAD) creates a pid which has a single non-
empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or
SID.

So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was
used for a group-leader, see copy_process() which does

	if (thread_group_leader(p))
		attach_pid(p, PIDTYPE_TGID);

If we race with exec which changes the leader pid_task(TGID) can return
the old leader. We do not care, but this means that we should not check
thread_group_leader().

Oleg.

^ permalink raw reply

* Re: [PATCH stable 4.4] powerpc/lib: fix book3s/32 boot failure due to code patching
From: Christophe Leroy @ 2019-05-15 15:13 UTC (permalink / raw)
  To: Greg KH
  Cc: erhard_f, Michael Neuling, linux-kernel, stable@vger.kernel.org,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <20190515141604.GB8999@kroah.com>



Le 15/05/2019 à 16:16, Greg KH a écrit :
> On Wed, May 15, 2019 at 01:30:42PM +0000, Christophe Leroy wrote:
>> [Backport of upstream commit b45ba4a51cde29b2939365ef0c07ad34c8321789]
>>
>> On powerpc32, patch_instruction() is called by apply_feature_fixups()
>> which is called from early_init()
>>
>> There is the following note in front of early_init():
>>   * Note that the kernel may be running at an address which is different
>>   * from the address that it was linked at, so we must use RELOC/PTRRELOC
>>   * to access static data (including strings).  -- paulus
>>
>> Therefore init_mem_is_free must be accessed with PTRRELOC()
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203597
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>
>> ---
>> Can't apply the upstream commit as such due to several other unrelated stuff
>> like for instance STRICT_KERNEL_RWX which are missing.
>> So instead, using same approach as for commit 252eb55816a6f69ef9464cad303cdb3326cdc61d
>>
>> Removed the Fixes: tag as I don't know yet the commit Id of the fixed commit on 4.4 branch.
>> ---
>>   arch/powerpc/lib/code-patching.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Now added, thanks.
> 

Thanks,

However you took the commit log from the upstream commit, which doesn't 
corresponds exactly to the change being done here and described in the 
backport patch

Christophe

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 14:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515143857.GB18892@redhat.com>

On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> On 05/15, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +	int fd, ret;
> > +	struct pid *p;
> > +	struct task_struct *tsk;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (pid <= 0)
> > +		return -EINVAL;
> > +
> > +	p = find_get_pid(pid);
> > +	if (!p)
> > +		return -ESRCH;
> > +
> > +	rcu_read_lock();
> > +	tsk = pid_task(p, PIDTYPE_PID);
> 
> You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> You can just do find_vpid() under rcu_read_lock().

Will do.

> 
> > +	if (!tsk)
> > +		ret = -ESRCH;
> > +	else if (unlikely(!thread_group_leader(tsk)))
> > +		ret = -EINVAL;
> 
> it seems that you can do a single check
> 
> 	tsk = pid_task(p, PIDTYPE_TGID);
> 	if (!tsk)
> 		ret = -ESRCH;
> 
> this even looks more correct if we race with exec changing the leader.

The logic here being that you can only reach the thread_group leader
from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?

Thanks, Oleg.
Christian

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 14:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, tglx,
	linux-arm-kernel, linux-parisc, cyphar, torvalds, linux-kernel,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>

On 05/15, Christian Brauner wrote:
>
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> +	int fd, ret;
> +	struct pid *p;
> +	struct task_struct *tsk;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (pid <= 0)
> +		return -EINVAL;
> +
> +	p = find_get_pid(pid);
> +	if (!p)
> +		return -ESRCH;
> +
> +	rcu_read_lock();
> +	tsk = pid_task(p, PIDTYPE_PID);

You do not need find_get_pid() before rcu_lock and put_pid() at the end.
You can just do find_vpid() under rcu_read_lock().

> +	if (!tsk)
> +		ret = -ESRCH;
> +	else if (unlikely(!thread_group_leader(tsk)))
> +		ret = -EINVAL;

it seems that you can do a single check

	tsk = pid_task(p, PIDTYPE_TGID);
	if (!tsk)
		ret = -ESRCH;

this even looks more correct if we race with exec changing the leader.

Oleg.


^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 14:16 UTC (permalink / raw)
  To: Yann Droneaud
  Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
	linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, oleg,
	tglx, linux-arm-kernel, linux-parisc, cyphar, torvalds,
	linux-mips, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <4c5ae46657e1931a832def5645db61eb0bf1accd.camel@opteya.com>

On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
> > 
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..237d18d6ecb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> > pid_namespace *ns)
> >  	return idr_get_next(&ns->idr, &nr);
> >  }
> >  
> > +/**
> > + * pidfd_open() - Open new pid file descriptor.
> > + *
> > + * @pid:   pid for which to retrieve a pidfd
> > + * @flags: flags to pass
> > + *
> > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > + * the process identified by @pid. Currently, the process identified by
> > + * @pid must be a thread-group leader. This restriction currently exists
> > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > + * leaders).
> > + *
> 
> Would it be possible to create file descriptor with "restricted"
> operation ?
> 
> - O_RDONLY: waiting for process completion allowed (for example)
> - O_WRONLY: sending process signal allowed

Yes, something like this is likely going to be possible in the future.
We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
is not the right model. It makes more sense to have specialized flags
that restrict actions.

> 
> For example, a process could send over a Unix socket a process a pidfd,
> allowing this to only wait for completion, but not sending signal ?
> 
> I see the permission check is not done in pidfd_open(), so what prevent
> a user from sending a signal to another user owned process ?

That's supposed to be possible. You can do the same right now already
with pids. Tools like LMK need this probably very much.
Permission checking for signals is done at send time right now.
And if you can't signal via a pid you can't signal via a pidfd as
they're both subject to the same permissions checks.

> 
> If it's in pidfd_send_signal(), then, passing the socket through
> SCM_RIGHT won't be useful if the target process is not owned by the
> same user, or root.
> 
> > + * Return: On success, a cloexec pidfd is returned.
> > + *         On error, a negative errno number will be returned.
> > + */
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +	int fd, ret;
> > +	struct pid *p;
> > +	struct task_struct *tsk;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (pid <= 0)
> > +		return -EINVAL;
> > +
> > +	p = find_get_pid(pid);
> > +	if (!p)
> > +		return -ESRCH;
> > +
> > +	rcu_read_lock();
> > +	tsk = pid_task(p, PIDTYPE_PID);
> > +	if (!tsk)
> > +		ret = -ESRCH;
> > +	else if (unlikely(!thread_group_leader(tsk)))
> > +		ret = -EINVAL;
> > +	else
> > +		ret = 0;
> > +	rcu_read_unlock();
> > +
> > +	fd = ret ?: pidfd_create(p);
> > +	put_pid(p);
> > +	return fd;
> > +}
> > +
> >  void __init pid_idr_init(void)
> >  {
> >  	/* Verify no one has done anything silly: */
> 
> Regards.
> 
> -- 
> Yann Droneaud
> OPTEYA
> 
> 

^ permalink raw reply

* Re: [PATCH stable 4.4] powerpc/lib: fix book3s/32 boot failure due to code patching
From: Greg KH @ 2019-05-15 14:16 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: erhard_f, Michael Neuling, linux-kernel, stable@vger.kernel.org,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <71dbc8bdad5da9f6cb0446535fb2a29c68fccf80.1557926850.git.christophe.leroy@c-s.fr>

On Wed, May 15, 2019 at 01:30:42PM +0000, Christophe Leroy wrote:
> [Backport of upstream commit b45ba4a51cde29b2939365ef0c07ad34c8321789]
> 
> On powerpc32, patch_instruction() is called by apply_feature_fixups()
> which is called from early_init()
> 
> There is the following note in front of early_init():
>  * Note that the kernel may be running at an address which is different
>  * from the address that it was linked at, so we must use RELOC/PTRRELOC
>  * to access static data (including strings).  -- paulus
> 
> Therefore init_mem_is_free must be accessed with PTRRELOC()
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203597
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> 
> ---
> Can't apply the upstream commit as such due to several other unrelated stuff
> like for instance STRICT_KERNEL_RWX which are missing.
> So instead, using same approach as for commit 252eb55816a6f69ef9464cad303cdb3326cdc61d
> 
> Removed the Fixes: tag as I don't know yet the commit Id of the fixed commit on 4.4 branch.
> ---
>  arch/powerpc/lib/code-patching.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Now added, thanks.

greg k-h

^ permalink raw reply

* Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV
From: Horia Geanta @ 2019-05-15 14:05 UTC (permalink / raw)
  To: Christophe Leroy, Herbert Xu, David S. Miller
  Cc: linuxppc-dev@lists.ozlabs.org, linux-crypto@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <a5b0d31d8fc9fc9bc2b69baa5330466090825a39.1557923113.git.christophe.leroy@c-s.fr>

On 5/15/2019 3:29 PM, Christophe Leroy wrote:
> Selftests report the following:
> 
> [    2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> [    2.995377] 00000000: 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
> [    3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> [    3.043185] 00000000: fe dc ba 98 76 54 32 10
> [    3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> [    3.073818] 00000000: 7d 33 88 93 0f 93 b2 42
> 
> This above dumps show that the actual output IV is indeed the input IV.
> This is due to the IV not being copied back into the request.
> 
> This patch fixes that.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Horia Geantă <horia.geanta@nxp.com>

While here, could you please check ecb mode (which by definition does not have
an IV) is behaving correctly?
Looking in driver_algs[] list of crypto algorithms supported by talitos,
ecb(aes,des,3des) are declared with ivsize != 0.

Thanks,
Horia

^ permalink raw reply

* [PATCH v2] powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool
From: Mathieu Malaterre @ 2019-05-15 13:51 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Michael Neuling, Mathieu Malaterre, linux-kernel,
	Christoph Hellwig, Paul Mackerras, linuxppc-dev
In-Reply-To: <20190515120942.3812-1-malat@debian.org>

In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9
option") the following piece of code was added:

   smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);

Since GCC 8 this triggers the following warning about incompatible
function types:

  arch/powerpc/kernel/hw_breakpoint.c:408:21: error: cast between incompatible function types from 'int (*)(struct arch_hw_breakpoint *)' to 'void (*)(void *)' [-Werror=cast-function-type]

Since the warning is there for a reason, and should not be hidden behind
a cast, provide an intermediate callback function to avoid the warning.

Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
---
v2: do not hide warning using a hack

 arch/powerpc/kernel/hw_breakpoint.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
index f70fb89dbf60..969092d84a2f 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -386,6 +386,11 @@ void hw_breakpoint_pmu_read(struct perf_event *bp)
 bool dawr_force_enable;
 EXPORT_SYMBOL_GPL(dawr_force_enable);
 
+static void set_dawr_cb(void *info)
+{
+	set_dawr(info);
+}
+
 static ssize_t dawr_write_file_bool(struct file *file,
 				    const char __user *user_buf,
 				    size_t count, loff_t *ppos)
@@ -405,7 +410,7 @@ static ssize_t dawr_write_file_bool(struct file *file,
 
 	/* If we are clearing, make sure all CPUs have the DAWR cleared */
 	if (!dawr_force_enable)
-		smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);
+		smp_call_function(set_dawr_cb, &null_brk, 0);
 
 	return rc;
 }
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool
From: Mathieu Malaterre @ 2019-05-15 13:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Neuling, linuxppc-dev, Paul Mackerras, LKML
In-Reply-To: <20190515131441.GA3072@infradead.org>

Hi Christoph,

On Wed, May 15, 2019 at 3:14 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, May 15, 2019 at 02:09:42PM +0200, Mathieu Malaterre wrote:
> > In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9
> > option") the following piece of code was added:
> >
> >    smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);
> >
> > Since GCC 8 this trigger the following warning about incompatible
> > function types:
>
> And the warning is there for a reason, and should not be hidden
> behind a cast.  This should instead be fixed by something like this:

<meme: I have no idea what I am doing>
OK, thanks for the quick feedback, will send a v2 asap.

> diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c
> index da307dd93ee3..a26b67a1be83 100644
> --- a/arch/powerpc/kernel/hw_breakpoint.c
> +++ b/arch/powerpc/kernel/hw_breakpoint.c
> @@ -384,6 +384,12 @@ void hw_breakpoint_pmu_read(struct perf_event *bp)
>  bool dawr_force_enable;
>  EXPORT_SYMBOL_GPL(dawr_force_enable);
>
> +
> +static void set_dawr_cb(void *info)
> +{
> +       set_dawr(info);
> +}
> +
>  static ssize_t dawr_write_file_bool(struct file *file,
>                                     const char __user *user_buf,
>                                     size_t count, loff_t *ppos)
> @@ -403,7 +409,7 @@ static ssize_t dawr_write_file_bool(struct file *file,
>
>         /* If we are clearing, make sure all CPUs have the DAWR cleared */
>         if (!dawr_force_enable)
> -               smp_call_function((smp_call_func_t)set_dawr, &null_brk, 0);
> +               smp_call_function(set_dawr_cb, &null_brk, 0);
>
>         return rc;
>  }

^ permalink raw reply

* [Bug 203609] New: Build error: implicit declaration of function 'cpu_mitigations_off'
From: bugzilla-daemon @ 2019-05-15 13:36 UTC (permalink / raw)
  To: linuxppc-dev

https://bugzilla.kernel.org/show_bug.cgi?id=203609

            Bug ID: 203609
           Summary: Build error: implicit declaration of function
                    'cpu_mitigations_off'
           Product: Platform Specific/Hardware
           Version: 2.5
    Kernel Version: 4.19.43 and 4.14.119
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: PPC-64
          Assignee: platform_ppc-64@kernel-bugs.osdl.org
          Reporter: jason@bluehome.net
        Regression: No

Created attachment 282765
  --> https://bugzilla.kernel.org/attachment.cgi?id=282765&action=edit
Build log

This just showed up in 4.19.43 and 4.14.119. 4.19.42 and 4.14.118 were fine.
I'm building with GCC 8.3 for ppc64el. 4.19.43 and 4.14.119 also build fine for
32- and 64-bit x86.

arch/powerpc/kernel/security.c: In function 'setup_barrier_nospec':
arch/powerpc/kernel/security.c:59:21: error: implicit declaration of function
'cpu_mitigations_off' [-Werror=implicit-function-declaration]
  if (!no_nospec && !cpu_mitigations_off())
                     ^~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
scripts/Makefile.build:303: recipe for target 'arch/powerpc/kernel/security.o'
failed
make[1]: *** [arch/powerpc/kernel/security.o] Error 1
Makefile:1051: recipe for target 'arch/powerpc/kernel' failed
make: *** [arch/powerpc/kernel] Error 2

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH stable 4.9] powerpc/lib: fix book3s/32 boot failure due to code patching
From: Christophe Leroy @ 2019-05-15 13:31 UTC (permalink / raw)
  To: Greg KH
  Cc: erhard_f, Michael Neuling, linux-kernel, stable@vger.kernel.org,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <20190515130825.GA3794@kroah.com>



Le 15/05/2019 à 15:08, Greg KH a écrit :
> On Wed, May 15, 2019 at 02:35:36PM +0200, Christophe Leroy wrote:
>>
>>
>> Le 15/05/2019 à 10:29, Greg KH a écrit :
>>> On Wed, May 15, 2019 at 06:40:47AM +0000, Christophe Leroy wrote:
>>>> [Backport of upstream commit b45ba4a51cde29b2939365ef0c07ad34c8321789]
>>>>
>>>> On powerpc32, patch_instruction() is called by apply_feature_fixups()
>>>> which is called from early_init()
>>>>
>>>> There is the following note in front of early_init():
>>>>    * Note that the kernel may be running at an address which is different
>>>>    * from the address that it was linked at, so we must use RELOC/PTRRELOC
>>>>    * to access static data (including strings).  -- paulus
>>>>
>>>> Therefore init_mem_is_free must be accessed with PTRRELOC()
>>>>
>>>> Fixes: 1c38a84d4586 ("powerpc: Avoid code patching freed init sections")
>>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203597
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>>>
>>>> ---
>>>> Can't apply the upstream commit as such due to several other unrelated stuff
>>>> like for instance STRICT_KERNEL_RWX which are missing.
>>>> So instead, using same approach as for commit 252eb55816a6f69ef9464cad303cdb3326cdc61d
>>>
>>> Now queued up, thanks.
>>>
>>
>> Should go to 4.4 as well since the commit it fixes is now queued for 4.4
>> ([PATCH 4.4 056/266] powerpc: Avoid code patching freed init sections)
> 
> Ok, can someone send me a backport that actually applies there?
> 

Done

Christophe

^ permalink raw reply

* [PATCH stable 4.4] powerpc/lib: fix book3s/32 boot failure due to code patching
From: Christophe Leroy @ 2019-05-15 13:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman, Greg KH,
	erhard_f, Michael Neuling
  Cc: linuxppc-dev, linux-kernel, stable

[Backport of upstream commit b45ba4a51cde29b2939365ef0c07ad34c8321789]

On powerpc32, patch_instruction() is called by apply_feature_fixups()
which is called from early_init()

There is the following note in front of early_init():
 * Note that the kernel may be running at an address which is different
 * from the address that it was linked at, so we must use RELOC/PTRRELOC
 * to access static data (including strings).  -- paulus

Therefore init_mem_is_free must be accessed with PTRRELOC()

Link: https://bugzilla.kernel.org/show_bug.cgi?id=203597
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>

---
Can't apply the upstream commit as such due to several other unrelated stuff
like for instance STRICT_KERNEL_RWX which are missing.
So instead, using same approach as for commit 252eb55816a6f69ef9464cad303cdb3326cdc61d

Removed the Fixes: tag as I don't know yet the commit Id of the fixed commit on 4.4 branch.
---
 arch/powerpc/lib/code-patching.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 2604192c0719..65ce778aee46 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -28,7 +28,7 @@ int patch_instruction(unsigned int *addr, unsigned int instr)
 	int err;
 
 	/* Make sure we aren't patching a freed init section */
-	if (init_mem_is_free && is_init(addr)) {
+	if (*PTRRELOC(&init_mem_is_free) && is_init(addr)) {
 		pr_debug("Skipping init section patching addr: 0x%px\n", addr);
 		return 0;
 	}
-- 
2.13.3


^ permalink raw reply related

* [RFC PATCH 5/5] powerpc/64s/radix: iomap use huge page mappings
From: Nicholas Piggin @ 2019-05-15 13:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm; +Cc: Linus Torvalds, Nicholas Piggin
In-Reply-To: <20190515131944.12489-1-npiggin@gmail.com>

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  8 +++
 arch/powerpc/mm/pgtable_64.c                 | 54 +++++++++++++++++---
 2 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 7dede2e34b70..93b8a99df88e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -274,6 +274,14 @@ extern unsigned long __vmalloc_end;
 #define VMALLOC_START	__vmalloc_start
 #define VMALLOC_END	__vmalloc_end
 
+static inline unsigned int ioremap_max_order(void)
+{
+	if (radix_enabled())
+		return PUD_SHIFT;
+	return 7 + PAGE_SHIFT; /* default from linux/vmalloc.h */
+}
+#define IOREMAP_MAX_ORDER ({ ioremap_max_order();})
+
 extern unsigned long __kernel_virt_start;
 extern unsigned long __kernel_virt_size;
 extern unsigned long __kernel_io_start;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index d2d976ff8a0e..f660116251e6 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -112,7 +112,7 @@ unsigned long ioremap_bot = IOREMAP_BASE;
  * __ioremap_at - Low level function to establish the page tables
  *                for an IO mapping
  */
-void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_t prot)
+static void __iomem * hash__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_t prot)
 {
 	unsigned long i;
 
@@ -120,6 +120,50 @@ void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_
 	if (pgprot_val(prot) & H_PAGE_4K_PFN)
 		return NULL;
 
+	for (i = 0; i < size; i += PAGE_SIZE)
+		if (map_kernel_page((unsigned long)ea + i, pa + i, prot))
+			return NULL;
+
+	return (void __iomem *)ea;
+}
+
+static int radix__ioremap_page_range(unsigned long addr, unsigned long end,
+		       phys_addr_t phys_addr, pgprot_t prot)
+{
+	while (addr != end) {
+		if (!(addr & ~PUD_MASK) && !(phys_addr & ~PUD_MASK) &&
+				end - addr >= PUD_SIZE) {
+			if (radix__map_kernel_page(addr, phys_addr, prot, PUD_SIZE))
+				return -ENOMEM;
+			addr += PUD_SIZE;
+			phys_addr += PUD_SIZE;
+
+		} else if (!(addr & ~PMD_MASK) && !(phys_addr & ~PMD_MASK) &&
+				end - addr >= PMD_SIZE) {
+			if (radix__map_kernel_page(addr, phys_addr, prot, PMD_SIZE))
+				return -ENOMEM;
+			addr += PMD_SIZE;
+			phys_addr += PMD_SIZE;
+
+		} else {
+			if (radix__map_kernel_page(addr, phys_addr, prot, PAGE_SIZE))
+				return -ENOMEM;
+			addr += PAGE_SIZE;
+			phys_addr += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static void __iomem * radix__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_t prot)
+{
+	if (radix__ioremap_page_range((unsigned long)ea, (unsigned long)ea + size, pa, prot))
+		return NULL;
+	return ea;
+}
+
+void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_t prot)
+{
 	if ((ea + size) >= (void *)IOREMAP_END) {
 		pr_warn("Outside the supported range\n");
 		return NULL;
@@ -129,11 +173,9 @@ void __iomem *__ioremap_at(phys_addr_t pa, void *ea, unsigned long size, pgprot_
 	WARN_ON(((unsigned long)ea) & ~PAGE_MASK);
 	WARN_ON(size & ~PAGE_MASK);
 
-	for (i = 0; i < size; i += PAGE_SIZE)
-		if (map_kernel_page((unsigned long)ea + i, pa + i, prot))
-			return NULL;
-
-	return (void __iomem *)ea;
+	if (radix_enabled())
+		return radix__ioremap_at(pa, ea, size, prot);
+	return hash__ioremap_at(pa, ea, size, prot);
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH 4/5] powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP
From: Nicholas Piggin @ 2019-05-15 13:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm; +Cc: Linus Torvalds, Nicholas Piggin
In-Reply-To: <20190515131944.12489-1-npiggin@gmail.com>

This does not actually enable huge vmap mappings, because powerpc/64
ioremap does not call ioremap_page_range, but it is required before
implementing huge mappings in ioremap, because the generic vunmap code
needs to cope with them.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig                     |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c | 93 ++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7996cfaceca..ffac84600e0e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -166,6 +166,7 @@ config PPC
 	select GENERIC_STRNLEN_USER
 	select GENERIC_TIME_VSYSCALL
 	select HAVE_ARCH_AUDITSYSCALL
+	select HAVE_ARCH_HUGE_VMAP		if PPC_BOOK3S_64 && PPC_RADIX_MMU
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_KASAN			if PPC32
 	select HAVE_ARCH_KGDB
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c9bcf428dd2b..3bc9ade56277 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1122,3 +1122,96 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
 
 	set_pte_at(mm, addr, ptep, pte);
 }
+
+int __init arch_ioremap_pud_supported(void)
+{
+	return radix_enabled();
+}
+
+int __init arch_ioremap_pmd_supported(void)
+{
+	return radix_enabled();
+}
+
+int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
+{
+	return 0;
+}
+
+int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
+{
+	pte_t *ptep = (pte_t *)pud;
+	pte_t new_pud = pfn_pte(__phys_to_pfn(addr), prot);
+
+	set_pte_at(&init_mm, 0 /* radix unused */, ptep, new_pud);
+
+	return 1;
+}
+
+int pud_clear_huge(pud_t *pud)
+{
+	if (pud_huge(*pud)) {
+		pud_clear(pud);
+		return 1;
+	}
+
+	return 0;
+}
+
+int pud_free_pmd_page(pud_t *pud, unsigned long addr)
+{
+	pmd_t *pmd;
+	int i;
+
+	pmd = (pmd_t *)pud_page_vaddr(*pud);
+	pud_clear(pud);
+
+	flush_tlb_kernel_range(addr, addr + PUD_SIZE);
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(pmd[i])) {
+			pte_t *pte;
+			pte = (pte_t *)pmd_page_vaddr(pmd[i]);
+
+			pte_free_kernel(&init_mm, pte);
+		}
+	}
+
+	pmd_free(&init_mm, pmd);
+
+	return 1;
+}
+
+int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
+{
+	pte_t *ptep = (pte_t *)pmd;
+	pte_t new_pmd = pfn_pte(__phys_to_pfn(addr), prot);
+
+	set_pte_at(&init_mm, 0 /* radix unused */, ptep, new_pmd);
+
+	return 1;
+}
+
+int pmd_clear_huge(pmd_t *pmd)
+{
+	if (pmd_huge(*pmd)) {
+		pmd_clear(pmd);
+		return 1;
+	}
+
+	return 0;
+}
+
+int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
+{
+	pte_t *pte;
+
+	pte = (pte_t *)pmd_page_vaddr(*pmd);
+	pmd_clear(pmd);
+
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+
+	pte_free_kernel(&init_mm, pte);
+
+	return 1;
+}
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH 3/5] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2019-05-15 13:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm; +Cc: Linus Torvalds, Nicholas Piggin
In-Reply-To: <20190515131944.12489-1-npiggin@gmail.com>

This appears to help cached git diff performance by about 5% on a
POWER9 (with 32MB dentry cache hash).

  Profiling git diff dTLB misses with a vanilla kernel:

  81.75%  git      [kernel.vmlinux]    [k] __d_lookup_rcu
   7.21%  git      [kernel.vmlinux]    [k] strncpy_from_user
   1.77%  git      [kernel.vmlinux]    [k] find_get_entry
   1.59%  git      [kernel.vmlinux]    [k] kmem_cache_free

            40,168      dTLB-miss
       0.100342754 seconds time elapsed

After this patch (and the subsequent powerpc HUGE_VMAP patches), the
dentry cache hash gets mapped with 2MB pages:

             2,987      dTLB-miss
       0.095933138 seconds time elapsed

elapsed time improvement isn't too scientific but seems consistent,
TLB misses certainly improves an order of magnitude. My laptop
takes a lot of misses here too, so x86 would be interesting to test,
I think it should just work there.

---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 87 +++++++++++++++++++++++++++--------------
 2 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c6eebb839552..029635560306 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -42,6 +42,7 @@ struct vm_struct {
 	unsigned long		size;
 	unsigned long		flags;
 	struct page		**pages;
+	unsigned int		page_shift;
 	unsigned int		nr_pages;
 	phys_addr_t		phys_addr;
 	const void		*caller;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e5e9e1fcac01..c9ba88768bca 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -216,32 +216,34 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
 static int vmap_page_range_noflush(unsigned long start, unsigned long end,
-				   pgprot_t prot, struct page **pages)
+				   pgprot_t prot, struct page **pages,
+				   unsigned int page_shift)
 {
-	pgd_t *pgd;
-	unsigned long next;
 	unsigned long addr = start;
-	int err = 0;
-	int nr = 0;
+	unsigned int i, nr = (end - start) >> (PAGE_SHIFT + page_shift);
 
-	BUG_ON(addr >= end);
-	pgd = pgd_offset_k(addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+	for (i = 0; i < nr; i++) {
+		int err;
+
+		err = ioremap_page_range(addr,
+					addr + (PAGE_SIZE << page_shift),
+					__pa(page_address(pages[i])), prot);
 		if (err)
 			return err;
-	} while (pgd++, addr = next, addr != end);
+
+		addr += PAGE_SIZE << page_shift;
+	}
 
 	return nr;
 }
 
 static int vmap_page_range(unsigned long start, unsigned long end,
-			   pgprot_t prot, struct page **pages)
+			   pgprot_t prot, struct page **pages,
+			   unsigned int page_shift)
 {
 	int ret;
 
-	ret = vmap_page_range_noflush(start, end, prot, pages);
+	ret = vmap_page_range_noflush(start, end, prot, pages, page_shift);
 	flush_cache_vmap(start, end);
 	return ret;
 }
@@ -1189,7 +1191,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 		addr = va->va_start;
 		mem = (void *)addr;
 	}
-	if (vmap_page_range(addr, addr + size, prot, pages) < 0) {
+	if (vmap_page_range(addr, addr + size, prot, pages, 0) < 0) {
 		vm_unmap_ram(mem, count);
 		return NULL;
 	}
@@ -1305,7 +1307,7 @@ void __init vmalloc_init(void)
 int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 			     pgprot_t prot, struct page **pages)
 {
-	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+	return vmap_page_range_noflush(addr, addr + size, prot, pages, 0);
 }
 
 /**
@@ -1352,7 +1354,7 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 	unsigned long end = addr + get_vm_area_size(area);
 	int err;
 
-	err = vmap_page_range(addr, end, prot, pages);
+	err = vmap_page_range(addr, end, prot, pages, 0);
 
 	return err > 0 ? 0 : err;
 }
@@ -1395,8 +1397,9 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 		return NULL;
 
 	if (flags & VM_IOREMAP)
-		align = 1ul << clamp_t(int, get_count_order_long(size),
-				       PAGE_SHIFT, IOREMAP_MAX_ORDER);
+		align = max(align,
+				1ul << clamp_t(int, get_count_order_long(size),
+				       PAGE_SHIFT, IOREMAP_MAX_ORDER));
 
 	area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
 	if (unlikely(!area))
@@ -1608,7 +1611,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
-			__free_pages(page, 0);
+			__free_pages(page, area->page_shift);
 		}
 
 		kvfree(area->pages);
@@ -1751,14 +1754,17 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 				 pgprot_t prot, int node)
 {
 	struct page **pages;
+	unsigned long addr = (unsigned long)area->addr;
+	unsigned long size = get_vm_area_size(area);
+	unsigned int page_shift = area->page_shift;
+	unsigned int shift = page_shift + PAGE_SHIFT;
 	unsigned int nr_pages, array_size, i;
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
 	const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
-					0 :
-					__GFP_HIGHMEM;
+					0 : __GFP_HIGHMEM;
 
-	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+	nr_pages = size >> shift;
 	array_size = (nr_pages * sizeof(struct page *));
 
 	area->nr_pages = nr_pages;
@@ -1779,10 +1785,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
 
-		if (node == NUMA_NO_NODE)
-			page = alloc_page(alloc_mask|highmem_mask);
-		else
-			page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
+		page = alloc_pages_node(node,
+				alloc_mask|highmem_mask, page_shift);
 
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
@@ -1794,8 +1798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (vmap_page_range(addr, addr + size, prot, pages, page_shift) < 0)
 		goto fail;
+
 	return area->addr;
 
 fail:
@@ -1832,19 +1837,35 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	struct vm_struct *area;
 	void *addr;
 	unsigned long real_size = size;
+	unsigned long real_align = align;
+	unsigned long size_per_node;
+	unsigned int shift;
 
 	size = PAGE_ALIGN(size);
 	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
 		goto fail;
 
+	size_per_node = size;
+	if (node == NUMA_NO_NODE)
+		size_per_node /= num_online_nodes();
+	if (size_per_node >= PMD_SIZE)
+		shift = PMD_SHIFT;
+	else
+		shift = PAGE_SHIFT;
+again:
+	align = max(real_align, 1UL << shift);
+	size = ALIGN(real_size, align);
+
 	area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
 				vm_flags, start, end, node, gfp_mask, caller);
 	if (!area)
 		goto fail;
 
+	area->page_shift = shift - PAGE_SHIFT;
+
 	addr = __vmalloc_area_node(area, gfp_mask, prot, node);
 	if (!addr)
-		return NULL;
+		goto fail;
 
 	/*
 	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -1858,8 +1879,16 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	return addr;
 
 fail:
-	warn_alloc(gfp_mask, NULL,
+	if (shift == PMD_SHIFT) {
+		shift = PAGE_SHIFT;
+		goto again;
+	}
+
+	if (!area) {
+		/* Warn for area allocation, page allocations already warn */
+		warn_alloc(gfp_mask, NULL,
 			  "vmalloc: allocation failure: %lu bytes", real_size);
+	}
 	return NULL;
 }
 
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH 2/5] mm: large system hash avoid vmap for non-NUMA machines when hashdist
From: Nicholas Piggin @ 2019-05-15 13:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm; +Cc: Linus Torvalds, Nicholas Piggin
In-Reply-To: <20190515131944.12489-1-npiggin@gmail.com>

hashdist currently always uses vmalloc when hashdist is true. When
there is only 1 online node and size <= MAX_ORDER, vmalloc can be
avoided.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1683d54d6405..1312d4db5602 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7978,7 +7978,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 			else
 				table = memblock_alloc_raw(size,
 							   SMP_CACHE_BYTES);
-		} else if (get_order(size) >= MAX_ORDER || hashdist) {
+		} else if (get_order(size) >= MAX_ORDER ||
+				(hashdist && num_online_nodes() > 1)) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		} else {
 			/*
-- 
2.20.1


^ permalink raw reply related

* [RFC PATCH 1/5] mm: large system hash use vmalloc for size > MAX_ORDER when !hashdist
From: Nicholas Piggin @ 2019-05-15 13:19 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm; +Cc: Linus Torvalds, Nicholas Piggin

The kernel currently clamps large system hashes to MAX_ORDER when
hashdist is not set, which is rather arbitrary.

vmalloc space is limited on 32-bit machines, but this shouldn't
result in much more used because of small physical memory.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/page_alloc.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59661106da16..1683d54d6405 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7978,7 +7978,7 @@ void *__init alloc_large_system_hash(const char *tablename,
 			else
 				table = memblock_alloc_raw(size,
 							   SMP_CACHE_BYTES);
-		} else if (hashdist) {
+		} else if (get_order(size) >= MAX_ORDER || hashdist) {
 			table = __vmalloc(size, gfp_flags, PAGE_KERNEL);
 		} else {
 			/*
@@ -7986,10 +7986,8 @@ void *__init alloc_large_system_hash(const char *tablename,
 			 * some pages at the end of hash table which
 			 * alloc_pages_exact() automatically does
 			 */
-			if (get_order(size) < MAX_ORDER) {
-				table = alloc_pages_exact(size, gfp_flags);
-				kmemleak_alloc(table, size, 1, gfp_flags);
-			}
+			table = alloc_pages_exact(size, gfp_flags);
+			kmemleak_alloc(table, size, 1, gfp_flags);
 		}
 	} while (!table && size > PAGE_SIZE && --log2qty);
 
-- 
2.20.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox