Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX
From: Suren Baghdasaryan @ 2019-04-17 20:36 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190402104153.25404-4-patrick.bellasi@arm.com>

 Hi Patrick,

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> When a task sleeps it removes its max utilization clamp from its CPU.
> However, the blocked utilization on that CPU can be higher than the max
> clamp value enforced while the task was running. This allows undesired
> CPU frequency increases while a CPU is idle, for example, when another
> CPU on the same frequency domain triggers a frequency update, since
> schedutil can now see the full not clamped blocked utilization of the
> idle CPU.
>
> Fix this by using
>   uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
>     uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition.
>

If I understand the intent correctly, you are trying to exclude idle
CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
true I think description can be simplified a bit :) In particular it
took me some time to understand what "blocked utilization" means,
however if it's a widely accepted term then feel free to ignore my
input.

> Don't track any minimum utilization clamps since an idle CPU never
> requires a minimum frequency. The decay of the blocked utilization is
> good enough to reduce the CPU frequency.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> --
> Changes in v8:
>  Message-ID: <20190314170619.rt6yhelj3y6dzypu@e110439-lin>
>  - moved flag reset into uclamp_rq_inc()
> ---
>  kernel/sched/core.c  | 45 ++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6e1beae5f348..046f61d33f00 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -754,8 +754,35 @@ static inline unsigned int uclamp_none(int clamp_id)
>         return SCHED_CAPACITY_SCALE;
>  }
>
> +static inline unsigned int
> +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int clamp_value)
> +{
> +       /*
> +        * Avoid blocked utilization pushing up the frequency when we go
> +        * idle (which drops the max-clamp) by retaining the last known
> +        * max-clamp.
> +        */
> +       if (clamp_id == UCLAMP_MAX) {
> +               rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> +               return clamp_value;
> +       }
> +
> +       return uclamp_none(UCLAMP_MIN);
> +}
> +
> +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> +                                    unsigned int clamp_value)
> +{
> +       /* Reset max-clamp retention only on idle exit */
> +       if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> +               return;
> +
> +       WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> +}
> +
>  static inline
> -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> +                                unsigned int clamp_value)

IMHO the name of uclamp_rq_max_value() is a bit misleading because:
1. It does not imply that it has to be called only when there are no
more runnable tasks on a CPU. This is currently the case because it's
called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
nothing in the name of this function indicates that it can't be called
from other places.
2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

>  {
>         struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
>         int bucket_id = UCLAMP_BUCKETS - 1;
> @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
>         }
>
>         /* No tasks -- default clamp values */
> -       return uclamp_none(clamp_id);
> +       return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>
>  /*
> @@ -794,6 +821,8 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
>         bucket = &uc_rq->bucket[uc_se->bucket_id];
>         bucket->tasks++;
>
> +       uclamp_idle_reset(rq, clamp_id, uc_se->value);
> +
>         /*
>          * Local max aggregation: rq buckets always track the max
>          * "requested" clamp value of its RUNNABLE tasks.
> @@ -820,6 +849,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
>         struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
>         struct uclamp_se *uc_se = &p->uclamp[clamp_id];
>         struct uclamp_bucket *bucket;
> +       unsigned int bkt_clamp;
>         unsigned int rq_clamp;
>
>         bucket = &uc_rq->bucket[uc_se->bucket_id];
> @@ -848,7 +878,8 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
>                  * there are anymore RUNNABLE tasks refcounting it.
>                  */
>                 bucket->value = uclamp_bucket_base_value(bucket->value);
> -               WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> +               bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> +               WRITE_ONCE(uc_rq->value, bkt_clamp);
>         }
>  }
>
> @@ -861,6 +892,10 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
>                 uclamp_rq_inc_id(p, rq, clamp_id);
> +
> +       /* Reset clamp idle holding when there is one RUNNABLE task */
> +       if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
> +               rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
>  }
>
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
> @@ -879,8 +914,10 @@ static void __init init_uclamp(void)
>         unsigned int clamp_id;
>         int cpu;
>
> -       for_each_possible_cpu(cpu)
> +       for_each_possible_cpu(cpu) {
>                 memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> +               cpu_rq(cpu)->uclamp_flags = 0;
> +       }
>
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>                 struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c3d1ae1e7eec..d8b182f1782c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -880,6 +880,8 @@ struct rq {
>  #ifdef CONFIG_UCLAMP_TASK
>         /* Utilization clamp values based on CPU's RUNNABLE tasks */
>         struct uclamp_rq        uclamp[UCLAMP_CNT] ____cacheline_aligned;
> +       unsigned int            uclamp_flags;
> +#define UCLAMP_FLAG_IDLE 0x01
>  #endif
>
>         struct cfs_rq           cfs;
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Joseph Myers @ 2019-04-17 21:22 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, hpa, adhemerval.zanella, linux-api, linuxppc-dev
In-Reply-To: <875zrnlakd.fsf@oldenburg2.str.redhat.com>

On Tue, 9 Apr 2019, Florian Weimer wrote:

> diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
> index 52ac6ad484..4cb5e4f0d2 100644
> --- a/sysdeps/unix/sysv/linux/Makefile
> +++ b/sysdeps/unix/sysv/linux/Makefile
> @@ -156,6 +156,7 @@ endif
>  
>  ifeq ($(subdir),termios)
>  sysdep_headers += termio.h
> +tests += tst-termios2
>  endif

This is missing the addition of bits/termios2-struct.h to the installed 
headers.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: H. Peter Anvin @ 2019-04-17 22:04 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer; +Cc: libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <bfe6744f-7326-0a06-e43c-9f88abce7032@linaro.org>

On 4/15/19 10:22 AM, Adhemerval Zanella wrote:
>>
>> New interfaces are only necessary for the handful of architectures that don't have the speed fields *and* to space to put them in. 
> 
> Based on your WIP, it seems that both sparc and mips could be adapted.
> Do we still have glibc supported architecture that would require compat
> symbols?
> 
>>
>> Using symbol versioning doesn't really help much since the real problem is that struct termios can be passed around in userspace, and the interfaces between user space libraries don't have any versioning. However, my POC code deals with that too by only seeing BOTHER when necessary, so if the structure is extended garbage in the extra fields will be ignored unless new baud rates are in use.
> 
> Yeah, we discussed this earlier and if recall correctly it was not settled
> that all architectures would allow the use to extra space for the new
> fields. It seems the case, which makes the adaptation for termios2 even easier.
> 
> The question I have for kernel side is whether termios2 is fully compatible
> with termios, meaning that if there is conner cases we need to handle in
> userland.
> 

It depends on what you mean with "fully compatible."

Functionality-wise, the termios2 interfaces are a strict superset. There
is not, however, any guarantee that struct kernel_termios2 *contains* a
contiguous binary equivalent to struct kernel_termios (in fact, on most
architectures, it doesn't.)

>>
>> My POC code deals with Alpha as well by falling back to the old interfaces if necessary and possible, otherwise return error.
>>
>> Exporting termios2 to user space feels a bit odd at this stage as it would only be usable as a fallback on old glibc. Call it kernel_termios2 at least. ioctls using struct termios *must* be changed to kernel_termios anyway!
>>
> 
> I still prefer to avoid export it to userland and make it usable through
> default termios, as your wip does.  My understanding is new interfaces 
> should be semantic equal to current one with the only deviation that
> non-standard baudrates will handled as its values.  The only issue I 
> can foresee is if POSIX starts to export new bauds value.
> 

... which will be easy to handle: just define a Bxxxx constant with the
value equal to the baud rate.

	-hhpa

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: H. Peter Anvin @ 2019-04-17 22:06 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Adhemerval Zanella, libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <87lg0az2xk.fsf@oldenburg2.str.redhat.com>

On 4/16/19 2:59 AM, Florian Weimer wrote:
> * hpa:
> 
>> Using symbol versioning doesn't really help much since the real
>> problem is that struct termios can be passed around in userspace, and
>> the interfaces between user space libraries don't have any
>> versioning. However, my POC code deals with that too by only seeing
>> BOTHER when necessary, so if the structure is extended garbage in the
>> extra fields will be ignored unless new baud rates are in use.
> 
> That still doesn't solve the problem of changing struct offsets after a
> struct field of type struct termios.
> 
>> Exporting termios2 to user space feels a bit odd at this stage as it
>> would only be usable as a fallback on old glibc. Call it
>> kernel_termios2 at least.
> 
> I'm not sure why we should do that?  The kernel calls it struct termios2
> in its UAPI headers.  If that name is not appropriate, it should be
> changed first in the UAPI headers.
> 

It should; I believe the namespace we ought to use is struct
__kernel_termios[2].

"struct termios" defined in these headers is obviously completely wrong
for userspace.

	-hpa

^ permalink raw reply

* Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
From: Suren Baghdasaryan @ 2019-04-17 22:26 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <20190402104153.25404-7-patrick.bellasi@arm.com>

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> The SCHED_DEADLINE scheduling class provides an advanced and formal
> model to define tasks requirements that can translate into proper
> decisions for both task placements and frequencies selections. Other
> classes have a more simplified model based on the POSIX concept of
> priorities.
>
> Such a simple priority based model however does not allow to exploit
> most advanced features of the Linux scheduler like, for example, driving
> frequencies selection via the schedutil cpufreq governor. However, also
> for non SCHED_DEADLINE tasks, it's still interesting to define tasks
> properties to support scheduler decisions.
>
> Utilization clamping exposes to user-space a new set of per-task
> attributes the scheduler can use as hints about the expected/required
> utilization for a task. This allows to implement a "proactive" per-task
> frequency control policy, a more advanced policy than the current one
> based just on "passive" measured task utilization. For example, it's
> possible to boost interactive tasks (e.g. to get better performance) or
> cap background tasks (e.g. to be more energy/thermal efficient).
>
> Introduce a new API to set utilization clamping values for a specified
> task by extending sched_setattr(), a syscall which already allows to
> define task specific properties for different scheduling classes. A new
> pair of attributes allows to specify a minimum and maximum utilization
> the scheduler can consider for a task.
>
> Do that by validating the required clamp values before and then applying
> the required changes using _the_ same pattern already in use for
> __setscheduler(). This ensures that the task is re-enqueued with the new
> clamp values.
>
> Do not allow to change sched class specific params and non class
> specific params (i.e. clamp values) at the same time.  This keeps things
> simple and still works for the most common cases since we are usually
> interested in just one of the two actions.

Sorry, I can't find where you are checking to eliminate the
possibility of simultaneous changes to both sched class specific
params and non class specific params... Am I too tired or they are
indeed missing?

>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> ---
> Changes in v8:
>  Others:
>  - using p->uclamp_req to track clamp values "requested" from userspace
> ---
>  include/linux/sched.h            |  9 ++++
>  include/uapi/linux/sched.h       | 12 ++++-
>  include/uapi/linux/sched/types.h | 66 ++++++++++++++++++++----
>  kernel/sched/core.c              | 87 +++++++++++++++++++++++++++++++-
>  4 files changed, 162 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d8491954e2e1..c2b81a84985b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -585,6 +585,7 @@ struct sched_dl_entity {
>   * @value:             clamp value "assigned" to a se
>   * @bucket_id:         bucket index corresponding to the "assigned" value
>   * @active:            the se is currently refcounted in a rq's bucket
> + * @user_defined:      the requested clamp value comes from user-space
>   *
>   * The bucket_id is the index of the clamp bucket matching the clamp value
>   * which is pre-computed and stored to avoid expensive integer divisions from
> @@ -594,11 +595,19 @@ struct sched_dl_entity {
>   * which can be different from the clamp value "requested" from user-space.
>   * This allows to know a task is refcounted in the rq's bucket corresponding
>   * to the "effective" bucket_id.
> + *
> + * The user_defined bit is set whenever a task has got a task-specific clamp
> + * value requested from userspace, i.e. the system defaults apply to this task
> + * just as a restriction. This allows to relax default clamps when a less
> + * restrictive task-specific value has been requested, thus allowing to
> + * implement a "nice" semantic. For example, a task running with a 20%
> + * default boost can still drop its own boosting to 0%.
>   */
>  struct uclamp_se {
>         unsigned int value              : bits_per(SCHED_CAPACITY_SCALE);
>         unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);
>         unsigned int active             : 1;
> +       unsigned int user_defined       : 1;
>  };
>  #endif /* CONFIG_UCLAMP_TASK */
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 075c610adf45..d2c65617a4a4 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -53,10 +53,20 @@
>  #define SCHED_FLAG_RECLAIM             0x02
>  #define SCHED_FLAG_DL_OVERRUN          0x04
>  #define SCHED_FLAG_KEEP_POLICY         0x08
> +#define SCHED_FLAG_KEEP_PARAMS         0x10
> +#define SCHED_FLAG_UTIL_CLAMP_MIN      0x20
> +#define SCHED_FLAG_UTIL_CLAMP_MAX      0x40
> +
> +#define SCHED_FLAG_KEEP_ALL    (SCHED_FLAG_KEEP_POLICY | \
> +                                SCHED_FLAG_KEEP_PARAMS)
> +
> +#define SCHED_FLAG_UTIL_CLAMP  (SCHED_FLAG_UTIL_CLAMP_MIN | \
> +                                SCHED_FLAG_UTIL_CLAMP_MAX)
>
>  #define SCHED_FLAG_ALL (SCHED_FLAG_RESET_ON_FORK       | \
>                          SCHED_FLAG_RECLAIM             | \
>                          SCHED_FLAG_DL_OVERRUN          | \
> -                        SCHED_FLAG_KEEP_POLICY)
> +                        SCHED_FLAG_KEEP_ALL            | \
> +                        SCHED_FLAG_UTIL_CLAMP)
>
>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index 10fbb8031930..c852153ddb0d 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -9,6 +9,7 @@ struct sched_param {
>  };
>
>  #define SCHED_ATTR_SIZE_VER0   48      /* sizeof first published struct */
> +#define SCHED_ATTR_SIZE_VER1   56      /* add: util_{min,max} */
>
>  /*
>   * Extended scheduling parameters data structure.
> @@ -21,8 +22,33 @@ struct sched_param {
>   * the tasks may be useful for a wide variety of application fields, e.g.,
>   * multimedia, streaming, automation and control, and many others.
>   *
> - * This variant (sched_attr) is meant at describing a so-called
> - * sporadic time-constrained task. In such model a task is specified by:
> + * This variant (sched_attr) allows to define additional attributes to
> + * improve the scheduler knowledge about task requirements.
> + *
> + * Scheduling Class Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes specifies the
> + * scheduling policy and relative POSIX attributes:
> + *
> + *  @size              size of the structure, for fwd/bwd compat.
> + *
> + *  @sched_policy      task's scheduling policy
> + *  @sched_nice                task's nice value      (SCHED_NORMAL/BATCH)
> + *  @sched_priority    task's static priority (SCHED_FIFO/RR)
> + *
> + * Certain more advanced scheduling features can be controlled by a
> + * predefined set of flags via the attribute:
> + *
> + *  @sched_flags       for customizing the scheduler behaviour
> + *
> + * Sporadic Time-Constrained Task Attributes
> + * =========================================
> + *
> + * A subset of sched_attr attributes allows to describe a so-called
> + * sporadic time-constrained task.
> + *
> + * In such a model a task is specified by:
>   *  - the activation period or minimum instance inter-arrival time;
>   *  - the maximum (or average, depending on the actual scheduling
>   *    discipline) computation time of all instances, a.k.a. runtime;
> @@ -34,14 +60,8 @@ struct sched_param {
>   * than the runtime and must be completed by time instant t equal to
>   * the instance activation time + the deadline.
>   *
> - * This is reflected by the actual fields of the sched_attr structure:
> + * This is reflected by the following fields of the sched_attr structure:
>   *
> - *  @size              size of the structure, for fwd/bwd compat.
> - *
> - *  @sched_policy      task's scheduling policy
> - *  @sched_flags       for customizing the scheduler behaviour
> - *  @sched_nice                task's nice value      (SCHED_NORMAL/BATCH)
> - *  @sched_priority    task's static priority (SCHED_FIFO/RR)
>   *  @sched_deadline    representative of the task's deadline
>   *  @sched_runtime     representative of the task's runtime
>   *  @sched_period      representative of the task's period
> @@ -53,6 +73,29 @@ struct sched_param {
>   * As of now, the SCHED_DEADLINE policy (sched_dl scheduling class) is the
>   * only user of this new interface. More information about the algorithm
>   * available in the scheduling class file or in Documentation/.
> + *
> + * Task Utilization Attributes
> + * ===========================
> + *
> + * A subset of sched_attr attributes allows to specify the utilization
> + * expected for a task. These attributes allow to inform the scheduler about
> + * the utilization boundaries within which it should schedule the task. These
> + * boundaries are valuable hints to support scheduler decisions on both task
> + * placement and frequency selection.
> + *
> + *  @sched_util_min    represents the minimum utilization
> + *  @sched_util_max    represents the maximum utilization
> + *
> + * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE]. It
> + * represents the percentage of CPU time used by a task when running at the
> + * maximum frequency on the highest capacity CPU of the system. For example, a
> + * 20% utilization task is a task running for 2ms every 10ms at maximum
> + * frequency.
> + *
> + * A task with a min utilization value bigger than 0 is more likely scheduled
> + * on a CPU with a capacity big enough to fit the specified value.
> + * A task with a max utilization value smaller than 1024 is more likely
> + * scheduled on a CPU with no more capacity than the specified value.
>   */
>  struct sched_attr {
>         __u32 size;
> @@ -70,6 +113,11 @@ struct sched_attr {
>         __u64 sched_runtime;
>         __u64 sched_deadline;
>         __u64 sched_period;
> +
> +       /* Utilization hints */
> +       __u32 sched_util_min;
> +       __u32 sched_util_max;
> +
>  };
>
>  #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20efb32e1a7e..68aed32e8ec7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1020,6 +1020,50 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
>         return result;
>  }
>
> +static int uclamp_validate(struct task_struct *p,
> +                          const struct sched_attr *attr)
> +{
> +       unsigned int lower_bound = p->uclamp_req[UCLAMP_MIN].value;
> +       unsigned int upper_bound = p->uclamp_req[UCLAMP_MAX].value;
> +
> +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> +               lower_bound = attr->sched_util_min;
> +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> +               upper_bound = attr->sched_util_max;
> +
> +       if (lower_bound > upper_bound)
> +               return -EINVAL;
> +       if (upper_bound > SCHED_CAPACITY_SCALE)
> +               return -EINVAL;
> +
> +       return 0;
> +}
> +
> +static void __setscheduler_uclamp(struct task_struct *p,
> +                                 const struct sched_attr *attr)
> +{
> +       if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> +               return;
> +
> +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> +               unsigned int lower_bound = attr->sched_util_min;
> +
> +               p->uclamp_req[UCLAMP_MIN].user_defined = true;
> +               p->uclamp_req[UCLAMP_MIN].value = lower_bound;
> +               p->uclamp_req[UCLAMP_MIN].bucket_id =
> +                       uclamp_bucket_id(lower_bound);
> +       }
> +
> +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> +               unsigned int upper_bound =  attr->sched_util_max;
> +
> +               p->uclamp_req[UCLAMP_MAX].user_defined = true;
> +               p->uclamp_req[UCLAMP_MAX].value = upper_bound;
> +               p->uclamp_req[UCLAMP_MAX].bucket_id =
> +                       uclamp_bucket_id(upper_bound);
> +       }
> +}
> +
>  static void uclamp_fork(struct task_struct *p)
>  {
>         unsigned int clamp_id;
> @@ -1056,6 +1100,13 @@ static void __init init_uclamp(void)
>  #else /* CONFIG_UCLAMP_TASK */
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline int uclamp_validate(struct task_struct *p,
> +                                 const struct sched_attr *attr)
> +{
> +       return -ENODEV;

ENOSYS might be more appropriate?

> +}
> +static void __setscheduler_uclamp(struct task_struct *p,
> +                                 const struct sched_attr *attr) { }
>  static inline void uclamp_fork(struct task_struct *p) { }
>  static inline void init_uclamp(void) { }
>  #endif /* CONFIG_UCLAMP_TASK */
> @@ -4424,6 +4475,13 @@ static void __setscheduler_params(struct task_struct *p,
>  static void __setscheduler(struct rq *rq, struct task_struct *p,
>                            const struct sched_attr *attr, bool keep_boost)
>  {
> +       /*
> +        * If params can't change scheduling class changes aren't allowed
> +        * either.
> +        */
> +       if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
> +               return;
> +
>         __setscheduler_params(p, attr);
>
>         /*
> @@ -4561,6 +4619,13 @@ static int __sched_setscheduler(struct task_struct *p,
>                         return retval;
>         }
>
> +       /* Update task specific "requested" clamps */
> +       if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) {
> +               retval = uclamp_validate(p, attr);
> +               if (retval)
> +                       return retval;
> +       }
> +
>         /*
>          * Make sure no PI-waiters arrive (or leave) while we are
>          * changing the priority of the task:
> @@ -4590,6 +4655,8 @@ static int __sched_setscheduler(struct task_struct *p,
>                         goto change;
>                 if (dl_policy(policy) && dl_param_changed(p, attr))
>                         goto change;
> +               if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> +                       goto change;
>
>                 p->sched_reset_on_fork = reset_on_fork;
>                 task_rq_unlock(rq, p, &rf);
> @@ -4670,7 +4737,9 @@ static int __sched_setscheduler(struct task_struct *p,
>                 put_prev_task(rq, p);
>
>         prev_class = p->sched_class;
> +
>         __setscheduler(rq, p, attr, pi);
> +       __setscheduler_uclamp(p, attr);
>
>         if (queued) {
>                 /*
> @@ -4846,6 +4915,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
>         if (ret)
>                 return -EFAULT;
>
> +       if ((attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) &&
> +           size < SCHED_ATTR_SIZE_VER1)
> +               return -EINVAL;
> +
>         /*
>          * XXX: Do we want to be lenient like existing syscalls; or do we want
>          * to be strict and return an error on out-of-bounds values?
> @@ -4922,10 +4995,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
>         rcu_read_lock();
>         retval = -ESRCH;
>         p = find_process_by_pid(pid);
> -       if (p != NULL)
> -               retval = sched_setattr(p, &attr);
> +       if (likely(p))
> +               get_task_struct(p);
>         rcu_read_unlock();
>
> +       if (likely(p)) {
> +               retval = sched_setattr(p, &attr);
> +               put_task_struct(p);
> +       }
> +
>         return retval;
>  }
>
> @@ -5076,6 +5154,11 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
>         else
>                 attr.sched_nice = task_nice(p);
>
> +#ifdef CONFIG_UCLAMP_TASK
> +       attr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
> +       attr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
> +#endif
> +
>         rcu_read_unlock();
>
>         retval = sched_read_attr(uattr, &attr, size);
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks
From: Suren Baghdasaryan @ 2019-04-17 23:07 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <20190402104153.25404-9-patrick.bellasi@arm.com>

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> By default FAIR tasks start without clamps, i.e. neither boosted nor
> capped, and they run at the best frequency matching their utilization
> demand.  This default behavior does not fit RT tasks which instead are
> expected to run at the maximum available frequency, if not otherwise
> required by explicitly capping them.
>
> Enforce the correct behavior for RT tasks by setting util_min to max
> whenever:
>
>  1. the task is switched to the RT class and it does not already have a
>     user-defined clamp value assigned.
>
>  2. an RT task is forked from a parent with RESET_ON_FORK set.
>
> NOTE: utilization clamp values are cross scheduling class attributes and
> thus they are never changed/reset once a value has been explicitly
> defined from user-space.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/sched/core.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bdebdabe9bc4..71c9dd6487b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
>  static void __setscheduler_uclamp(struct task_struct *p,
>                                   const struct sched_attr *attr)
>  {
> +       unsigned int clamp_id;
> +
> +       /*
> +        * On scheduling class change, reset to default clamps for tasks
> +        * without a task-specific value.
> +        */
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> +               struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
> +               unsigned int clamp_value = uclamp_none(clamp_id);
> +
> +               /* Keep using defined clamps across class changes */
> +               if (uc_se->user_defined)
> +                       continue;
> +
> +               /* By default, RT tasks always get 100% boost */
> +               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +                       clamp_value = uclamp_none(UCLAMP_MAX);
> +
> +               uc_se->bucket_id = uclamp_bucket_id(clamp_value);
> +               uc_se->value = clamp_value;

Is it possible for p->uclamp_req[UCLAMP_MAX].value to be less than
uclamp_none(UCLAMP_MAX) for this RT task? If that's a possibility then
I think we will end up with a case of p->uclamp_req[UCLAMP_MIN].value
> p->uclamp_req[UCLAMP_MAX].value after these assignments are done.

> +       }
> +
>         if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>                 return;
>
> @@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>                 unsigned int clamp_value = uclamp_none(clamp_id);
>
> +               /* By default, RT tasks always get 100% boost */
> +               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +                       clamp_value = uclamp_none(UCLAMP_MAX);
> +
>                 p->uclamp_req[clamp_id].user_defined = false;
>                 p->uclamp_req[clamp_id].value = clamp_value;
>                 p->uclamp_req[clamp_id].bucket_id = uclamp_bucket_id(clamp_value);
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Suren Baghdasaryan @ 2019-04-18  0:12 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <20190402104153.25404-13-patrick.bellasi@arm.com>

On Tue, Apr 2, 2019 at 3:43 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> The cgroup CPU bandwidth controller allows to assign a specified
> (maximum) bandwidth to the tasks of a group. However this bandwidth is
> defined and enforced only on a temporal base, without considering the
> actual frequency a CPU is running on. Thus, the amount of computation
> completed by a task within an allocated bandwidth can be very different
> depending on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Giving the mechanisms described above, it is now possible to extend the
> cpu controller to specify the minimum (or maximum) utilization which
> should be considered for tasks RUNNABLE on a cpu.
> This makes it possible to better defined the actual computational
> power assigned to task groups, thus improving the cgroup CPU bandwidth
> controller which is currently based just on time constraints.
>
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
>
> - util.min: defines the minimum utilization which should be considered
>             i.e. the RUNNABLE tasks of this group will run at least at a
>                  minimum frequency which corresponds to the util.min
>                  utilization
>
> - util.max: defines the maximum utilization which should be considered
>             i.e. the RUNNABLE tasks of this group will run up to a
>                  maximum frequency which corresponds to the util.max
>                  utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
>    hierarchies, while system wide clamps are defined by a generic
>    interface which does not depends on cgroups. This system wide
>    interface enforces constraints on tasks in the root node.
>
> b) enforce effective constraints at each level of the hierarchy which
>    are a restriction of the group requests considering its parent's
>    effective constraints. Root group effective constraints are defined
>    by the system wide interface.
>    This mechanism allows each (non-root) level of the hierarchy to:
>    - request whatever clamp values it would like to get
>    - effectively get only up to the maximum amount allowed by its parent
>
> c) have higher priority than task-specific clamps, defined via
>    sched_setattr(), thus allowing to control and restrict task requests
>
> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Tejun Heo <tj@kernel.org>
>
> --
> Changes in v8:
>  Message-ID: <20190214154817.GN50184@devbig004.ftw2.facebook.com>
>  - update changelog description for points b), c) and following paragraph
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  27 +++++
>  init/Kconfig                            |  22 ++++
>  kernel/sched/core.c                     | 142 +++++++++++++++++++++++-
>  kernel/sched/sched.h                    |   6 +
>  4 files changed, 196 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 7bf3f129c68b..47710a77f4fa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth limit models for
>  normal scheduling policy and absolute bandwidth allocation model for
>  realtime scheduling policy.
>
> +Cycles distribution is based, by default, on a temporal base and it
> +does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to enforce a minimum
> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> +which should never be exceeded by a CPU.
> +
>  WARNING: cgroup2 doesn't yet support control of realtime processes and
>  the cpu controller can only be enabled when all RT processes are in
>  the root cgroup.  Be aware that system management software may already
> @@ -974,6 +980,27 @@ All time durations are in microseconds.
>         Shows pressure stall information for CPU. See
>         Documentation/accounting/psi.txt for details.
>
> +  cpu.util.min
> +        A read-write single value file which exists on non-root cgroups.
> +        The default is "0", i.e. no utilization boosting.
> +
> +        The requested minimum utilization in the range [0, 1024].
> +
> +        This interface allows reading and setting minimum utilization clamp
> +        values similar to the sched_setattr(2). This minimum utilization
> +        value is used to clamp the task specific minimum utilization clamp.
> +
> +  cpu.util.max
> +        A read-write single value file which exists on non-root cgroups.
> +        The default is "1024". i.e. no utilization capping
> +
> +        The requested maximum utilization in the range [0, 1024].
> +
> +        This interface allows reading and setting maximum utilization clamp
> +        values similar to the sched_setattr(2). This maximum utilization
> +        value is used to clamp the task specific maximum utilization clamp.
> +
> +
>
>  Memory
>  ------
> diff --git a/init/Kconfig b/init/Kconfig
> index 7439cbf4d02e..33006e8de996 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -877,6 +877,28 @@ config RT_GROUP_SCHED
>
>  endif #CGROUP_SCHED
>
> +config UCLAMP_TASK_GROUP
> +       bool "Utilization clamping per group of tasks"
> +       depends on CGROUP_SCHED
> +       depends on UCLAMP_TASK
> +       default n
> +       help
> +         This feature enables the scheduler to track the clamped utilization
> +         of each CPU based on RUNNABLE tasks currently scheduled on that CPU.
> +
> +         When this option is enabled, the user can specify a min and max
> +         CPU bandwidth which is allowed for each single task in a group.
> +         The max bandwidth allows to clamp the maximum frequency a task
> +         can use, while the min bandwidth allows to define a minimum
> +         frequency a task will always use.
> +
> +         When task group based utilization clamping is enabled, an eventually
> +         specified task-specific clamp value is constrained by the cgroup
> +         specified clamp value. Both minimum and maximum task clamping cannot
> +         be bigger than the corresponding clamping defined at task group level.
> +
> +         If in doubt, say N.
> +
>  config CGROUP_PIDS
>         bool "PIDs controller"
>         help
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 71c9dd6487b1..aeed2dd315cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1130,8 +1130,12 @@ static void __init init_uclamp(void)
>         /* System defaults allow max clamp values for both indexes */
>         uc_max.value = uclamp_none(UCLAMP_MAX);
>         uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> -       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
>                 uclamp_default[clamp_id] = uc_max;
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +               root_task_group.uclamp_req[clamp_id] = uc_max;
> +#endif
> +       }
>  }
>
>  #else /* CONFIG_UCLAMP_TASK */
> @@ -6720,6 +6724,19 @@ void ia64_set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>
> +static inline int alloc_uclamp_sched_group(struct task_group *tg,
> +                                          struct task_group *parent)
> +{
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +       int clamp_id;
> +
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +               tg->uclamp_req[clamp_id] = parent->uclamp_req[clamp_id];
> +#endif
> +
> +       return 1;

Looks like you never return anything else neither here nor in the
following patches I think...

> +}
> +
>  static void sched_free_group(struct task_group *tg)
>  {
>         free_fair_sched_group(tg);
> @@ -6743,6 +6760,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>         if (!alloc_rt_sched_group(tg, parent))
>                 goto err;
>
> +       if (!alloc_uclamp_sched_group(tg, parent))
> +               goto err;
> +
>         return tg;
>
>  err:
> @@ -6963,6 +6983,100 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
>                 sched_move_task(task);
>  }
>
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +static int cpu_util_min_write_u64(struct cgroup_subsys_state *css,
> +                                 struct cftype *cftype, u64 min_value)
> +{
> +       struct task_group *tg;
> +       int ret = 0;
> +
> +       if (min_value > SCHED_CAPACITY_SCALE)
> +               return -ERANGE;
> +
> +       rcu_read_lock();
> +
> +       tg = css_tg(css);
> +       if (tg == &root_task_group) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       if (tg->uclamp_req[UCLAMP_MIN].value == min_value)
> +               goto out;
> +       if (tg->uclamp_req[UCLAMP_MAX].value < min_value) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       /* Update tg's "requested" clamp value */
> +       tg->uclamp_req[UCLAMP_MIN].value = min_value;
> +       tg->uclamp_req[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value);
> +
> +out:
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static int cpu_util_max_write_u64(struct cgroup_subsys_state *css,
> +                                 struct cftype *cftype, u64 max_value)
> +{
> +       struct task_group *tg;
> +       int ret = 0;
> +
> +       if (max_value > SCHED_CAPACITY_SCALE)
> +               return -ERANGE;
> +
> +       rcu_read_lock();
> +
> +       tg = css_tg(css);
> +       if (tg == &root_task_group) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +       if (tg->uclamp_req[UCLAMP_MAX].value == max_value)
> +               goto out;
> +       if (tg->uclamp_req[UCLAMP_MIN].value > max_value) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       /* Update tg's "requested" clamp value */
> +       tg->uclamp_req[UCLAMP_MAX].value = max_value;
> +       tg->uclamp_req[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value);
> +
> +out:
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
> +static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css,
> +                                 enum uclamp_id clamp_id)
> +{
> +       struct task_group *tg;
> +       u64 util_clamp;
> +
> +       rcu_read_lock();
> +       tg = css_tg(css);
> +       util_clamp = tg->uclamp_req[clamp_id].value;
> +       rcu_read_unlock();
> +
> +       return util_clamp;
> +}
> +
> +static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css,
> +                                struct cftype *cft)
> +{
> +       return cpu_uclamp_read(css, UCLAMP_MIN);
> +}
> +
> +static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css,
> +                                struct cftype *cft)
> +{
> +       return cpu_uclamp_read(css, UCLAMP_MAX);
> +}
> +#endif /* CONFIG_UCLAMP_TASK_GROUP */
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
>                                 struct cftype *cftype, u64 shareval)
> @@ -7300,6 +7414,18 @@ static struct cftype cpu_legacy_files[] = {
>                 .read_u64 = cpu_rt_period_read_uint,
>                 .write_u64 = cpu_rt_period_write_uint,
>         },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +       {
> +               .name = "util.min",
> +               .read_u64 = cpu_util_min_read_u64,
> +               .write_u64 = cpu_util_min_write_u64,
> +       },
> +       {
> +               .name = "util.max",
> +               .read_u64 = cpu_util_max_read_u64,
> +               .write_u64 = cpu_util_max_write_u64,
> +       },
>  #endif
>         { }     /* Terminate */
>  };
> @@ -7467,6 +7593,20 @@ static struct cftype cpu_files[] = {
>                 .seq_show = cpu_max_show,
>                 .write = cpu_max_write,
>         },
> +#endif
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +       {
> +               .name = "util.min",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_u64 = cpu_util_min_read_u64,
> +               .write_u64 = cpu_util_min_write_u64,
> +       },
> +       {
> +               .name = "util.max",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .read_u64 = cpu_util_max_read_u64,
> +               .write_u64 = cpu_util_max_write_u64,
> +       },
>  #endif
>         { }     /* terminate */
>  };
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6ae3628248eb..b46b6912beba 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -399,6 +399,12 @@ struct task_group {
>  #endif
>
>         struct cfs_bandwidth    cfs_bandwidth;
> +
> +#ifdef CONFIG_UCLAMP_TASK_GROUP
> +       /* Clamp values requested for a task group */
> +       struct uclamp_se        uclamp_req[UCLAMP_CNT];
> +#endif
> +
>  };
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps
From: Suren Baghdasaryan @ 2019-04-18  0:51 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: LKML, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra, Tejun Heo,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle
In-Reply-To: <20190402104153.25404-5-patrick.bellasi@arm.com>

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>
> Tasks without a user-defined clamp value are considered not clamped
> and by default their utilization can have any value in the
> [0..SCHED_CAPACITY_SCALE] range.
>
> Tasks with a user-defined clamp value are allowed to request any value
> in that range, and the required clamp is unconditionally enforced.
> However, a "System Management Software" could be interested in limiting
> the range of clamp values allowed for all tasks.
>
> Add a privileged interface to define a system default configuration via:
>
>   /proc/sys/kernel/sched_uclamp_util_{min,max}
>
> which works as an unconditional clamp range restriction for all tasks.
>
> With the default configuration, the full SCHED_CAPACITY_SCALE range of
> values is allowed for each clamp index. Otherwise, the task-specific
> clamp is capped by the corresponding system default value.
>
> Do that by tracking, for each task, the "effective" clamp value and
> bucket the task has been refcounted in at enqueue time. This
> allows to lazy aggregate "requested" and "system default" values at
> enqueue time and simplifies refcounting updates at dequeue time.
>
> The cached bucket ids are used to avoid (relatively) more expensive
> integer divisions every time a task is enqueued.
>
> An active flag is used to report when the "effective" value is valid and
> thus the task is actually refcounted in the corresponding rq's bucket.
>
> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
>
> ---
> Changes in v8:
>  Message-ID: <20190313201010.GU2482@worktop.programming.kicks-ass.net>
>  - add "requested" values uclamp_se instance beside the existing
>    "effective" values instance
>  - rename uclamp_effective_{get,assign}() into uclamp_eff_{get,set}()
>  - make uclamp_eff_get() return the new "effective" values by copy
> Message-ID: <20190318125844.ajhjpaqlcgxn7qkq@e110439-lin>
>  - run uclamp_fork() code independently from the class being supported.
>    Resetting active flag is not harmful and following patches will add
>    other code which still needs to be executed independently from class
>    support.
> Message-ID: <20190313201342.GV2482@worktop.programming.kicks-ass.net>
>  - add sysctl_sched_uclamp_handler()'s internal mutex to serialize
>    concurrent usages
> ---
>  include/linux/sched.h        |  10 +++
>  include/linux/sched/sysctl.h |  11 +++
>  kernel/sched/core.c          | 131 ++++++++++++++++++++++++++++++++++-
>  kernel/sysctl.c              |  16 +++++
>  4 files changed, 167 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0c0dd7aac8e9..d8491954e2e1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -584,14 +584,21 @@ struct sched_dl_entity {
>   * Utilization clamp for a scheduling entity
>   * @value:             clamp value "assigned" to a se
>   * @bucket_id:         bucket index corresponding to the "assigned" value
> + * @active:            the se is currently refcounted in a rq's bucket
>   *
>   * The bucket_id is the index of the clamp bucket matching the clamp value
>   * which is pre-computed and stored to avoid expensive integer divisions from
>   * the fast path.
> + *
> + * The active bit is set whenever a task has got an "effective" value assigned,
> + * which can be different from the clamp value "requested" from user-space.
> + * This allows to know a task is refcounted in the rq's bucket corresponding
> + * to the "effective" bucket_id.
>   */
>  struct uclamp_se {
>         unsigned int value              : bits_per(SCHED_CAPACITY_SCALE);
>         unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);
> +       unsigned int active             : 1;
>  };
>  #endif /* CONFIG_UCLAMP_TASK */
>
> @@ -676,6 +683,9 @@ struct task_struct {
>         struct sched_dl_entity          dl;
>
>  #ifdef CONFIG_UCLAMP_TASK
> +       /* Clamp values requested for a scheduling entity */
> +       struct uclamp_se                uclamp_req[UCLAMP_CNT];
> +       /* Effective clamp values used for a scheduling entity */
>         struct uclamp_se                uclamp[UCLAMP_CNT];
>  #endif
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 99ce6d728df7..d4f6215ee03f 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int write,
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern unsigned int sysctl_sched_uclamp_util_min;
> +extern unsigned int sysctl_sched_uclamp_util_max;
> +#endif
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>  extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>  #endif
> @@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int write,
>                 void __user *buffer, size_t *lenp,
>                 loff_t *ppos);
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> +                                      void __user *buffer, size_t *lenp,
> +                                      loff_t *ppos);
> +#endif
> +
>  extern int sysctl_numa_balancing(struct ctl_table *table, int write,
>                                  void __user *buffer, size_t *lenp,
>                                  loff_t *ppos);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 046f61d33f00..d368ac26b8aa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -733,6 +733,14 @@ static void set_load_weight(struct task_struct *p, bool update_load)
>  }
>
>  #ifdef CONFIG_UCLAMP_TASK
> +/* Max allowed minimum utilization */
> +unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
> +
> +/* Max allowed maximum utilization */
> +unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
> +
> +/* All clamps are required to be less or equal than these values */
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
>
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
> @@ -801,6 +809,52 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
>         return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>
> +/*
> + * The effective clamp bucket index of a task depends on, by increasing
> + * priority:
> + * - the task specific clamp value, when explicitly requested from userspace
> + * - the system default clamp value, defined by the sysadmin
> + */
> +static inline struct uclamp_se
> +uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
> +{
> +       struct uclamp_se uc_req = p->uclamp_req[clamp_id];
> +       struct uclamp_se uc_max = uclamp_default[clamp_id];
> +
> +       /* System default restrictions always apply */
> +       if (unlikely(uc_req.value > uc_max.value))
> +               return uc_max;
> +
> +       return uc_req;
> +}
> +
> +static inline unsigned int
> +uclamp_eff_bucket_id(struct task_struct *p, unsigned int clamp_id)

This function is not used anywhere AFAIKT. uclamp_eff_bucket_id() and
uclamp_eff_value() look very similar, maybe they can be combined into
one function returning struct uclamp_se?

> +{
> +       struct uclamp_se uc_eff;
> +
> +       /* Task currently refcounted: use back-annotated (effective) bucket */
> +       if (p->uclamp[clamp_id].active)
> +               return p->uclamp[clamp_id].bucket_id;
> +
> +       uc_eff = uclamp_eff_get(p, clamp_id);
> +
> +       return uc_eff.bucket_id;
> +}
> +
> +unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
> +{
> +       struct uclamp_se uc_eff;
> +
> +       /* Task currently refcounted: use back-annotated (effective) value */
> +       if (p->uclamp[clamp_id].active)
> +               return p->uclamp[clamp_id].value;
> +
> +       uc_eff = uclamp_eff_get(p, clamp_id);
> +
> +       return uc_eff.value;
> +}
> +
>  /*
>   * When a task is enqueued on a rq, the clamp bucket currently defined by the
>   * task's uclamp::bucket_id is refcounted on that rq. This also immediately
> @@ -818,8 +872,12 @@ static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
>         struct uclamp_se *uc_se = &p->uclamp[clamp_id];
>         struct uclamp_bucket *bucket;
>
> +       /* Update task effective clamp */
> +       p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
> +
>         bucket = &uc_rq->bucket[uc_se->bucket_id];
>         bucket->tasks++;
> +       uc_se->active = true;
>
>         uclamp_idle_reset(rq, clamp_id, uc_se->value);
>
> @@ -856,6 +914,7 @@ static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
>         SCHED_WARN_ON(!bucket->tasks);
>         if (likely(bucket->tasks))
>                 bucket->tasks--;
> +       uc_se->active = false;
>
>         /*
>          * Keep "local max aggregation" simple and accept to (possibly)
> @@ -909,8 +968,69 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>                 uclamp_rq_dec_id(p, rq, clamp_id);
>  }
>
> +int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
> +                               void __user *buffer, size_t *lenp,
> +                               loff_t *ppos)
> +{
> +       int old_min, old_max;
> +       static DEFINE_MUTEX(mutex);
> +       int result;
> +
> +       mutex_lock(&mutex);
> +       old_min = sysctl_sched_uclamp_util_min;
> +       old_max = sysctl_sched_uclamp_util_max;
> +
> +       result = proc_dointvec(table, write, buffer, lenp, ppos);
> +       if (result)
> +               goto undo;
> +       if (!write)
> +               goto done;
> +
> +       if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
> +           sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE) {
> +               result = -EINVAL;
> +               goto undo;
> +       }
> +
> +       if (old_min != sysctl_sched_uclamp_util_min) {
> +               uclamp_default[UCLAMP_MIN].value =
> +                       sysctl_sched_uclamp_util_min;
> +               uclamp_default[UCLAMP_MIN].bucket_id =
> +                       uclamp_bucket_id(sysctl_sched_uclamp_util_min);
> +       }
> +       if (old_max != sysctl_sched_uclamp_util_max) {
> +               uclamp_default[UCLAMP_MAX].value =
> +                       sysctl_sched_uclamp_util_max;
> +               uclamp_default[UCLAMP_MAX].bucket_id =
> +                       uclamp_bucket_id(sysctl_sched_uclamp_util_max);
> +       }
> +
> +       /*
> +        * Updating all the RUNNABLE task is expensive, keep it simple and do
> +        * just a lazy update at each next enqueue time.
> +        */
> +       goto done;
> +
> +undo:
> +       sysctl_sched_uclamp_util_min = old_min;
> +       sysctl_sched_uclamp_util_max = old_max;
> +done:
> +       mutex_unlock(&mutex);
> +
> +       return result;
> +}
> +
> +static void uclamp_fork(struct task_struct *p)
> +{
> +       unsigned int clamp_id;
> +
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +               p->uclamp[clamp_id].active = false;
> +}
> +
>  static void __init init_uclamp(void)
>  {
> +       struct uclamp_se uc_max = {};
>         unsigned int clamp_id;
>         int cpu;
>
> @@ -920,16 +1040,23 @@ static void __init init_uclamp(void)
>         }
>
>         for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> -               struct uclamp_se *uc_se = &init_task.uclamp[clamp_id];
> +               struct uclamp_se *uc_se = &init_task.uclamp_req[clamp_id];
>
>                 uc_se->value = uclamp_none(clamp_id);
>                 uc_se->bucket_id = uclamp_bucket_id(uc_se->value);
>         }
> +
> +       /* System defaults allow max clamp values for both indexes */
> +       uc_max.value = uclamp_none(UCLAMP_MAX);
> +       uc_max.bucket_id = uclamp_bucket_id(uc_max.value);
> +       for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id)
> +               uclamp_default[clamp_id] = uc_max;
>  }
>
>  #else /* CONFIG_UCLAMP_TASK */
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
>  static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_fork(struct task_struct *p) { }
>  static inline void init_uclamp(void) { }
>  #endif /* CONFIG_UCLAMP_TASK */
>
> @@ -2530,6 +2657,8 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>          */
>         p->prio = current->normal_prio;
>
> +       uclamp_fork(p);
> +
>         /*
>          * Revert to default priority/policy on fork if requested.
>          */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 987ae08147bf..72277f09887d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -446,6 +446,22 @@ static struct ctl_table kern_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = sched_rr_handler,
>         },
> +#ifdef CONFIG_UCLAMP_TASK
> +       {
> +               .procname       = "sched_uclamp_util_min",
> +               .data           = &sysctl_sched_uclamp_util_min,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = sysctl_sched_uclamp_handler,
> +       },
> +       {
> +               .procname       = "sched_uclamp_util_max",
> +               .data           = &sysctl_sched_uclamp_util_max,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = sysctl_sched_uclamp_handler,
> +       },
> +#endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>         {
>                 .procname       = "sched_autogroup_enabled",
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Daniel Axtens @ 2019-04-18  6:38 UTC (permalink / raw)
  To: Andrew Donnellan, Matthew Garrett, jmorris
  Cc: linux-security-module, linux-kernel, dhowells, linux-api, luto,
	linuxppc-dev, Michael Ellerman, cmr
In-Reply-To: <059c523e-926c-24ee-0935-198031712145@au1.ibm.com>

Hi Andrew,

>> +		If CONFIG_LOCK_DOWN_KERNEL is enabled, the kernel can be
>> +		moved to a more locked down state at runtime by writing to
>> +		this attribute. Valid values are:
>> +
>> +		integrity:
>> +			The kernel will disable functionality that allows
>> +			userland to modify the running kernel image, other
>> +			than through the loading or execution of appropriately
>> +			signed objects.
>> +
>> +		confidentiality:
>> +			The kernel will disable all functionality disabled by
>> +			the integrity mode, but additionally will disable
>> +			features that potentially permit userland to obtain
>> +			confidential information stored within the kernel.
>
> [+ linuxppc, mpe, dja, cmr]
>
> I'm thinking about whether we should lock down the powerpc xmon debug 
> monitor - intuitively, I think the answer is yes if for no other reason 
> than Least Astonishment, when lockdown is enabled you probably don't 
> expect xmon to keep letting you access kernel memory.
>
> Semantically though, xmon is not a userspace process - it's in kernel 
> and reads debug commands/outputs debug data directly from/to the 
> console. Is that a threat vector that this series cares about?

I guess there are 2 ways you could think about lockdown:

 - It adds a security boundary between the kernel and UID 0, so that
   userland cannot compromise the integrity/confidentiality of the
   locked down kernel.

 - It is a bundle of related security boundaries so that the
   integrity/confidentiality of a running, locked down kernel cannot be
   compromised, even by a privileged, physically present user.

You're right that techincally xmon is in the kernel and on the console
rather than in userland, so it doesn't fall within the first concept of
lockdown. But I think usecases for lockdown tend to expect something
more like the second concept.

IOW, lockdown is a trapdoor - once you've locked down a kernel, you
can't get out of lockdown (except by rebooting). xmon would allow you to
get out of the trapdoor, so I think it should be restricted by lockdown.

Regards,
Daniel

>
>
> -- 
> Andrew Donnellan              OzLabs, ADL Canberra
> andrew.donnellan@au1.ibm.com  IBM Australia Limited

^ permalink raw reply

* Re: [PATCH v1 2/4] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-18  9:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190417142253.GH32622@redhat.com>

On Wed, Apr 17, 2019 at 04:22:54PM +0200, Oleg Nesterov wrote:
> On 04/16, Christian Brauner wrote:
> >
> > +	if (clone_flags & CLONE_PIDFD) {
> > +		retval = pidfd_create(pid, &pidfdf);
> > +		if (retval < 0)
> > +			goto bad_fork_free_pid;
> > +		pidfd = retval;
> > +	}
> 
> ...
> 
> > +	if (clone_flags & CLONE_PIDFD) {
> > +		fd_install(pidfd, pidfdf);
> > +		put_user(pidfd, parent_tidptr);
> 
> put_user() can fail, I don't think this error should be silently ignored,
> this can lead to the hard-to-trigger/debug problems.
> 
> Why can't we do put_user-with-check along with pidfd_create() above?

I've moved put_user() right were pidfd_create() is called but I think
then it makes sense to change pidfd_create() to also do the fd_install()
such that the following sequence creates the pidfd, installs it, and
calls put_user() and calls ksys_close() on error. Any objections Oleg?

+       if (clone_flags & CLONE_PIDFD) {
+               retval = pidfd_create(pid);
+               if (retval < 0)
+                       goto bad_fork_free_pid;
+
+               pidfd = retval;
+               retval = put_user(pidfd, parent_tidptr);
+               if (retval)
+                       goto bad_fork_put_pidfd;
+       }

Christian

^ permalink raw reply

* [PATCH v2 0/5]  clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner

Hey,

/* v2 summary */
Move put_user() into copy process before clone's point of no return so
that we can handle put_user() errors as suggested by Oleg. The good news
is that this again allows us to make the patch smaller.

/* v1 summary */
As suggested by Oleg, have pidfds returned in the fourth argument of
clone allowing us to return a pidfd and its pid to the caller at the
same time. This has various advantages:
- callers get the associated pid for the pidfd without additional
  parsing 
  This makes it easier for userspce to get metadata access through
  procfs.
- the type of the return value for clone() remains unchanged
  (was changed to return an fd in the previous iteration)
- pid file descriptor numbering can start at 0 as is customary for
  file descriptors
  (was changed to start at 1 in the previous patchset to not break
   fork()-like error checking when returning pidfds)
- finally, the patchset has gotten smaller

The patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call as previously discussed.

As decided last week [1] Jann and I have refined the implementation of
pidfds as anonymous inodes. Based on last weeks RFC we have only tweaked
documentation and naming, as well as making the sample program how to
get easy metadata access from a pidfd a little cleaner and more paranoid
when checking for errors.
The sample program can also serve as a test for the patchset.

When clone is called with CLONE_PIDFD a pidfd will be returned in the
fourth argument of clone. This is based on an idea from Oleg. It allows
us to return a pidfd and the associated pid to the caller at the same
time.

We have taken care that pidfds are created *after* the fd table has
been unshared to not leak pidfds into child processes.

The actual code for CLONE_PIDFD in patch 2 is completely confined to
fork.c (apart from the CLONE_PIDFD definition of course) and is rather
small and hopefully good to review.

The additional changes listed under David's name in the diffstat below
are here to make anon_inodes available unconditionally. They are needed
for the new mount api and thus for core vfs code in addition to pidfds.
David knows this and he has informed Al that this patch is sent out
here. The changes themselves are rather automatic.

As promised I have also contacted Joel who has sent a patchset to make
pidfds pollable. He has been informed and is happy to port his patchset
once we have moved forward [2].
Jann and I currently plan to target this patchset for inclusion in the 5.2
merge window.

Thanks!
Jann & Christian

[1]: https://lore.kernel.org/lkml/CAHk-=wifyY+XGNW=ZC4MyTHD14w81F8JjQNH-GaGAm2RxZ_S8Q@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20190411200059.GA75190@google.com/

Christian Brauner (4):
  clone: add CLONE_PIDFD
  signal: use fdget() since we don't allow O_PATH
  signal: support CLONE_PIDFD with pidfd_send_signal
  samples: show race-free pidfd metadata access

David Howells (1):
  Make anon_inodes unconditional

 arch/arm/kvm/Kconfig           |   1 -
 arch/arm64/kvm/Kconfig         |   1 -
 arch/mips/kvm/Kconfig          |   1 -
 arch/powerpc/kvm/Kconfig       |   1 -
 arch/s390/kvm/Kconfig          |   1 -
 arch/x86/Kconfig               |   1 -
 arch/x86/kvm/Kconfig           |   1 -
 drivers/base/Kconfig           |   1 -
 drivers/char/tpm/Kconfig       |   1 -
 drivers/dma-buf/Kconfig        |   1 -
 drivers/gpio/Kconfig           |   1 -
 drivers/iio/Kconfig            |   1 -
 drivers/infiniband/Kconfig     |   1 -
 drivers/vfio/Kconfig           |   1 -
 fs/Makefile                    |   2 +-
 fs/notify/fanotify/Kconfig     |   1 -
 fs/notify/inotify/Kconfig      |   1 -
 include/linux/pid.h            |   2 +
 include/uapi/linux/sched.h     |   1 +
 init/Kconfig                   |  10 ---
 kernel/fork.c                  |  96 ++++++++++++++++++++++++++--
 kernel/signal.c                |  14 +++--
 kernel/sys_ni.c                |   3 -
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 112 +++++++++++++++++++++++++++++++++
 26 files changed, 225 insertions(+), 39 deletions(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

-- 
2.21.0

^ permalink raw reply

* [PATCH v2 1/5] Make anon_inodes unconditional
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190418101841.4476-1-christian@brauner.io>

From: David Howells <dhowells@redhat.com>

Make the anon_inodes facility unconditional so that it can be used by core
VFS code and pidfd code.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[christian@brauner.io: adapt commit message to mention pidfds]
Signed-off-by: Christian Brauner <christian@brauner.io>
---
/* changelog */
v1: patch unchanged
v2: patch unchanged
---
 arch/arm/kvm/Kconfig       |  1 -
 arch/arm64/kvm/Kconfig     |  1 -
 arch/mips/kvm/Kconfig      |  1 -
 arch/powerpc/kvm/Kconfig   |  1 -
 arch/s390/kvm/Kconfig      |  1 -
 arch/x86/Kconfig           |  1 -
 arch/x86/kvm/Kconfig       |  1 -
 drivers/base/Kconfig       |  1 -
 drivers/char/tpm/Kconfig   |  1 -
 drivers/dma-buf/Kconfig    |  1 -
 drivers/gpio/Kconfig       |  1 -
 drivers/iio/Kconfig        |  1 -
 drivers/infiniband/Kconfig |  1 -
 drivers/vfio/Kconfig       |  1 -
 fs/Makefile                |  2 +-
 fs/notify/fanotify/Kconfig |  1 -
 fs/notify/inotify/Kconfig  |  1 -
 init/Kconfig               | 10 ----------
 18 files changed, 1 insertion(+), 27 deletions(-)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 3f5320f46de2..f591026347a5 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -22,7 +22,6 @@ config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	depends on MMU && OF
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select ARM_GIC
 	select ARM_GIC_V3
 	select ARM_GIC_V3_ITS
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a3f85624313e..a67121d419a2 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -23,7 +23,6 @@ config KVM
 	depends on OF
 	select MMU_NOTIFIER
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_ARCH_TLB_FLUSH_ALL
 	select KVM_MMIO
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index 4528bc9c3cb1..eac25aef21e0 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	depends on MIPS_FP_SUPPORT
 	select EXPORT_UASM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select KVM_MMIO
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index bfdde04e4905..f53997a8ca62 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -20,7 +20,6 @@ if VIRTUALIZATION
 config KVM
 	bool
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select SRCU
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 767453faacfc..1816ee48eadd 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -21,7 +21,6 @@ config KVM
 	prompt "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
-	select ANON_INODES
 	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select HAVE_KVM_VCPU_ASYNC_IOCTL
 	select HAVE_KVM_EVENTFD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..7a70fb58b2d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -44,7 +44,6 @@ config X86
 	#
 	select ACPI_LEGACY_TABLES_LOOKUP	if ACPI
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
-	select ANON_INODES
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_CLOCKSOURCE_INIT
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 72fa955f4a15..fc042419e670 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -27,7 +27,6 @@ config KVM
 	depends on X86_LOCAL_APIC
 	select PREEMPT_NOTIFIERS
 	select MMU_NOTIFIER
-	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQFD
 	select IRQ_BYPASS_MANAGER
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 059700ea3521..03f067da12ee 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -174,7 +174,6 @@ source "drivers/base/regmap/Kconfig"
 config DMA_SHARED_BUFFER
 	bool
 	default n
-	select ANON_INODES
 	select IRQ_WORK
 	help
 	  This option enables the framework for buffer-sharing between
diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 536e55d3919f..f3e4bc490cf0 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -157,7 +157,6 @@ config TCG_CRB
 config TCG_VTPM_PROXY
 	tristate "VTPM Proxy Interface"
 	depends on TCG_TPM
-	select ANON_INODES
 	---help---
 	  This driver proxies for an emulated TPM (vTPM) running in userspace.
 	  A device /dev/vtpmx is provided that creates a device pair
diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig
index 2e5a0faa2cb1..3fc9c2efc583 100644
--- a/drivers/dma-buf/Kconfig
+++ b/drivers/dma-buf/Kconfig
@@ -3,7 +3,6 @@ menu "DMABUF options"
 config SYNC_FILE
 	bool "Explicit Synchronization Framework"
 	default n
-	select ANON_INODES
 	select DMA_SHARED_BUFFER
 	---help---
 	  The Sync File Framework adds explicit syncronization via
diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..0f91600c27ae 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -12,7 +12,6 @@ config ARCH_HAVE_CUSTOM_GPIO_H
 
 menuconfig GPIOLIB
 	bool "GPIO Support"
-	select ANON_INODES
 	help
 	  This enables GPIO support through the generic GPIO library.
 	  You only need to enable this, if you also want to enable
diff --git a/drivers/iio/Kconfig b/drivers/iio/Kconfig
index d08aeb41cd07..1dec0fecb6ef 100644
--- a/drivers/iio/Kconfig
+++ b/drivers/iio/Kconfig
@@ -4,7 +4,6 @@
 
 menuconfig IIO
 	tristate "Industrial I/O support"
-	select ANON_INODES
 	help
 	  The industrial I/O subsystem provides a unified framework for
 	  drivers for many different types of embedded sensors using a
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index a1fb840de45d..d318bab25860 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -25,7 +25,6 @@ config INFINIBAND_USER_MAD
 
 config INFINIBAND_USER_ACCESS
 	tristate "InfiniBand userspace access (verbs and CM)"
-	select ANON_INODES
 	depends on MMU
 	---help---
 	  Userspace InfiniBand access support.  This enables the
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9de5ed38da83..3798d77d131c 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -22,7 +22,6 @@ menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64)
-	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/vfio.txt for more details.
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..35945f8139e6 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -25,7 +25,7 @@ obj-$(CONFIG_PROC_FS) += proc_namespace.o
 
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
-obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
+obj-y				+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
index 735bfb2e9190..521dc91d2cb5 100644
--- a/fs/notify/fanotify/Kconfig
+++ b/fs/notify/fanotify/Kconfig
@@ -1,7 +1,6 @@
 config FANOTIFY
 	bool "Filesystem wide access notification"
 	select FSNOTIFY
-	select ANON_INODES
 	select EXPORTFS
 	default n
 	---help---
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index b981fc0c8379..0161c74e76e2 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,6 +1,5 @@
 config INOTIFY_USER
 	bool "Inotify support for userspace"
-	select ANON_INODES
 	select FSNOTIFY
 	default y
 	---help---
diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..be8f97e37a76 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1171,9 +1171,6 @@ config LD_DEAD_CODE_DATA_ELIMINATION
 config SYSCTL
 	bool
 
-config ANON_INODES
-	bool
-
 config HAVE_UID16
 	bool
 
@@ -1378,14 +1375,12 @@ config HAVE_FUTEX_CMPXCHG
 config EPOLL
 	bool "Enable eventpoll support" if EXPERT
 	default y
-	select ANON_INODES
 	help
 	  Disabling this option will cause the kernel to be built without
 	  support for epoll family of system calls.
 
 config SIGNALFD
 	bool "Enable signalfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the signalfd() system call that allows to receive signals
@@ -1395,7 +1390,6 @@ config SIGNALFD
 
 config TIMERFD
 	bool "Enable timerfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the timerfd() system call that allows to receive timer
@@ -1405,7 +1399,6 @@ config TIMERFD
 
 config EVENTFD
 	bool "Enable eventfd() system call" if EXPERT
-	select ANON_INODES
 	default y
 	help
 	  Enable the eventfd() system call that allows to receive both
@@ -1516,7 +1509,6 @@ config KALLSYMS_BASE_RELATIVE
 # syscall, maps, verifier
 config BPF_SYSCALL
 	bool "Enable bpf() system call"
-	select ANON_INODES
 	select BPF
 	select IRQ_WORK
 	default n
@@ -1533,7 +1525,6 @@ config BPF_JIT_ALWAYS_ON
 
 config USERFAULTFD
 	bool "Enable userfaultfd() system call"
-	select ANON_INODES
 	depends on MMU
 	help
 	  Enable the userfaultfd() system call that allows to intercept and
@@ -1600,7 +1591,6 @@ config PERF_EVENTS
 	bool "Kernel performance events and counters"
 	default y if PROFILING
 	depends on HAVE_PERF_EVENTS
-	select ANON_INODES
 	select IRQ_WORK
 	select SRCU
 	help
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190418101841.4476-1-christian@brauner.io>

This patchset makes it possible to retrieve pid file descriptors at process
creation time by introducing the new flag CLONE_PIDFD to the clone() system
call. Linus originally suggested to implement this as a new flag to clone()
instead of making it a separate system call. As spotted by Linus, there is
exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically, this
makes it possible to interpret a pidfd differently, narrowing or widening
the scope of various operations (e.g. signal sending). Thus, a pidfd cannot
just refer to a tgid, but also a tid, or in theory - given appropriate flag
arguments in relevant syscalls - a process group or session. A pidfd does
not represent a privilege. This does not imply it cannot ever be that way
but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers pid
namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the fourth
argument of clone. This has the advantage that we can give back the
associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with a
sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be translated
into a helper that would be suitable for inclusion in libc so that users
don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1:
- Oleg Nesterov <oleg@redhat.com>:
  - return pidfd in fourth argument of clone
    This way we can return the pid and the pidfd at the same time to the
    caller and can also start pid file descriptor numbering at 0 as is
    customary for file descriptors.
- Christian Brauner <christian@brauner.io>:
  - update comments to reflect changes based on Oleg's idea
v2:
- Oleg Nesterov <oleg@redhat.com>:
  - move put_user() before clone()'s point of no return so we can handle
    put_user() errors
- Christian Brauner <christian@brauner.io>:
  - change pidfd_create() to also fd_install()
    With Oleg's change it makes sense to do the fd_install() right before
    the moved put_user().
---
 include/linux/pid.h        |  2 +
 include/uapi/linux/sched.h |  1 +
 kernel/fork.c              | 96 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 95 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b6f4ba16065a..3c8ef5a199ca 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f80063e..ed4ee170bee2 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..201aafdac727 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/slab.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
@@ -21,8 +22,10 @@
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
 #include <linux/rtmutex.h>
 #include <linux/init.h>
+#include <linux/fsnotify.h>
 #include <linux/unistd.h>
 #include <linux/module.h>
 #include <linux/vmalloc.h>
@@ -1662,6 +1665,58 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_RCU */
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+	struct pid *pid = f->private_data;
+
+	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+	seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid:  struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ *         On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+	int fd;
+
+	fd = anon_inode_getfd("pidfd", &pidfd_fops, get_pid(pid),
+			      O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		put_pid(pid);
+
+	return fd;
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
 					unsigned long clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
+					int __user *parent_tidptr,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
 					int node)
 {
-	int retval;
+	int pidfd = -1, retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
 
@@ -1730,6 +1786,19 @@ static __latent_entropy struct task_struct *copy_process(
 			return ERR_PTR(-EINVAL);
 	}
 
+	/* Pidfds will be returned through parent_tidptr. */
+	if ((clone_flags & (CLONE_PIDFD | CLONE_PARENT_SETTID)) ==
+	    (CLONE_PIDFD | CLONE_PARENT_SETTID))
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * Ensure that we can potentially reuse CLONE_DETACHED for
+	 * CLONE_PIDFD in the future.
+	 */
+	if ((clone_flags & (CLONE_PIDFD | CLONE_DETACHED)) ==
+	    (CLONE_PIDFD | CLONE_DETACHED))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
@@ -1936,6 +2005,22 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
+	/*
+	 * This has to happen after we've potentially unshared the file
+	 * descriptor table (so that the pidfd doesn't leak into the child
+	 * if the fd table isn't shared).
+	 */
+	if (clone_flags & CLONE_PIDFD) {
+		retval = pidfd_create(pid);
+		if (retval < 0)
+			goto bad_fork_free_pid;
+
+		pidfd = retval;
+		retval = put_user(pidfd, parent_tidptr);
+		if (retval)
+			goto bad_fork_put_pidfd;
+	}
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1996,7 +2081,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	retval = cgroup_can_fork(p);
 	if (retval)
-		goto bad_fork_free_pid;
+		goto bad_fork_put_pidfd;
 
 	/*
 	 * From this point on we must avoid any synchronous user-space
@@ -2111,6 +2196,9 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+	if (clone_flags & CLONE_PIDFD)
+		ksys_close(pidfd);
 bad_fork_free_pid:
 	cgroup_threadgroup_change_end(current);
 	if (pid != &init_struct_pid)
@@ -2176,7 +2264,7 @@ static inline void init_idle_pids(struct task_struct *idle)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
 			    cpu_to_node(cpu));
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
@@ -2223,7 +2311,7 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size,
+	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
 			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
 	add_latent_entropy();
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner, Jann Horn
In-Reply-To: <20190418101841.4476-1-christian@brauner.io>

As stated in the original commit for pidfd_send_signal() we don't allow to
signal processes through O_PATH file descriptors since it is semantically
equivalent to a write on the pidfd. We already correctly error out right
now and return EBADF if an O_PATH fd is passed. This is because we use
file->f_op to detect whether a pidfd is passed and O_PATH fds have their
file->f_op set to empty_fops in do_dentry_open() and thus fail the test.
Thus, there is no regression. It's just semantically correct to use fdget()
and return an error right from there instead of taking a reference and
returning an error later.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jann@thejh.net>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1: patch not present
v2:
- Oleg Nesterov <oleg@redhat.com>:
  - split out from following patch since the s/fdget_raw()/fdget()/
    replacement is a fix unrelated to the theme of the following patch
---
 kernel/signal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..227ba170298e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3581,7 +3581,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	if (flags)
 		return -EINVAL;
 
-	f = fdget_raw(pidfd);
+	f = fdget(pidfd);
 	if (!f.file)
 		return -EBADF;
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190418101841.4476-1-christian@brauner.io>

Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils the
request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to be
used by users without procfs mounted or even users without procfs support
compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1: patch unchanged
v2:
- Oleg Nesterov <oleg@redhat.com>:
  - split s/fdget_raw()/fdget()/ into separate patch as it has nothing to
    do with supporting CLONE_PIDFD
---
 kernel/signal.c | 12 +++++++++---
 kernel/sys_ni.c |  3 ---
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 227ba170298e..cd83cc376767 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3513,7 +3513,6 @@ SYSCALL_DEFINE2(kill, pid_t, pid, int, sig)
 	return kill_something_info(sig, &info, pid);
 }
 
-#ifdef CONFIG_PROC_FS
 /*
  * Verify that the signaler and signalee either are in the same pid namespace
  * or that the signaler's pid namespace is an ancestor of the signalee's pid
@@ -3550,6 +3549,14 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 	return copy_siginfo_from_user(kinfo, info);
 }
 
+static struct pid *pidfd_to_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return tgid_pidfd_to_pid(file);
+}
+
 /**
  * sys_pidfd_send_signal - send a signal to a process through a task file
  *                          descriptor
@@ -3586,7 +3593,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		return -EBADF;
 
 	/* Is this a pidfd? */
-	pid = tgid_pidfd_to_pid(f.file);
+	pid = pidfd_to_pid(f.file);
 	if (IS_ERR(pid)) {
 		ret = PTR_ERR(pid);
 		goto err;
@@ -3620,7 +3627,6 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	fdput(f);
 	return ret;
 }
-#endif /* CONFIG_PROC_FS */
 
 static int
 do_send_specific(pid_t tgid, pid_t pid, int sig, struct kernel_siginfo *info)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d21f4befaea4..4d9ae5ea6caf 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -167,9 +167,6 @@ COND_SYSCALL(syslog);
 
 /* kernel/sched/core.c */
 
-/* kernel/signal.c */
-COND_SYSCALL(pidfd_send_signal);
-
 /* kernel/sys.c */
 COND_SYSCALL(setregid);
 COND_SYSCALL(setgid);
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 5/5] samples: show race-free pidfd metadata access
From: Christian Brauner @ 2019-04-18 10:18 UTC (permalink / raw)
  To: torvalds, viro, jannh, dhowells, oleg, linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol, Christian Brauner
In-Reply-To: <20190418101841.4476-1-christian@brauner.io>

This is a sample program showing userspace how to get race-free access to
process metadata from a pidfd. It is rather easy to do and userspace can
actually simply reuse code that currently parses a process's status file in
procfs.
The program can easily be extended into a generic helper suitable for
inclusion in a libc to make it even easier for userspace to gain metadata
access.

Since this came up in a discussion since this API is going to be used in
various service managers. A lot of programs will have a whitelist seccomp
filter that returns EPERM for all new syscalls. This means that programs
might get confused if CLONE_PIDFD works but the later pidfd_send_signal()
syscall doesn't. Hence, here's a ahead of time check that
pidfd_send_signal() is supported:

bool pidfd_send_signal_supported()
{
        int procfd = open("/proc/self", O_DIRECTORY | O_RDONLY | O_CLOEXEC);
        if (procfd < 0)
                return false;

        /* pidfd_send_signal() should never fail this test. So it must
         * mean it is not available or blocked by an LSM or seccomp or
         * other. So * fallback to using pids in this case.
         */
        return pidfd_send_signal(procfd, 0, NULL, 0) == 0;
}

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
/* changelog */
v1:
- Christian Brauner <christian@brauner.io>:
  - adapt sample program to changes in how CLONE_PIDFD returns the pidfd
    With Oleg's suggestion we can simplify the program even more.
v2: patch unchanged
---
 samples/Makefile               |   2 +-
 samples/pidfd/Makefile         |   6 ++
 samples/pidfd/pidfd-metadata.c | 112 +++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 samples/pidfd/Makefile
 create mode 100644 samples/pidfd/pidfd-metadata.c

diff --git a/samples/Makefile b/samples/Makefile
index b1142a958811..fadadb1c3b05 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -3,4 +3,4 @@
 obj-$(CONFIG_SAMPLES)	+= kobject/ kprobes/ trace_events/ livepatch/ \
 			   hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/ \
 			   configfs/ connector/ v4l/ trace_printk/ \
-			   vfio-mdev/ statx/ qmi/ binderfs/
+			   vfio-mdev/ statx/ qmi/ binderfs/ pidfd/
diff --git a/samples/pidfd/Makefile b/samples/pidfd/Makefile
new file mode 100644
index 000000000000..0ff97784177a
--- /dev/null
+++ b/samples/pidfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+hostprogs-y := pidfd-metadata
+always := $(hostprogs-y)
+HOSTCFLAGS_pidfd-metadata.o += -I$(objtree)/usr/include
+all: pidfd-metadata
diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
new file mode 100644
index 000000000000..bd8456fc4c0e
--- /dev/null
+++ b/samples/pidfd/pidfd-metadata.c
@@ -0,0 +1,112 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef CLONE_PIDFD
+#define CLONE_PIDFD 0x00001000
+#endif
+
+static int do_child(void *args)
+{
+	printf("%d\n", getpid());
+	_exit(EXIT_SUCCESS);
+}
+
+static pid_t pidfd_clone(int flags, int *pidfd)
+{
+	size_t stack_size = 1024;
+	char *stack[1024] = { 0 };
+
+#ifdef __ia64__
+	return __clone2(do_child, stack, stack_size, flags | SIGCHLD, NULL, pidfd);
+#else
+	return clone(do_child, stack + stack_size, flags | SIGCHLD, NULL, pidfd);
+#endif
+}
+
+static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
+					unsigned int flags)
+{
+	return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
+}
+
+static int pidfd_metadata_fd(pid_t pid, int pidfd)
+{
+	int procfd, ret;
+	char path[100];
+
+	snprintf(path, sizeof(path), "/proc/%d", pid);
+	procfd = open(path, O_DIRECTORY | O_RDONLY | O_CLOEXEC);
+	if (procfd < 0) {
+		warn("Failed to open %s\n", path);
+		return -1;
+	}
+
+	/*
+	 * Verify that the pid has not been recycled and our /proc/<pid> handle
+	 * is still valid.
+	 */
+	ret = sys_pidfd_send_signal(pidfd, 0, NULL, 0);
+	if (ret < 0) {
+		switch (errno) {
+		case EPERM:
+			/* Process exists, just not allowed to signal it. */
+			break;
+		default:
+			warn("Failed to signal process\n");
+			close(procfd);
+			procfd = -1;
+		}
+	}
+
+	return procfd;
+}
+
+int main(int argc, char *argv[])
+{
+	int ret = EXIT_FAILURE;
+	char buf[4096] = { 0 };
+	pid_t pid;
+	int pidfd, procfd, statusfd;
+	ssize_t bytes;
+
+	pid = pidfd_clone(CLONE_PIDFD, &pidfd);
+	if (pid < 0)
+		exit(ret);
+
+	procfd = pidfd_metadata_fd(pid, pidfd);
+	close(pidfd);
+	if (procfd < 0)
+		goto out;
+
+	statusfd = openat(procfd, "status", O_RDONLY | O_CLOEXEC);
+	close(procfd);
+	if (statusfd < 0)
+		goto out;
+
+	bytes = read(statusfd, buf, sizeof(buf));
+	if (bytes > 0)
+		bytes = write(STDOUT_FILENO, buf, bytes);
+	close(statusfd);
+	ret = EXIT_SUCCESS;
+
+out:
+	(void)wait(NULL);
+
+	exit(ret);
+}
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: Adhemerval Zanella @ 2019-04-18 11:09 UTC (permalink / raw)
  To: H. Peter Anvin, Florian Weimer; +Cc: libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <8f05dc10-02ea-327f-f984-98b91a0b195d@zytor.com>



On 17/04/2019 19:04, H. Peter Anvin wrote:
> On 4/15/19 10:22 AM, Adhemerval Zanella wrote:
>>>
>>> New interfaces are only necessary for the handful of architectures that don't have the speed fields *and* to space to put them in. 
>>
>> Based on your WIP, it seems that both sparc and mips could be adapted.
>> Do we still have glibc supported architecture that would require compat
>> symbols?
>>
>>>
>>> Using symbol versioning doesn't really help much since the real problem is that struct termios can be passed around in userspace, and the interfaces between user space libraries don't have any versioning. However, my POC code deals with that too by only seeing BOTHER when necessary, so if the structure is extended garbage in the extra fields will be ignored unless new baud rates are in use.
>>
>> Yeah, we discussed this earlier and if recall correctly it was not settled
>> that all architectures would allow the use to extra space for the new
>> fields. It seems the case, which makes the adaptation for termios2 even easier.
>>
>> The question I have for kernel side is whether termios2 is fully compatible
>> with termios, meaning that if there is conner cases we need to handle in
>> userland.
>>
> 
> It depends on what you mean with "fully compatible."
> 
> Functionality-wise, the termios2 interfaces are a strict superset. There
> is not, however, any guarantee that struct kernel_termios2 *contains* a
> contiguous binary equivalent to struct kernel_termios (in fact, on most
> architectures, it doesn't.)

I mean that we can fully implement termios1 using termios2 by adjusting
the termios struct in syscall wrappers.  If it is a superset I think it
is fine.

> 
>>>
>>> My POC code deals with Alpha as well by falling back to the old interfaces if necessary and possible, otherwise return error.
>>>
>>> Exporting termios2 to user space feels a bit odd at this stage as it would only be usable as a fallback on old glibc. Call it kernel_termios2 at least. ioctls using struct termios *must* be changed to kernel_termios anyway!
>>>
>>
>> I still prefer to avoid export it to userland and make it usable through
>> default termios, as your wip does.  My understanding is new interfaces 
>> should be semantic equal to current one with the only deviation that
>> non-standard baudrates will handled as its values.  The only issue I 
>> can foresee is if POSIX starts to export new bauds value.
>>
> 
> ... which will be easy to handle: just define a Bxxxx constant with the
> value equal to the baud rate.
> 
> 	-hhpa

Right.

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Masahiro Yamada @ 2019-04-18 11:10 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Linux Kernel Mailing List, Linux Kbuild mailing list, linux-api,
	linux-modules, Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy,
	Dmitry V. Levin, Michal Marek, Dmitry Torokhov, Rusty Russell,
	Jessica Yu, Lucas De Marchi
In-Reply-To: <20190406121447.GB4047@localhost.localdomain>

On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> Problem:
>
> When a kernel module is compiled as a separate module, some important
> information about the kernel module is available via .modinfo section of
> the module.  In contrast, when the kernel module is compiled into the
> kernel, that information is not available.
>
> Information about built-in modules is necessary in the following cases:
>
> 1. When it is necessary to find out what additional parameters can be
> passed to the kernel at boot time.
>
> 2. When you need to know which module names and their aliases are in
> the kernel. This is very useful for creating an initrd image.
>
> Proposal:
>
> The proposed patch does not remove .modinfo section with module
> information from the vmlinux at the build time and saves it into a
> separate file after kernel linking. So, the kernel does not increase in
> size and no additional information remains in it. Information is stored
> in the same format as in the separate modules (null-terminated string
> array). Because the .modinfo section is already exported with a separate
> modules, we are not creating a new API.
>
> It can be easily read in the userspace:
>
> $ tr '\0' '\n' < kernel.builtin
> ext4.softdep=pre: crc32c
> ext4.license=GPL
> ext4.description=Fourth Extended Filesystem
> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
> ext4.alias=fs-ext4
> ext4.alias=ext3
> ext4.alias=fs-ext3
> ext4.alias=ext2
> ext4.alias=fs-ext2
> md_mod.alias=block-major-9-*
> md_mod.alias=md
> md_mod.description=MD RAID framework
> md_mod.license=GPL
> md_mod.parmtype=create_on_open:bool
> md_mod.parmtype=start_dirty_degraded:int
> ...
>
> v2:
>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
>  * Rename output file to kernel.builtin;

Sorry, I do not get why you renamed
"kernel.builtin.modinfo" to "kernel.builtin".

If you drop "modinfo", we do not understand
what kind information is contained in it.

I think "kernel" and "builtin" have
a quite similar meaning here.

How about "builtin.modinfo" for example?


It is shorter, and it is clear enough
that it contains module_info.



>  * Add MODULE_VERSION to modinfo that is saved to the kernel.builtin;
>  * Fix build warnings on powerpc.
>
> Co-Developed-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
> Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---


> diff --git a/.gitignore b/.gitignore
> index a20ac26aa2f5..432332fd745e 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -45,6 +45,7 @@
>  *.xz
>  Module.symvers
>  modules.builtin
> +kernel.builtin

This file is generated only in the top of the tree.

Please add '/' prefix and move it to
the "# Top-level generic files" section.




> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index c8cf45362bd6..b914e026ef28 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -226,6 +226,10 @@ modpost_link vmlinux.o
>  # modpost vmlinux.o to check for section mismatches
>  ${MAKE} -f "${srctree}/scripts/Makefile.modpost" vmlinux.o
>
> +info MODINFO kernel.builtin
> +"${OBJCOPY}" -j .modinfo -O binary vmlinux.o kernel.builtin
> +chmod 444 kernel.builtin


Could you explain why this "chmod" is necessary?


Thanks.




--
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output
From: Thomas Gleixner @ 2019-04-18 13:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Aubrey Li, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Andi Kleen, Tim Chen, Dave Hansen,
	Arjan van de Ven, Alexey Dobriyan, aubrey.li, Linux API, LKML
In-Reply-To: <CALCETrUjF9PBmkzH1J86vw4ZW785DP7FtcT+gcSrx29=BUnjoQ@mail.gmail.com>

On Wed, 17 Apr 2019, Andy Lutomirski wrote:

> On Tue, Apr 16, 2019 at 4:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:
> >
> > > The architecture specific information of the running processes could
> > > be useful to the userland. Add support to examine process architecture
> > > specific information externally.
> >
> > The implementation looks just fine to me.  Have you had any feedback on
> > the overall desirability of adding this feature?
> 
> I think I've been the most outspoken, and my not-all-that-strong
> opinion is that I don't really like it.  /proc/PID/status is already a
> bit of a mess, and I don't think we really want it to be a mess that
> is different on different architectures.  Hence my suggestion of
> /proc/PID/x86_status instead.  Or we could do /proc/PID/arch_status, I

arch_status looks like the right thing to do.

Thanks,

	tglx

^ permalink raw reply

* Re: [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-18 13:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418101841.4476-3-christian@brauner.io>

On 04/18, Christian Brauner wrote:
>
> @@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
>  					unsigned long clone_flags,
>  					unsigned long stack_start,
>  					unsigned long stack_size,
> +					int __user *parent_tidptr,
>  					int __user *child_tidptr,
>  					struct pid *pid,
>  					int trace,
>  					unsigned long tls,
>  					int node)
>  {
> -	int retval;
> +	int pidfd = -1, retval;

it seems that initialization is unneeded, but this is cosmetic.

I see no technical problems, feel free to add my reviewed-by.

But let me ask a couple of questions...

Why O_CLOEXEC? I am just curious, I do not really care.

Should we allow CLONE_THREAD | CLONE_PIDFD ?

Are you sure we will never need to extend this interface? If not, then perhaps it
make sense to add something like

	if (CLONE_PIDFD) {
		unsigned long not_used_yet;
		if (get_user(not_used_yet, parent_tidptr) ||
		    not_used_yet != 0)
			return -EINVAL;
	}

this way we can easily add more arguments in future or even turn CLONE_PIDFD into
CLONE_MORE_ARGS_IN_PARENT_TIDPTR.

Not that I think this is really good idea, sys_clone2() makes more sense, but still.

Oleg.

^ permalink raw reply

* Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH
From: Oleg Nesterov @ 2019-04-18 13:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol, Jann Horn
In-Reply-To: <20190418101841.4476-4-christian@brauner.io>

On 04/18, Christian Brauner wrote:
>
> It's just semantically correct to use fdget()
> and return an error right from there instead of taking a reference and
> returning an error later.

agreed, and thanks for your explanations

Acked-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-18 13:17 UTC (permalink / raw)
  To: Joseph Myers, Will Deacon
  Cc: carlos, Florian Weimer, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1770787324.668.1555530989646.JavaMail.zimbra@efficios.com>

----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
> 
>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>> 
>>> > +/* RSEQ_SIG is a signature required before each abort handler code.
>>> > +
>>> > +   It is a 32-bit value that maps to actual architecture code compiled
>>> > +   into applications and libraries. It needs to be defined for each
>>> > +   architecture. When choosing this value, it needs to be taken into
>>> > +   account that generating invalid instructions may have ill effects on
>>> > +   tools like objdump, and may also have impact on the CPU speculative
>>> > +   execution efficiency in some cases.  */
>>> > +
>>> > +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>> 
>>> After further investigation, we should probably do the following
>>> to handle compiling with -mbig-endian on aarch64, which generates
>>> binaries with mixed code vs data endianness (little endian code,
>>> big endian data):
>> 
>> First, the comment on RSEQ_SIG should specify whether it is to be
>> interpreted in the code or the data endianness.
> 
> Right. The signature passed as argument to the rseq registration
> system call needs to be in data endianness (currently exposed kernel
> ABI).
> 
> Ideally for userspace, we want to define a signature in code endianness
> that happens to nicely match specific code patterns.
> 
>> 
>>> For ARM32, the situation is a bit more complex. Only armv6+
>>> generates mixed-endianness code vs data with -mbig-endian.
>>> Prior to armv6, the code and data endianness matches. Therefore,
>>> I plan to #ifdef the reversed endianness handling with:
>>> 
>>> #if __ARM_ARCH >= 6 && __ARM_BIG_ENDIAN
>>> 
>>> on arm32.
>> 
>> That doesn't work well because BE code (.o files) can be built for v5te
>> (for example) and used on a range of different architecture variants with
>> both BE32 and BE8 - the choice between BE32 and BE8 is a link-time choice,
>> not a compile-time choice.  So if the value for Arm is a compile-time
>> constant, it should also work for both BE32 and BE8.
> 
> Good to know! Then we need to be even more careful.
> 
>> 
>> In turn, that suggests to me that RSEQ_SIG should be defined to be a value
>> that is always in the code endianness (and whatever corresponding kernel
>> code handles RSEQ_SIG values should act accordingly on architectures where
>> the two endiannesses can differ).  If the kernel ABI is already fixed in a
>> way that prevents such a definition of RSEQ_SIG semantics as using code
>> endianness, a value should be chosen for Arm that works for both
>> endiannesses.
> 
> It might be tricky to pick up a trap instruction that is a palindrome
> endianness-wise.
> 
>> 
>> (Also, installed glibc headers are supposed to work with older compilers,
>> and support for __ARM_ARCH was only added in GCC 4.8.  Before that you
>> need to test lots of separate macros for different architecture variants
>> to determine a version number.)
> 
> Good point!
> 
> Here is an alternative to the palindrome approach. I'm taking arm32
> as an example:
> 
> * We define RSEQ_SIG_CODE in code endianness, meant to be used with
>  .inst in rseq assembly:
> 
> #define RSEQ_SIG_CODE 0xe7f5def3
> 
> * We define RSEQ_SIG_DATA in data endianness:
> 
> #define RSEQ_SIG_DATA \
>        ({ \
>                int sig; \
>                asm volatile (  "b 2f\n\t" \
>                                ".arm\n\t" \
>                                "1: .inst 0xe7f5def3\n\t" \
>                                "2:\n\t" \
>                                "ldr %[sig], 1b\n\t" \
>                                : [sig] "=r" (sig)); \
>                sig; \
>        })
> 
> Technically, only glibc and early-adopter libraries wishing to
> register rseq need to use RSEQ_SIG_DATA. The RSEQ_SIG_CODE needs
> to be used from inline assembly to create the signatures before
> each abort handler.

The approach above should work for arm32 be8 vs be32 linker weirdness.

For aarch64, I think we can simply do:

/*
 * aarch64 -mbig-endian generates mixed endianness code vs data:
 * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
 * matches code endianness.
 */
#define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */

#ifdef __ARM_BIG_ENDIAN
#define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
#else
#define RSEQ_SIG_DATA   RSEQ_SIG_CODE
#endif

#define RSEQ_SIG        RSEQ_SIG_DATA

Feedback is most welcome,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-18 13:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418131206.GB13701@redhat.com>

On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
> On 04/18, Christian Brauner wrote:
> >
> > @@ -1674,13 +1729,14 @@ static __latent_entropy struct task_struct *copy_process(
> >  					unsigned long clone_flags,
> >  					unsigned long stack_start,
> >  					unsigned long stack_size,
> > +					int __user *parent_tidptr,
> >  					int __user *child_tidptr,
> >  					struct pid *pid,
> >  					int trace,
> >  					unsigned long tls,
> >  					int node)
> >  {
> > -	int retval;
> > +	int pidfd = -1, retval;
> 
> it seems that initialization is unneeded, but this is cosmetic.
> 
> I see no technical problems, feel free to add my reviewed-by.

Thank you!

> 
> 
> But let me ask a couple of questions...

Sure!

> 
> 
> Why O_CLOEXEC? I am just curious, I do not really care.

I think that having the file descriptor O_CLOEXEC by default is a good
security measure in general. Most of the time you do not want to pass a
file descriptor through exec() (apart from 0,1,2) and it is usually more
of an issue when you accidently do it then when you accidently don't. So
if users really care about passing a pidfd they should do so by removing
the O_CLOEXEC flag explicitly.
(New file descriptors should probably all default to that but that's just
my opinion.)
Another thing is that for a pidfds it makes even more sense to have them
cloexec by default. You don't want to *unintentionally* leak an fd that
can be used to operate on a process.

> 
> 
> Should we allow CLONE_THREAD | CLONE_PIDFD ?

I think so, yes. I have thought about this. Yes, due to CLONE_FILES |
CLONE_VM you'd necessarily hand the pidfd to the child but threads are
no security boundary in the first place. So if you have a
non-cooperating thread you very much already have a problem. The
situation is not very much different from passing the tid.
Plus, CLONE_PIDFD can make use of the CLONE_UNDETACHED flag in the
future in contrast to all the other flags. So one could potentially (not
saying we have this planned!) add a flag CLONE_PIDFD_KILL_ON_CLOSE (This
is just an improvised bad name rn.) which would cause the child to kill
itself if it does close(pidfd) on its own pidfd and noone else is
holding a reference at which point you'd hand a suicide switch to a new
thread but nothing else.

> 
> 
> Are you sure we will never need to extend this interface? If not, then perhaps it

Well, we can already make use of CLONE_UNDETACHED with CLONE_PIDFD since
we don't just ignore it. :)

> make sense to add something like
> 
> 	if (CLONE_PIDFD) {
> 		unsigned long not_used_yet;
> 		if (get_user(not_used_yet, parent_tidptr) ||
> 		    not_used_yet != 0)
> 			return -EINVAL;

Oh, very interesting point. Yes, I think this is worth it.

> 	}
> 
> this way we can easily add more arguments in future or even turn CLONE_PIDFD into
> CLONE_MORE_ARGS_IN_PARENT_TIDPTR.
> 
> Not that I think this is really good idea, sys_clone2() makes more sense, but still.

No, you're right about leaving this option open. Even if we don't extend
not allowing garbage to be passed is always a good idea.

Christian

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Jessica Yu @ 2019-04-18 13:52 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Alexey Gladkov, Linux Kernel Mailing List,
	Linux Kbuild mailing list, linux-api, linux-modules,
	Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy, Dmitry V. Levin,
	Michal Marek, Dmitry Torokhov, Rusty Russell, Lucas De Marchi
In-Reply-To: <CAK7LNATfo0K6kzwobkAZ_u2KnWL3XycJzbr3KPtPF=ePziCuDg@mail.gmail.com>

+++ Masahiro Yamada [18/04/19 20:10 +0900]:
>On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>>
>> Problem:
>>
>> When a kernel module is compiled as a separate module, some important
>> information about the kernel module is available via .modinfo section of
>> the module.  In contrast, when the kernel module is compiled into the
>> kernel, that information is not available.
>>
>> Information about built-in modules is necessary in the following cases:
>>
>> 1. When it is necessary to find out what additional parameters can be
>> passed to the kernel at boot time.
>>
>> 2. When you need to know which module names and their aliases are in
>> the kernel. This is very useful for creating an initrd image.
>>
>> Proposal:
>>
>> The proposed patch does not remove .modinfo section with module
>> information from the vmlinux at the build time and saves it into a
>> separate file after kernel linking. So, the kernel does not increase in
>> size and no additional information remains in it. Information is stored
>> in the same format as in the separate modules (null-terminated string
>> array). Because the .modinfo section is already exported with a separate
>> modules, we are not creating a new API.
>>
>> It can be easily read in the userspace:
>>
>> $ tr '\0' '\n' < kernel.builtin
>> ext4.softdep=pre: crc32c
>> ext4.license=GPL
>> ext4.description=Fourth Extended Filesystem
>> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
>> ext4.alias=fs-ext4
>> ext4.alias=ext3
>> ext4.alias=fs-ext3
>> ext4.alias=ext2
>> ext4.alias=fs-ext2
>> md_mod.alias=block-major-9-*
>> md_mod.alias=md
>> md_mod.description=MD RAID framework
>> md_mod.license=GPL
>> md_mod.parmtype=create_on_open:bool
>> md_mod.parmtype=start_dirty_degraded:int
>> ...
>>
>> v2:
>>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
>>  * Rename output file to kernel.builtin;
>
>Sorry, I do not get why you renamed
>"kernel.builtin.modinfo" to "kernel.builtin".
>
>If you drop "modinfo", we do not understand
>what kind information is contained in it.
>
>I think "kernel" and "builtin" have
>a quite similar meaning here.
>
>How about "builtin.modinfo" for example?
>
>
>It is shorter, and it is clear enough
>that it contains module_info.

I agree that the name kernel.builtin is unclear in what kind of
information it contains. Apologies for not having clarified this in
the previous review.

Since kbuild already produces "modules.order" and "modules.builtin"
files, why not just name it "modules.builtin.modinfo" to keep the
names consistent with what is already there? It clearly conveys that
this file contains modinfo for builtin modules.

I'll leave it up to Lucas to decide if the file format is OK for kmod.
With the modinfo dump, kmod could then decide what to do with the
information, append to modules.alias{,.bin}, etc.

Also, I think this file needs to be documented in Documentation/kbuild/kbuild.txt.

Thanks,

Jessica

^ permalink raw reply

* Re: [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-18 14:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418132822.untjt7erfvbbiz7a@brauner.io>

On 04/18, Christian Brauner wrote:
>
> On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
> > Should we allow CLONE_THREAD | CLONE_PIDFD ?
>
> I think so, yes. I have thought about this.

OK, I won't insist. But let me explain why did I ask.

> Yes, due to CLONE_FILES |
> CLONE_VM you'd necessarily hand the pidfd to the child but threads are
> no security boundary in the first place.

No, no, I am not not worried about security. CLONE_PARENT | CLONE_PIDFD
looks more problematic to me, but I see nothing dangerous security-wise..

I agree that CLONE_THREAD | CLONE_PIDFD may be usefule, but I am not sure
we should allow this from the very begining, until we have a "real" use-case.

IIUC, we are going to make it pollable soon. OK, but proc_tgid_base_poll()
(which should be turned into pidfd_poll) simply can't work if pid_task() is
not a group leader. poll(pidfd) will hang forever if pidfd was created by
CLONE_THREAD | CLONE_PIDFD.

Sure, we can (should?) improve pidfd_poll() but this will need more nasty
changes in the core kernel code. Do we really need/want this? Right now it
is not clear to me. Instead, we can simply disallow CLONE_THREAD|CLONE_PIDFD
until we decide that yes, we want to poll sub-threads.

But again, I am fine with CLONE_THREAD | CLONE_PIDFD.

Oleg.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox