Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table
From: Dave Chinner @ 2019-07-23 22:21 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: Jonathan Corbet, Darrick J. Wong, supporter:XFS FILESYSTEM,
	open list:DOCUMENTATION, open list, skhan, linux-kernel-mentees
In-Reply-To: <20190723145201.GA20658@localhost>

On Tue, Jul 23, 2019 at 03:52:01PM +0100, Sheriff Esseson wrote:
> On Tue, Jul 23, 2019 at 07:42:18AM -0600, Jonathan Corbet wrote:
> > On Tue, 23 Jul 2019 12:48:13 +0100
> > Sheriff Esseson <sheriffesseson@gmail.com> wrote:
> > 
> > > the "Removed Sysctls" section is a table - bring it alive with ReST.
> > > 
> > > Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
> > 
> > So this appears to be identical to the patch you sent three days ago; is
> > there a reason why you are sending it again now?
> > 
> > Thanks,
> > 
> > jon
> 
> Sorry, I was think the patch went unnoticed during the merge window - I could
> not find a response.

The correct thing to do in that case is to reply to the original
patch and ask if it has been looked at. The usual way of doing this
is quoting the commit message and replying with a "Ping?" comment
to bump it back to the top of everyone's mail stacks.

But, again, 3 days is not a long time, people tend to be extremely
busy and might take a few days to get to reviewing non-critical
changes, and people may not even review patches during the merge
window. I'd suggest waiting a week before pinging a patch you've
sent if there's been no response....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [PATCH v12 01/11] MODSIGN: Export module signature definitions
From: Thiago Jung Bauermann @ 2019-07-23 22:39 UTC (permalink / raw)
  To: Philipp Rudo
  Cc: Jessica Yu, linux-integrity, linux-security-module, keyrings,
	linux-crypto, linuxppc-dev, linux-doc, linux-kernel, Mimi Zohar,
	Dmitry Kasatkin, James Morris, Serge E. Hallyn, David Howells,
	David Woodhouse, Herbert Xu, David S. Miller, Jonathan Corbet,
	AKASHI, Takahiro, Heiko Carstens, linux-s390
In-Reply-To: <20190705150000.372345b0@laptop-ibm>


Hello Philipp,


Philipp Rudo <prudo@linux.ibm.com> writes:

> Hi Thiago,
>
> On Thu, 04 Jul 2019 15:57:34 -0300
> Thiago Jung Bauermann <bauerman@linux.ibm.com> wrote:
>
>> Hello Philipp,
>> 
>> Philipp Rudo <prudo@linux.ibm.com> writes:
>> 
>> > Hi Thiago,
>> >
>> >
>> > On Thu, 04 Jul 2019 03:42:57 -0300
>> > Thiago Jung Bauermann <bauerman@linux.ibm.com> wrote:
>> >  
>> >> Jessica Yu <jeyu@kernel.org> writes:
>> >>   
>> >> > +++ Thiago Jung Bauermann [27/06/19 23:19 -0300]:    
>> >> >>IMA will use the module_signature format for append signatures, so export
>> >> >>the relevant definitions and factor out the code which verifies that the
>> >> >>appended signature trailer is valid.
>> >> >>
>> >> >>Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
>> >> >>and be able to use mod_check_sig() without having to depend on either
>> >> >>CONFIG_MODULE_SIG or CONFIG_MODULES.
>> >> >>
>> >> >>Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
>> >> >>Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
>> >> >>Cc: Jessica Yu <jeyu@kernel.org>
>> >> >>---
>> >> >> include/linux/module.h           |  3 --
>> >> >> include/linux/module_signature.h | 44 +++++++++++++++++++++++++
>> >> >> init/Kconfig                     |  6 +++-
>> >> >> kernel/Makefile                  |  1 +
>> >> >> kernel/module.c                  |  1 +
>> >> >> kernel/module_signature.c        | 46 ++++++++++++++++++++++++++
>> >> >> kernel/module_signing.c          | 56 +++++---------------------------
>> >> >> scripts/Makefile                 |  2 +-
>> >> >> 8 files changed, 106 insertions(+), 53 deletions(-)
>> >> >>
>> >> >>diff --git a/include/linux/module.h b/include/linux/module.h
>> >> >>index 188998d3dca9..aa56f531cf1e 100644
>> >> >>--- a/include/linux/module.h
>> >> >>+++ b/include/linux/module.h
>> >> >>@@ -25,9 +25,6 @@
>> >> >> #include <linux/percpu.h>
>> >> >> #include <asm/module.h>
>> >> >>
>> >> >>-/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
>> >> >>-#define MODULE_SIG_STRING "~Module signature appended~\n"
>> >> >>-    
>> >> >
>> >> > Hi Thiago, apologies for the delay.    
>> >> 
>> >> Hello Jessica, thanks for reviewing the patch!
>> >>   
>> >> > It looks like arch/s390/kernel/machine_kexec_file.c also relies on
>> >> > MODULE_SIG_STRING being defined, so module_signature.h will need to be
>> >> > included there too, otherwise we'll run into a compilation error.    
>> >> 
>> >> Indeed. Thanks for spotting that. The patch below fixes it. It's
>> >> identical to the previous version except for the changes in 
>> >> arch/s390/kernel/machine_kexec_file.c and their description in the
>> >> commit message. I'm also copying some s390 people in this email.  
>> >
>> > to me the s390 part looks good but for one minor nit.  
>> 
>> Thanks for the prompt review!
>> 
>> > In arch/s390/Kconfig KEXEC_VERIFY_SIG currently depends on
>> > SYSTEM_DATA_VERIFICATION. I'd prefer when you update this to the new
>> > MODULE_SIG_FORMAT. It shouldn't make any difference right now, as we don't
>> > use mod_check_sig in our code path. But it could cause problems in the future,
>> > when more code might be shared.  
>> 
>> Makes sense. Here is the updated patch with the Kconfig change.
>> 
>
> The patch looks good now.

Thanks! Can I add your Reviewed-by?

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH v2] Documentation/security-bugs: provide more information about linux-distros
From: Kees Cook @ 2019-07-23 22:23 UTC (permalink / raw)
  To: Solar Designer
  Cc: Sasha Levin, corbet, will, peterz, gregkh, tyhicks, linux-doc,
	linux-kernel
In-Reply-To: <20190719084215.GA24691@openwall.com>

On Fri, Jul 19, 2019 at 10:42:15AM +0200, Solar Designer wrote:
> - The reporter having been directed to post from elsewhere (and I
> suspect this documentation file) without being aware of list policy.

Perhaps specify "linux-distros@" without a domain, so it's more clear?
Or re-split the Wiki into two pages to avoid confusion?

> - The reporter not mentioning (and sometimes not replying even when
> asked) whether they're also coordinating with security@k.o or whether
> they want someone on linux-distros to help coordinate with security@k.o.
> (Maybe this is something we want to write about here.)

Yeah, that seems useful to include in both places.

> - The Linux kernel bug having been introduced too recently to be of much
> interest to distros.

Right; that'd be good to add as well. I see a lot of panic on twitter,
for example, about bugs that only ever existed in -rc releases.

> > Sending to the distros@ list risks exposing Linux-only flaws to non-Linux
> > distros.
> 
> Right.
> 
> > This has caused leaks in the past
> 
> Do you mean leaks to *BSD security teams or to the public?  I'm not
> aware of past leaks to the public via the non-Linux distros present on
> the distros@ list.  Are you?

I don't know the origin of the leaks, but it only happened when distros@
was used instead of linux-distros@. I think this happened with DirtyCOW,
specifically.

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings
From: Rob Herring @ 2019-07-23 22:18 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Mark Rutland, Greg Kroah-Hartman, Rafael J. Wysocki, Frank Rowand,
	Jonathan Corbet,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	linux-kernel@vger.kernel.org, David Collins, Android Kernel Team,
	Linux Doc Mailing List
In-Reply-To: <CAGETcx-hCrUvY5whZBihueqqCxmF3oDjFybjmoo3JUu87iiiEw@mail.gmail.com>

On Tue, Jul 23, 2019 at 2:49 PM Saravana Kannan <saravanak@google.com> wrote:
>
> On Tue, Jul 23, 2019 at 11:06 AM Rob Herring <robh+dt@kernel.org> wrote:
> >
> > On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan <saravanak@google.com> wrote:
> > >
> > > Add device-links after the devices are created (but before they are
> > > probed) by looking at common DT bindings like clocks and
> > > interconnects.
> >
> > The structure now looks a lot better to me. A few minor things below.
>
> Thanks.
>
> > >
> > > Automatically adding device-links for functional dependencies at the
> > > framework level provides the following benefits:
> > >
> > > - Optimizes device probe order and avoids the useless work of
> > >   attempting probes of devices that will not probe successfully
> > >   (because their suppliers aren't present or haven't probed yet).
> > >
> > >   For example, in a commonly available mobile SoC, registering just
> > >   one consumer device's driver at an initcall level earlier than the
> > >   supplier device's driver causes 11 failed probe attempts before the
> > >   consumer device probes successfully. This was with a kernel with all
> > >   the drivers statically compiled in. This problem gets a lot worse if
> > >   all the drivers are loaded as modules without direct symbol
> > >   dependencies.
> > >
> > > - Supplier devices like clock providers, interconnect providers, etc
> > >   need to keep the resources they provide active and at a particular
> > >   state(s) during boot up even if their current set of consumers don't
> > >   request the resource to be active. This is because the rest of the
> > >   consumers might not have probed yet and turning off the resource
> > >   before all the consumers have probed could lead to a hang or
> > >   undesired user experience.
> > >
> > >   Some frameworks (Eg: regulator) handle this today by turning off
> > >   "unused" resources at late_initcall_sync and hoping all the devices
> > >   have probed by then. This is not a valid assumption for systems with
> > >   loadable modules. Other frameworks (Eg: clock) just don't handle
> > >   this due to the lack of a clear signal for when they can turn off
> > >   resources. This leads to downstream hacks to handle cases like this
> > >   that can easily be solved in the upstream kernel.
> > >
> > >   By linking devices before they are probed, we give suppliers a clear
> > >   count of the number of dependent consumers. Once all of the
> > >   consumers are active, the suppliers can turn off the unused
> > >   resources without making assumptions about the number of consumers.
> > >
> > > By default we just add device-links to track "driver presence" (probe
> > > succeeded) of the supplier device. If any other functionality provided
> > > by device-links are needed, it is left to the consumer/supplier
> > > devices to change the link when they probe.
> > >
> > > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > > ---
> > >  .../admin-guide/kernel-parameters.txt         |   5 +
> > >  drivers/of/platform.c                         | 158 ++++++++++++++++++
> > >  2 files changed, 163 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 138f6664b2e2..109b4310844f 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -3141,6 +3141,11 @@
> > >                         This can be set from sysctl after boot.
> > >                         See Documentation/sysctl/vm.txt for details.
> > >
> > > +       of_devlink      [KNL] Make device links from common DT bindings. Useful
> > > +                       for optimizing probe order and making sure resources
> > > +                       aren't turned off before the consumer devices have
> > > +                       probed.
> > > +
> > >         ohci1394_dma=early      [HW] enable debugging via the ohci1394 driver.
> > >                         See Documentation/debugging-via-ohci1394.txt for more
> > >                         info.
> > > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > > index 04ad312fd85b..88a2086e26fa 100644
> > > --- a/drivers/of/platform.c
> > > +++ b/drivers/of/platform.c
> > > @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node *root,
> > >  }
> > >  EXPORT_SYMBOL_GPL(of_platform_default_populate);
> > >
> > > +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> > > +{
> > > +       of_node_get(sup);
> > > +       /*
> > > +        * Don't allow linking a device node as a consumer of one of its
> > > +        * descendant nodes. By definition, a child node can't be a functional
> > > +        * dependency for the parent node.
> > > +        */
> > > +       while (sup) {
> > > +               if (sup == con) {
> > > +                       of_node_put(sup);
> > > +                       return false;
> > > +               }
> > > +               sup = of_get_next_parent(sup);
> > > +       }
> > > +       return true;
> > > +}
> > > +
> > > +static int of_link_to_phandle(struct device *dev, struct device_node *sup_np)
> > > +{
> > > +       struct platform_device *sup_dev;
> > > +       u32 dl_flags = DL_FLAG_AUTOPROBE_CONSUMER;
> > > +       int ret = 0;
> > > +
> > > +       /*
> > > +        * Since we are trying to create device links, we need to find
> > > +        * the actual device node that owns this supplier phandle.
> > > +        * Often times it's the same node, but sometimes it can be one
> > > +        * of the parents. So walk up the parent till you find a
> > > +        * device.
> > > +        */
> > > +       while (sup_np && !of_find_property(sup_np, "compatible", NULL))
> > > +               sup_np = of_get_next_parent(sup_np);
> > > +       if (!sup_np)
> > > +               return 0;
> > > +
> > > +       if (!of_link_is_valid(dev->of_node, sup_np)) {
> > > +               of_node_put(sup_np);
> > > +               return 0;
> > > +       }
> > > +       sup_dev = of_find_device_by_node(sup_np);
> > > +       of_node_put(sup_np);
> > > +       if (!sup_dev)
> > > +               return -ENODEV;
> > > +       if (!device_link_add(dev, &sup_dev->dev, dl_flags))
> > > +               ret = -ENODEV;
> > > +       put_device(&sup_dev->dev);
> > > +       return ret;
> > > +}
> > > +
> > > +static struct device_node *parse_prop_cells(struct device_node *np,
> > > +                                           const char *prop, int i,
> >
> > I like 'i' for for loops, but less so for function params. Perhaps
> > 'index' instead like of_parse_phandle_with_args.
>
> Sounds good.
>
> >
> > > +                                           const char *binding,
> > > +                                           const char *cell)
> > > +{
> > > +       struct of_phandle_args sup_args;
> > > +
> > > +       if (!i && strcmp(prop, binding))
> >
> > Why the '!i' test?
>
> To avoid a string comparison for every index. It's kinda wasteful once
> the first index passes.

That's not very obvious and pretty fragile though this is a static
function. Perhaps we should split to match() and parse() functions. At
least put a comment here as to what we're doing.

>
> > > +               return NULL;
> > > +
> > > +       if (of_parse_phandle_with_args(np, binding, cell, i, &sup_args))
> > > +               return NULL;
> > > +
> > > +       return sup_args.np;
> > > +}
> > > +
> > > +static struct device_node *parse_clocks(struct device_node *np,
> > > +                                       const char *prop, int i)
> > > +{
> > > +       return parse_prop_cells(np, prop, i, "clocks", "#clock-cells");
> > > +}
> > > +
> > > +static struct device_node *parse_interconnects(struct device_node *np,
> > > +                                              const char *prop, int i)
> > > +{
> > > +       return parse_prop_cells(np, prop, i, "interconnects",
> > > +                               "#interconnect-cells");
> > > +}
> > > +
> > > +static int strcmp_suffix(const char *str, const char *suffix)
> > > +{
> > > +       unsigned int len, suffix_len;
> > > +
> > > +       len = strlen(str);
> > > +       suffix_len = strlen(suffix);
> > > +       if (len <= suffix_len)
> > > +               return -1;
> > > +       return strcmp(str + len - suffix_len, suffix);
> > > +}
> > > +
> > > +static struct device_node *parse_regulators(struct device_node *np,
> > > +                                           const char *prop, int i)
> > > +{
> > > +       if (i || strcmp_suffix(prop, "-supply"))
> > > +               return NULL;
> > > +
> > > +       return of_parse_phandle(np, prop, 0);
> > > +}
> > > +
> > > +/**
> > > + * struct supplier_bindings - Information for parsing supplier DT binding
> > > + *
> > > + * @parse_prop:                If the function cannot parse the property, return NULL.
> > > + *                     Otherwise, return the phandle listed in the property
> > > + *                     that corresponds to index i.
> > > + */
> > > +struct supplier_bindings {
> > > +       struct device_node *(*parse_prop)(struct device_node *np,
> > > +                                         const char *name, int i);
> > > +};
> > > +
> > > +struct supplier_bindings bindings[] = {
> >
> > static const
>
> Will do.
>
> >
> > > +       { .parse_prop = parse_clocks, },
> > > +       { .parse_prop = parse_interconnects, },
> > > +       { .parse_prop = parse_regulators, },
> > > +       { },
> > > +};
> > > +
> > > +static bool of_link_property(struct device *dev, struct device_node *con_np,
> > > +                            const char *prop)
> > > +{
> > > +       struct device_node *phandle;
> > > +       struct supplier_bindings *s = bindings;
> > > +       unsigned int i = 0;
> > > +       bool done = true;
> > > +
> > > +       while (!i && s->parse_prop) {
> >
> > Using 'i' is a little odd. Perhaps a 'matched' bool would be easier to read.
>
> That's how I wrote it first (locally) and then redid it this way
> because the bool felt very superfluous. I don't think this is that
> hard to understand.

Alright...

> > > +               while ((phandle = s->parse_prop(con_np, prop, i))) {
> > > +                       i++;
> > > +                       if (of_link_to_phandle(dev, phandle))
> > > +                               done = false;
> >
> > Just return here. No point in continuing as 'done' is never set back to true.
>
> Actually, there is a point for this. Say Device-C depends on suppliers
> Device-S1 and Device-S2 and they are listed in DT in that order.
>
> Say, S1 gets populated after late_initcall_sync but S2 is probes way
> before that. If I don't continue past a "failed linking" to S1 and
> also link up to S2, then S2 will get a sync_state() callback before C
> is probed. So I have to go through all possible suppliers and as many
> as possible.
>
> Let me add a comment about this somewhere in the code (probably the
> header that defines the add_links() ops).

Okay, makes sense.

Rob

^ permalink raw reply

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-07-23 22:12 UTC (permalink / raw)
  To: Phil Auld
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <20190723171307.GC2947@lorien.usersys.redhat.com>

Thanks for all the help and testing you provided.  It's good to know
these changes have passed at least some scheduler regression tests.
If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let
Peter add it.

Will you be handling the backport into the RHEL 8 kernels?  I'll
submit this to Ubuntu and linux-stable once it gets accepted.

Thanks again,


On Tue, Jul 23, 2019 at 12:13 PM Phil Auld <pauld@redhat.com> wrote:
>
> Hi Dave,
>
> On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota. This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> >
> > This has been root caused to cpu-local run queue being allocated per cpu
> > bandwidth slices, and then not fully using that slice within the period.
> > At which point the slice and quota expires. This expiration of unused
> > slice results in applications not being able to utilize the quota for
> > which they are allocated.
> >
> > The non-expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'. Prior to that it appears that this had been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > added the following conditional which resulted in slices never being
> > expired.
> >
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> >       /* extend local deadline, drift is bounded above by 2 ticks */
> >       cfs_rq->runtime_expires += TICK_NSEC;
> >
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> >
> > This allows quota already allocated to per-cpu run-queues to live longer
> > than the period boundary. This allows threads on runqueues that do not
> > use much CPU to continue to use their remaining slice over a longer
> > period of time than cpu.cfs_period_us. However, this helps prevent the
> > above condition of hitting throttling while also not fully utilizing
> > your cpu quota.
> >
> > This theoretically allows a machine to use slightly more than its
> > allotted quota in some periods. This overflow would be bounded by the
> > remaining quota left on each per-cpu runqueueu. This is typically no
> > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > change nothing, as they should theoretically fully utilize all of their
> > quota in each period. For user-interactive tasks as described above this
> > provides a much better user/application experience as their cpu
> > utilization will more closely match the amount they requested when they
> > hit throttling. This means that cpu limits no longer strictly apply per
> > period for non-cpu bound applications, but that they are still accurate
> > over longer timeframes.
> >
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > 80 CPU machine), this commit resulted in almost 30x performance
> > improvement, while still maintaining correct cpu quota restrictions.
> > That testcase is available at https://github.com/indeedeng/fibtest.
> >
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> > Reviewed-by: Ben Segall <bsegall@google.com>
>
> This still works for me. The documentation reads pretty well, too. Good job.
>
> Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.
>
> I'll run it through some more tests when I have time. The code is the same
> as the earlier one I tested from what I can see.
>
> Cheers,
> Phil
>
> > ---
> >  Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
> >  kernel/sched/fair.c                   | 72 ++++------------------------------
> >  kernel/sched/sched.h                  |  4 --
> >  3 files changed, 67 insertions(+), 83 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> > index 3a90642..9801d6b 100644
> > --- a/Documentation/scheduler/sched-bwc.rst
> > +++ b/Documentation/scheduler/sched-bwc.rst
> > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
> >  specification of the maximum CPU bandwidth available to a group or hierarchy.
> >
> >  The bandwidth allowed for a group is specified using a quota and period. Within
> > -each given "period" (microseconds), a group is allowed to consume only up to
> > -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> > -group exceeds this limit (for that period), the tasks belonging to its
> > -hierarchy will be throttled and are not allowed to run again until the next
> > -period.
> > -
> > -A group's unused runtime is globally tracked, being refreshed with quota units
> > -above at each period boundary.  As threads consume this bandwidth it is
> > -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> > +each given "period" (microseconds), a task group is allocated up to "quota"
> > +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> > +slices as threads in the cgroup become runnable. Once all quota has been
> > +assigned any additional requests for quota will result in those threads being
> > +throttled. Throttled threads will not be able to run again until the next
> > +period when the quota is replenished.
> > +
> > +A group's unassigned quota is globally tracked, being refreshed back to
> > +cfs_quota units at each period boundary. As threads consume this bandwidth it
> > +is transferred to cpu-local "silos" on a demand basis. The amount transferred
> >  within each of these updates is tunable and described as the "slice".
> >
> >  Management
> > @@ -35,12 +36,12 @@ The default values are::
> >
> >  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
> >  bandwidth restriction in place, such a group is described as an unconstrained
> > -bandwidth group.  This represents the traditional work-conserving behavior for
> > +bandwidth group. This represents the traditional work-conserving behavior for
> >  CFS.
> >
> >  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> > -The minimum quota allowed for the quota or period is 1ms.  There is also an
> > -upper bound on the period length of 1s.  Additional restrictions exist when
> > +The minimum quota allowed for the quota or period is 1ms. There is also an
> > +upper bound on the period length of 1s. Additional restrictions exist when
> >  bandwidth limits are used in a hierarchical fashion, these are explained in
> >  more detail below.
> >
> > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
> >  System wide settings
> >  --------------------
> >  For efficiency run-time is transferred between the global pool and CPU local
> > -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> > -on large systems.  The amount transferred each time such an update is required
> > +"silos" in a batch fashion. This greatly reduces global accounting pressure
> > +on large systems. The amount transferred each time such an update is required
> >  is described as the "slice".
> >
> >  This is tunable via procfs::
> > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
> >  In case b) above, even though the child may have runtime remaining it will not
> >  be allowed to until the parent's runtime is refreshed.
> >
> > +CFS Bandwidth Quota Caveats
> > +---------------------------
> > +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> > +the slice may be returned to the global pool if all threads on that cpu become
> > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> > +variable. This is a performance tweak that helps prevent added contention on
> > +the global lock.
> > +
> > +The fact that cpu-local slices do not expire results in some interesting corner
> > +cases that should be understood.
> > +
> > +For cgroup cpu constrained applications that are cpu limited this is a
> > +relatively moot point because they will naturally consume the entirety of their
> > +quota as well as the entirety of each cpu-local slice in each period. As a
> > +result it is expected that nr_periods roughly equal nr_throttled, and that
> > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> > +
> > +For highly-threaded, non-cpu bound applications this non-expiration nuance
> > +allows applications to briefly burst past their quota limits by the amount of
> > +unused slice on each cpu that the task group is running on (typically at most
> > +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> > +applies if quota had been assigned to a cpu and then not fully used or returned
> > +in previous periods. This burst amount will not be transferred between cores.
> > +As a result, this mechanism still strictly limits the task group to quota
> > +average usage, albeit over a longer time window than a single period.  This
> > +also limits the burst ability to no more than 1ms per cpu.  This provides
> > +better more predictable user experience for highly threaded applications with
> > +small quota limits on high core count machines. It also eliminates the
> > +propensity to throttle these applications while simultanously using less than
> > +quota amounts of cpu. Another way to say this, is that by allowing the unused
> > +portion of a slice to remain valid across periods we have decreased the
> > +possibility of wastefully expiring quota on cpu-local silos that don't need a
> > +full slice's amount of cpu time.
> > +
> > +The interaction between cpu-bound and non-cpu-bound-interactive applications
> > +should also be considered, especially when single core usage hits 100%. If you
> > +gave each of these applications half of a cpu-core and they both got scheduled
> > +on the same CPU it is theoretically possible that the non-cpu bound application
> > +will use up to 1ms additional quota in some periods, thereby preventing the
> > +cpu-bound application from fully using its quota by that same amount. In these
> > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> > +decide which application is chosen to run, as they will both be runnable and
> > +have remaining quota. This runtime discrepancy will be made up in the following
> > +periods when the interactive application idles.
> > +
> >  Examples
> >  --------
> >  1. Limit a group to 1 CPU worth of runtime::
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 036be95..00b68f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >
> >       now = sched_clock_cpu(smp_processor_id());
> >       cfs_b->runtime = cfs_b->quota;
> > -     cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> > -     cfs_b->expires_seq++;
> >  }
> >
> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >  {
> >       struct task_group *tg = cfs_rq->tg;
> >       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > -     u64 amount = 0, min_amount, expires;
> > -     int expires_seq;
> > +     u64 amount = 0, min_amount;
> >
> >       /* note: this is a positive sum as runtime_remaining <= 0 */
> >       min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >                       cfs_b->idle = 0;
> >               }
> >       }
> > -     expires_seq = cfs_b->expires_seq;
> > -     expires = cfs_b->runtime_expires;
> >       raw_spin_unlock(&cfs_b->lock);
> >
> >       cfs_rq->runtime_remaining += amount;
> > -     /*
> > -      * we may have advanced our local expiration to account for allowed
> > -      * spread between our sched_clock and the one on which runtime was
> > -      * issued.
> > -      */
> > -     if (cfs_rq->expires_seq != expires_seq) {
> > -             cfs_rq->expires_seq = expires_seq;
> > -             cfs_rq->runtime_expires = expires;
> > -     }
> >
> >       return cfs_rq->runtime_remaining > 0;
> >  }
> >
> > -/*
> > - * Note: This depends on the synchronization provided by sched_clock and the
> > - * fact that rq->clock snapshots this value.
> > - */
> > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > -{
> > -     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> > -
> > -     /* if the deadline is ahead of our clock, nothing to do */
> > -     if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> > -             return;
> > -
> > -     if (cfs_rq->runtime_remaining < 0)
> > -             return;
> > -
> > -     /*
> > -      * If the local deadline has passed we have to consider the
> > -      * possibility that our sched_clock is 'fast' and the global deadline
> > -      * has not truly expired.
> > -      *
> > -      * Fortunately we can check determine whether this the case by checking
> > -      * whether the global deadline(cfs_b->expires_seq) has advanced.
> > -      */
> > -     if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> > -             /* extend local deadline, drift is bounded above by 2 ticks */
> > -             cfs_rq->runtime_expires += TICK_NSEC;
> > -     } else {
> > -             /* global deadline is ahead, expiration has passed */
> > -             cfs_rq->runtime_remaining = 0;
> > -     }
> > -}
> > -
> >  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
> >  {
> >       /* dock delta_exec before expiring quota (as it could span periods) */
> >       cfs_rq->runtime_remaining -= delta_exec;
> > -     expire_cfs_rq_runtime(cfs_rq);
> >
> >       if (likely(cfs_rq->runtime_remaining > 0))
> >               return;
> > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >               resched_curr(rq);
> >  }
> >
> > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> > -             u64 remaining, u64 expires)
> > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
> >  {
> >       struct cfs_rq *cfs_rq;
> >       u64 runtime;
> > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >               remaining -= runtime;
> >
> >               cfs_rq->runtime_remaining += runtime;
> > -             cfs_rq->runtime_expires = expires;
> >
> >               /* we check whether we're throttled above */
> >               if (cfs_rq->runtime_remaining > 0)
> > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> >   */
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
> >  {
> > -     u64 runtime, runtime_expires;
> > +     u64 runtime;
> >       int throttled;
> >
> >       /* no need to continue the timer with no bandwidth constraint */
> > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >       /* account preceding periods in which throttling occurred */
> >       cfs_b->nr_throttled += overrun;
> >
> > -     runtime_expires = cfs_b->runtime_expires;
> > -
> >       /*
> >        * This check is repeated as we are holding onto the new bandwidth while
> >        * we unthrottle. This can potentially race with an unthrottled group
> > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
> >               cfs_b->distribute_running = 1;
> >               raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >               /* we can't nest cfs_b->lock while distributing bandwidth */
> > -             runtime = distribute_cfs_runtime(cfs_b, runtime,
> > -                                              runtime_expires);
> > +             runtime = distribute_cfs_runtime(cfs_b, runtime);
> >               raw_spin_lock_irqsave(&cfs_b->lock, flags);
> >
> >               cfs_b->distribute_running = 0;
> > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >               return;
> >
> >       raw_spin_lock(&cfs_b->lock);
> > -     if (cfs_b->quota != RUNTIME_INF &&
> > -         cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> > +     if (cfs_b->quota != RUNTIME_INF) {
> >               cfs_b->runtime += slack_runtime;
> >
> >               /* we are under rq->lock, defer unthrottling using a timer */
> > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >  {
> >       u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> >       unsigned long flags;
> > -     u64 expires;
> >
> >       /* confirm we're still not at a refresh boundary */
> >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >       if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> >               runtime = cfs_b->runtime;
> >
> > -     expires = cfs_b->runtime_expires;
> >       if (runtime)
> >               cfs_b->distribute_running = 1;
> >
> > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >       if (!runtime)
> >               return;
> >
> > -     runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> > +     runtime = distribute_cfs_runtime(cfs_b, runtime);
> >
> >       raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > -     if (expires == cfs_b->runtime_expires)
> > -             lsub_positive(&cfs_b->runtime, runtime);
> > +     lsub_positive(&cfs_b->runtime, runtime);
> >       cfs_b->distribute_running = 0;
> >       raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> >  }
> > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >
> >       cfs_b->period_active = 1;
> >       overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> > -     cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> > -     cfs_b->expires_seq++;
> >       hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> >  }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 802b1f3..28c16e9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -335,8 +335,6 @@ struct cfs_bandwidth {
> >       u64                     quota;
> >       u64                     runtime;
> >       s64                     hierarchical_quota;
> > -     u64                     runtime_expires;
> > -     int                     expires_seq;
> >
> >       u8                      idle;
> >       u8                      period_active;
> > @@ -556,8 +554,6 @@ struct cfs_rq {
> >
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >       int                     runtime_enabled;
> > -     int                     expires_seq;
> > -     u64                     runtime_expires;
> >       s64                     runtime_remaining;
> >
> >       u64                     throttled_clock;
> > --
> > 1.8.3.1
> >
>
> --

^ permalink raw reply

* Re: [PATCH 2/2] kernel-doc: core-api: Include string.h into core-api
From: Kees Cook @ 2019-07-23 21:28 UTC (permalink / raw)
  To: Joe Perches
  Cc: Linus Torvalds, linux-kernel, Jonathan Corbet, Stephen Kitt,
	Nitin Gote, jannh, kernel-hardening, Rasmus Villemoes,
	Andrew Morton, linux-doc
In-Reply-To: <224a6ebf39955f4107c0c376d66155d970e46733.1563841972.git.joe@perches.com>

On Mon, Jul 22, 2019 at 05:38:16PM -0700, Joe Perches wrote:
> core-api should show all the various string functions including the
> newly added stracpy and stracpy_pad.
> 
> Miscellanea:
> 
> o Update the Returns: value for strscpy
> o fix a defect with %NUL)
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  Documentation/core-api/kernel-api.rst |  3 +++
>  include/linux/string.h                |  5 +++--
>  lib/string.c                          | 10 ++++++----
>  3 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
> index 08af5caf036d..f77de49b1d51 100644
> --- a/Documentation/core-api/kernel-api.rst
> +++ b/Documentation/core-api/kernel-api.rst
> @@ -42,6 +42,9 @@ String Manipulation
>  .. kernel-doc:: lib/string.c
>     :export:
>  
> +.. kernel-doc:: include/linux/string.h
> +   :internal:
> +
>  .. kernel-doc:: mm/util.c
>     :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user
>                 vmemdup_user strndup_user memdup_user_nul
> diff --git a/include/linux/string.h b/include/linux/string.h
> index f80b0973f0e5..329188fffc11 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -515,8 +515,9 @@ static inline void memcpy_and_pad(void *dest, size_t dest_len,
>   * But this can lead to bugs due to typos, or if prefix is a pointer
>   * and not a constant. Instead use str_has_prefix().
>   *
> - * Returns: 0 if @str does not start with @prefix
> -         strlen(@prefix) if @str does start with @prefix
> + * Returns:
> + * * strlen(@prefix) if @str starts with @prefix
> + * * 0 if @str does not start with @prefix
>   */
>  static __always_inline size_t str_has_prefix(const char *str, const char *prefix)
>  {
> diff --git a/lib/string.c b/lib/string.c
> index 461fb620f85f..53582b6dce2a 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -173,8 +173,9 @@ EXPORT_SYMBOL(strlcpy);
>   * doesn't unnecessarily force the tail of the destination buffer to be
>   * zeroed.  If zeroing is desired please use strscpy_pad().
>   *
> - * Return: The number of characters copied (not including the trailing
> - *         %NUL) or -E2BIG if the destination buffer wasn't big enough.
> + * Returns:
> + * * The number of characters copied (not including the trailing %NUL)
> + * * -E2BIG if count is 0.
>   */
>  ssize_t strscpy(char *dest, const char *src, size_t count)
>  {
> @@ -253,8 +254,9 @@ EXPORT_SYMBOL(strscpy);
>   * For full explanation of why you may want to consider using the
>   * 'strscpy' functions please see the function docstring for strscpy().
>   *
> - * Return: The number of characters copied (not including the trailing
> - *         %NUL) or -E2BIG if the destination buffer wasn't big enough.
> + * Returns:
> + * * The number of characters copied (not including the trailing %NUL)
> + * * -E2BIG if count is 0.
>   */
>  ssize_t strscpy_pad(char *dest, const char *src, size_t count)
>  {
> -- 
> 2.15.0
> 

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings
From: Saravana Kannan @ 2019-07-23 20:48 UTC (permalink / raw)
  To: Rob Herring
  Cc: Mark Rutland, Greg Kroah-Hartman, Rafael J. Wysocki, Frank Rowand,
	Jonathan Corbet,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	linux-kernel@vger.kernel.org, David Collins, Android Kernel Team,
	Linux Doc Mailing List
In-Reply-To: <CAL_JsqK9GTxxxjhhWwqxOW9XERFziu2O71ETV2RhXb7B1WFY2g@mail.gmail.com>

On Tue, Jul 23, 2019 at 11:06 AM Rob Herring <robh+dt@kernel.org> wrote:
>
> On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan <saravanak@google.com> wrote:
> >
> > Add device-links after the devices are created (but before they are
> > probed) by looking at common DT bindings like clocks and
> > interconnects.
>
> The structure now looks a lot better to me. A few minor things below.

Thanks.

> >
> > Automatically adding device-links for functional dependencies at the
> > framework level provides the following benefits:
> >
> > - Optimizes device probe order and avoids the useless work of
> >   attempting probes of devices that will not probe successfully
> >   (because their suppliers aren't present or haven't probed yet).
> >
> >   For example, in a commonly available mobile SoC, registering just
> >   one consumer device's driver at an initcall level earlier than the
> >   supplier device's driver causes 11 failed probe attempts before the
> >   consumer device probes successfully. This was with a kernel with all
> >   the drivers statically compiled in. This problem gets a lot worse if
> >   all the drivers are loaded as modules without direct symbol
> >   dependencies.
> >
> > - Supplier devices like clock providers, interconnect providers, etc
> >   need to keep the resources they provide active and at a particular
> >   state(s) during boot up even if their current set of consumers don't
> >   request the resource to be active. This is because the rest of the
> >   consumers might not have probed yet and turning off the resource
> >   before all the consumers have probed could lead to a hang or
> >   undesired user experience.
> >
> >   Some frameworks (Eg: regulator) handle this today by turning off
> >   "unused" resources at late_initcall_sync and hoping all the devices
> >   have probed by then. This is not a valid assumption for systems with
> >   loadable modules. Other frameworks (Eg: clock) just don't handle
> >   this due to the lack of a clear signal for when they can turn off
> >   resources. This leads to downstream hacks to handle cases like this
> >   that can easily be solved in the upstream kernel.
> >
> >   By linking devices before they are probed, we give suppliers a clear
> >   count of the number of dependent consumers. Once all of the
> >   consumers are active, the suppliers can turn off the unused
> >   resources without making assumptions about the number of consumers.
> >
> > By default we just add device-links to track "driver presence" (probe
> > succeeded) of the supplier device. If any other functionality provided
> > by device-links are needed, it is left to the consumer/supplier
> > devices to change the link when they probe.
> >
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   5 +
> >  drivers/of/platform.c                         | 158 ++++++++++++++++++
> >  2 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 138f6664b2e2..109b4310844f 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3141,6 +3141,11 @@
> >                         This can be set from sysctl after boot.
> >                         See Documentation/sysctl/vm.txt for details.
> >
> > +       of_devlink      [KNL] Make device links from common DT bindings. Useful
> > +                       for optimizing probe order and making sure resources
> > +                       aren't turned off before the consumer devices have
> > +                       probed.
> > +
> >         ohci1394_dma=early      [HW] enable debugging via the ohci1394 driver.
> >                         See Documentation/debugging-via-ohci1394.txt for more
> >                         info.
> > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > index 04ad312fd85b..88a2086e26fa 100644
> > --- a/drivers/of/platform.c
> > +++ b/drivers/of/platform.c
> > @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node *root,
> >  }
> >  EXPORT_SYMBOL_GPL(of_platform_default_populate);
> >
> > +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> > +{
> > +       of_node_get(sup);
> > +       /*
> > +        * Don't allow linking a device node as a consumer of one of its
> > +        * descendant nodes. By definition, a child node can't be a functional
> > +        * dependency for the parent node.
> > +        */
> > +       while (sup) {
> > +               if (sup == con) {
> > +                       of_node_put(sup);
> > +                       return false;
> > +               }
> > +               sup = of_get_next_parent(sup);
> > +       }
> > +       return true;
> > +}
> > +
> > +static int of_link_to_phandle(struct device *dev, struct device_node *sup_np)
> > +{
> > +       struct platform_device *sup_dev;
> > +       u32 dl_flags = DL_FLAG_AUTOPROBE_CONSUMER;
> > +       int ret = 0;
> > +
> > +       /*
> > +        * Since we are trying to create device links, we need to find
> > +        * the actual device node that owns this supplier phandle.
> > +        * Often times it's the same node, but sometimes it can be one
> > +        * of the parents. So walk up the parent till you find a
> > +        * device.
> > +        */
> > +       while (sup_np && !of_find_property(sup_np, "compatible", NULL))
> > +               sup_np = of_get_next_parent(sup_np);
> > +       if (!sup_np)
> > +               return 0;
> > +
> > +       if (!of_link_is_valid(dev->of_node, sup_np)) {
> > +               of_node_put(sup_np);
> > +               return 0;
> > +       }
> > +       sup_dev = of_find_device_by_node(sup_np);
> > +       of_node_put(sup_np);
> > +       if (!sup_dev)
> > +               return -ENODEV;
> > +       if (!device_link_add(dev, &sup_dev->dev, dl_flags))
> > +               ret = -ENODEV;
> > +       put_device(&sup_dev->dev);
> > +       return ret;
> > +}
> > +
> > +static struct device_node *parse_prop_cells(struct device_node *np,
> > +                                           const char *prop, int i,
>
> I like 'i' for for loops, but less so for function params. Perhaps
> 'index' instead like of_parse_phandle_with_args.

Sounds good.

>
> > +                                           const char *binding,
> > +                                           const char *cell)
> > +{
> > +       struct of_phandle_args sup_args;
> > +
> > +       if (!i && strcmp(prop, binding))
>
> Why the '!i' test?

To avoid a string comparison for every index. It's kinda wasteful once
the first index passes.

> > +               return NULL;
> > +
> > +       if (of_parse_phandle_with_args(np, binding, cell, i, &sup_args))
> > +               return NULL;
> > +
> > +       return sup_args.np;
> > +}
> > +
> > +static struct device_node *parse_clocks(struct device_node *np,
> > +                                       const char *prop, int i)
> > +{
> > +       return parse_prop_cells(np, prop, i, "clocks", "#clock-cells");
> > +}
> > +
> > +static struct device_node *parse_interconnects(struct device_node *np,
> > +                                              const char *prop, int i)
> > +{
> > +       return parse_prop_cells(np, prop, i, "interconnects",
> > +                               "#interconnect-cells");
> > +}
> > +
> > +static int strcmp_suffix(const char *str, const char *suffix)
> > +{
> > +       unsigned int len, suffix_len;
> > +
> > +       len = strlen(str);
> > +       suffix_len = strlen(suffix);
> > +       if (len <= suffix_len)
> > +               return -1;
> > +       return strcmp(str + len - suffix_len, suffix);
> > +}
> > +
> > +static struct device_node *parse_regulators(struct device_node *np,
> > +                                           const char *prop, int i)
> > +{
> > +       if (i || strcmp_suffix(prop, "-supply"))
> > +               return NULL;
> > +
> > +       return of_parse_phandle(np, prop, 0);
> > +}
> > +
> > +/**
> > + * struct supplier_bindings - Information for parsing supplier DT binding
> > + *
> > + * @parse_prop:                If the function cannot parse the property, return NULL.
> > + *                     Otherwise, return the phandle listed in the property
> > + *                     that corresponds to index i.
> > + */
> > +struct supplier_bindings {
> > +       struct device_node *(*parse_prop)(struct device_node *np,
> > +                                         const char *name, int i);
> > +};
> > +
> > +struct supplier_bindings bindings[] = {
>
> static const

Will do.

>
> > +       { .parse_prop = parse_clocks, },
> > +       { .parse_prop = parse_interconnects, },
> > +       { .parse_prop = parse_regulators, },
> > +       { },
> > +};
> > +
> > +static bool of_link_property(struct device *dev, struct device_node *con_np,
> > +                            const char *prop)
> > +{
> > +       struct device_node *phandle;
> > +       struct supplier_bindings *s = bindings;
> > +       unsigned int i = 0;
> > +       bool done = true;
> > +
> > +       while (!i && s->parse_prop) {
>
> Using 'i' is a little odd. Perhaps a 'matched' bool would be easier to read.

That's how I wrote it first (locally) and then redid it this way
because the bool felt very superfluous. I don't think this is that
hard to understand.

> > +               while ((phandle = s->parse_prop(con_np, prop, i))) {
> > +                       i++;
> > +                       if (of_link_to_phandle(dev, phandle))
> > +                               done = false;
>
> Just return here. No point in continuing as 'done' is never set back to true.

Actually, there is a point for this. Say Device-C depends on suppliers
Device-S1 and Device-S2 and they are listed in DT in that order.

Say, S1 gets populated after late_initcall_sync but S2 is probes way
before that. If I don't continue past a "failed linking" to S1 and
also link up to S2, then S2 will get a sync_state() callback before C
is probed. So I have to go through all possible suppliers and as many
as possible.

Let me add a comment about this somewhere in the code (probably the
header that defines the add_links() ops).

-Saravana

> > +               }
> > +               s++;
> > +       }
> > +       return done ? 0 : -ENODEV;
> > +}
> > +
> > +static bool of_devlink;
> > +core_param(of_devlink, of_devlink, bool, 0);
> > +
> > +static int of_link_to_suppliers(struct device *dev)
> > +{
> > +       struct property *p;
> > +       bool done = true;
> > +
> > +       if (!of_devlink)
> > +               return 0;
> > +       if (unlikely(!dev->of_node))
> > +               return 0;
> > +
> > +       for_each_property_of_node(dev->of_node, p)
> > +               if (of_link_property(dev, dev->of_node, p->name))
> > +                       done = false;
> > +
> > +       return done ? 0 : -ENODEV;
> > +}
> > +
> >  #ifndef CONFIG_PPC
> >  static const struct of_device_id reserved_mem_matches[] = {
> >         { .compatible = "qcom,rmtfs-mem" },
> > @@ -524,6 +681,7 @@ static int __init of_platform_default_populate_init(void)
> >         if (!of_have_populated_dt())
> >                 return -ENODEV;
> >
> > +       platform_bus_type.add_links = of_link_to_suppliers;
> >         /*
> >          * Handle certain compatibles explicitly, since we don't want to create
> >          * platform_devices for every node in /reserved-memory with a
> > --
> > 2.22.0.657.g960e92d24f-goog
> >

^ permalink raw reply

* Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings
From: Rob Herring @ 2019-07-23 18:06 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Mark Rutland, Greg Kroah-Hartman, Rafael J. Wysocki, Frank Rowand,
	Jonathan Corbet, devicetree, linux-kernel@vger.kernel.org,
	David Collins, Android Kernel Team, Linux Doc Mailing List
In-Reply-To: <20190720061647.234852-4-saravanak@google.com>

On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan <saravanak@google.com> wrote:
>
> Add device-links after the devices are created (but before they are
> probed) by looking at common DT bindings like clocks and
> interconnects.

The structure now looks a lot better to me. A few minor things below.

>
> Automatically adding device-links for functional dependencies at the
> framework level provides the following benefits:
>
> - Optimizes device probe order and avoids the useless work of
>   attempting probes of devices that will not probe successfully
>   (because their suppliers aren't present or haven't probed yet).
>
>   For example, in a commonly available mobile SoC, registering just
>   one consumer device's driver at an initcall level earlier than the
>   supplier device's driver causes 11 failed probe attempts before the
>   consumer device probes successfully. This was with a kernel with all
>   the drivers statically compiled in. This problem gets a lot worse if
>   all the drivers are loaded as modules without direct symbol
>   dependencies.
>
> - Supplier devices like clock providers, interconnect providers, etc
>   need to keep the resources they provide active and at a particular
>   state(s) during boot up even if their current set of consumers don't
>   request the resource to be active. This is because the rest of the
>   consumers might not have probed yet and turning off the resource
>   before all the consumers have probed could lead to a hang or
>   undesired user experience.
>
>   Some frameworks (Eg: regulator) handle this today by turning off
>   "unused" resources at late_initcall_sync and hoping all the devices
>   have probed by then. This is not a valid assumption for systems with
>   loadable modules. Other frameworks (Eg: clock) just don't handle
>   this due to the lack of a clear signal for when they can turn off
>   resources. This leads to downstream hacks to handle cases like this
>   that can easily be solved in the upstream kernel.
>
>   By linking devices before they are probed, we give suppliers a clear
>   count of the number of dependent consumers. Once all of the
>   consumers are active, the suppliers can turn off the unused
>   resources without making assumptions about the number of consumers.
>
> By default we just add device-links to track "driver presence" (probe
> succeeded) of the supplier device. If any other functionality provided
> by device-links are needed, it is left to the consumer/supplier
> devices to change the link when they probe.
>
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   5 +
>  drivers/of/platform.c                         | 158 ++++++++++++++++++
>  2 files changed, 163 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 138f6664b2e2..109b4310844f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3141,6 +3141,11 @@
>                         This can be set from sysctl after boot.
>                         See Documentation/sysctl/vm.txt for details.
>
> +       of_devlink      [KNL] Make device links from common DT bindings. Useful
> +                       for optimizing probe order and making sure resources
> +                       aren't turned off before the consumer devices have
> +                       probed.
> +
>         ohci1394_dma=early      [HW] enable debugging via the ohci1394 driver.
>                         See Documentation/debugging-via-ohci1394.txt for more
>                         info.
> diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> index 04ad312fd85b..88a2086e26fa 100644
> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node *root,
>  }
>  EXPORT_SYMBOL_GPL(of_platform_default_populate);
>
> +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> +{
> +       of_node_get(sup);
> +       /*
> +        * Don't allow linking a device node as a consumer of one of its
> +        * descendant nodes. By definition, a child node can't be a functional
> +        * dependency for the parent node.
> +        */
> +       while (sup) {
> +               if (sup == con) {
> +                       of_node_put(sup);
> +                       return false;
> +               }
> +               sup = of_get_next_parent(sup);
> +       }
> +       return true;
> +}
> +
> +static int of_link_to_phandle(struct device *dev, struct device_node *sup_np)
> +{
> +       struct platform_device *sup_dev;
> +       u32 dl_flags = DL_FLAG_AUTOPROBE_CONSUMER;
> +       int ret = 0;
> +
> +       /*
> +        * Since we are trying to create device links, we need to find
> +        * the actual device node that owns this supplier phandle.
> +        * Often times it's the same node, but sometimes it can be one
> +        * of the parents. So walk up the parent till you find a
> +        * device.
> +        */
> +       while (sup_np && !of_find_property(sup_np, "compatible", NULL))
> +               sup_np = of_get_next_parent(sup_np);
> +       if (!sup_np)
> +               return 0;
> +
> +       if (!of_link_is_valid(dev->of_node, sup_np)) {
> +               of_node_put(sup_np);
> +               return 0;
> +       }
> +       sup_dev = of_find_device_by_node(sup_np);
> +       of_node_put(sup_np);
> +       if (!sup_dev)
> +               return -ENODEV;
> +       if (!device_link_add(dev, &sup_dev->dev, dl_flags))
> +               ret = -ENODEV;
> +       put_device(&sup_dev->dev);
> +       return ret;
> +}
> +
> +static struct device_node *parse_prop_cells(struct device_node *np,
> +                                           const char *prop, int i,

I like 'i' for for loops, but less so for function params. Perhaps
'index' instead like of_parse_phandle_with_args.

> +                                           const char *binding,
> +                                           const char *cell)
> +{
> +       struct of_phandle_args sup_args;
> +
> +       if (!i && strcmp(prop, binding))

Why the '!i' test?

> +               return NULL;
> +
> +       if (of_parse_phandle_with_args(np, binding, cell, i, &sup_args))
> +               return NULL;
> +
> +       return sup_args.np;
> +}
> +
> +static struct device_node *parse_clocks(struct device_node *np,
> +                                       const char *prop, int i)
> +{
> +       return parse_prop_cells(np, prop, i, "clocks", "#clock-cells");
> +}
> +
> +static struct device_node *parse_interconnects(struct device_node *np,
> +                                              const char *prop, int i)
> +{
> +       return parse_prop_cells(np, prop, i, "interconnects",
> +                               "#interconnect-cells");
> +}
> +
> +static int strcmp_suffix(const char *str, const char *suffix)
> +{
> +       unsigned int len, suffix_len;
> +
> +       len = strlen(str);
> +       suffix_len = strlen(suffix);
> +       if (len <= suffix_len)
> +               return -1;
> +       return strcmp(str + len - suffix_len, suffix);
> +}
> +
> +static struct device_node *parse_regulators(struct device_node *np,
> +                                           const char *prop, int i)
> +{
> +       if (i || strcmp_suffix(prop, "-supply"))
> +               return NULL;
> +
> +       return of_parse_phandle(np, prop, 0);
> +}
> +
> +/**
> + * struct supplier_bindings - Information for parsing supplier DT binding
> + *
> + * @parse_prop:                If the function cannot parse the property, return NULL.
> + *                     Otherwise, return the phandle listed in the property
> + *                     that corresponds to index i.
> + */
> +struct supplier_bindings {
> +       struct device_node *(*parse_prop)(struct device_node *np,
> +                                         const char *name, int i);
> +};
> +
> +struct supplier_bindings bindings[] = {

static const

> +       { .parse_prop = parse_clocks, },
> +       { .parse_prop = parse_interconnects, },
> +       { .parse_prop = parse_regulators, },
> +       { },
> +};
> +
> +static bool of_link_property(struct device *dev, struct device_node *con_np,
> +                            const char *prop)
> +{
> +       struct device_node *phandle;
> +       struct supplier_bindings *s = bindings;
> +       unsigned int i = 0;
> +       bool done = true;
> +
> +       while (!i && s->parse_prop) {

Using 'i' is a little odd. Perhaps a 'matched' bool would be easier to read.

> +               while ((phandle = s->parse_prop(con_np, prop, i))) {
> +                       i++;
> +                       if (of_link_to_phandle(dev, phandle))
> +                               done = false;

Just return here. No point in continuing as 'done' is never set back to true.

> +               }
> +               s++;
> +       }
> +       return done ? 0 : -ENODEV;
> +}
> +
> +static bool of_devlink;
> +core_param(of_devlink, of_devlink, bool, 0);
> +
> +static int of_link_to_suppliers(struct device *dev)
> +{
> +       struct property *p;
> +       bool done = true;
> +
> +       if (!of_devlink)
> +               return 0;
> +       if (unlikely(!dev->of_node))
> +               return 0;
> +
> +       for_each_property_of_node(dev->of_node, p)
> +               if (of_link_property(dev, dev->of_node, p->name))
> +                       done = false;
> +
> +       return done ? 0 : -ENODEV;
> +}
> +
>  #ifndef CONFIG_PPC
>  static const struct of_device_id reserved_mem_matches[] = {
>         { .compatible = "qcom,rmtfs-mem" },
> @@ -524,6 +681,7 @@ static int __init of_platform_default_populate_init(void)
>         if (!of_have_populated_dt())
>                 return -ENODEV;
>
> +       platform_bus_type.add_links = of_link_to_suppliers;
>         /*
>          * Handle certain compatibles explicitly, since we don't want to create
>          * platform_devices for every node in /reserved-memory with a
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply

* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Phil Auld @ 2019-07-23 17:13 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos,
	John Hammond, Cong Wang, Jonathan Corbet, linux-doc
In-Reply-To: <1563900266-19734-2-git-send-email-chiluk+linux@indeed.com>

Hi Dave,

On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
> 
> This has been root caused to cpu-local run queue being allocated per cpu
> bandwidth slices, and then not fully using that slice within the period.
> At which point the slice and quota expires. This expiration of unused
> slice results in applications not being able to utilize the quota for
> which they are allocated.
> 
> The non-expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this had been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevent the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
> 
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
> Reviewed-by: Ben Segall <bsegall@google.com>

This still works for me. The documentation reads pretty well, too. Good job.

Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>.

I'll run it through some more tests when I have time. The code is the same
as the earlier one I tested from what I can see.

Cheers,
Phil

> ---
>  Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
>  kernel/sched/fair.c                   | 72 ++++------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 67 insertions(+), 83 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
> index 3a90642..9801d6b 100644
> --- a/Documentation/scheduler/sched-bwc.rst
> +++ b/Documentation/scheduler/sched-bwc.rst
> @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota units
> -above at each period boundary.  As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> +each given "period" (microseconds), a task group is allocated up to "quota"
> +microseconds of CPU time. That quota is assigned to per-cpu run queues in
> +slices as threads in the cgroup become runnable. Once all quota has been
> +assigned any additional requests for quota will result in those threads being
> +throttled. Throttled threads will not be able to run again until the next
> +period when the quota is replenished.
> +
> +A group's unassigned quota is globally tracked, being refreshed back to
> +cfs_quota units at each period boundary. As threads consume this bandwidth it
> +is transferred to cpu-local "silos" on a demand basis. The amount transferred
>  within each of these updates is tunable and described as the "slice".
>  
>  Management
> @@ -35,12 +36,12 @@ The default values are::
>  
>  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
>  bandwidth restriction in place, such a group is described as an unconstrained
> -bandwidth group.  This represents the traditional work-conserving behavior for
> +bandwidth group. This represents the traditional work-conserving behavior for
>  CFS.
>  
>  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
> -The minimum quota allowed for the quota or period is 1ms.  There is also an
> -upper bound on the period length of 1s.  Additional restrictions exist when
> +The minimum quota allowed for the quota or period is 1ms. There is also an
> +upper bound on the period length of 1s. Additional restrictions exist when
>  bandwidth limits are used in a hierarchical fashion, these are explained in
>  more detail below.
>  
> @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
>  System wide settings
>  --------------------
>  For efficiency run-time is transferred between the global pool and CPU local
> -"silos" in a batch fashion.  This greatly reduces global accounting pressure
> -on large systems.  The amount transferred each time such an update is required
> +"silos" in a batch fashion. This greatly reduces global accounting pressure
> +on large systems. The amount transferred each time such an update is required
>  is described as the "slice".
>  
>  This is tunable via procfs::
> @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
>  In case b) above, even though the child may have runtime remaining it will not
>  be allowed to until the parent's runtime is refreshed.
>  
> +CFS Bandwidth Quota Caveats
> +---------------------------
> +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
> +the slice may be returned to the global pool if all threads on that cpu become
> +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
> +variable. This is a performance tweak that helps prevent added contention on
> +the global lock.
> +
> +The fact that cpu-local slices do not expire results in some interesting corner
> +cases that should be understood.
> +
> +For cgroup cpu constrained applications that are cpu limited this is a
> +relatively moot point because they will naturally consume the entirety of their
> +quota as well as the entirety of each cpu-local slice in each period. As a
> +result it is expected that nr_periods roughly equal nr_throttled, and that
> +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
> +
> +For highly-threaded, non-cpu bound applications this non-expiration nuance
> +allows applications to briefly burst past their quota limits by the amount of
> +unused slice on each cpu that the task group is running on (typically at most
> +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
> +applies if quota had been assigned to a cpu and then not fully used or returned
> +in previous periods. This burst amount will not be transferred between cores.
> +As a result, this mechanism still strictly limits the task group to quota
> +average usage, albeit over a longer time window than a single period.  This
> +also limits the burst ability to no more than 1ms per cpu.  This provides
> +better more predictable user experience for highly threaded applications with
> +small quota limits on high core count machines. It also eliminates the
> +propensity to throttle these applications while simultanously using less than
> +quota amounts of cpu. Another way to say this, is that by allowing the unused
> +portion of a slice to remain valid across periods we have decreased the
> +possibility of wastefully expiring quota on cpu-local silos that don't need a
> +full slice's amount of cpu time.
> +
> +The interaction between cpu-bound and non-cpu-bound-interactive applications
> +should also be considered, especially when single core usage hits 100%. If you
> +gave each of these applications half of a cpu-core and they both got scheduled
> +on the same CPU it is theoretically possible that the non-cpu bound application
> +will use up to 1ms additional quota in some periods, thereby preventing the
> +cpu-bound application from fully using its quota by that same amount. In these
> +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
> +decide which application is chosen to run, as they will both be runnable and
> +have remaining quota. This runtime discrepancy will be made up in the following
> +periods when the interactive application idles.
> +
>  Examples
>  --------
>  1. Limit a group to 1 CPU worth of runtime::
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 036be95..00b68f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>  
>  	now = sched_clock_cpu(smp_processor_id());
>  	cfs_b->runtime = cfs_b->quota;
> -	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  }
>  
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  {
>  	struct task_group *tg = cfs_rq->tg;
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> -	u64 amount = 0, min_amount, expires;
> -	int expires_seq;
> +	u64 amount = 0, min_amount;
>  
>  	/* note: this is a positive sum as runtime_remaining <= 0 */
>  	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
> @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  			cfs_b->idle = 0;
>  		}
>  	}
> -	expires_seq = cfs_b->expires_seq;
> -	expires = cfs_b->runtime_expires;
>  	raw_spin_unlock(&cfs_b->lock);
>  
>  	cfs_rq->runtime_remaining += amount;
> -	/*
> -	 * we may have advanced our local expiration to account for allowed
> -	 * spread between our sched_clock and the one on which runtime was
> -	 * issued.
> -	 */
> -	if (cfs_rq->expires_seq != expires_seq) {
> -		cfs_rq->expires_seq = expires_seq;
> -		cfs_rq->runtime_expires = expires;
> -	}
>  
>  	return cfs_rq->runtime_remaining > 0;
>  }
>  
> -/*
> - * Note: This depends on the synchronization provided by sched_clock and the
> - * fact that rq->clock snapshots this value.
> - */
> -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> -{
> -	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> -
> -	/* if the deadline is ahead of our clock, nothing to do */
> -	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
> -		return;
> -
> -	if (cfs_rq->runtime_remaining < 0)
> -		return;
> -
> -	/*
> -	 * If the local deadline has passed we have to consider the
> -	 * possibility that our sched_clock is 'fast' and the global deadline
> -	 * has not truly expired.
> -	 *
> -	 * Fortunately we can check determine whether this the case by checking
> -	 * whether the global deadline(cfs_b->expires_seq) has advanced.
> -	 */
> -	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
> -		/* extend local deadline, drift is bounded above by 2 ticks */
> -		cfs_rq->runtime_expires += TICK_NSEC;
> -	} else {
> -		/* global deadline is ahead, expiration has passed */
> -		cfs_rq->runtime_remaining = 0;
> -	}
> -}
> -
>  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
>  {
>  	/* dock delta_exec before expiring quota (as it could span periods) */
>  	cfs_rq->runtime_remaining -= delta_exec;
> -	expire_cfs_rq_runtime(cfs_rq);
>  
>  	if (likely(cfs_rq->runtime_remaining > 0))
>  		return;
> @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  		resched_curr(rq);
>  }
>  
> -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
> -		u64 remaining, u64 expires)
> +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
>  {
>  	struct cfs_rq *cfs_rq;
>  	u64 runtime;
> @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  		remaining -= runtime;
>  
>  		cfs_rq->runtime_remaining += runtime;
> -		cfs_rq->runtime_expires = expires;
>  
>  		/* we check whether we're throttled above */
>  		if (cfs_rq->runtime_remaining > 0)
> @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   */
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
> -	u64 runtime, runtime_expires;
> +	u64 runtime;
>  	int throttled;
>  
>  	/* no need to continue the timer with no bandwidth constraint */
> @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  	/* account preceding periods in which throttling occurred */
>  	cfs_b->nr_throttled += overrun;
>  
> -	runtime_expires = cfs_b->runtime_expires;
> -
>  	/*
>  	 * This check is repeated as we are holding onto the new bandwidth while
>  	 * we unthrottle. This can potentially race with an unthrottled group
> @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
>  		cfs_b->distribute_running = 1;
>  		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
> -		runtime = distribute_cfs_runtime(cfs_b, runtime,
> -						 runtime_expires);
> +		runtime = distribute_cfs_runtime(cfs_b, runtime);
>  		raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  
>  		cfs_b->distribute_running = 0;
> @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  		return;
>  
>  	raw_spin_lock(&cfs_b->lock);
> -	if (cfs_b->quota != RUNTIME_INF &&
> -	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
> +	if (cfs_b->quota != RUNTIME_INF) {
>  		cfs_b->runtime += slack_runtime;
>  
>  		/* we are under rq->lock, defer unthrottling using a timer */
> @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
>  	unsigned long flags;
> -	u64 expires;
>  
>  	/* confirm we're still not at a refresh boundary */
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>  		runtime = cfs_b->runtime;
>  
> -	expires = cfs_b->runtime_expires;
>  	if (runtime)
>  		cfs_b->distribute_running = 1;
>  
> @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);
> +	lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
>  	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
> @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  
>  	cfs_b->period_active = 1;
>  	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> -	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
> -	cfs_b->expires_seq++;
>  	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 802b1f3..28c16e9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -335,8 +335,6 @@ struct cfs_bandwidth {
>  	u64			quota;
>  	u64			runtime;
>  	s64			hierarchical_quota;
> -	u64			runtime_expires;
> -	int			expires_seq;
>  
>  	u8			idle;
>  	u8			period_active;
> @@ -556,8 +554,6 @@ struct cfs_rq {
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	int			runtime_enabled;
> -	int			expires_seq;
> -	u64			runtime_expires;
>  	s64			runtime_remaining;
>  
>  	u64			throttled_clock;
> -- 
> 1.8.3.1
> 

-- 

^ permalink raw reply

* [PATCH v1] coda: Fix typo in the struct CodaCred documentation
From: Andy Shevchenko @ 2019-07-23 16:57 UTC (permalink / raw)
  To: Jan Harkes, coda, codalist, Jonathan Corbet, linux-doc; +Cc: Andy Shevchenko

Documentation mistakenly refers to a different type while explaining
the contents of the struct CodaCred.

Fix the typo in the struct CodaCred description in the documentation.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 Documentation/filesystems/coda.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt
index 545262c167c3..1711ad48e38a 100644
--- a/Documentation/filesystems/coda.txt
+++ b/Documentation/filesystems/coda.txt
@@ -421,14 +421,14 @@ kernel support.
 
 
   The CodaCred structure defines a variety of user and group ids as
-  they are set for the calling process. The vuid_t and guid_t are 32 bit
+  they are set for the calling process. The vuid_t and vgid_t are 32 bit
   unsigned integers.  It also defines group membership in an array.  On
   Unix the CodaCred has proven sufficient to implement good security
   semantics for Coda but the structure may have to undergo modification
   for the Windows environment when these mature.
 
   struct CodaCred {
-      vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/
+      vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid */
       vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */
       vgid_t cr_groups[NGROUPS];        /* Group membership for caller */
   };
-- 
2.20.1


^ permalink raw reply related

* [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <1563900266-19734-1-git-send-email-chiluk+linux@indeed.com>

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Reviewed-by: Ben Segall <bsegall@google.com>
---
 Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++-------
 kernel/sched/fair.c                   | 72 ++++------------------------------
 kernel/sched/sched.h                  |  4 --
 3 files changed, 67 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst
index 3a90642..9801d6b 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period boundary. As threads consume this bandwidth it
+is transferred to cpu-local "silos" on a demand basis. The amount transferred
 within each of these updates is tunable and described as the "slice".
 
 Management
@@ -35,12 +36,12 @@ The default values are::
 
 A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group.  This represents the traditional work-conserving behavior for
+bandwidth group. This represents the traditional work-conserving behavior for
 CFS.
 
 Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms.  There is also an
-upper bound on the period length of 1s.  Additional restrictions exist when
+The minimum quota allowed for the quota or period is 1ms. There is also an
+upper bound on the period length of 1s. Additional restrictions exist when
 bandwidth limits are used in a hierarchical fashion, these are explained in
 more detail below.
 
@@ -53,8 +54,8 @@ unthrottled if it is in a constrained state.
 System wide settings
 --------------------
 For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion.  This greatly reduces global accounting pressure
-on large systems.  The amount transferred each time such an update is required
+"silos" in a batch fashion. This greatly reduces global accounting pressure
+on large systems. The amount transferred each time such an update is required
 is described as the "slice".
 
 This is tunable via procfs::
@@ -97,6 +98,51 @@ There are two ways in which a group may become throttled:
 In case b) above, even though the child may have runtime remaining it will not
 be allowed to until the parent's runtime is refreshed.
 
+CFS Bandwidth Quota Caveats
+---------------------------
+Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
+the slice may be returned to the global pool if all threads on that cpu become
+unrunnable. This is configured at compile time by the min_cfs_rq_runtime
+variable. This is a performance tweak that helps prevent added contention on
+the global lock.
+
+The fact that cpu-local slices do not expire results in some interesting corner
+cases that should be understood.
+
+For cgroup cpu constrained applications that are cpu limited this is a
+relatively moot point because they will naturally consume the entirety of their
+quota as well as the entirety of each cpu-local slice in each period. As a
+result it is expected that nr_periods roughly equal nr_throttled, and that
+cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
+
+For highly-threaded, non-cpu bound applications this non-expiration nuance
+allows applications to briefly burst past their quota limits by the amount of
+unused slice on each cpu that the task group is running on (typically at most
+1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
+applies if quota had been assigned to a cpu and then not fully used or returned
+in previous periods. This burst amount will not be transferred between cores.
+As a result, this mechanism still strictly limits the task group to quota
+average usage, albeit over a longer time window than a single period.  This
+also limits the burst ability to no more than 1ms per cpu.  This provides
+better more predictable user experience for highly threaded applications with
+small quota limits on high core count machines. It also eliminates the
+propensity to throttle these applications while simultanously using less than
+quota amounts of cpu. Another way to say this, is that by allowing the unused
+portion of a slice to remain valid across periods we have decreased the
+possibility of wastefully expiring quota on cpu-local silos that don't need a
+full slice's amount of cpu time.
+
+The interaction between cpu-bound and non-cpu-bound-interactive applications
+should also be considered, especially when single core usage hits 100%. If you
+gave each of these applications half of a cpu-core and they both got scheduled
+on the same CPU it is theoretically possible that the non-cpu bound application
+will use up to 1ms additional quota in some periods, thereby preventing the
+cpu-bound application from fully using its quota by that same amount. In these
+instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
+decide which application is chosen to run, as they will both be runnable and
+have remaining quota. This runtime discrepancy will be made up in the following
+periods when the interactive application idles.
+
 Examples
 --------
 1. Limit a group to 1 CPU worth of runtime::
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95..00b68f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
 
 	now = sched_clock_cpu(smp_processor_id());
 	cfs_b->runtime = cfs_b->quota;
-	cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 }
 
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct task_group *tg = cfs_rq->tg;
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
-	u64 amount = 0, min_amount, expires;
-	int expires_seq;
+	u64 amount = 0, min_amount;
 
 	/* note: this is a positive sum as runtime_remaining <= 0 */
 	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
@@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 			cfs_b->idle = 0;
 		}
 	}
-	expires_seq = cfs_b->expires_seq;
-	expires = cfs_b->runtime_expires;
 	raw_spin_unlock(&cfs_b->lock);
 
 	cfs_rq->runtime_remaining += amount;
-	/*
-	 * we may have advanced our local expiration to account for allowed
-	 * spread between our sched_clock and the one on which runtime was
-	 * issued.
-	 */
-	if (cfs_rq->expires_seq != expires_seq) {
-		cfs_rq->expires_seq = expires_seq;
-		cfs_rq->runtime_expires = expires;
-	}
 
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/*
- * Note: This depends on the synchronization provided by sched_clock and the
- * fact that rq->clock snapshots this value.
- */
-static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	/* if the deadline is ahead of our clock, nothing to do */
-	if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
-		return;
-
-	if (cfs_rq->runtime_remaining < 0)
-		return;
-
-	/*
-	 * If the local deadline has passed we have to consider the
-	 * possibility that our sched_clock is 'fast' and the global deadline
-	 * has not truly expired.
-	 *
-	 * Fortunately we can check determine whether this the case by checking
-	 * whether the global deadline(cfs_b->expires_seq) has advanced.
-	 */
-	if (cfs_rq->expires_seq == cfs_b->expires_seq) {
-		/* extend local deadline, drift is bounded above by 2 ticks */
-		cfs_rq->runtime_expires += TICK_NSEC;
-	} else {
-		/* global deadline is ahead, expiration has passed */
-		cfs_rq->runtime_remaining = 0;
-	}
-}
-
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
-	expire_cfs_rq_runtime(cfs_rq);
 
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
@@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		resched_curr(rq);
 }
 
-static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
-		u64 remaining, u64 expires)
+static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 {
 	struct cfs_rq *cfs_rq;
 	u64 runtime;
@@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		remaining -= runtime;
 
 		cfs_rq->runtime_remaining += runtime;
-		cfs_rq->runtime_expires = expires;
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0)
@@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  */
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
-	u64 runtime, runtime_expires;
+	u64 runtime;
 	int throttled;
 
 	/* no need to continue the timer with no bandwidth constraint */
@@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 	/* account preceding periods in which throttling occurred */
 	cfs_b->nr_throttled += overrun;
 
-	runtime_expires = cfs_b->runtime_expires;
-
 	/*
 	 * This check is repeated as we are holding onto the new bandwidth while
 	 * we unthrottle. This can potentially race with an unthrottled group
@@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u
 		cfs_b->distribute_running = 1;
 		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
-		runtime = distribute_cfs_runtime(cfs_b, runtime,
-						 runtime_expires);
+		runtime = distribute_cfs_runtime(cfs_b, runtime);
 		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
@@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		return;
 
 	raw_spin_lock(&cfs_b->lock);
-	if (cfs_b->quota != RUNTIME_INF &&
-	    cfs_rq->runtime_expires == cfs_b->runtime_expires) {
+	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
 		/* we are under rq->lock, defer unthrottling using a timer */
@@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 	unsigned long flags;
-	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
@@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	expires = cfs_b->runtime_expires;
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
@@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (!runtime)
 		return;
 
-	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
+	runtime = distribute_cfs_runtime(cfs_b, runtime);
 
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
-	if (expires == cfs_b->runtime_expires)
-		lsub_positive(&cfs_b->runtime, runtime);
+	lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
 	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
@@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 
 	cfs_b->period_active = 1;
 	overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
-	cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
-	cfs_b->expires_seq++;
 	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3..28c16e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -335,8 +335,6 @@ struct cfs_bandwidth {
 	u64			quota;
 	u64			runtime;
 	s64			hierarchical_quota;
-	u64			runtime_expires;
-	int			expires_seq;
 
 	u8			idle;
 	u8			period_active;
@@ -556,8 +554,6 @@ struct cfs_rq {
 
 #ifdef CONFIG_CFS_BANDWIDTH
 	int			runtime_enabled;
-	int			expires_seq;
-	u64			runtime_expires;
 	s64			runtime_remaining;
 
 	u64			throttled_clock;
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw)
  To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra,
	Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson,
	Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet,
	linux-doc
In-Reply-To: <1558121424-2914-1-git-send-email-chiluk+linux@indeed.com>

Changelog v6
- Added back missing call to lsub_positive(&cfs_b->runtime, runtime);
- Added Reviewed-by: Ben Segall <bsegall@google.com>
- Fix some grammar in the Documentation, and change some wording.
- Updated documentation due to the .rst change

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.

^ permalink raw reply

* Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table
From: Sheriff Esseson @ 2019-07-23 14:52 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Darrick J. Wong, supporter:XFS FILESYSTEM, Jonathan Corbet,
	open list:DOCUMENTATION, open list, skhan, linux-kernel-mentees
In-Reply-To: <20190723074218.4532737f@lwn.net>

On Tue, Jul 23, 2019 at 07:42:18AM -0600, Jonathan Corbet wrote:
> On Tue, 23 Jul 2019 12:48:13 +0100
> Sheriff Esseson <sheriffesseson@gmail.com> wrote:
> 
> > the "Removed Sysctls" section is a table - bring it alive with ReST.
> > 
> > Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
> 
> So this appears to be identical to the patch you sent three days ago; is
> there a reason why you are sending it again now?
> 
> Thanks,
> 
> jon

Sorry, I was think the patch went unnoticed during the merge window - I could
not find a response.

^ permalink raw reply

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
From: Joel Fernandes @ 2019-07-23 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, carmenjackson, Christian Hansen,
	Colin Ian King, dancol, David Howells, fmayer, joaodias,
	Jonathan Corbet, Kees Cook, Kirill Tkhai, Konstantin Khlebnikov,
	linux-doc, linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport,
	minchan, minchan, namhyung, sspatil, surenb, Thomas Gleixner,
	timmurray, tkjos, Vlastimil Babka, wvw
In-Reply-To: <20190722150639.27641c63b003dd04e187fd96@linux-foundation.org>

On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> Well, it's never going to be "accurate" - something could change one
> nanosecond after userspace has read the data...
> 
> Presumably with this approach the data will be "more" accurate.  How
> big a problem has this inaccuracy proven to be in real-world usage?

Has proven to be quite a thorn. But the security issue is the main problem..

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.

..as mentioned here.

I should have emphasized on the security issue more, will do so in the next
revision.

> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > ...
> >
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/sched/mm.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> >   *
> >   * This function tries to get a user memory page by pfn as described above.
> >   */
> 
> Above comment needs updating or moving?
> 
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> >
> > ...
> >
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> 
> Is said to mean "The system tried to use the device represented by a
> file you specified, and it couldnt find the device.  This can mean that
> the device file was installed incorrectly, or that the physical device
> is missing or not correctly attached to the computer."
> 
> This doesn't seem appropriate in this usage and is hence possibly
> misleading.  Someone whose application fails with ENXIO will be
> scratching their heads.

This actually keeps it consistent with the current code. I refactored that
code a bit and I'm reusing parts of it to keep lines of code less. See
page_idle_bitmap_write where it returns -ENXIO in current upstream.

However note that I am actually returning 0 if page_idle_bitmap_write()
returns -ENXIO:

+	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+	if (ret == -ENXIO)
+		return 0;  /* Reads beyond max_pfn do nothing */

The reason I do it this way is, I am using page_idle_get_frames() in the old
code and the new code, a bit confusing I know! But it is the cleanest way I
could find to keep this code common.

> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> >
> > ...
> >
> > +static void add_page_idle_list(struct page *page,
> > +			       unsigned long addr, struct mm_walk *walk)
> > +{
> > +	struct page *page_get;
> > +	struct page_node *pn;
> > +	int bit;
> > +	unsigned long frames;
> > +	struct page_idle_proc_priv *priv = walk->private;
> > +	u64 *chunk = (u64 *)priv->buffer;
> > +
> > +	if (priv->write) {
> > +		/* Find whether this page was asked to be marked */
> > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +		bit = frames % BITMAP_CHUNK_BITS;
> > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +		if (((*chunk >> bit) & 1) == 0)
> > +			return;
> > +	}
> > +
> > +	page_get = page_idle_get_page(page);
> > +	if (!page_get)
> > +		return;
> > +
> > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> 
> I'm not liking this GFP_ATOMIC.  If I'm reading the code correctly,
> userspace can ask for an arbitrarily large number of GFP_ATOMIC
> allocations by doing a large read.  This can potentially exhaust page
> reserves which things like networking Rx interrupts need and can make
> this whole feature less reliable.

Ok, I will look into this more and possibly do the allocation another way.
spinlocks are held hence I use GFP_ATOMIC..

thanks,

 - Joel


^ permalink raw reply

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
From: Joel Fernandes @ 2019-07-23 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Mike Rapoport, minchan, minchan, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw,
	linux-api
In-Reply-To: <20190723060525.GA4552@dhcp22.suse.cz>

On Tue, Jul 23, 2019 at 08:05:25AM +0200, Michal Hocko wrote:
> [Cc linux-api - please always do CC this list when introducing a user
>  visible API]

Sorry, will do.

> On Mon 22-07-19 17:32:04, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> 
> I didn't get to read the actual code but the overall idea makes sense to
> me. I can see this being useful for userspace memory management (along
> with remote MADV_PAGEOUT, MADV_COLD).

Thanks.

> Normally I would object that a cumbersome nature of the existing
> interface can be hidden in a userspace but I do agree that rowhammer has
> made this one close to unusable for anything but a privileged process.

Agreed, this is one of the primary motivations for the patch as you said.

> I do not think you can make any argument about accuracy because
> the information will never be accurate. Sure the race window is smaller
> in principle but you can hardly say anything about how much or whether
> at all.

Sure, fair enough. That is why I wasn't beating the drum too much on the
accuracy point. However, this surprisingly does work quite well.

thanks,

 - Joel


^ permalink raw reply

* Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
From: Waiman Long @ 2019-07-23 14:30 UTC (permalink / raw)
  To: peter enderborg, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Alexander Viro, Jonathan Corbet,
	Luis Chamberlain, Kees Cook, Johannes Weiner, Michal Hocko,
	Vladimir Davydov
  Cc: linux-mm, linux-doc, linux-fsdevel, cgroups, linux-kernel,
	Roman Gushchin, Shakeel Butt, Andrea Arcangeli
In-Reply-To: <71ab6307-9484-fdd3-fe6d-d261acf7c4a5@sony.com>

On 7/22/19 8:46 AM, peter enderborg wrote:
> On 7/2/19 8:37 PM, Waiman Long wrote:
>> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
>> file to shrink the slab by flushing all the per-cpu slabs and free
>> slabs in partial lists. This applies only to the root caches, though.
>>
>> Extends this capability by shrinking all the child memcg caches and
>> the root cache when a value of '2' is written to the shrink sysfs file.
>>
>> On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
>> build, the the amount of memory occupied by slabs before shrinking
>> slabs were:
>>
>>  # grep task_struct /proc/slabinfo
>>  task_struct         7114   7296   7744    4    8 : tunables    0    0
>>  0 : slabdata   1824   1824      0
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:            1310444 kB
>>  SReclaimable:     377604 kB
>>  SUnreclaim:       932840 kB
>>
>> After shrinking slabs:
>>
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:             695652 kB
>>  SReclaimable:     322796 kB
>>  SUnreclaim:       372856 kB
>>  # grep task_struct /proc/slabinfo
>>  task_struct         2262   2572   7744    4    8 : tunables    0    0
>>  0 : slabdata    643    643      0
>
> What is the time between this measurement points? Should not the shrinked memory show up as reclaimable?

In this case, I echoed '2' to all the shrink sysfs files under
/sys/kernel/slab. The purpose of shrinking caches is to reclaim as much
unused memory slabs from all the caches, irrespective if they are
reclaimable or not. We do not reclaim any used objects. That is why we
see the numbers were reduced in both cases.

Cheers,
Longman

^ permalink raw reply

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
From: Joel Fernandes @ 2019-07-23 14:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai,
	Konstantin Khlebnikov, linux-doc, linux-fsdevel, linux-mm,
	Michal Hocko, Mike Rapoport, namhyung, sspatil, surenb,
	Thomas Gleixner, timmurray, tkjos, Vlastimil Babka, wvw
In-Reply-To: <20190723061358.GD128252@google.com>

On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> 
> cumbersome: That's the fair tradeoff between idle page tracking and
> clear_refs because idle page tracking could check even though the page
> is not mapped.

It is fair tradeoff, but could be made simpler. The userspace code got
reduced by a good amount as well.

> error-prone: What's the error?

We see in normal Android usage, that some of the times pages appear not to be
idle even when they really are idle. Reproducing this is a bit unpredictable
and happens at random occasions. With this new interface, we are seeing this
happen much much lesser.

> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> What you mean with error is this timing issue?
> Why do you need to be accurate? IOW, accurate is always good but what's
> the scale of the accuracy?

There is a time window between looking up pagemap and checking if page is
idle. Anyway, see below for the primary goals as you asked:

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc/<pid>/page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> 
> Ah, so the primary goal is to provide convinience interface and it would
> help accurary, too. IOW, accuracy is not your main goal?

There are a couple of primary goals: Security, conveience and also solving
the accuracy/reliability problem we are seeing. Do keep in mind looking up
PFN has security implications. The PFN field in pagemap is zeroed if the user
does not have CAP_SYS_ADMIN.

> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> 
> So the goal is to detect idle pages with idle memory tracking?

Isn't that what idle memory tracking does?

> It couldn't work well because such idle pages could finally swap out and
> lose every flags of the page descriptor which is working mechanism of
> idle page tracking. It should have named "workingset page tracking",
> not "idle page tracking".

The heap profiler that uses page-idle tracking is not to measure working set,
but to look for pages that are idle for long periods of time.

Thanks for bringing up the swapping corner case..  Perhaps we can improve
the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
is true that we would lose access information during the window, there is a
high likelihood that the page was not accessed which is why it was swapped.
Thoughts?

thanks,

 - Joel



> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > Cc: vdavydov.dev@gmail.com
> > Cc: Brendan Gregg <bgregg@netflix.com>
> > Cc: kernel-team@android.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > ---
> > Internal review -> v1:
> > Fixes from Suren.
> > Corrections to change log, docs (Florian, Sandeep)
> > 
> >  fs/proc/base.c            |   3 +
> >  fs/proc/internal.h        |   1 +
> >  fs/proc/task_mmu.c        |  57 +++++++
> >  include/linux/page_idle.h |   4 +
> >  mm/page_idle.c            | 305 +++++++++++++++++++++++++++++++++-----
> >  5 files changed, 330 insertions(+), 40 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 77eb628ecc7f..a58dd74606e9 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> >  	REG("smaps",      S_IRUGO, proc_pid_smaps_operations),
> >  	REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
> >  	REG("pagemap",    S_IRUSR, proc_pagemap_operations),
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +	REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
> > +#endif
> >  #endif
> >  #ifdef CONFIG_SECURITY
> >  	DIR("attr",       S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index cd0c8d5ce9a1..bc9371880c63 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations;
> >  extern const struct file_operations proc_pid_smaps_rollup_operations;
> >  extern const struct file_operations proc_clear_refs_operations;
> >  extern const struct file_operations proc_pagemap_operations;
> > +extern const struct file_operations proc_page_idle_operations;
> >  
> >  extern unsigned long task_vsize(struct mm_struct *);
> >  extern unsigned long task_statm(struct mm_struct *,
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 4d2b860dbc3f..11ccc53da38e 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
> >  	.open		= pagemap_open,
> >  	.release	= pagemap_release,
> >  };
> > +
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
> > +				   size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_read(file, buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> > +	return ret;
> > +}
> > +
> > +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	int ret;
> > +	struct task_struct *tsk = get_proc_task(file_inode(file));
> > +
> > +	if (!tsk)
> > +		return -EINVAL;
> > +	ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk);
> > +	put_task_struct(tsk);
> > +	return ret;
> > +}
> > +
> > +static int proc_page_idle_open(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm;
> > +
> > +	mm = proc_mem_open(inode, PTRACE_MODE_READ);
> > +	if (IS_ERR(mm))
> > +		return PTR_ERR(mm);
> > +	file->private_data = mm;
> > +	return 0;
> > +}
> > +
> > +static int proc_page_idle_release(struct inode *inode, struct file *file)
> > +{
> > +	struct mm_struct *mm = file->private_data;
> > +
> > +	if (mm)
> > +		mmdrop(mm);
> > +	return 0;
> > +}
> > +
> > +const struct file_operations proc_page_idle_operations = {
> > +	.llseek		= mem_lseek, /* borrow this */
> > +	.read		= proc_page_idle_read,
> > +	.write		= proc_page_idle_write,
> > +	.open		= proc_page_idle_open,
> > +	.release	= proc_page_idle_release,
> > +};
> > +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> > +
> >  #endif /* CONFIG_PROC_PAGE_MONITOR */
> >  
> >  #ifdef CONFIG_NUMA
> > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > index 1e894d34bdce..f1bc2640d85e 100644
> > --- a/include/linux/page_idle.h
> > +++ b/include/linux/page_idle.h
> > @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page)
> >  }
> >  #endif /* CONFIG_64BIT */
> >  
> > +ssize_t page_idle_proc_write(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> > +ssize_t page_idle_proc_read(struct file *file,
> > +	char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk);
> >  #else /* !CONFIG_IDLE_PAGE_TRACKING */
> >  
> >  static inline bool page_is_young(struct page *page)
> > diff --git a/mm/page_idle.c b/mm/page_idle.c
> > index 295512465065..874a60c41fef 100644
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/sched/mm.h>
> >  
> >  #define BITMAP_CHUNK_SIZE	sizeof(u64)
> >  #define BITMAP_CHUNK_BITS	(BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> >   *
> >   * This function tries to get a user memory page by pfn as described above.
> >   */
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> >  {
> >  	struct page *page;
> >  	pg_data_t *pgdat;
> >  
> > -	if (!pfn_valid(pfn))
> > -		return NULL;
> > -
> > -	page = pfn_to_page(pfn);
> > +	page = page_in;
> >  	if (!page || !PageLRU(page) ||
> >  	    !get_page_unless_zero(page))
> >  		return NULL;
> > @@ -51,6 +49,15 @@ static struct page *page_idle_get_page(unsigned long pfn)
> >  	return page;
> >  }
> >  
> > +static struct page *page_idle_get_page_pfn(unsigned long pfn)
> > +{
> > +
> > +	if (!pfn_valid(pfn))
> > +		return NULL;
> > +
> > +	return page_idle_get_page(pfn_to_page(pfn));
> > +}
> > +
> >  static bool page_idle_clear_pte_refs_one(struct page *page,
> >  					struct vm_area_struct *vma,
> >  					unsigned long addr, void *arg)
> > @@ -118,6 +125,47 @@ static void page_idle_clear_pte_refs(struct page *page)
> >  		unlock_page(page);
> >  }
> >  
> > +/* Helper to get the start and end frame given a pos and count */
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm,
> > +				unsigned long *start, unsigned long *end)
> > +{
> > +	unsigned long max_frame;
> > +
> > +	/* If an mm is not given, assume we want physical frames */
> > +	max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +		return -EINVAL;
> > +
> > +	*start = pos * BITS_PER_BYTE;
> > +	if (*start >= max_frame)
> > +		return -ENXIO;
> > +
> > +	*end = *start + count * BITS_PER_BYTE;
> > +	if (*end > max_frame)
> > +		*end = max_frame;
> > +	return 0;
> > +}
> > +
> > +static bool page_really_idle(struct page *page)
> > +{
> > +	if (!page)
> > +		return false;
> > +
> > +	if (page_is_idle(page)) {
> > +		/*
> > +		 * The page might have been referenced via a
> > +		 * pte, in which case it is not idle. Clear
> > +		 * refs and recheck.
> > +		 */
> > +		page_idle_clear_pte_refs(page);
> > +		if (page_is_idle(page))
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> >  static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  				     struct bin_attribute *attr, char *buf,
> >  				     loff_t pos, size_t count)
> > @@ -125,35 +173,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj,
> >  	u64 *out = (u64 *)buf;
> >  	struct page *page;
> >  	unsigned long pfn, end_pfn;
> > -	int bit;
> > -
> > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > -		return -EINVAL;
> > -
> > -	pfn = pos * BITS_PER_BYTE;
> > -	if (pfn >= max_pfn)
> > -		return 0;
> > +	int bit, ret;
> >  
> > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > -	if (end_pfn > max_pfn)
> > -		end_pfn = max_pfn;
> > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > +	if (ret == -ENXIO)
> > +		return 0;  /* Reads beyond max_pfn do nothing */
> > +	else if (ret)
> > +		return ret;
> >  
> >  	for (; pfn < end_pfn; pfn++) {
> >  		bit = pfn % BITMAP_CHUNK_BITS;
> >  		if (!bit)
> >  			*out = 0ULL;
> > -		page = page_idle_get_page(pfn);
> > -		if (page) {
> > -			if (page_is_idle(page)) {
> > -				/*
> > -				 * The page might have been referenced via a
> > -				 * pte, in which case it is not idle. Clear
> > -				 * refs and recheck.
> > -				 */
> > -				page_idle_clear_pte_refs(page);
> > -				if (page_is_idle(page))
> > -					*out |= 1ULL << bit;
> > -			}
> > +		page = page_idle_get_page_pfn(pfn);
> > +		if (page && page_really_idle(page)) {
> > +			*out |= 1ULL << bit;
> >  			put_page(page);
> >  		}
> >  		if (bit == BITMAP_CHUNK_BITS - 1)
> > @@ -170,23 +204,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj,
> >  	const u64 *in = (u64 *)buf;
> >  	struct page *page;
> >  	unsigned long pfn, end_pfn;
> > -	int bit;
> > +	int bit, ret;
> >  
> > -	if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > -		return -EINVAL;
> > -
> > -	pfn = pos * BITS_PER_BYTE;
> > -	if (pfn >= max_pfn)
> > -		return -ENXIO;
> > -
> > -	end_pfn = pfn + count * BITS_PER_BYTE;
> > -	if (end_pfn > max_pfn)
> > -		end_pfn = max_pfn;
> > +	ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
> > +	if (ret)
> > +		return ret;
> >  
> >  	for (; pfn < end_pfn; pfn++) {
> >  		bit = pfn % BITMAP_CHUNK_BITS;
> >  		if ((*in >> bit) & 1) {
> > -			page = page_idle_get_page(pfn);
> > +			page = page_idle_get_page_pfn(pfn);
> >  			if (page) {
> >  				page_idle_clear_pte_refs(page);
> >  				set_page_idle(page);
> > @@ -224,10 +251,208 @@ struct page_ext_operations page_idle_ops = {
> >  };
> >  #endif
> >  
> > +/*  page_idle tracking for /proc/<pid>/page_idle */
> > +
> > +static DEFINE_SPINLOCK(idle_page_list_lock);
> > +struct list_head idle_page_list;
> > +
> > +struct page_node {
> > +	struct page *page;
> > +	unsigned long addr;
> > +	struct list_head list;
> > +};
> > +
> > +struct page_idle_proc_priv {
> > +	unsigned long start_addr;
> > +	char *buffer;
> > +	int write;
> > +};
> > +
> > +static void add_page_idle_list(struct page *page,
> > +			       unsigned long addr, struct mm_walk *walk)
> > +{
> > +	struct page *page_get;
> > +	struct page_node *pn;
> > +	int bit;
> > +	unsigned long frames;
> > +	struct page_idle_proc_priv *priv = walk->private;
> > +	u64 *chunk = (u64 *)priv->buffer;
> > +
> > +	if (priv->write) {
> > +		/* Find whether this page was asked to be marked */
> > +		frames = (addr - priv->start_addr) >> PAGE_SHIFT;
> > +		bit = frames % BITMAP_CHUNK_BITS;
> > +		chunk = &chunk[frames / BITMAP_CHUNK_BITS];
> > +		if (((*chunk >> bit) & 1) == 0)
> > +			return;
> > +	}
> > +
> > +	page_get = page_idle_get_page(page);
> > +	if (!page_get)
> > +		return;
> > +
> > +	pn = kmalloc(sizeof(*pn), GFP_ATOMIC);
> > +	if (!pn)
> > +		return;
> > +
> > +	pn->page = page_get;
> > +	pn->addr = addr;
> > +	list_add(&pn->list, &idle_page_list);
> > +}
> > +
> > +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr,
> > +				    unsigned long end,
> > +				    struct mm_walk *walk)
> > +{
> > +	struct vm_area_struct *vma = walk->vma;
> > +	pte_t *pte;
> > +	spinlock_t *ptl;
> > +	struct page *page;
> > +
> > +	ptl = pmd_trans_huge_lock(pmd, vma);
> > +	if (ptl) {
> > +		if (pmd_present(*pmd)) {
> > +			page = follow_trans_huge_pmd(vma, addr, pmd,
> > +						     FOLL_DUMP|FOLL_WRITE);
> > +			if (!IS_ERR_OR_NULL(page))
> > +				add_page_idle_list(page, addr, walk);
> > +		}
> > +		spin_unlock(ptl);
> > +		return 0;
> > +	}
> > +
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		if (!pte_present(*pte))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, *pte);
> > +		if (page)
> > +			add_page_idle_list(page, addr, walk);
> > +	}
> > +
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +	return 0;
> > +}
> > +
> > +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff,
> > +			       size_t count, loff_t *pos,
> > +			       struct task_struct *tsk, int write)
> > +{
> > +	int ret;
> > +	char *buffer;
> > +	u64 *out;
> > +	unsigned long start_addr, end_addr, start_frame, end_frame;
> > +	struct mm_struct *mm = file->private_data;
> > +	struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, };
> > +	struct page_node *cur, *next;
> > +	struct page_idle_proc_priv priv;
> > +	bool walk_error = false;
> > +
> > +	if (!mm || !mmget_not_zero(mm))
> > +		return -EINVAL;
> > +
> > +	if (count > PAGE_SIZE)
> > +		count = PAGE_SIZE;
> > +
> > +	buffer = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!buffer) {
> > +		ret = -ENOMEM;
> > +		goto out_mmput;
> > +	}
> > +	out = (u64 *)buffer;
> > +
> > +	if (write && copy_from_user(buffer, ubuff, count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame);
> > +	if (ret)
> > +		goto out;
> > +
> > +	start_addr = (start_frame << PAGE_SHIFT);
> > +	end_addr = (end_frame << PAGE_SHIFT);
> > +	priv.buffer = buffer;
> > +	priv.start_addr = start_addr;
> > +	priv.write = write;
> > +	walk.private = &priv;
> > +	walk.mm = mm;
> > +
> > +	down_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * Protects the idle_page_list which is needed because
> > +	 * walk_page_vma() holds ptlock which deadlocks with
> > +	 * page_idle_clear_pte_refs(). So we have to collect all
> > +	 * pages first, and then call page_idle_clear_pte_refs().
> > +	 */
> > +	spin_lock(&idle_page_list_lock);
> > +	ret = walk_page_range(start_addr, end_addr, &walk);
> > +	if (ret)
> > +		walk_error = true;
> > +
> > +	list_for_each_entry_safe(cur, next, &idle_page_list, list) {
> > +		int bit, index;
> > +		unsigned long off;
> > +		struct page *page = cur->page;
> > +
> > +		if (unlikely(walk_error))
> > +			goto remove_page;
> > +
> > +		if (write) {
> > +			page_idle_clear_pte_refs(page);
> > +			set_page_idle(page);
> > +		} else {
> > +			if (page_really_idle(page)) {
> > +				off = ((cur->addr) >> PAGE_SHIFT) - start_frame;
> > +				bit = off % BITMAP_CHUNK_BITS;
> > +				index = off / BITMAP_CHUNK_BITS;
> > +				out[index] |= 1ULL << bit;
> > +			}
> > +		}
> > +remove_page:
> > +		put_page(page);
> > +		list_del(&cur->list);
> > +		kfree(cur);
> > +	}
> > +	spin_unlock(&idle_page_list_lock);
> > +
> > +	if (!write && !walk_error)
> > +		ret = copy_to_user(ubuff, buffer, count);
> > +
> > +	up_read(&mm->mmap_sem);
> > +out:
> > +	kfree(buffer);
> > +out_mmput:
> > +	mmput(mm);
> > +	if (!ret)
> > +		ret = count;
> > +	return ret;
> > +
> > +}
> > +
> > +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff,
> > +			    size_t count, loff_t *pos, struct task_struct *tsk)
> > +{
> > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0);
> > +}
> > +
> > +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff,
> > +			     size_t count, loff_t *pos, struct task_struct *tsk)
> > +{
> > +	return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1);
> > +}
> > +
> >  static int __init page_idle_init(void)
> >  {
> >  	int err;
> >  
> > +	INIT_LIST_HEAD(&idle_page_list);
> > +
> >  	err = sysfs_create_group(mm_kobj, &page_idle_attr_group);
> >  	if (err) {
> >  		pr_err("page_idle: register sysfs failed\n");
> > -- 
> > 2.22.0.657.g960e92d24f-goog

^ permalink raw reply

* [PATCH V2 2/2] kernel-doc: core-api: Include string.h into core-api
From: Joe Perches @ 2019-07-23 13:51 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Jonathan Corbet, Stephen Kitt, Kees Cook, Nitin Gote, jannh,
	kernel-hardening, Rasmus Villemoes, Andrew Morton, linux-doc
In-Reply-To: <cover.1563889130.git.joe@perches.com>

core-api should show all the various string functions including the
newly added stracpy and stracpy_pad.

Miscellanea:

o Update the Returns: value for strscpy
o fix a defect with %NUL)

Signed-off-by: Joe Perches <joe@perches.com>
---

V2: Correct return of -E2BIG descriptions

 Documentation/core-api/kernel-api.rst |  3 +++
 include/linux/string.h                |  5 +++--
 lib/string.c                          | 10 ++++++----
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 08af5caf036d..f77de49b1d51 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -42,6 +42,9 @@ String Manipulation
 .. kernel-doc:: lib/string.c
    :export:
 
+.. kernel-doc:: include/linux/string.h
+   :internal:
+
 .. kernel-doc:: mm/util.c
    :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user
                vmemdup_user strndup_user memdup_user_nul
diff --git a/include/linux/string.h b/include/linux/string.h
index 7572cd78cf9f..3cf684db4bc6 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -519,8 +519,9 @@ static inline void memcpy_and_pad(void *dest, size_t dest_len,
  * But this can lead to bugs due to typos, or if prefix is a pointer
  * and not a constant. Instead use str_has_prefix().
  *
- * Returns: 0 if @str does not start with @prefix
-         strlen(@prefix) if @str does start with @prefix
+ * Returns:
+ * * strlen(@prefix) if @str starts with @prefix
+ * * 0 if @str does not start with @prefix
  */
 static __always_inline size_t str_has_prefix(const char *str, const char *prefix)
 {
diff --git a/lib/string.c b/lib/string.c
index 461fb620f85f..f7bc10da4259 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -173,8 +173,9 @@ EXPORT_SYMBOL(strlcpy);
  * doesn't unnecessarily force the tail of the destination buffer to be
  * zeroed.  If zeroing is desired please use strscpy_pad().
  *
- * Return: The number of characters copied (not including the trailing
- *         %NUL) or -E2BIG if the destination buffer wasn't big enough.
+ * Returns:
+ * * The number of characters copied (not including the trailing %NUL)
+ * * -E2BIG if count is 0 or @src was truncated.
  */
 ssize_t strscpy(char *dest, const char *src, size_t count)
 {
@@ -253,8 +254,9 @@ EXPORT_SYMBOL(strscpy);
  * For full explanation of why you may want to consider using the
  * 'strscpy' functions please see the function docstring for strscpy().
  *
- * Return: The number of characters copied (not including the trailing
- *         %NUL) or -E2BIG if the destination buffer wasn't big enough.
+ * Returns:
+ * * The number of characters copied (not including the trailing %NUL)
+ * * -E2BIG if count is 0 or @src was truncated.
  */
 ssize_t strscpy_pad(char *dest, const char *src, size_t count)
 {
-- 
2.15.0


^ permalink raw reply related

* [PATCH V2 0/2] string: Add stracpy and stracpy_pad
From: Joe Perches @ 2019-07-23 13:51 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel
  Cc: Jonathan Corbet, Stephen Kitt, Kees Cook, Nitin Gote, jannh,
	kernel-hardening, Rasmus Villemoes, Andrew Morton, linux-doc

Add more string copy mechanisms to help avoid defects

Joe Perches (2):
  string: Add stracpy and stracpy_pad mechanisms
  kernel-doc: core-api: Include string.h into core-api

 Documentation/core-api/kernel-api.rst |  3 +++
 include/linux/string.h                | 50 +++++++++++++++++++++++++++++++++--
 lib/string.c                          | 10 ++++---
 3 files changed, 57 insertions(+), 6 deletions(-)

-- 
2.15.0


^ permalink raw reply

* Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing
From: Joel Fernandes @ 2019-07-23 13:47 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-kernel, vdavydov.dev, Brendan Gregg, kernel-team,
	Alexey Dobriyan, Al Viro, Andrew Morton, carmenjackson,
	Christian Hansen, Colin Ian King, dancol, David Howells, fmayer,
	joaodias, Jonathan Corbet, Kees Cook, Kirill Tkhai, linux-doc,
	linux-fsdevel, linux-mm, Michal Hocko, Mike Rapoport, minchan,
	minchan, namhyung, sspatil, surenb, Thomas Gleixner, timmurray,
	tkjos, Vlastimil Babka, wvw
In-Reply-To: <8b15dac6-f776-ac9a-8377-ae38f5c9007f@yandex-team.ru>

On Tue, Jul 23, 2019 at 01:10:05PM +0300, Konstantin Khlebnikov wrote:
> On 23.07.2019 11:43, Konstantin Khlebnikov wrote:
> > On 23.07.2019 0:32, Joel Fernandes (Google) wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > This is quite cumbersome and can be error-prone too. If between
> > > accessing the per-PID pagemap and the global page_idle bitmap, if
> > > something changes with the page then the information is not accurate.
> > > More over looking up PFN from pagemap in Android devices is not
> > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > the PFN.
> > > 
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc/<pid>/page_idle file. This
> > > eliminates the need for userspace to calculate the mapping of the page.
> > > It follows the exact same semantics as the global
> > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > > It ended up simplifying userspace code, solving the security issue
> > > mentioned and works quite well. SELinux does not need to be turned off
> > > since no pagemap look up is needed.
> > > 
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time.
> > > 
> > > Documentation material:
> > > The idle page tracking API for virtual address indexing using virtual page
> > > frame numbers (VFN) is located at /proc/<pid>/page_idle. It is a bitmap
> > > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > > except that it uses virtual instead of physical frame numbers.
> > > 
> > > This idle page tracking API can be simpler to use than physical address
> > > indexing, since the pagemap for a process does not need to be looked up
> > > to mark or read a page's idle bit. It is also more accurate than
> > > physical address indexing since in physical address indexing, address
> > > space changes can occur between reading the pagemap and reading the
> > > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > > the duration of the access.
> > 
> > Maybe integrate this into existing interface: /proc/pid/clear_refs and
> > /proc/pid/pagemap ?
> > 
> > I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
> > marks pages idle only for pages mapped in this process.
> > And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
> > This is faster - we don't need to walk whole rmap for that.
> 
> Moreover, this is so cheap so could be counted and shown in smaps.
> Unlike to clearing real access bits this does not disrupt memory reclaimer.
> Killer feature.

I replied to your patch:
https://lore.kernel.org/lkml/20190723134647.GA104199@google.com/T/#med8992e75c32d9c47f95b119d24a43ded36420bc


^ permalink raw reply

* Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table
From: Jonathan Corbet @ 2019-07-23 13:42 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: skhan, linux-kernel-mentees, Darrick J. Wong, linux-xfs,
	open list:DOCUMENTATION, open list
In-Reply-To: <20190723114813.GA14870@localhost>

On Tue, 23 Jul 2019 12:48:13 +0100
Sheriff Esseson <sheriffesseson@gmail.com> wrote:

> the "Removed Sysctls" section is a table - bring it alive with ReST.
> 
> Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>

So this appears to be identical to the patch you sent three days ago; is
there a reason why you are sending it again now?

Thanks,

jon

^ permalink raw reply

* [PATCH] Documentation/features/locking: update lists
From: Mark Rutland @ 2019-07-23 13:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mark Rutland, Jonathan Corbet, linux-doc

The locking feature lists don't match reality as of v5.3-rc1:

* arm64 moved to queued spinlocks in commit:

  c11090474d70590170cf5fa6afe85864ab494b37

  ("arm64: locking: Replace ticket lock implementation with qspinlock")

* xtensa moved to queued spinlocks and rwlocks in commit:

  579afe866f52adcd921272a224ab36733051059c

  ("xtensa: use generic spinlock/rwlock implementation")

* architecture-specific rwsem support was removed in commit:

  46ad0840b1584b92b5ff2cc3ed0b011dd6b8e0f1

  ("locking/rwsem: Remove arch specific rwsem files")

So update the feature lists accordingly, and remove the now redundant
rwsem-optimized list.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
---
 .../locking/queued-rwlocks/arch-support.txt        |  2 +-
 .../locking/queued-spinlocks/arch-support.txt      |  4 +--
 .../locking/rwsem-optimized/arch-support.txt       | 34 ----------------------
 3 files changed, 3 insertions(+), 37 deletions(-)
 delete mode 100644 Documentation/features/locking/rwsem-optimized/arch-support.txt

diff --git a/Documentation/features/locking/queued-rwlocks/arch-support.txt b/Documentation/features/locking/queued-rwlocks/arch-support.txt
index c683da198f31..ee922746a64c 100644
--- a/Documentation/features/locking/queued-rwlocks/arch-support.txt
+++ b/Documentation/features/locking/queued-rwlocks/arch-support.txt
@@ -30,5 +30,5 @@
     |          um: | TODO |
     |   unicore32: | TODO |
     |         x86: |  ok  |
-    |      xtensa: | TODO |
+    |      xtensa: |  ok  |
     -----------------------
diff --git a/Documentation/features/locking/queued-spinlocks/arch-support.txt b/Documentation/features/locking/queued-spinlocks/arch-support.txt
index e3080b82aefd..c52116c1a049 100644
--- a/Documentation/features/locking/queued-spinlocks/arch-support.txt
+++ b/Documentation/features/locking/queued-spinlocks/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |         c6x: | TODO |
     |        csky: | TODO |
     |       h8300: | TODO |
@@ -30,5 +30,5 @@
     |          um: | TODO |
     |   unicore32: | TODO |
     |         x86: |  ok  |
-    |      xtensa: | TODO |
+    |      xtensa: |  ok  |
     -----------------------
diff --git a/Documentation/features/locking/rwsem-optimized/arch-support.txt b/Documentation/features/locking/rwsem-optimized/arch-support.txt
deleted file mode 100644
index 7521d7500fbe..000000000000
--- a/Documentation/features/locking/rwsem-optimized/arch-support.txt
+++ /dev/null
@@ -1,34 +0,0 @@
-#
-# Feature name:          rwsem-optimized
-#         Kconfig:       !RWSEM_GENERIC_SPINLOCK
-#         description:   arch provides optimized rwsem APIs
-#
-    -----------------------
-    |         arch |status|
-    -----------------------
-    |       alpha: |  ok  |
-    |         arc: | TODO |
-    |         arm: |  ok  |
-    |       arm64: |  ok  |
-    |         c6x: | TODO |
-    |        csky: | TODO |
-    |       h8300: | TODO |
-    |     hexagon: | TODO |
-    |        ia64: |  ok  |
-    |        m68k: | TODO |
-    |  microblaze: | TODO |
-    |        mips: | TODO |
-    |       nds32: | TODO |
-    |       nios2: | TODO |
-    |    openrisc: | TODO |
-    |      parisc: | TODO |
-    |     powerpc: | TODO |
-    |       riscv: | TODO |
-    |        s390: |  ok  |
-    |          sh: |  ok  |
-    |       sparc: |  ok  |
-    |          um: |  ok  |
-    |   unicore32: | TODO |
-    |         x86: |  ok  |
-    |      xtensa: |  ok  |
-    -----------------------
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain
From: Catalin Marinas @ 2019-07-23 12:11 UTC (permalink / raw)
  To: Takao Indoh
  Cc: Jonathan Corbet, Will Deacon, linux-doc, linux-kernel,
	linux-arm-kernel, QI Fuli, Takao Indoh
In-Reply-To: <20190617143255.10462-3-indou.takao@jp.fujitsu.com>

On Mon, Jun 17, 2019 at 11:32:55PM +0900, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> This patch adds new boot parameter 'disable_tlbflush_is' to disable TLB
> flush within the same inner shareable domain for performance tuning.
> 
> In the case of flush_tlb_mm() *without* this parameter, TLB entry is
> invalidated by __tlbi(aside1is, asid). By this instruction, all CPUs within
> the same inner shareable domain check if there are TLB entries which have
> this ASID, this causes performance noise, especially at large-scale HPC
> environment, which has more than thousand nodes with low latency
> interconnect.
> 
> When this new parameter is specified, TLB entry is invalidated by
> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
> Therefore TLB flush is done on minimal CPUs and performance problem does
> not occur.
> 
> Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
> Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
[...]
> +void flush_tlb_mm(struct mm_struct *mm)
> +{
> +	if (disable_tlbflush_is)
> +		on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm,
> +				 (void *)mm, true);
> +	else
> +		__flush_tlb_mm(mm);
> +}

Could we try instead to call a _nosync variant here when
cpumask_weight() is 1 or the *IS if greater than 1 and avoid the IPI?

Will tried this in the past but because of the task placement after
fork()+execve(), I think we always ended up with a weight of 2 in the
child process. Your first patch "solves" this by flushing the TLBs on
context switch (bar the CnP case). Can you give it a try to see if it
improves things? At least it's a starting point for further
investigation.

I fully agree with Will that we don't want two different TLB handling
implementations in the arm64 kernel and even less desirable to have a
command line option.

Thanks.

-- 
Catalin

^ permalink raw reply

* Re: [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage"))
From: Catalin Marinas @ 2019-07-23 11:55 UTC (permalink / raw)
  To: Takao Indoh
  Cc: Jonathan Corbet, Will Deacon, linux-doc, linux-kernel,
	linux-arm-kernel, QI Fuli, Takao Indoh
In-Reply-To: <20190617143255.10462-2-indou.takao@jp.fujitsu.com>

Hi,

I know Will is on the case but just expressing some thoughts of my own.

On Mon, Jun 17, 2019 at 11:32:54PM +0900, Takao Indoh wrote:
> From: Takao Indoh <indou.takao@fujitsu.com>
> 
> mm_cpumask was deleted by the commit 38d96287504a ("arm64: mm: kill
> mm_cpumask usage") because it was not used at that time. Now this is needed
> to find appropriate CPUs for TLB flush, so this patch reverts this commit.
> 
> Signed-off-by: QI Fuli <qi.fuli@fujitsu.com>
> Signed-off-by: Takao Indoh <indou.takao@fujitsu.com>
> ---
>  arch/arm64/include/asm/mmu_context.h | 7 ++++++-
>  arch/arm64/kernel/smp.c              | 6 ++++++
>  arch/arm64/mm/context.c              | 2 ++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 2da3e478fd8f..21ef11590bcb 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -241,8 +241,13 @@ static inline void
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
>  	  struct task_struct *tsk)
>  {
> -	if (prev != next)
> +	unsigned int cpu = smp_processor_id();
> +
> +	if (prev != next) {
>  		__switch_mm(next);
> +		cpumask_clear_cpu(cpu, mm_cpumask(prev));
> +		local_flush_tlb_mm(prev);
> +	}

That's not actually a revert as we've never flushed the TLBs on the
switch_mm() path. Also, this flush is not sufficient on a CnP capable
CPU since another thread of the same CPU could have the prev TTBR0_EL1
value set and loading the TLB back.

-- 
Catalin

^ permalink raw reply

* [PATCH] Documentation: filesystem: fix "Removed Sysctls" table
From: Sheriff Esseson @ 2019-07-23 11:48 UTC (permalink / raw)
  To: skhan
  Cc: linux-kernel-mentees, Darrick J. Wong, linux-xfs, Jonathan Corbet,
	open list:DOCUMENTATION, open list

the "Removed Sysctls" section is a table - bring it alive with ReST.

Signed-off-by: Sheriff Esseson <sheriffesseson@gmail.com>
---
 Documentation/admin-guide/xfs.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index e76665a8f2f2..fb5b39f73059 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -337,11 +337,12 @@ None at present.
 Removed Sysctls
 ===============
 
+=============================	=======
   Name				Removed
-  ----				-------
+=============================	=======
   fs.xfs.xfsbufd_centisec	v4.0
   fs.xfs.age_buffer_centisecs	v4.0
-
+=============================	=======
 
 Error handling
 ==============
-- 
2.22.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox