Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	surenb@google.com, brauner@kernel.org, chris@chrisdown.name
Subject: Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period
Date: Wed, 29 Mar 2023 15:05:32 -0400	[thread overview]
Message-ID: <ZCSL/OBkpw5EClUu@cmpxchg.org> (raw)
In-Reply-To: <CA+CLi1hupf1t52GR-GgvAxhfJeUs6Z59Pmgd4NLmkwiEKTyACQ@mail.gmail.com>

On Wed, Mar 29, 2023 at 08:32:12PM +0200, Domenico Cerasuolo wrote:
> On Wed, Mar 29, 2023 at 6:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> > > PSI offers 2 mechanisms to get information about a specific resource
> > > pressure. One is reading from /proc/pressure/<resource>, which gives
> > > average pressures aggregated every 2s. The other is creating a pollable
> > > fd for a specific resource and cgroup.
> > >
> > > The trigger creation requires CAP_SYS_RESOURCE, and gives the
> > > possibility to pick specific time window and threshold, spawing an RT
> > > thread to aggregate the data.
> > >
> > > Systemd would like to provide containers the option to monitor pressure
> > > on their own cgroup and sub-cgroups. For example, if systemd launches a
> > > container that itself then launches services, the container should have
> > > the ability to poll() for pressure in individual services. But neither
> > > the container nor the services are privileged.
> > >
> > > This patch implements a mechanism to allow unprivileged users to create
> > > pressure triggers. The difference with privileged triggers creation is
> > > that unprivileged ones must have a time window that's a multiple of 2s.
> > > This is so that we can avoid unrestricted spawning of rt threads, and
> > > use instead the same aggregation mechanism done for the averages, which
> > > runs independently of any triggers.
> > >
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> >
> > Overall it looks good to me. Thanks for adding the comment on the
> > privilege check, it's much easier to understand now.
> >
> > A few nitpicks below:
> >
> > > @@ -151,6 +151,9 @@ struct psi_trigger {
> > >
> > >       /* Deferred event(s) from previous ratelimit window */
> > >       bool pending_event;
> > > +
> > > +     /* Used to differentiate destruction action*/
> > > +     enum psi_aggregators aggregator;
> > >  };
> >
> > The comment is a bit mysterious. How about
> >
> >         /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
> >
> > > @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> > >               seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > >       group->avg_last_update = sched_clock();
> > >       group->avg_next_update = group->avg_last_update + psi_period;
> > > -     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >       mutex_init(&group->avgs_lock);
> > > -     /* Init trigger-related members */
> > > +
> > > +     /* Init rtpoll trigger-related members */
> > >       atomic_set(&group->rtpoll_scheduled, 0);
> > >       mutex_init(&group->rtpoll_trigger_lock);
> > >       INIT_LIST_HEAD(&group->rtpoll_triggers);
> > > @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> > >       init_waitqueue_head(&group->rtpoll_wait);
> > >       timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> > >       rcu_assign_pointer(group->rtpoll_task, NULL);
> > > +
> > > +     /* Init avg trigger-related members */
> > > +     INIT_LIST_HEAD(&group->avg_triggers);
> > > +     memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> > > +     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >  }
> >
> > Can you move those above the rtpoll inits?
> >
> > It helps with navigating the code and spotting missing inits when the
> > init sequence follows the order of the struct members.
> >
> > > @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win,
> > u64 now, u64 value)
> > >       return growth;
> > >  }
> > >
> > > -static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total)
> > > +static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total,
> > > +                                                enum psi_aggregators
> > aggregator)
> > >  {
> > >       struct psi_trigger *t;
> > > -     u64 *total = group->total[PSI_POLL];
> > > +     u64 *total = group->total[aggregator];
> > > +     struct list_head *triggers = aggregator == PSI_AVGS ?
> > &group->avg_triggers
> > > +                     : &group->rtpoll_triggers;
> > > +     u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total
> > : group->rtpoll_total;
> > >       *update_total = false;
> >
> > These lines are a bit too long. When the init part gets that long,
> > it's preferable to move it outside of the decl block:
> >
> >         if (aggregator == PSI_AVGS) {
> >                 triggers = &group->avg_triggers;
> >                 aggregator_total = group->avg_total;
> >         } else {
> >                 triggers = &group->rtpoll_triggers;
> >                 aggregator_total = group->rtpoll_total;
> >         }
> >
> > > @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >               u64 period = ULLONG_MAX;
> > >
> > >               list_del(&t->node);
> > > -             group->rtpoll_nr_triggers[t->state]--;
> > > -             if (!group->rtpoll_nr_triggers[t->state])
> > > -                     group->rtpoll_states &= ~(1 << t->state);
> > > -             /* reset min update period for the remaining triggers */
> > > -             list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> > > -                     period = min(period, div_u64(tmp->win.size,
> > > -                                     UPDATES_PER_WINDOW));
> > > -             group->rtpoll_min_period = period;
> > > -             /* Destroy rtpoll_task when the last trigger is destroyed
> > */
> > > -             if (group->rtpoll_states == 0) {
> > > -                     group->rtpoll_until = 0;
> > > -                     task_to_destroy = rcu_dereference_protected(
> > > -                                     group->rtpoll_task,
> > > -
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > -                     rcu_assign_pointer(group->rtpoll_task, NULL);
> > > -                     del_timer(&group->rtpoll_timer);
> > > +             if (t->aggregator == PSI_AVGS) {
> > > +                     group->avg_nr_triggers[t->state]--;
> > > +             } else {
> > > +                     group->rtpoll_nr_triggers[t->state]--;
> > > +                     if (!group->rtpoll_nr_triggers[t->state])
> > > +                             group->rtpoll_states &= ~(1 << t->state);
> > > +                     /* reset min update period for the remaining
> > triggers */
> > > +                     list_for_each_entry(tmp, &group->rtpoll_triggers,
> > node)
> > > +                             period = min(period, div_u64(tmp->win.size,
> > > +                                             UPDATES_PER_WINDOW));
> > > +                     group->rtpoll_min_period = period;
> > > +                     /* Destroy rtpoll_task when the last trigger is
> > destroyed */
> > > +                     if (group->rtpoll_states == 0) {
> > > +                             group->rtpoll_until = 0;
> > > +                             task_to_destroy =
> > rcu_dereference_protected(
> > > +                                             group->rtpoll_task,
> > > +
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > +                             rcu_assign_pointer(group->rtpoll_task,
> > NULL);
> > > +                             del_timer(&group->rtpoll_timer);
> >
> > These lines are quite long too, we usually shoot for a line length of
> > 80 characters. Can you do
> >
> >                 if (t->aggregator == PSI_AVGS) {
> >                         group->avg_nr_triggers[t->state]--;
> >                         return;
> >                 }
> >
> >                 /* Else, it's an rtpoll trigger */
> >                 group->rtpoll_nr_triggers[t->state]--;
> >                 ...
> >
> Can't return there I think, the function doesn't end after the else branch,
> should I put a `goto out` instead to jump the rtpoll code?

You're right, I missed the bottom part beyond the diff.

Looking closer, I think trigger_create and trigger_destroy are
actually buggy. They have to protect against update_trigger(), so both
creation and destruction must take the right lock - avgs_lock or
rtpoll_trigger_lock - before modifying the list. They're both taking
only the rtpoll_trigger_lock right now.

IOW the trigger type distinction needs to be higher up in general.

	if (t->aggregator == PSI_AVGS) {
		mutex_lock(&group->avgs_lock);
		...
	} else {
		mutex_lock(&group->rtpoll_trigger_lock);
		...
	}

     prev parent reply	other threads:[~2023-03-29 19:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-29 15:33 [PATCH v4 0/4] sched/psi: Allow unprivileged PSI polling Domenico Cerasuolo
2023-03-29 15:33 ` [PATCH v4 1/4] sched/psi: rearrange polling code in preparation Domenico Cerasuolo
2023-03-29 15:51   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 2/4] sched/psi: rename existing poll members " Domenico Cerasuolo
2023-03-29 15:53   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 3/4] sched/psi: extract update_triggers side effect Domenico Cerasuolo
2023-03-29 15:54   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period Domenico Cerasuolo
2023-03-29 16:13   ` Johannes Weiner
     [not found]     ` <CA+CLi1hupf1t52GR-GgvAxhfJeUs6Z59Pmgd4NLmkwiEKTyACQ@mail.gmail.com>
2023-03-29 19:05       ` Johannes Weiner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCSL/OBkpw5EClUu@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=brauner@kernel.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=chris@chrisdown.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox