Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	surenb@google.com, brauner@kernel.org, chris@chrisdown.name
Subject: Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period
Date: Wed, 29 Mar 2023 15:05:32 -0400	[thread overview]
Message-ID: <ZCSL/OBkpw5EClUu@cmpxchg.org> (raw)
In-Reply-To: <CA+CLi1hupf1t52GR-GgvAxhfJeUs6Z59Pmgd4NLmkwiEKTyACQ@mail.gmail.com>

On Wed, Mar 29, 2023 at 08:32:12PM +0200, Domenico Cerasuolo wrote:
> On Wed, Mar 29, 2023 at 6:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> > > PSI offers 2 mechanisms to get information about a specific resource
> > > pressure. One is reading from /proc/pressure/<resource>, which gives
> > > average pressures aggregated every 2s. The other is creating a pollable
> > > fd for a specific resource and cgroup.
> > >
> > > The trigger creation requires CAP_SYS_RESOURCE, and gives the
> > > possibility to pick specific time window and threshold, spawing an RT
> > > thread to aggregate the data.
> > >
> > > Systemd would like to provide containers the option to monitor pressure
> > > on their own cgroup and sub-cgroups. For example, if systemd launches a
> > > container that itself then launches services, the container should have
> > > the ability to poll() for pressure in individual services. But neither
> > > the container nor the services are privileged.
> > >
> > > This patch implements a mechanism to allow unprivileged users to create
> > > pressure triggers. The difference with privileged triggers creation is
> > > that unprivileged ones must have a time window that's a multiple of 2s.
> > > This is so that we can avoid unrestricted spawning of rt threads, and
> > > use instead the same aggregation mechanism done for the averages, which
> > > runs independently of any triggers.
> > >
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> >
> > Overall it looks good to me. Thanks for adding the comment on the
> > privilege check, it's much easier to understand now.
> >
> > A few nitpicks below:
> >
> > > @@ -151,6 +151,9 @@ struct psi_trigger {
> > >
> > >       /* Deferred event(s) from previous ratelimit window */
> > >       bool pending_event;
> > > +
> > > +     /* Used to differentiate destruction action*/
> > > +     enum psi_aggregators aggregator;
> > >  };
> >
> > The comment is a bit mysterious. How about
> >
> >         /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
> >
> > > @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> > >               seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > >       group->avg_last_update = sched_clock();
> > >       group->avg_next_update = group->avg_last_update + psi_period;
> > > -     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >       mutex_init(&group->avgs_lock);
> > > -     /* Init trigger-related members */
> > > +
> > > +     /* Init rtpoll trigger-related members */
> > >       atomic_set(&group->rtpoll_scheduled, 0);
> > >       mutex_init(&group->rtpoll_trigger_lock);
> > >       INIT_LIST_HEAD(&group->rtpoll_triggers);
> > > @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> > >       init_waitqueue_head(&group->rtpoll_wait);
> > >       timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> > >       rcu_assign_pointer(group->rtpoll_task, NULL);
> > > +
> > > +     /* Init avg trigger-related members */
> > > +     INIT_LIST_HEAD(&group->avg_triggers);
> > > +     memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> > > +     INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > >  }
> >
> > Can you move those above the rtpoll inits?
> >
> > It helps with navigating the code and spotting missing inits when the
> > init sequence follows the order of the struct members.
> >
> > > @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win,
> > u64 now, u64 value)
> > >       return growth;
> > >  }
> > >
> > > -static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total)
> > > +static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total,
> > > +                                                enum psi_aggregators
> > aggregator)
> > >  {
> > >       struct psi_trigger *t;
> > > -     u64 *total = group->total[PSI_POLL];
> > > +     u64 *total = group->total[aggregator];
> > > +     struct list_head *triggers = aggregator == PSI_AVGS ?
> > &group->avg_triggers
> > > +                     : &group->rtpoll_triggers;
> > > +     u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total
> > : group->rtpoll_total;
> > >       *update_total = false;
> >
> > These lines are a bit too long. When the init part gets that long,
> > it's preferable to move it outside of the decl block:
> >
> >         if (aggregator == PSI_AVGS) {
> >                 triggers = &group->avg_triggers;
> >                 aggregator_total = group->avg_total;
> >         } else {
> >                 triggers = &group->rtpoll_triggers;
> >                 aggregator_total = group->rtpoll_total;
> >         }
> >
> > > @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > >               u64 period = ULLONG_MAX;
> > >
> > >               list_del(&t->node);
> > > -             group->rtpoll_nr_triggers[t->state]--;
> > > -             if (!group->rtpoll_nr_triggers[t->state])
> > > -                     group->rtpoll_states &= ~(1 << t->state);
> > > -             /* reset min update period for the remaining triggers */
> > > -             list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> > > -                     period = min(period, div_u64(tmp->win.size,
> > > -                                     UPDATES_PER_WINDOW));
> > > -             group->rtpoll_min_period = period;
> > > -             /* Destroy rtpoll_task when the last trigger is destroyed
> > */
> > > -             if (group->rtpoll_states == 0) {
> > > -                     group->rtpoll_until = 0;
> > > -                     task_to_destroy = rcu_dereference_protected(
> > > -                                     group->rtpoll_task,
> > > -
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > -                     rcu_assign_pointer(group->rtpoll_task, NULL);
> > > -                     del_timer(&group->rtpoll_timer);
> > > +             if (t->aggregator == PSI_AVGS) {
> > > +                     group->avg_nr_triggers[t->state]--;
> > > +             } else {
> > > +                     group->rtpoll_nr_triggers[t->state]--;
> > > +                     if (!group->rtpoll_nr_triggers[t->state])
> > > +                             group->rtpoll_states &= ~(1 << t->state);
> > > +                     /* reset min update period for the remaining
> > triggers */
> > > +                     list_for_each_entry(tmp, &group->rtpoll_triggers,
> > node)
> > > +                             period = min(period, div_u64(tmp->win.size,
> > > +                                             UPDATES_PER_WINDOW));
> > > +                     group->rtpoll_min_period = period;
> > > +                     /* Destroy rtpoll_task when the last trigger is
> > destroyed */
> > > +                     if (group->rtpoll_states == 0) {
> > > +                             group->rtpoll_until = 0;
> > > +                             task_to_destroy =
> > rcu_dereference_protected(
> > > +                                             group->rtpoll_task,
> > > +
> >  lockdep_is_held(&group->rtpoll_trigger_lock));
> > > +                             rcu_assign_pointer(group->rtpoll_task,
> > NULL);
> > > +                             del_timer(&group->rtpoll_timer);
> >
> > These lines are quite long too, we usually shoot for a line length of
> > 80 characters. Can you do
> >
> >                 if (t->aggregator == PSI_AVGS) {
> >                         group->avg_nr_triggers[t->state]--;
> >                         return;
> >                 }
> >
> >                 /* Else, it's an rtpoll trigger */
> >                 group->rtpoll_nr_triggers[t->state]--;
> >                 ...
> >
> Can't return there I think, the function doesn't end after the else branch,
> should I put a `goto out` instead to jump the rtpoll code?

You're right, I missed the bottom part beyond the diff.

Looking closer, I think trigger_create and trigger_destroy are
actually buggy. They have to protect against update_trigger(), so both
creation and destruction must take the right lock - avgs_lock or
rtpoll_trigger_lock - before modifying the list. They're both taking
only the rtpoll_trigger_lock right now.

IOW the trigger type distinction needs to be higher up in general.

	if (t->aggregator == PSI_AVGS) {
		mutex_lock(&group->avgs_lock);
		...
	} else {
		mutex_lock(&group->rtpoll_trigger_lock);
		...
	}

     prev parent reply	other threads:[~2023-03-29 19:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-29 15:33 [PATCH v4 0/4] sched/psi: Allow unprivileged PSI polling Domenico Cerasuolo
2023-03-29 15:33 ` [PATCH v4 1/4] sched/psi: rearrange polling code in preparation Domenico Cerasuolo
2023-03-29 15:51   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 2/4] sched/psi: rename existing poll members " Domenico Cerasuolo
2023-03-29 15:53   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 3/4] sched/psi: extract update_triggers side effect Domenico Cerasuolo
2023-03-29 15:54   ` Johannes Weiner
2023-03-29 15:33 ` [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period Domenico Cerasuolo
2023-03-29 16:13   ` Johannes Weiner
     [not found]     ` <CA+CLi1hupf1t52GR-GgvAxhfJeUs6Z59Pmgd4NLmkwiEKTyACQ@mail.gmail.com>
2023-03-29 19:05       ` Johannes Weiner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCSL/OBkpw5EClUu@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=brauner@kernel.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=chris@chrisdown.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.