From: Mel Gorman <mgorman@techsingularity.net>
To: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Davidlohr Bueso <dave@stgolabs.net>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: Cgroup memory barrier usage and call frequency from scheduler
Date: Thu, 9 Apr 2020 19:20:55 +0100 [thread overview]
Message-ID: <20200409182055.GM3818@techsingularity.net> (raw)
In-Reply-To: <20200409175621.GA37608@mtj.thefacebook.com>
On Thu, Apr 09, 2020 at 01:56:21PM -0400, Tejun Heo wrote:
> Hello, Mel.
>
> On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:
> > Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
> > cgroup_rstat_updated() race window") introduced two full memory
> > barriers to close a race. The one in cgroup_rstat_updated can be
> > called at a high frequency from the scheduler from update_curr ->
> > cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
> > not sure how closely this was looked at. cgroup_rstat_updated shows up
> > in profiles of netperf UDP_STREAM accounting for about 1% of overhead
>
> Oops, that's pretty high.
>
> > which doesn't sound a lot but that's about the same weight as some of
> > the critical network paths. I have three questions about the patch
> >
> > 1. Why were full barriers used?
>
> Given
>
> A C
> --- ---
> B D
>
> the code is trying to guarantee that either B sees C or D sees A, so it does
> need full ordering.
>
Ok, still not particularly clear given where they are used and how
it's related to updated_children but like you say later it's "kinda
irrelevant" :)
> > 2. Why was it important that the data race be closed when the inaccuracy
> > is temporary?
>
> There was a pending patchset which converted memcg to use rstat and the
> conversion included the event counters which needed to be synchronous (e.g.
> for things like oom kill counts). The patchset didn't make it through due to
> the percpu memory overhead at the time. The memory overhead issue can be
> resolved now but in the meantime memcg got improved in a different way which
> made the rstat conversion not immediately necessary, so it fell through the
> cracks. In retrospect, this patch shouldn't have been committed on its own or
> at least the synchronous and pure state update paths should have been
> separate.
>
Ah, thanks for that explanation.
> > 3. Why is it called from the context of update_curr()?
>
> It's just being callled from the path which udpates sched statistics.
>
> > For 1, the use of a full barrier seems unnecessary when it appears that
> > you could have used a read barrier and a write barrier. The following
> > patch drops the profile overhead to 0.1%
>
> I'm not sure this is correct but that's kinda irrelevant.
>
> > For 2, the changelog says the barriers are necessary because "we plan to use
> > rstat to track counters which need to be accurate". That is a bit vague.
> > Under what circumstances is a transient inaccuracy a serious enough
> > problem to justify additional barriers in the scheduler?
>
> Hope this is explained now.
>
It is.
> > For 3, update_curr() is called from a lot of places, some of which are
> > quite hot -- e.g. task enqueue/dequeue. This is necessary information from
> > the runqueue needs to be preserved. However, it's less clear that the cpu
> > accounting information needs to be up to date on this granularity although
> > it might be related to question 2. Why was the delta_exec not similarly
> > accumulated in cpuacct_change() and defer the hierarchical update to
> > be called from somewhere like entity_tick()? It would need tracking the
> > CPU time at the last update as delta_exec would be lost so it's not very
> > trivial but it does not look like it would be overly complicated.
>
> Most likely historic. The code has been there for a long time and the only
> recent changes were plumbing around them. Nothing in cpuacct needs to be
> per-scheduling-event accurate, so yeah, for the longer term, it'd be a good
> idea to move them out of hot path.
>
Even if it's a future thing it helps me to know the accuracy does not
have to be perfect. It means if it bothers me enough, I can take a shot
at addressing it myself without having to worry that some controller is
broken as a side-effect.
> For now, I'll revert the patch. Nothing in tree needs that right now. If the
> need for synchronous counting comes back later, I'll make that a separate
> path.
>
That's perfect, thanks!
--
Mel Gorman
SUSE Labs
next prev parent reply other threads:[~2020-04-09 18:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-09 15:44 Cgroup memory barrier usage and call frequency from scheduler Mel Gorman
2020-04-09 16:49 ` Peter Zijlstra
2020-04-09 17:21 ` Mel Gorman
2020-04-09 17:56 ` Tejun Heo
2020-04-09 18:20 ` Mel Gorman [this message]
2020-04-09 19:27 ` Tejun Heo
[not found] ` <20200409154413.GK3818-3eNAlZScCAx27rWaFMvyedHuzzzSOjJt@public.gmane.org>
2020-04-09 19:08 ` [PATCH cgroup/for-5.7-fixes] Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" Tejun Heo
2020-04-09 19:08 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200409182055.GM3818@techsingularity.net \
--to=mgorman@techsingularity.net \
--cc=dave@stgolabs.net \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.