From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>
Cc: Alexander Fedorov
<halcien-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
Vladimir Davydov
<vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
Muchun Song <songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
Sebastian Andrzej Siewior
<bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>,
cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
Subject: Re: Possible race in obj_stock_flush_required() vs drain_obj_stock()
Date: Wed, 12 Oct 2022 13:23:11 -0400 [thread overview]
Message-ID: <Y0b3/wGDBL7GaNWJ@cmpxchg.org> (raw)
In-Reply-To: <Yzxc0jzOnAu667F8-+xijCwNIfdq8QK5Xt2JBmg@public.gmane.org>
On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote:
> On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote:
> > On 03.10.2022 17:27, Michal Hocko wrote:
> > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote:
> > >> On 03.10.2022 16:32, Michal Hocko wrote:
> > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote:
> > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> > >>>> stock->nr_bytes = 0;
> > >>>> }
> > >>>>
> > >>>> - obj_cgroup_put(old);
> > >>>> + /*
> > >>>> + * Clear pointer before freeing memory so that
> > >>>> + * drain_all_stock() -> obj_stock_flush_required()
> > >>>> + * does not see a freed pointer.
> > >>>> + */
> > >>>> stock->cached_objcg = NULL;
> > >>>> + obj_cgroup_put(old);
> > >>>
> > >>> Do we need barrier() or something else to ensure there is no reordering?
> > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref
> > >>> counting.
> > >>
> > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care
> > >> of this:
> > >
> > > This is a very subtle guarantee. Also it would only apply if this is the
> > > last reference, right?
> >
> > Hmm, yes, for the last reference only, also not sure about pcp ref
> > counter ordering rules for previous references.
> >
> > > Is there any reason to not use
> > > WRITE_ONCE(stock->cached_objcg, NULL);
> > > obj_cgroup_put(old);
> > >
> > > IIRC this should prevent any reordering.
> >
> > Now that I think about it we actually must use WRITE_ONCE everywhere
> > when writing cached_objcg because otherwise compiler might split the
> > pointer-sized store into several smaller-sized ones (store tearing),
> > and obj_stock_flush_required() would read garbage instead of pointer.
> >
> > And thinking about memory barriers, maybe we need them too alongside
> > WRITE_ONCE when setting pointer to non-null value? Otherwise
> > drain_all_stock() -> obj_stock_flush_required() might read old data.
> > Since that's exactly what rcu_assign_pointer() does, it seems
> > that we are going back to using rcu_*() primitives everywhere?
>
> Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives,
> but maybe it's the right thing. Maybe instead we should always schedule draining
> on all cpus instead and perform a cpu-local check and bail out if a flush is not
> required? Michal, Johannes, what do you think?
I agree it's overkill.
This is a speculative check, and we don't need any state coherency,
just basic lifetime. READ_ONCE should fully address this problem. That
said, I think the code could be a bit clearer and better documented.
How about the below?
(Nevermind the ifdef, I'm working on removing CONFIG_MEMCG_KMEM
altogether, as it's a really strange way to say !SLOB at this point)
---
From 22855af38b116ec030286975ed2aa06851680296 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Wed, 12 Oct 2022 12:59:07 -0400
Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup
deletion
Alexander Fedorov reports a race condition between two concurrent
stock draining operations, where the first one clears the stock's obj
pointer between the pointer test and deref of the second. Analysis:
1) First CPU:
css_killed_work_fn() -> mem_cgroup_css_offline() ->
drain_all_stock() -> obj_stock_flush_required()
if (stock->cached_objcg) {
This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance.
2) Second CPU:
css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() ->
obj_cgroup_uncharge() -> drain_obj_stock()
It frees `cached_objcg` pointer in its own `memcg_stock` instance:
struct obj_cgroup *old = stock->cached_objcg;
< ... >
obj_cgroup_put(old);
stock->cached_objcg = NULL;
3) First CPU continues after the 'if' check and re-reads the pointer
again, now it is NULL and dereferencing it leads to kernel panic:
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg)
{
< ... >
if (stock->cached_objcg) {
memcg = obj_cgroup_memcg(stock->cached_objcg);
There is already RCU protection in place to ensure lifetime. Add the
missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and
consolidate and document the speculative code.
Reported-by: Alexander Fedorov <halcien-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
mm/memcontrol.c | 44 ++++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..09ac2f8991ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex);
#ifdef CONFIG_MEMCG_KMEM
static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg);
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
#else
@@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
{
return NULL;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- return false;
-}
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
{
}
@@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
struct mem_cgroup *memcg;
bool flush = false;
+ /*
+ * Speculatively check up front if this CPU has any
+ * cached charges that belong to the specified
+ * root_memcg. The state may change from under us -
+ * which is okay, because the draining itself is a
+ * best-effort operation. Just ensure lifetime of
+ * whatever we end up looking at.
+ */
rcu_read_lock();
- memcg = stock->cached;
+ memcg = READ_ONCE(stock->cached);
if (memcg && stock->nr_pages &&
mem_cgroup_is_descendant(memcg, root_memcg))
flush = true;
- else if (obj_stock_flush_required(stock, root_memcg))
- flush = true;
+#ifdef CONFIG_MEMCG_KMEM
+ else {
+ struct obj_cgroup *objcg;
+
+ objcg = READ_ONCE(stock->cached_objcg);
+ if (objcg && stock->nr_bytes &&
+ mem_cgroup_is_descendant(obj_cgroup_memcg(objcg),
+ root_memcg))
+ flush = true;
+ }
+#endif
rcu_read_unlock();
if (flush &&
@@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
return old;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- struct mem_cgroup *memcg;
-
- if (stock->cached_objcg) {
- memcg = obj_cgroup_memcg(stock->cached_objcg);
- if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
- return true;
- }
-
- return false;
-}
-
static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
bool allow_uncharge)
{
--
2.37.3
WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alexander Fedorov <halcien@gmail.com>,
Michal Hocko <mhocko@suse.com>,
Shakeel Butt <shakeelb@google.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Muchun Song <songmuchun@bytedance.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Possible race in obj_stock_flush_required() vs drain_obj_stock()
Date: Wed, 12 Oct 2022 13:23:11 -0400 [thread overview]
Message-ID: <Y0b3/wGDBL7GaNWJ@cmpxchg.org> (raw)
In-Reply-To: <Yzxc0jzOnAu667F8@P9FQF9L96D.lan>
On Tue, Oct 04, 2022 at 09:18:26AM -0700, Roman Gushchin wrote:
> On Mon, Oct 03, 2022 at 06:01:35PM +0300, Alexander Fedorov wrote:
> > On 03.10.2022 17:27, Michal Hocko wrote:
> > > On Mon 03-10-22 17:09:15, Alexander Fedorov wrote:
> > >> On 03.10.2022 16:32, Michal Hocko wrote:
> > >>> On Mon 03-10-22 15:47:10, Alexander Fedorov wrote:
> > >>>> @@ -3197,17 +3197,30 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> > >>>> stock->nr_bytes = 0;
> > >>>> }
> > >>>>
> > >>>> - obj_cgroup_put(old);
> > >>>> + /*
> > >>>> + * Clear pointer before freeing memory so that
> > >>>> + * drain_all_stock() -> obj_stock_flush_required()
> > >>>> + * does not see a freed pointer.
> > >>>> + */
> > >>>> stock->cached_objcg = NULL;
> > >>>> + obj_cgroup_put(old);
> > >>>
> > >>> Do we need barrier() or something else to ensure there is no reordering?
> > >>> I am not reallyu sure what kind of barriers are implied by the pcp ref
> > >>> counting.
> > >>
> > >> obj_cgroup_put() -> kfree_rcu() -> synchronize_rcu() should take care
> > >> of this:
> > >
> > > This is a very subtle guarantee. Also it would only apply if this is the
> > > last reference, right?
> >
> > Hmm, yes, for the last reference only, also not sure about pcp ref
> > counter ordering rules for previous references.
> >
> > > Is there any reason to not use
> > > WRITE_ONCE(stock->cached_objcg, NULL);
> > > obj_cgroup_put(old);
> > >
> > > IIRC this should prevent any reordering.
> >
> > Now that I think about it we actually must use WRITE_ONCE everywhere
> > when writing cached_objcg because otherwise compiler might split the
> > pointer-sized store into several smaller-sized ones (store tearing),
> > and obj_stock_flush_required() would read garbage instead of pointer.
> >
> > And thinking about memory barriers, maybe we need them too alongside
> > WRITE_ONCE when setting pointer to non-null value? Otherwise
> > drain_all_stock() -> obj_stock_flush_required() might read old data.
> > Since that's exactly what rcu_assign_pointer() does, it seems
> > that we are going back to using rcu_*() primitives everywhere?
>
> Hm, Idk, I'm still somewhat resistant to the idea of putting rcu primitives,
> but maybe it's the right thing. Maybe instead we should always schedule draining
> on all cpus instead and perform a cpu-local check and bail out if a flush is not
> required? Michal, Johannes, what do you think?
I agree it's overkill.
This is a speculative check, and we don't need any state coherency,
just basic lifetime. READ_ONCE should fully address this problem. That
said, I think the code could be a bit clearer and better documented.
How about the below?
(Nevermind the ifdef, I'm working on removing CONFIG_MEMCG_KMEM
altogether, as it's a really strange way to say !SLOB at this point)
---
From 22855af38b116ec030286975ed2aa06851680296 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 12 Oct 2022 12:59:07 -0400
Subject: [PATCH] mm: memcontrol: fix NULL deref race condition during cgroup
deletion
Alexander Fedorov reports a race condition between two concurrent
stock draining operations, where the first one clears the stock's obj
pointer between the pointer test and deref of the second. Analysis:
1) First CPU:
css_killed_work_fn() -> mem_cgroup_css_offline() ->
drain_all_stock() -> obj_stock_flush_required()
if (stock->cached_objcg) {
This check sees a non-NULL pointer for *another* CPU's `memcg_stock` instance.
2) Second CPU:
css_free_rwork_fn() -> __mem_cgroup_free() -> free_percpu() ->
obj_cgroup_uncharge() -> drain_obj_stock()
It frees `cached_objcg` pointer in its own `memcg_stock` instance:
struct obj_cgroup *old = stock->cached_objcg;
< ... >
obj_cgroup_put(old);
stock->cached_objcg = NULL;
3) First CPU continues after the 'if' check and re-reads the pointer
again, now it is NULL and dereferencing it leads to kernel panic:
static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
struct mem_cgroup *root_memcg)
{
< ... >
if (stock->cached_objcg) {
memcg = obj_cgroup_memcg(stock->cached_objcg);
There is already RCU protection in place to ensure lifetime. Add the
missing READ_ONCE to the cgroup pointers to fix the TOCTOU, and
consolidate and document the speculative code.
Reported-by: Alexander Fedorov <halcien@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/memcontrol.c | 44 ++++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..09ac2f8991ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2190,8 +2190,6 @@ static DEFINE_MUTEX(percpu_charge_mutex);
#ifdef CONFIG_MEMCG_KMEM
static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock);
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg);
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages);
#else
@@ -2199,11 +2197,6 @@ static inline struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
{
return NULL;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- return false;
-}
static void memcg_account_kmem(struct mem_cgroup *memcg, int nr_pages)
{
}
@@ -2339,13 +2332,30 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
struct mem_cgroup *memcg;
bool flush = false;
+ /*
+ * Speculatively check up front if this CPU has any
+ * cached charges that belong to the specified
+ * root_memcg. The state may change from under us -
+ * which is okay, because the draining itself is a
+ * best-effort operation. Just ensure lifetime of
+ * whatever we end up looking at.
+ */
rcu_read_lock();
- memcg = stock->cached;
+ memcg = READ_ONCE(stock->cached);
if (memcg && stock->nr_pages &&
mem_cgroup_is_descendant(memcg, root_memcg))
flush = true;
- else if (obj_stock_flush_required(stock, root_memcg))
- flush = true;
+#ifdef CONFIG_MEMCG_KMEM
+ else {
+ struct obj_cgroup *objcg;
+
+ objcg = READ_ONCE(stock->cached_objcg);
+ if (objcg && stock->nr_bytes &&
+ mem_cgroup_is_descendant(obj_cgroup_memcg(objcg),
+ root_memcg))
+ flush = true;
+ }
+#endif
rcu_read_unlock();
if (flush &&
@@ -3297,20 +3307,6 @@ static struct obj_cgroup *drain_obj_stock(struct memcg_stock_pcp *stock)
return old;
}
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
- struct mem_cgroup *root_memcg)
-{
- struct mem_cgroup *memcg;
-
- if (stock->cached_objcg) {
- memcg = obj_cgroup_memcg(stock->cached_objcg);
- if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
- return true;
- }
-
- return false;
-}
-
static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
bool allow_uncharge)
{
--
2.37.3
next prev parent reply other threads:[~2022-10-12 17:23 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-30 14:06 Possible race in obj_stock_flush_required() vs drain_obj_stock() Alexander Fedorov
2022-09-30 14:06 ` Alexander Fedorov
[not found] ` <1664546131660.1777662787.1655319815-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-09-30 18:26 ` Roman Gushchin
2022-09-30 18:26 ` Roman Gushchin
[not found] ` <Yzc0yZwDB8GG+4t7-+xijCwNIfdoLQcUKs7qKB+WAnPUfkyWGUBSOeVevoDU@public.gmane.org>
2022-10-01 12:38 ` Alexander Fedorov
2022-10-01 12:38 ` Alexander Fedorov
[not found] ` <b91e75f4-b09c-85aa-c6ad-2364dab9af92-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-02 16:16 ` Roman Gushchin
2022-10-02 16:16 ` Roman Gushchin
[not found] ` <Yzm5cukBe6IfyAs7-+xijCwNIfdq8QK5Xt2JBmg@public.gmane.org>
2022-10-03 12:47 ` Alexander Fedorov
2022-10-03 12:47 ` Alexander Fedorov
[not found] ` <d3cf9c69-19a1-53f9-cf97-5d40ce5cda44-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-03 13:32 ` Michal Hocko
2022-10-03 13:32 ` Michal Hocko
[not found] ` <YzrkaKZKYqx+c325-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-10-03 14:09 ` Alexander Fedorov
2022-10-03 14:09 ` Alexander Fedorov
[not found] ` <821923d8-17c3-f1c2-4d6a-5653c88db3e8-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-03 14:27 ` Michal Hocko
2022-10-03 14:27 ` Michal Hocko
[not found] ` <YzrxNGpf7sSwSWy2-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2022-10-03 15:01 ` Alexander Fedorov
2022-10-03 15:01 ` Alexander Fedorov
[not found] ` <2f9bdffd-062e-a364-90c4-da7f09c95619-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2022-10-04 16:18 ` Roman Gushchin
2022-10-04 16:18 ` Roman Gushchin
[not found] ` <Yzxc0jzOnAu667F8-+xijCwNIfdq8QK5Xt2JBmg@public.gmane.org>
2022-10-12 17:23 ` Johannes Weiner [this message]
2022-10-12 17:23 ` Johannes Weiner
[not found] ` <Y0b3/wGDBL7GaNWJ-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-10-12 18:49 ` Roman Gushchin
2022-10-12 18:49 ` Roman Gushchin
[not found] ` <Y0cMMPwE4aus3P9c-+xijCwNIfdoLQcUKs7qKB+WAnPUfkyWGUBSOeVevoDU@public.gmane.org>
2022-10-12 19:18 ` Johannes Weiner
2022-10-12 19:18 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y0b3/wGDBL7GaNWJ@cmpxchg.org \
--to=hannes-druugvl0lcnafugrpc6u6w@public.gmane.org \
--cc=bigeasy-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=halcien-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=mhocko-IBi9RG/b67k@public.gmane.org \
--cc=roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org \
--cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org \
--cc=vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.