* [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() @ 2018-01-25 0:19 Roman Gushchin 2018-01-25 17:03 ` David Miller 0 siblings, 1 reply; 10+ messages in thread From: Roman Gushchin @ 2018-01-25 0:19 UTC (permalink / raw) To: netdev Cc: linux-kernel, kernel-team, Roman Gushchin, Eric Dumazet, Johannes Weiner, Tejun Heo, David S . Miller We've catched several cgroup css refcounting issues on 4.15-rc7, triggered from different release paths. We've used cgroups v2. I've added a temporarily per-memcg sockmem atomic counter, and found, that we're sometimes falling below 0. It was easy to reproduce, so I was able to bisect the problem. It was introduced by the commit 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()"), which moved the mem_cgroup_sk_alloc() call from the BH context into inet_csk_accept(). The problem is that all the memory allocated before mem_cgroup_sk_alloc() is charged to the socket, but not charged to the memcg. So, when we're releasing the socket, we're uncharging more, than we've charged. Fix this by charging the cgroup by the amount of already allocated memory right after mem_cgroup_sk_alloc() in inet_csk_accept(). Fixes: 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David S. Miller <davem@davemloft.net> --- net/ipv4/inet_connection_sock.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4ca46dc08e63..f439162c2ea2 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -434,6 +434,7 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) struct request_sock *req; struct sock *newsk; int error; + long amt; lock_sock(sk); @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) spin_unlock_bh(&queue->fastopenq.lock); } mem_cgroup_sk_alloc(newsk); + amt = sk_memory_allocated(newsk); + if (amt && newsk->sk_memcg) + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); + out: release_sock(sk); if (req) -- 2.14.3 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-01-25 0:19 [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() Roman Gushchin @ 2018-01-25 17:03 ` David Miller 2018-01-25 17:15 ` Roman Gushchin 2018-01-31 21:54 ` Roman Gushchin 0 siblings, 2 replies; 10+ messages in thread From: David Miller @ 2018-01-25 17:03 UTC (permalink / raw) To: guro; +Cc: netdev, linux-kernel, kernel-team, edumazet, hannes, tj From: Roman Gushchin <guro@fb.com> Date: Thu, 25 Jan 2018 00:19:11 +0000 > @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) > spin_unlock_bh(&queue->fastopenq.lock); > } > mem_cgroup_sk_alloc(newsk); > + amt = sk_memory_allocated(newsk); > + if (amt && newsk->sk_memcg) > + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); > + This looks confusing to me. sk_memory_allocated() is the total amount of memory used by all sockets for a particular "struct proto", not just for that specific socket. Maybe I don't understand how this socket memcg stuff works, but it seems like you should be looking instead at how much memory is allocated to this specific socket. Thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-01-25 17:03 ` David Miller @ 2018-01-25 17:15 ` Roman Gushchin 2018-01-31 21:54 ` Roman Gushchin 1 sibling, 0 replies; 10+ messages in thread From: Roman Gushchin @ 2018-01-25 17:15 UTC (permalink / raw) To: David Miller; +Cc: netdev, linux-kernel, kernel-team, edumazet, hannes, tj On Thu, Jan 25, 2018 at 12:03:02PM -0500, David Miller wrote: > From: Roman Gushchin <guro@fb.com> > Date: Thu, 25 Jan 2018 00:19:11 +0000 > > > @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) > > spin_unlock_bh(&queue->fastopenq.lock); > > } > > mem_cgroup_sk_alloc(newsk); > > + amt = sk_memory_allocated(newsk); > > + if (amt && newsk->sk_memcg) > > + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); > > + > > This looks confusing to me. > > sk_memory_allocated() is the total amount of memory used by all > sockets for a particular "struct proto", not just for that specific > socket. Oh, I see... > > Maybe I don't understand how this socket memcg stuff works, but it > seems like you should be looking instead at how much memory is > allocated to this specific socket. Yes, this is what I wanted to do originally. Let me find a proper way to do this. Thank you! Roman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-01-25 17:03 ` David Miller 2018-01-25 17:15 ` Roman Gushchin @ 2018-01-31 21:54 ` Roman Gushchin 2018-02-01 15:16 ` David Miller 1 sibling, 1 reply; 10+ messages in thread From: Roman Gushchin @ 2018-01-31 21:54 UTC (permalink / raw) To: David Miller; +Cc: netdev, linux-kernel, kernel-team, edumazet, hannes, tj On Thu, Jan 25, 2018 at 12:03:02PM -0500, David Miller wrote: > From: Roman Gushchin <guro@fb.com> > Date: Thu, 25 Jan 2018 00:19:11 +0000 > > > @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) > > spin_unlock_bh(&queue->fastopenq.lock); > > } > > mem_cgroup_sk_alloc(newsk); > > + amt = sk_memory_allocated(newsk); > > + if (amt && newsk->sk_memcg) > > + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); > > + > > This looks confusing to me. > > sk_memory_allocated() is the total amount of memory used by all > sockets for a particular "struct proto", not just for that specific > socket. > > Maybe I don't understand how this socket memcg stuff works, but it > seems like you should be looking instead at how much memory is > allocated to this specific socket. So, the patch below takes the per-socket charge into account, and it _almost_ works: css leak is weaker by a couple orders of magnitude, but still exists. I believe, the problem is that we need additional synchronization for sk_memcg and sk_forward_alloc fields; and I'm really out of ideas how to do it without heavy artillery like introducing a new field for unaccounted memcg charge. As I can see, we check the sk_memcg field without socket lock; and we do set it from a concurrent context. Most likely, I do miss something... So I really start thinking that reverting 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") and fixing the original issue differently might be easier and a proper way to go. Does it makes sense? Thanks! -- diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4ca46dc08e63..287de1501a30 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -476,6 +476,12 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) spin_unlock_bh(&queue->fastopenq.lock); } mem_cgroup_sk_alloc(newsk); + if (mem_cgroup_sockets_enabled && newsk->sk_memcg) { + int amt = sk_mem_pages(newsk->sk_forward_alloc); + if (amt > 0) + mem_cgroup_charge_skmem(newsk->sk_memcg, amt); + } + out: release_sock(sk); if (req) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-01-31 21:54 ` Roman Gushchin @ 2018-02-01 15:16 ` David Miller 2018-02-01 20:22 ` Roman Gushchin 0 siblings, 1 reply; 10+ messages in thread From: David Miller @ 2018-02-01 15:16 UTC (permalink / raw) To: guro; +Cc: netdev, linux-kernel, kernel-team, edumazet, hannes, tj From: Roman Gushchin <guro@fb.com> Date: Wed, 31 Jan 2018 21:54:08 +0000 > So I really start thinking that reverting 9f1c2674b328 > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") > and fixing the original issue differently might be easier > and a proper way to go. Does it makes sense? You'll need to work that out with Eric Dumazet who added the change in question which you think we should revert. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-02-01 15:16 ` David Miller @ 2018-02-01 20:22 ` Roman Gushchin 2018-02-01 21:17 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Roman Gushchin @ 2018-02-01 20:22 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, netdev, linux-kernel, kernel-team, edumazet, hannes, tj On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote: > From: Roman Gushchin <guro@fb.com> > Date: Wed, 31 Jan 2018 21:54:08 +0000 > > > So I really start thinking that reverting 9f1c2674b328 > > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") > > and fixing the original issue differently might be easier > > and a proper way to go. Does it makes sense? > > You'll need to work that out with Eric Dumazet who added the > change in question which you think we should revert. Eric, can you, please, provide some details about the use-after-free problem that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it? Deferring mem_cgroup_sk_alloc() breaks socket memory accounting and makes it much more fragile in general. So, I wonder, if there are solutions for the use-after-free problem. Thank you! Roman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-02-01 20:22 ` Roman Gushchin @ 2018-02-01 21:17 ` Eric Dumazet 2018-02-01 22:55 ` Roman Gushchin 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2018-02-01 21:17 UTC (permalink / raw) To: Roman Gushchin Cc: David S. Miller, netdev, LKML, kernel-team, Johannes Weiner, Tejun Heo On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@fb.com> wrote: > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote: >> From: Roman Gushchin <guro@fb.com> >> Date: Wed, 31 Jan 2018 21:54:08 +0000 >> >> > So I really start thinking that reverting 9f1c2674b328 >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") >> > and fixing the original issue differently might be easier >> > and a proper way to go. Does it makes sense? >> >> You'll need to work that out with Eric Dumazet who added the >> change in question which you think we should revert. > > Eric, > > can you, please, provide some details about the use-after-free problem > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it? > > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting > and makes it much more fragile in general. So, I wonder, if there are > solutions for the use-after-free problem. > > Thank you! > > Roman Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers following this thread ) Our kernel has a debug feature on percpu_ref_get_many() which detects the typical use-after-free problem of doing atomic_long_add(nr, &ref->count); while ref->count is 0, or memory already freed. Bug was serious because css_put() will release the css a second time. Stack trace looked like : Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ? cgroup_get+0x43/0x50 Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>] warn_slowpath_common+0xac/0xc8 Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>] warn_slowpath_null+0x1a/0x1c Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>] cgroup_get+0x43/0x50 Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>] cgroup_sk_alloc+0x64/0x90 Oct 8 00:23:14 lphh23 kernel: [27239.568128] [<ffffffff90bd6e91>] sk_clone_lock+0x2d1/0x400 Oct 8 00:23:14 lphh23 kernel: [27239.568134] [<ffffffff90bf2d56>] inet_csk_clone_lock+0x16/0x100 Oct 8 00:23:14 lphh23 kernel: [27239.568138] [<ffffffff90bff163>] tcp_create_openreq_child+0x23/0x600 Oct 8 00:23:14 lphh23 kernel: [27239.568143] [<ffffffff90c1ba8a>] tcp_v6_syn_recv_sock+0x26a/0x8f0 Oct 8 00:23:14 lphh23 kernel: [27239.568146] [<ffffffff90bffbfe>] tcp_check_req+0x1ce/0x440 Oct 8 00:23:14 lphh23 kernel: [27239.568152] [<ffffffff90c6556c>] tcp_v6_rcv+0x9cc/0x22a0 Oct 8 00:23:14 lphh23 kernel: [27239.568155] [<ffffffff90c67cc2>] ? ip6table_mangle_hook+0x42/0x190 Oct 8 00:23:14 lphh23 kernel: [27239.568158] [<ffffffff90c61e5b>] ip6_input+0x1ab/0x400 Oct 8 00:23:14 lphh23 kernel: [27239.568162] [<ffffffff90cd8c0d>] ? ip6_rcv_finish+0x93/0x93 Oct 8 00:23:14 lphh23 kernel: [27239.568165] [<ffffffff90c61a2d>] ipv6_rcv+0x32d/0x5b0 Oct 8 00:23:14 lphh23 kernel: [27239.568167] [<ffffffff90cd8b7a>] ? ip6_fragment+0x965/0x965 Oct 8 00:23:14 lphh23 kernel: [27239.568171] [<ffffffff90c2fd4c>] process_backlog+0x39c/0xc50 Oct 8 00:23:14 lphh23 kernel: [27239.568177] [<ffffffff907be695>] ? ktime_get+0x35/0xa0 Oct 8 00:23:14 lphh23 kernel: [27239.568180] [<ffffffff907bf681>] ? clockevents_program_event+0x81/0x1c0 Oct 8 00:23:14 lphh23 kernel: [27239.568183] [<ffffffff90c2e22e>] net_rx_action+0x10e/0x360 Oct 8 00:23:14 lphh23 kernel: [27239.568190] [<ffffffff906064f1>] __do_softirq+0x151/0x2f5 Oct 8 00:23:14 lphh23 kernel: [27239.568196] [<ffffffff90d101dc>] do_softirq_own_stack+0x1c/0x30 Oct 8 00:23:14 lphh23 kernel: [27239.568197] <EOI> [<ffffffff9079a12b>] __local_bh_enable_ip+0x6b/0xa0 Oct 8 00:23:14 lphh23 kernel: [27239.568203] [<ffffffff90c609c6>] ip6_output+0x326/0x1060 Oct 8 00:23:14 lphh23 kernel: [27239.568206] [<ffffffff90c67d3d>] ? ip6table_mangle_hook+0xbd/0x190 Oct 8 00:23:14 lphh23 kernel: [27239.568209] [<ffffffff90c5f780>] ? inet6_getname+0x130/0x130 Oct 8 00:23:14 lphh23 kernel: [27239.568212] [<ffffffff90c606a0>] ? ip6_finish_output+0xf20/0xf20 Oct 8 00:23:14 lphh23 kernel: [27239.568215] [<ffffffff90cd77a7>] ip6_xmit+0x52d/0x5b6 Oct 8 00:23:14 lphh23 kernel: [27239.568217] [<ffffffff90cd6ffe>] ? ip6_call_ra_chain+0xc9/0xc9 Oct 8 00:23:14 lphh23 kernel: [27239.568220] [<ffffffff90c4483d>] ? tcp_ack+0x60d/0x3290 Oct 8 00:23:14 lphh23 kernel: [27239.568223] [<ffffffff90c67521>] inet6_csk_xmit+0x181/0x2b0 Oct 8 00:23:14 lphh23 kernel: [27239.568225] [<ffffffff90c4bb55>] tcp_send_ack+0x6f5/0xdf0 Oct 8 00:23:14 lphh23 kernel: [27239.568229] [<ffffffff90bf8311>] tcp_rcv_state_process+0x8a1/0x2630 Oct 8 00:23:14 lphh23 kernel: [27239.568231] [<ffffffff90c1c24b>] tcp_v6_do_rcv+0x13b/0x340 Oct 8 00:23:14 lphh23 kernel: [27239.568234] [<ffffffff90c2286c>] release_sock+0xec/0x180 Oct 8 00:23:14 lphh23 kernel: [27239.568237] [<ffffffff90c08b6f>] __inet_stream_connect+0x1ef/0x2f0 Oct 8 00:23:14 lphh23 kernel: [27239.568240] [<ffffffff906d8710>] ? __wake_up_locked_key+0x70/0x70 Oct 8 00:23:14 lphh23 kernel: [27239.568243] [<ffffffff90c08cab>] inet_stream_connect+0x3b/0x60 Oct 8 00:23:14 lphh23 kernel: [27239.568249] [<ffffffff90bd5564>] SYSC_connect+0x84/0xc0 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-02-01 21:17 ` Eric Dumazet @ 2018-02-01 22:55 ` Roman Gushchin 2018-02-01 23:27 ` Eric Dumazet 0 siblings, 1 reply; 10+ messages in thread From: Roman Gushchin @ 2018-02-01 22:55 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, netdev, LKML, kernel-team, Johannes Weiner, Tejun Heo On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote: > On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@fb.com> wrote: > > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote: > >> From: Roman Gushchin <guro@fb.com> > >> Date: Wed, 31 Jan 2018 21:54:08 +0000 > >> > >> > So I really start thinking that reverting 9f1c2674b328 > >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") > >> > and fixing the original issue differently might be easier > >> > and a proper way to go. Does it makes sense? > >> > >> You'll need to work that out with Eric Dumazet who added the > >> change in question which you think we should revert. > > > > Eric, > > > > can you, please, provide some details about the use-after-free problem > > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call > > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it? > > > > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting > > and makes it much more fragile in general. So, I wonder, if there are > > solutions for the use-after-free problem. > > > > Thank you! > > > > Roman > > Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers > following this thread ) > > Our kernel has a debug feature on percpu_ref_get_many() which detects > the typical use-after-free problem of > doing atomic_long_add(nr, &ref->count); while ref->count is 0, or > memory already freed. > > Bug was serious because css_put() will release the css a second time. > > Stack trace looked like : > > Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ> > [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c > Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ? > cgroup_get+0x43/0x50 > Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>] > warn_slowpath_common+0xac/0xc8 > Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>] > warn_slowpath_null+0x1a/0x1c > Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>] > cgroup_get+0x43/0x50 > Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>] > cgroup_sk_alloc+0x64/0x90 Hm, that looks strange... It's cgroup_sk_alloc(), not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328. I thought, that it's css_get() in mem_cgroup_sk_alloc(), which you removed, but the stacktrace you've posted is different. void mem_cgroup_sk_alloc(struct sock *sk) { /* * Socket cloning can throw us here with sk_memcg already * filled. It won't however, necessarily happen from * process context. So the test for root memcg given * the current task's memcg won't help us in this case. * * Respecting the original socket's memcg is a better * decision in this case. */ if (sk->sk_memcg) { BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); >>> css_get(&sk->sk_memcg->css); return; } Is it possible to reproduce the issue on an upstream kernel? Any ideas of what can trigger it? Btw, with the following patch applied (below) and cgroup v2 enabled, the issue, which I'm talking about, can be reproduced in seconds after reboot by doing almost any network activity. Just sshing to a machine is enough. The corresponding warning will be printed to dmesg. What is a proper way to fix the socket memory accounting in this case, what do you think? Thank you! Roman -- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c51c589..55fb890 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -276,6 +276,8 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + atomic_t tcpcnt; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 19eea69..c69ff04 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v) seq_printf(m, "workingset_nodereclaim %lu\n", stat[WORKINGSET_NODERECLAIM]); + seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt)); + return 0; } @@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { gfp_t gfp_mask = GFP_KERNEL; + atomic_add(nr_pages, &memcg->tcpcnt); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { struct page_counter *fail; @@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) */ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) { + int v = atomic_sub_return(nr_pages, &memcg->tcpcnt); + if (v < 0) { + pr_info("@@@ %p %d \n", memcg, v); + } + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { page_counter_uncharge(&memcg->tcpmem, nr_pages); return; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-02-01 22:55 ` Roman Gushchin @ 2018-02-01 23:27 ` Eric Dumazet 2018-02-01 23:42 ` Roman Gushchin 0 siblings, 1 reply; 10+ messages in thread From: Eric Dumazet @ 2018-02-01 23:27 UTC (permalink / raw) To: Roman Gushchin Cc: David S. Miller, netdev, LKML, kernel-team, Johannes Weiner, Tejun Heo Well, this memcg stuff is so confusing. My recollection is that we had : https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6 And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well Honestly bug was closed months ago for us, based on stack traces on the wild. No C repro or whatever, but reproducing it would be a matter of having a TCP listener constantly doing a socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop, while connections are attempted to the listening port. On Thu, Feb 1, 2018 at 2:55 PM, Roman Gushchin <guro@fb.com> wrote: > On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote: >> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <guro@fb.com> wrote: >> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote: >> >> From: Roman Gushchin <guro@fb.com> >> >> Date: Wed, 31 Jan 2018 21:54:08 +0000 >> >> >> >> > So I really start thinking that reverting 9f1c2674b328 >> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()") >> >> > and fixing the original issue differently might be easier >> >> > and a proper way to go. Does it makes sense? >> >> >> >> You'll need to work that out with Eric Dumazet who added the >> >> change in question which you think we should revert. >> > >> > Eric, >> > >> > can you, please, provide some details about the use-after-free problem >> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call >> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it? >> > >> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting >> > and makes it much more fragile in general. So, I wonder, if there are >> > solutions for the use-after-free problem. >> > >> > Thank you! >> > >> > Roman >> >> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers >> following this thread ) >> >> Our kernel has a debug feature on percpu_ref_get_many() which detects >> the typical use-after-free problem of >> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or >> memory already freed. >> >> Bug was serious because css_put() will release the css a second time. >> >> Stack trace looked like : >> >> Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ> >> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c >> Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ? >> cgroup_get+0x43/0x50 >> Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>] >> warn_slowpath_common+0xac/0xc8 >> Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>] >> warn_slowpath_null+0x1a/0x1c >> Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>] >> cgroup_get+0x43/0x50 >> Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>] >> cgroup_sk_alloc+0x64/0x90 > > Hm, that looks strange... It's cgroup_sk_alloc(), > not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328. > > I thought, that it's css_get() in mem_cgroup_sk_alloc(), which > you removed, but the stacktrace you've posted is different. > > void mem_cgroup_sk_alloc(struct sock *sk) { > /* > * Socket cloning can throw us here with sk_memcg already > * filled. It won't however, necessarily happen from > * process context. So the test for root memcg given > * the current task's memcg won't help us in this case. > * > * Respecting the original socket's memcg is a better > * decision in this case. > */ > if (sk->sk_memcg) { > BUG_ON(mem_cgroup_is_root(sk->sk_memcg)); >>>> css_get(&sk->sk_memcg->css); > return; > } > > Is it possible to reproduce the issue on an upstream kernel? > Any ideas of what can trigger it? > > Btw, with the following patch applied (below) and cgroup v2 enabled, > the issue, which I'm talking about, can be reproduced in seconds after reboot > by doing almost any network activity. Just sshing to a machine is enough. > The corresponding warning will be printed to dmesg. > > What is a proper way to fix the socket memory accounting in this case, > what do you think? > > Thank you! > > Roman > > -- > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c51c589..55fb890 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -276,6 +276,8 @@ struct mem_cgroup { > struct list_head event_list; > spinlock_t event_list_lock; > > + atomic_t tcpcnt; > + > struct mem_cgroup_per_node *nodeinfo[0]; > /* WARNING: nodeinfo must be the last member here */ > }; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 19eea69..c69ff04 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v) > seq_printf(m, "workingset_nodereclaim %lu\n", > stat[WORKINGSET_NODERECLAIM]); > > + seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt)); > + > return 0; > } > > @@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > gfp_t gfp_mask = GFP_KERNEL; > > + atomic_add(nr_pages, &memcg->tcpcnt); > + > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > struct page_counter *fail; > > @@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > */ > void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) > { > + int v = atomic_sub_return(nr_pages, &memcg->tcpcnt); > + if (v < 0) { > + pr_info("@@@ %p %d \n", memcg, v); > + } > + > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) { > page_counter_uncharge(&memcg->tcpmem, nr_pages); > return; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() 2018-02-01 23:27 ` Eric Dumazet @ 2018-02-01 23:42 ` Roman Gushchin 0 siblings, 0 replies; 10+ messages in thread From: Roman Gushchin @ 2018-02-01 23:42 UTC (permalink / raw) To: Eric Dumazet Cc: David S. Miller, netdev, LKML, kernel-team, Johannes Weiner, Tejun Heo On Thu, Feb 01, 2018 at 03:27:14PM -0800, Eric Dumazet wrote: > Well, this memcg stuff is so confusing. > > My recollection is that we had : > > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179 > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941 > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6 > > And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well > > Honestly bug was closed months ago for us, based on stack traces on the wild. > > No C repro or whatever, but reproducing it would be a matter of > having a TCP listener constantly doing a > socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop, > while connections are attempted to the listening port. Oh, I see... Then I think that we should return memcg_sk_alloc() back to the bh context, where cgroup_sk_alloc() is, and repeat all the tricks to avoid copying dead cgroups/memcg pointers. Do you agree? I'll try to master a patch and reproduce the issue. Thanks! ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-02-01 23:42 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-25 0:19 [PATCH net] net: memcontrol: charge allocated memory after mem_cgroup_sk_alloc() Roman Gushchin 2018-01-25 17:03 ` David Miller 2018-01-25 17:15 ` Roman Gushchin 2018-01-31 21:54 ` Roman Gushchin 2018-02-01 15:16 ` David Miller 2018-02-01 20:22 ` Roman Gushchin 2018-02-01 21:17 ` Eric Dumazet 2018-02-01 22:55 ` Roman Gushchin 2018-02-01 23:27 ` Eric Dumazet 2018-02-01 23:42 ` Roman Gushchin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).