From: Stanislav Fomichev <sdf@fomichev.me>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: Stanislav Fomichev <sdf@google.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org, davem@davemloft.net,
ast@kernel.org, Martin KaFai Lau <kafai@fb.com>,
Yonghong Song <yhs@fb.com>
Subject: Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()
Date: Tue, 13 Aug 2019 14:28:47 -0700 [thread overview]
Message-ID: <20190813212847.GI2820@mini-arch> (raw)
In-Reply-To: <2d24378a-73f4-bfa0-dc99-4a0ed761c797@iogearbox.net>
On 08/13, Daniel Borkmann wrote:
> On 8/12/19 7:52 PM, Stanislav Fomichev wrote:
> > On 08/12, Daniel Borkmann wrote:
> > > On 8/9/19 6:10 PM, Stanislav Fomichev wrote:
> > > > Add new helper bpf_sk_storage_clone which optionally clones sk storage
> > > > and call it from sk_clone_lock.
> > > >
> > > > Cc: Martin KaFai Lau <kafai@fb.com>
> > > > Cc: Yonghong Song <yhs@fb.com>
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > [...]
> > > > +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
> > > > +{
> > > > + struct bpf_sk_storage *new_sk_storage = NULL;
> > > > + struct bpf_sk_storage *sk_storage;
> > > > + struct bpf_sk_storage_elem *selem;
> > > > + int ret;
> > > > +
> > > > + RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
> > > > +
> > > > + rcu_read_lock();
> > > > + sk_storage = rcu_dereference(sk->sk_bpf_storage);
> > > > +
> > > > + if (!sk_storage || hlist_empty(&sk_storage->list))
> > > > + goto out;
> > > > +
> > > > + hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
> > > > + struct bpf_sk_storage_elem *copy_selem;
> > > > + struct bpf_sk_storage_map *smap;
> > > > + struct bpf_map *map;
> > > > + int refold;
> > > > +
> > > > + smap = rcu_dereference(SDATA(selem)->smap);
> > > > + if (!(smap->map.map_flags & BPF_F_CLONE))
> > > > + continue;
> > > > +
> > > > + map = bpf_map_inc_not_zero(&smap->map, false);
> > > > + if (IS_ERR(map))
> > > > + continue;
> > > > +
> > > > + copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
> > > > + if (!copy_selem) {
> > > > + ret = -ENOMEM;
> > > > + bpf_map_put(map);
> > > > + goto err;
> > > > + }
> > > > +
> > > > + if (new_sk_storage) {
> > > > + selem_link_map(smap, copy_selem);
> > > > + __selem_link_sk(new_sk_storage, copy_selem);
> > > > + } else {
> > > > + ret = sk_storage_alloc(newsk, smap, copy_selem);
> > > > + if (ret) {
> > > > + kfree(copy_selem);
> > > > + atomic_sub(smap->elem_size,
> > > > + &newsk->sk_omem_alloc);
> > > > + bpf_map_put(map);
> > > > + goto err;
> > > > + }
> > > > +
> > > > + new_sk_storage = rcu_dereference(copy_selem->sk_storage);
> > > > + }
> > > > + bpf_map_put(map);
> > >
> > > The map get/put combination /under/ RCU read lock seems a bit odd to me, could
> > > you exactly describe the race that this would be preventing?
> > There is a race between sk storage release and sk storage clone.
> > bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
> > users to finish and the new ones are prevented via map's refcnt being
> > zero; we need to do something like that for the clone.
> > Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
> > If I read everythin correctly, I think without map_inc/map_put we
> > get the following race:
> >
> > CPU0 CPU1
> >
> > bpf_map_put
> > bpf_sk_storage_map_free(smap)
> > synchronize_rcu
> >
> > // no more users via bpf or
> > // syscall, but clone
> > // can still happen
> >
> > for each (bucket)
> > selem_unlink
> > selem_unlink_map(smap)
> >
> > // adding anything at
> > // this point to the
> > // bucket will leak
> >
> > rcu_read_lock
> > tcp_v4_rcv
> > tcp_v4_do_rcv
> > // sk is lockless TCP_LISTEN
> > tcp_v4_cookie_check
> > tcp_v4_syn_recv_sock
> > bpf_sk_storage_clone
> > rcu_dereference(sk->sk_bpf_storage)
> > selem_link_map(smap, copy)
> > // adding new element to the
> > // map -> leak
> > rcu_read_unlock
> >
> > selem_unlink_sk
> > sk->sk_bpf_storage = NULL
> >
> > synchronize_rcu
> >
>
> Makes sense, thanks for clarifying. Perhaps a small comment on top of
> the bpf_map_inc_not_zero() would be great as well, so it's immediately
> clear also from this location when reading the code why this is done.
Sure, no problem, will have something similar to what I have before
synchronize_rcu in bpf_sk_storage_map_free.
> Thanks,
> Daniel
next prev parent reply other threads:[~2019-08-13 21:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-09 16:10 [PATCH bpf-next v2 0/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 1/4] bpf: export bpf_map_inc_not_zero Stanislav Fomichev
2019-08-11 23:53 ` Yonghong Song
2019-08-09 16:10 ` [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept() Stanislav Fomichev
2019-08-11 23:54 ` Yonghong Song
2019-08-12 10:17 ` Daniel Borkmann
2019-08-12 17:52 ` Stanislav Fomichev
2019-08-13 21:12 ` Daniel Borkmann
2019-08-13 21:28 ` Stanislav Fomichev [this message]
2019-08-13 1:47 ` Martin Lau
2019-08-13 5:05 ` Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 3/4] bpf: sync bpf.h to tools/ Stanislav Fomichev
2019-08-09 16:10 ` [PATCH bpf-next v2 4/4] selftests/bpf: add sockopt clone/inheritance test Stanislav Fomichev
2019-08-11 23:54 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190813212847.GI2820@mini-arch \
--to=sdf@fomichev.me \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=kafai@fb.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@google.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.