netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Cong Wang <xiyou.wangcong@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
	Cong Wang <cong.wang@bytedance.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Dongdong Wang <wangdongdong.6@bytedance.com>
Subject: Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map
Date: Wed, 16 Dec 2020 21:06:43 -0800	[thread overview]
Message-ID: <CAM_iQpU4ULPqo60o7CuZqqqdrybkqNd5GNufep57UhBpmMGuPg@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQL70bVdms6_D_ep1L2v-OcgXu-9KTtLULQdfCMftLhENQ@mail.gmail.com>

On Tue, Dec 15, 2020 at 6:35 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 15, 2020 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Sure, people also implement CT on native hash map too and timeout
> > with user-space timers. ;)
>
> exactly. what's wrong with that?
> Perfectly fine way to do CT.

Seriously? When we have 8 millions of entries in a hash map, it is
definitely seriously wrong to purge entries one by one from user-space.

In case you don't believe me, take a look at what cilium CT GC does,
which is precisely expires entries one by one in user-space:

https://github.com/cilium/cilium/blob/0f57292c0037ee23ba1ca2f9abb113f36a664645/pkg/bpf/map_linux.go#L728
https://github.com/cilium/cilium/blob/master/pkg/maps/ctmap/ctmap.go#L398

and of course what people complained:

https://github.com/cilium/cilium/issues/5048

>
> > > Anything extra can be added on top from user space
> > > which can easily copy with 1 sec granularity.
> >
> > The problem is never about granularity, it is about how efficient we can
> > GC. User-space has to scan the whole table one by one, while the kernel
> > can just do this behind the scene with a much lower overhead.
> >
> > Let's say we arm a timer for each entry in user-space, it requires a syscall
> > and locking buckets each time for each entry. Kernel could do it without
> > any additional syscall and batching. Like I said above, we could have
> > millions of entries, so the overhead would be big in this scenario.
>
> and the user space can pick any other implementation instead
> of trivial entry by entry gc with timer.

Unless they don't have to, right? With timeout implementation in kernel,
user space does not need to invent any wheel.


>
> > > Say the kernel does GC and deletes htab entries.
> > > How user space will know that it's gone? There would need to be
> >
> > By a lookup.
> >
> > > an event sent to user space when entry is being deleted by the kernel.
> > > But then such event will be racy. Instead when timers and expirations
> > > are done by user space everything is in sync.
> >
> > Why there has to be an event?
>
> because when any production worthy implementation moves
> past the prototype stage there is something that user space needs to keep
> as well. Sometimes the bpf map in the kernel is alone.
> But a lot of times there is a user space mirror of the map in c++ or golang
> with the same key where user space keeps extra data.

So... what event does LRU map send when it deletes a different entry
when the map is full?

Thanks.

  parent reply	other threads:[~2020-12-17  5:13 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 20:11 [Patch bpf-next v2 0/5] bpf: introduce timeout map Cong Wang
2020-12-14 20:11 ` [Patch bpf-next v2 1/5] bpf: use index instead of hash for map_locked[] Cong Wang
2020-12-14 20:11 ` [Patch bpf-next v2 2/5] bpf: introduce timeout map Cong Wang
2020-12-15 19:27   ` Andrii Nakryiko
2020-12-15 20:06     ` Cong Wang
2020-12-15 22:03       ` Andrii Nakryiko
2020-12-15 23:23         ` Daniel Borkmann
2020-12-16  0:22           ` Cong Wang
2020-12-16  1:14             ` Alexei Starovoitov
2020-12-16  2:10               ` Cong Wang
2020-12-16  2:35                 ` Alexei Starovoitov
2020-12-16 10:38                   ` David Laight
2020-12-17  5:06                   ` Cong Wang [this message]
2020-12-17 22:39             ` Daniel Borkmann
2020-12-16  0:15         ` Cong Wang
2020-12-16 18:35           ` Andrii Nakryiko
2020-12-17  6:29             ` Cong Wang
2020-12-17 21:14               ` Cong Wang
2020-12-18 19:14                 ` Andrii Nakryiko
2020-12-18 19:13               ` Andrii Nakryiko
2020-12-14 20:11 ` [Patch bpf-next v2 3/5] selftests/bpf: update elem_size check in map ptr test Cong Wang
2020-12-14 20:40   ` Andrey Ignatov
2020-12-14 20:11 ` [Patch bpf-next v2 4/5] selftests/bpf: add a test case for bpf timeout map Cong Wang
2020-12-14 20:11 ` [Patch bpf-next v2 5/5] selftests/bpf: add timeout map check in map_ptr tests Cong Wang
2020-12-14 20:41   ` Andrey Ignatov
2020-12-15 19:28 ` [Patch bpf-next v2 0/5] bpf: introduce timeout map Andrii Nakryiko
2020-12-15 20:12   ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAM_iQpU4ULPqo60o7CuZqqqdrybkqNd5GNufep57UhBpmMGuPg@mail.gmail.com \
    --to=xiyou.wangcong@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=netdev@vger.kernel.org \
    --cc=wangdongdong.6@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).