From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Martin KaFai Lau <kafai@fb.com>
Cc: netdev@vger.kernel.org, David Miller <davem@davemloft.net>,
Alexei Starovoitov <ast@fb.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH v2 net-next 1/6] bpf: LRU List
Date: Mon, 14 Nov 2016 17:50:37 -0800 [thread overview]
Message-ID: <20161115015035.GC8080@ast-mbp.thefacebook.com> (raw)
In-Reply-To: <1478890511-1346984-2-git-send-email-kafai@fb.com>
On Fri, Nov 11, 2016 at 10:55:06AM -0800, Martin KaFai Lau wrote:
> Introduce bpf_lru_list which will provide LRU capability to
> the bpf_htab in the later patch.
>
> * General Thoughts:
> 1. Target use case. Read is more often than update.
> (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
> If bpf_prog does a bpf_lookup_elem() first and then an in-place
> update, it still counts as a read operation to the LRU list concern.
> 2. It may be useful to think of it as a LRU cache
> 3. Optimize the read case
> 3.1 No lock in read case
> 3.2 The LRU maintenance is only done during bpf_update_elem()
> 4. If there is a percpu LRU list, it will lose the system-wise LRU
> property. A completely isolated percpu LRU list has the best
> performance but the memory utilization is not ideal considering
> the work load may be imbalance.
> 5. Hence, this patch starts the LRU implementation with a global LRU
> list with batched operations before accessing the global LRU list.
> As a LRU cache, #read >> #update/#insert operations, it will work well.
> 6. There is a local list (for each cpu) which is named
> 'struct bpf_lru_locallist'. This local list is not used to sort
> the LRU property. Instead, the local list is to batch enough
> operations before acquiring the lock of the global LRU list. More
> details on this later.
> 7. In the later patch, it allows a percpu LRU list by specifying a
> map-attribute for scalability reason and for use cases that need to
> prepare for the worst (and pathological) case like DoS attack.
> The percpu LRU list is completely isolated from each other and the
> LRU nodes (including free nodes) cannot be moved across the list. The
> following description is for the global LRU list but mostly applicable
> to the percpu LRU list also.
>
> * Global LRU List:
> 1. It has three sub-lists: active-list, inactive-list and free-list.
> 2. The two list idea, active and inactive, is borrowed from the
> page cache.
> 3. All nodes are pre-allocated and all sit at the free-list (of the
> global LRU list) at the beginning. The pre-allocation reasoning
> is similar to the existing BPF_MAP_TYPE_HASH. However,
> opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
> the LRU map.
>
> * Active/Inactive List (of the global LRU list):
> 1. The active list, as its name says it, maintains the active set of
> the nodes. We can think of it as the working set or more frequently
> accessed nodes. The access frequency is approximated by a ref-bit.
> The ref-bit is set during the bpf_lookup_elem().
> 2. The inactive list, as its name also says it, maintains a less
> active set of nodes. They are the candidates to be removed
> from the bpf_htab when we are running out of free nodes.
> 3. The ordering of these two lists is acting as a rough clock.
> The tail of the inactive list is the older nodes and
> should be released first if the bpf_htab needs free element.
>
> * Rotating the Active/Inactive List (of the global LRU list):
> 1. It is the basic operation to maintain the LRU property of
> the global list.
> 2. The active list is only rotated when the inactive list is running
> low. This idea is similar to the current page cache.
> Inactive running low is currently defined as
> "# of inactive < # of active".
> 3. The active list rotation always starts from the tail. It moves
> node without ref-bit set to the head of the inactive list.
> It moves node with ref-bit set back to the head of the active
> list and then clears its ref-bit.
> 4. The inactive rotation is pretty simply.
> It walks the inactive list and moves the nodes back to the head of
> active list if its ref-bit is set. The ref-bit is cleared after moving
> to the active list.
> If the node does not have ref-bit set, it just leave it as it is
> because it is already in the inactive list.
>
> * Shrinking the Inactive List (of the global LRU list):
> 1. Shrinking is the operation to get free nodes when the bpf_htab is
> full.
> 2. It usually only shrinks the inactive list to get free nodes.
> 3. During shrinking, it will walk the inactive list from the tail,
> delete the nodes without ref-bit set from bpf_htab.
> 4. If no free node found after step (3), it will forcefully get
> one node from the tail of inactive or active list. Forcefully is
> in the sense that it ignores the ref-bit.
>
> * Local List:
> 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to
> batch enough operations before acquiring the lock of the
> global LRU.
> 2. A local list has two sub-lists, free-list and pending-list.
> 3. During bpf_update_elem(), it will try to get from the free-list
> of (the current CPU local list).
> 4. If the local free-list is empty, it will acquire from the
> global LRU list. The global LRU list can either satisfy it
> by its global free-list or by shrinking the global inactive
> list. Since we have acquired the global LRU list lock,
> it will try to get at most LOCAL_FREE_TARGET elements
> to the local free list.
> 5. When a new element is added to the bpf_htab, it will
> first sit at the pending-list (of the local list) first.
> The pending-list will be flushed to the global LRU list
> when it needs to acquire free nodes from the global list
> next time.
>
> * Lock Consideration:
> The LRU list has a lock (lru_lock). Each bucket of htab has a
> lock (buck_lock). If both locks need to be acquired together,
> the lock order is always lru_lock -> buck_lock and this only
> happens in the bpf_lru_list.c logic.
>
> In hashtab.c, both locks are not acquired together (i.e. one
> lock is always released first before acquiring another lock).
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
thanks for detailed commit log.
I think it's worth adding it to bpf_lru_list.c as design documentation.
Acked-by: Alexei Starovoitov <ast@kernel.org>
next prev parent reply other threads:[~2016-11-15 1:50 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-11 18:55 [PATCH v2 net-next 0/6] bpf: LRU map Martin KaFai Lau
2016-11-11 18:55 ` [PATCH v2 net-next 1/6] bpf: LRU List Martin KaFai Lau
2016-11-15 1:50 ` Alexei Starovoitov [this message]
2016-11-11 18:55 ` [PATCH v2 net-next 2/6] bpf: Add percpu LRU list Martin KaFai Lau
2016-11-15 1:51 ` Alexei Starovoitov
2016-11-11 18:55 ` [PATCH v2 net-next 3/6] bpf: Refactor codes handling percpu map Martin KaFai Lau
2016-11-14 18:43 ` Alexei Starovoitov
2016-11-11 18:55 ` [PATCH v2 net-next 4/6] bpf: Add BPF_MAP_TYPE_LRU_HASH Martin KaFai Lau
2016-11-14 20:52 ` Alexei Starovoitov
2016-11-11 18:55 ` [PATCH v2 net-next 5/6] bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH Martin KaFai Lau
2016-11-15 1:34 ` Alexei Starovoitov
2016-11-11 18:55 ` [PATCH v2 net-next 6/6] bpf: Add tests for the LRU bpf_htab Martin KaFai Lau
2016-11-15 1:43 ` Alexei Starovoitov
2016-11-14 18:19 ` [PATCH v2 net-next 0/6] bpf: LRU map David Miller
2016-11-15 16:51 ` David Miller
2016-11-15 16:59 ` David Miller
2016-11-15 18:12 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161115015035.GC8080@ast-mbp.thefacebook.com \
--to=alexei.starovoitov@gmail.com \
--cc=ast@fb.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=kafai@fb.com \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.