From: John Fastabend <john.fastabend@gmail.com>
To: Joe Stringer <joe@isovalent.com>, bpf@vger.kernel.org
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
ast@kernel.org, corbet@lwn.net, martin.lau@linux.dev,
bagasdotme@gmail.com, maxtram95@gmail.com
Subject: RE: [PATCH bpf-next v3] docs/bpf: Add LRU internals description and graph
Date: Mon, 13 Mar 2023 22:21:36 -0700 [thread overview]
Message-ID: <6410046013834_42581208fd@john.notmuch> (raw)
In-Reply-To: <20230312190600.324573-1-joe@isovalent.com>
Joe Stringer wrote:
> Extend the bpf hashmap docs to include a brief description of the
> internals of the LRU map type (setting appropriate API expectations),
> including the original commit message from Martin and a variant on the
> graph that I had presented during my Linux Plumbers Conference 2022 talk
> on "Pressure feedback for LRU map types"[0].
>
> The node names in the dot file correspond roughly to the functions where
> the logic for those decisions or steps is defined, to help curious
> developers to cross-reference and update this logic if the details of
> the LRU implementation ever differ from this description.
>
> [0]: https://lpc.events/event/16/contributions/1368/
>
> Signed-off-by: Joe Stringer <joe@isovalent.com>
Thanks couple nits below
> ---
> v3: Use standard table syntax
> Replace inline commit message with reference to commit
> Fix incorrect Y/N label for common LRU check
> Rename some dotfile variables to reduce confusion between cases
> Minor wording touchups
> v2: Fix issue that caused initial email submission to fail
> ---
> Documentation/bpf/map_hash.rst | 62 ++++++++
> Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
> 2 files changed, 228 insertions(+)
> create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 8669426264c6..61602ce26561 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
> .. SPDX-License-Identifier: GPL-2.0-only
> .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>
> ===============================================
> BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
> cur_key = &next_key;
> }
> }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> ++------------------------+---------------------------+----------------------------------+
> +| | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> ++========================+===========================+==================================+
> +| ``BPF_NO_COMMON_LRU`` | Per-CPU LRU, global map | Per-CPU LRU, per-cpu map |
> ++------------------------+---------------------------+----------------------------------+
> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map | Global LRU, per-cpu map |
> ++------------------------+---------------------------+----------------------------------+
Above all seems API to me. Maybe move the statement about not considered stable
ABI down here? Something like,
"
The internal details of which entry is evicted and acquiring a new entry
are not considered stable and may change in the future. But the current
impelementation is as follows.
"
Or something like that?
> +
> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.
Similarly this is ABI correct? Probably we can also specify the error code?
Assuming it is just EBUSY or EAGAIN?
> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure:: map_lru_hash_update.dot
> + :alt: Diagram outlining the LRU eviction steps taken during map update
> +
> + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> + variants
> +
> +Map updates start from the oval in the top right "begin ``bpf_map_update()``"
> +and progress through the graph towards the bottom where the result may be
> +either a successful update or a failure with various error codes. The key in
> +the top right provides indicators for which locks may be involved in specific
> +operations. This is intended as a visual hint for reasoning about how map
> +contention may impact update operations, though the map type and flags may
> +impact the actual contention on those locks, based on the logic described in
> +the table above. For instance, if the map is created with type
> +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_NO_COMMON_LRU`` then all map
> +properties would be per-cpu.
> +
> +The dot file source for the above figure uses internal kernel function names
> +for the node names in order to make the corresponding logic easier to find.
> diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot
> new file mode 100644
> index 000000000000..3a44ebec501e
> --- /dev/null
> +++ b/Documentation/bpf/map_lru_hash_update.dot
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright (C) 2022-2023 Isovalent, Inc.
> +digraph {
> + node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
> + graph [splines=ortho, nodesep=1]
> +
> + subgraph cluster_key {
> + label = "Key\n(locks held during operation)";
> + rankdir = TB;
> +
> + remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
> + hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
> + lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
> + local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
> + no_lock [shape=rectangle,label="no locks held"]
> + }
> +
> + begin [shape=oval,label="begin\nbpf_map_update()"]
> +
> + // Nodes below with an 'fn_' prefix are roughly labeled by the C function
> + // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
> + // Number suffixes and errno suffixes handle subsections of the corresponding
> + // logic in the function as of the writing of this dot.
> +
> + // The following corresponds to bpf_lru_pop_free() for common LRU case.
> + local_freelist_check [shape=diamond,fillcolor=1,
> + label="Local freelist\nnode available?"];
> + // The following corresponds to __local_list_pop_free() for common LRU case.
> + use_local_node [shape=rectangle,
> + label="Use node owned\nby this CPU"]
> +
> + common_lru_check [shape=diamond,
> + label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> + fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
> + label="Flush local pending,
> + Rotate Global list, move
> + LOCAL_FREE_TARGET
> + from global -> local"]
> + // Also corresponds to:
> + // fn__local_list_flush()
> + // fn_bpf_lru_list_rotate()
> + fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
> + label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
> +
> + fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
> + label="Shrink inactive list
> + up to remaining
> + LOCAL_FREE_TARGET
> + (global LRU -> local)"]
> + fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
> + label="> 0 entries in\nlocal free list?"]
> + fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
> + label="Steal one node from
> + inactive, or if empty,
> + from active global list"]
> + fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
> + label="Try to remove\nnode from hashtab"]
> +
> + local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
> + common_lru_check2 [shape=diamond,
> + label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> + subgraph cluster_remote_lock {
> + label = "Iterate through CPUs\n(start from current)";
> + style = dashed;
> + rankdir=LR;
> +
> + local_freelist_check5 [shape=diamond,fillcolor=4,
> + label="Steal a node from\nper-cpu freelist?"]
> + local_freelist_check6 [shape=rectangle,fillcolor=4,
> + label="Steal a node from
> + (1) Unreferenced pending, or
> + (2) Any pending node"]
> + local_freelist_check7 [shape=rectangle,fillcolor=3,
> + label="Try to remove\nnode from hashtab"]
> + fn_htab_lru_map_update_elem [shape=diamond,
> + label="Stole node\nfrom remote\nCPU?"]
> + fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
> + // Also corresponds to:
> + // use_local_node()
> + // fn__local_list_pop_pending()
> + }
> +
> + fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
> + label="Use node that was\nnot recently referenced"]
> + local_freelist_check4 [shape=rectangle,
> + label="Use node that was\nactively referenced\nin global list"]
> + fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
> + fn_htab_lru_map_update_elem3 [shape=rectangle,
> + label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
> + fn_htab_lru_map_update_elem4 [shape=diamond,
> + label="Can lock this\nhashtab bucket?"]
> + fn_htab_lru_map_update_elem5 [shape=rectangle,fillcolor=3,
> + label="Update hashmap\nwith new element"]
> + fn_htab_lru_map_update_elem6 [shape=oval,label="return 0"]
> + fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
> +
> + begin -> local_freelist_check
> + local_freelist_check -> use_local_node [xlabel="Y"]
> + local_freelist_check -> common_lru_check [xlabel="N"]
> + common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
> + common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> + fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
> + fn___bpf_lru_node_move_to_free ->
> + fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
> + fn___bpf_lru_node_move_to_free ->
> + fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> + fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
> + fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
> + fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
> + fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
> + fn___bpf_lru_list_shrink3 -> local_freelist_check2
> + local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
> + local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
> + common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
> + common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
> + local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
> + local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
> + local_freelist_check6 -> local_freelist_check7
> + local_freelist_check7 -> fn_htab_lru_map_update_elem
> +
> + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
> + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2 [xlabel = "N"]
> + fn_htab_lru_map_update_elem2 ->
> + fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
> + fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
> + fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
> +
> + use_local_node -> fn_htab_lru_map_update_elem4
> + fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
> + local_freelist_check4 -> fn_htab_lru_map_update_elem4
> +
> + fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [xlabel="Y"]
> + fn_htab_lru_map_update_elem4 ->
> + fn_htab_lru_map_update_elem_EBUSY [xlabel="N"]
> + fn_htab_lru_map_update_elem5 -> fn_htab_lru_map_update_elem6
> +
> + // Create invisible pad nodes to line up various nodes
> + pad0 [style=invis]
> + pad1 [style=invis]
> + pad2 [style=invis]
> + pad3 [style=invis]
> + pad4 [style=invis]
> +
> + // Line up the key with the top of the graph
> + no_lock -> local_lock [style=invis]
> + local_lock -> lru_lock [style=invis]
> + lru_lock -> hash_lock [style=invis]
> + hash_lock -> remote_lock [style=invis]
> + remote_lock -> local_freelist_check5 [style=invis]
> + remote_lock -> fn___bpf_lru_list_shrink [style=invis]
> +
> + // Line up return code nodes at the bottom of the graph
> + fn_htab_lru_map_update_elem -> pad0 [style=invis]
> + pad0 -> pad1 [style=invis]
> + pad1 -> pad2 [style=invis]
> + pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
> + fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
> + pad3 -> fn_htab_lru_map_update_elem_EBUSY [style=invis]
> +
> + // Reduce diagram width by forcing some nodes to appear above others
> + local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
> + common_lru_check2 -> pad4 [style=invis]
> + pad4 -> local_freelist_check5 [style=invis]
> +}
> --
> 2.34.1
>
next prev parent reply other threads:[~2023-03-14 5:21 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-12 19:05 [PATCH bpf-next v3] docs/bpf: Add LRU internals description and graph Joe Stringer
2023-03-13 3:54 ` Bagas Sanjaya
2023-03-13 19:18 ` Joe Stringer
2023-03-14 5:21 ` John Fastabend [this message]
2023-03-16 1:19 ` Joe Stringer
2023-03-14 13:55 ` Maxim Mikityanskiy
2023-03-16 1:22 ` Joe Stringer
2023-03-14 19:31 ` Martin KaFai Lau
2023-03-16 1:54 ` Joe Stringer
2023-03-17 6:04 ` Martin KaFai Lau
2023-04-01 18:57 ` Joe Stringer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6410046013834_42581208fd@john.notmuch \
--to=john.fastabend@gmail.com \
--cc=ast@kernel.org \
--cc=bagasdotme@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=joe@isovalent.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=maxtram95@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.