From: Eric Dumazet <eric.dumazet@gmail.com>
To: Yonghong Song <yhs@fb.com>,
ast@fb.com, daniel@iogearbox.net, netdev@vger.kernel.org
Cc: kernel-team@fb.com
Subject: Re: [PATCH bpf-next 1/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
Date: Mon, 22 Jan 2018 11:28:32 -0800 [thread overview]
Message-ID: <1516649312.3478.13.camel@gmail.com> (raw)
In-Reply-To: <20180118230851.1533009-2-yhs@fb.com>
On Thu, 2018-01-18 at 15:08 -0800, Yonghong Song wrote:
> Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
> command. This command is handy when users want to enumerate
> keys. Otherwise, a different map which supports key
> enumeration may be required to store the keys. If the
> map data is sparse and all map data are to be deleted without
> closing file descriptor, using MAP_GET_NEXT_KEY to find
> all keys is much faster than enumerating all key space.
>
> This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
> If user provided key pointer is NULL or the key does not have
> an exact match in the trie, the first key will be returned.
> Otherwise, the next key will be returned.
>
> In this implemenation, key enumeration follows a postorder
> traversal of internal trie. More specific keys
> will be returned first than less specific ones, given
> a sequence of MAP_GET_NEXT_KEY syscalls.
>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
> kernel/bpf/lpm_trie.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 93 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 584e022..d7ea962 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -591,9 +591,100 @@ static void trie_free(struct bpf_map *map)
> raw_spin_unlock(&trie->lock);
> }
>
> -static int trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
> {
> - return -ENOTSUPP;
> + struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
> + struct bpf_lpm_trie_key *key = _key, *next_key = _next_key;
> + struct lpm_trie_node *node, *next_node = NULL, *parent;
> + struct lpm_trie_node **node_stack = NULL;
> + struct lpm_trie_node __rcu **root;
> + int err = 0, stack_ptr = -1;
> + unsigned int next_bit;
> + size_t matchlen;
> +
> + /* The get_next_key follows postorder. For the 4 node example in
> + * the top of this file, the trie_get_next_key() returns the following
> + * one after another:
> + * 192.168.0.0/24
> + * 192.168.1.0/24
> + * 192.168.128.0/24
> + * 192.168.0.0/16
> + *
> + * The idea is to return more specific keys before less specific ones.
> + */
> +
> + /* Empty trie */
> + if (!rcu_dereference(trie->root))
> + return -ENOENT;
> +
> + /* For invalid key, find the leftmost node in the trie */
> + if (!key || key->prefixlen > trie->max_prefixlen) {
> + root = &trie->root;
> + goto find_leftmost;
> + }
> +
> + node_stack = kmalloc(trie->max_prefixlen * sizeof(struct lpm_trie_node *),
> + GFP_USER | __GFP_NOWARN);
It is illegal to use a blocking kmalloc() while holding RCU.
CONFIG_DEBUG_ATOMIC_SLEEP=y is your friend
> + if (!node_stack)
> + return -ENOMEM;
> +
> + /* Try to find the exact node for the given key */
> + for (node = rcu_dereference(trie->root); node;) {
> + node_stack[++stack_ptr] = node;
> + matchlen = longest_prefix_match(trie, node, key);
> + if (node->prefixlen != matchlen ||
> + node->prefixlen == key->prefixlen)
> + break;
> +
> + next_bit = extract_bit(key->data, node->prefixlen);
> + node = rcu_dereference(node->child[next_bit]);
> + }
> + if (!node || node->prefixlen != key->prefixlen ||
> + (node->flags & LPM_TREE_NODE_FLAG_IM)) {
> + root = &trie->root;
> + goto find_leftmost;
> + }
> +
> + /* The node with the exactly-matching key has been found,
> + * find the first node in postorder after the matched node.
> + */
> + node = node_stack[stack_ptr];
> + while (stack_ptr > 0) {
> + parent = node_stack[stack_ptr - 1];
> + if (rcu_dereference(parent->child[0]) == node &&
> + rcu_dereference(parent->child[1])) {
> + root = &parent->child[1];
> + goto find_leftmost;
> + }
> + if (!(parent->flags & LPM_TREE_NODE_FLAG_IM)) {
> + next_node = parent;
> + goto do_copy;
> + }
> +
> + node = parent;
> + stack_ptr--;
> + }
> +
> + /* did not find anything */
> + err = -ENOENT;
> + goto free_stack;
> +
> +find_leftmost:
> + /* Find the leftmost non-intermediate node, all intermediate nodes
> + * have exact two children, so this function will never return NULL.
> + */
> + for (node = rcu_dereference(*root); node;) {
> + if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
> + next_node = node;
> + node = rcu_dereference(node->child[0]);
> + }
> +do_copy:
> + next_key->prefixlen = next_node->prefixlen;
> + memcpy((void *)next_key + offsetof(struct bpf_lpm_trie_key, data),
> + next_node->data, trie->data_size);
> +free_stack:
> + kfree(node_stack);
> + return err;
> }
>
> const struct bpf_map_ops trie_map_ops = {
next prev parent reply other threads:[~2018-01-22 19:28 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-18 23:08 [PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map Yonghong Song
2018-01-18 23:08 ` [PATCH bpf-next 1/2] " Yonghong Song
2018-01-22 19:28 ` Eric Dumazet [this message]
2018-01-23 0:05 ` Yonghong Song
2018-01-26 4:47 ` Eric Dumazet
2018-01-26 16:50 ` Yonghong Song
2018-01-18 23:08 ` [PATCH bpf-next 2/2] tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of " Yonghong Song
2018-01-20 1:07 ` [PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for " Daniel Borkmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1516649312.3478.13.camel@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=ast@fb.com \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=netdev@vger.kernel.org \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).