All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Yonghong Song <yhs@fb.com>,
	ast@fb.com, daniel@iogearbox.net, netdev@vger.kernel.org
Cc: kernel-team@fb.com
Subject: Re: [PATCH bpf-next 1/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
Date: Mon, 22 Jan 2018 11:28:32 -0800	[thread overview]
Message-ID: <1516649312.3478.13.camel@gmail.com> (raw)
In-Reply-To: <20180118230851.1533009-2-yhs@fb.com>

On Thu, 2018-01-18 at 15:08 -0800, Yonghong Song wrote:
> Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
> command. This command is handy when users want to enumerate
> keys. Otherwise, a different map which supports key
> enumeration may be required to store the keys. If the
> map data is sparse and all map data are to be deleted without
> closing file descriptor, using MAP_GET_NEXT_KEY to find
> all keys is much faster than enumerating all key space.
> 
> This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
> If user provided key pointer is NULL or the key does not have
> an exact match in the trie, the first key will be returned.
> Otherwise, the next key will be returned.
> 
> In this implemenation, key enumeration follows a postorder
> traversal of internal trie. More specific keys
> will be returned first than less specific ones, given
> a sequence of MAP_GET_NEXT_KEY syscalls.
> 
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>  kernel/bpf/lpm_trie.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 93 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 584e022..d7ea962 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -591,9 +591,100 @@ static void trie_free(struct bpf_map *map)
>  	raw_spin_unlock(&trie->lock);
>  }
>  
> -static int trie_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +static int trie_get_next_key(struct bpf_map *map, void *_key, void *_next_key)
>  {
> -	return -ENOTSUPP;
> +	struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
> +	struct bpf_lpm_trie_key *key = _key, *next_key = _next_key;
> +	struct lpm_trie_node *node, *next_node = NULL, *parent;
> +	struct lpm_trie_node **node_stack = NULL;
> +	struct lpm_trie_node __rcu **root;
> +	int err = 0, stack_ptr = -1;
> +	unsigned int next_bit;
> +	size_t matchlen;
> +
> +	/* The get_next_key follows postorder. For the 4 node example in
> +	 * the top of this file, the trie_get_next_key() returns the following
> +	 * one after another:
> +	 *   192.168.0.0/24
> +	 *   192.168.1.0/24
> +	 *   192.168.128.0/24
> +	 *   192.168.0.0/16
> +	 *
> +	 * The idea is to return more specific keys before less specific ones.
> +	 */
> +
> +	/* Empty trie */
> +	if (!rcu_dereference(trie->root))
> +		return -ENOENT;
> +
> +	/* For invalid key, find the leftmost node in the trie */
> +	if (!key || key->prefixlen > trie->max_prefixlen) {
> +		root = &trie->root;
> +		goto find_leftmost;
> +	}
> +
> +	node_stack = kmalloc(trie->max_prefixlen * sizeof(struct lpm_trie_node *),
> +			     GFP_USER | __GFP_NOWARN);

It is illegal to use a blocking kmalloc() while holding RCU.

CONFIG_DEBUG_ATOMIC_SLEEP=y is your friend

> +	if (!node_stack)
> +		return -ENOMEM;
> +
> +	/* Try to find the exact node for the given key */
> +	for (node = rcu_dereference(trie->root); node;) {
> +		node_stack[++stack_ptr] = node;
> +		matchlen = longest_prefix_match(trie, node, key);
> +		if (node->prefixlen != matchlen ||
> +		    node->prefixlen == key->prefixlen)
> +			break;
> +
> +		next_bit = extract_bit(key->data, node->prefixlen);
> +		node = rcu_dereference(node->child[next_bit]);
> +	}
> +	if (!node || node->prefixlen != key->prefixlen ||
> +	    (node->flags & LPM_TREE_NODE_FLAG_IM)) {
> +		root = &trie->root;
> +		goto find_leftmost;
> +	}
> +
> +	/* The node with the exactly-matching key has been found,
> +	 * find the first node in postorder after the matched node.
> +	 */
> +	node = node_stack[stack_ptr];
> +	while (stack_ptr > 0) {
> +		parent = node_stack[stack_ptr - 1];
> +		if (rcu_dereference(parent->child[0]) == node &&
> +		    rcu_dereference(parent->child[1])) {
> +			root = &parent->child[1];
> +			goto find_leftmost;
> +		}
> +		if (!(parent->flags & LPM_TREE_NODE_FLAG_IM)) {
> +			next_node = parent;
> +			goto do_copy;
> +		}
> +
> +		node = parent;
> +		stack_ptr--;
> +	}
> +
> +	/* did not find anything */
> +	err = -ENOENT;
> +	goto free_stack;
> +
> +find_leftmost:
> +	/* Find the leftmost non-intermediate node, all intermediate nodes
> +	 * have exact two children, so this function will never return NULL.
> +	 */
> +	for (node = rcu_dereference(*root); node;) {
> +		if (!(node->flags & LPM_TREE_NODE_FLAG_IM))
> +			next_node = node;
> +		node = rcu_dereference(node->child[0]);
> +	}
> +do_copy:
> +	next_key->prefixlen = next_node->prefixlen;
> +	memcpy((void *)next_key + offsetof(struct bpf_lpm_trie_key, data),
> +	       next_node->data, trie->data_size);
> +free_stack:
> +	kfree(node_stack);
> +	return err;
>  }
>  
>  const struct bpf_map_ops trie_map_ops = {

  reply	other threads:[~2018-01-22 19:28 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-18 23:08 [PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map Yonghong Song
2018-01-18 23:08 ` [PATCH bpf-next 1/2] " Yonghong Song
2018-01-22 19:28   ` Eric Dumazet [this message]
2018-01-23  0:05     ` Yonghong Song
2018-01-26  4:47   ` Eric Dumazet
2018-01-26 16:50     ` Yonghong Song
2018-01-18 23:08 ` [PATCH bpf-next 2/2] tools/bpf: add a testcase for MAP_GET_NEXT_KEY command of " Yonghong Song
2018-01-20  1:07 ` [PATCH bpf-next 0/2] bpf: implement MAP_GET_NEXT_KEY command for " Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1516649312.3478.13.camel@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=ast@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.