* [PATCH net-next 0/2] net: store netdevs in an xarray
@ 2023-07-22 1:42 Jakub Kicinski
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-22 1:42 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, mkubecek, lorenzo, Jakub Kicinski
One of more annoying developer experience gaps we have in netlink
is iterating over netdevs. It's painful. Add an xarray to make
it trivial.
Jakub Kicinski (2):
net: store netdevs in an xarray
net: convert some netlink netdev iterators to depend on the xarray
include/linux/netdevice.h | 3 ++
include/net/net_namespace.h | 4 +-
net/core/dev.c | 82 ++++++++++++++++++++++++-------------
net/core/netdev-genl.c | 37 ++++-------------
net/ethtool/netlink.c | 56 ++++++-------------------
net/ethtool/tunnels.c | 73 ++++++++++++---------------------
6 files changed, 108 insertions(+), 147 deletions(-)
--
2.41.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-22 1:42 [PATCH net-next 0/2] net: store netdevs in an xarray Jakub Kicinski
@ 2023-07-22 1:42 ` Jakub Kicinski
2023-07-22 1:47 ` Jakub Kicinski
` (2 more replies)
2023-07-22 1:42 ` [PATCH net-next 2/2] net: convert some netlink netdev iterators to depend on the xarray Jakub Kicinski
2023-07-24 15:28 ` [PATCH net-next 0/2] net: store netdevs in an xarray Simon Horman
2 siblings, 3 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-22 1:42 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, mkubecek, lorenzo, Jakub Kicinski
Iterating over the netdev hash table for netlink dumps is hard.
Dumps are done in "chunks" so we need to save the position
after each chunk, so we know where to restart from. Because
netdevs are stored in a hash table we remember which bucket
we were in and how many devices we dumped.
Since we don't hold any locks across the "chunks" - devices may
come and go while we're dumping. If that happens we may miss
a device (if device is deleted from the bucket we were in).
We indicate to user space that this may have happened by setting
NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
if it sees that. Somehow I doubt most user space gets this right..
To illustrate let's look at an example:
System state:
start: # [A, B, C]
del: B # [A, C]
with the hash table we may dump [A, B], missing C completely even
tho it existed both before and after the "del B".
Add an xarray and use it to allocate ifindexes. This way we
can iterate ifindexes in order, without the worry that we'll
skip one. We may still generate a dump of a state which "never
existed", for example for a set of values and sequence of ops:
System state:
start: # [A, B]
add: C # [A, C, B]
del: B # [A, C]
we may generate a dump of [A], if C got an index between A and B.
System has never been in such state. But I'm 90% sure that's perfectly
fine, important part is that we can't _miss_ devices which exist before
and after. User space which wants to mirror kernel's state subscribes
to notifications and does periodic dumps so it will know that C exists
from the notification about its creation or from the next dump
(next dump is _guaranteed_ to include C, if it doesn't get removed).
To avoid any perf regressions keep the hash table for now. Most
net namespaces have very few devices and microbenchmarking 1M lookups
on Skylake I get the following results (not counting loopback
to number of devs):
#devs | hash | xa | delta
2 | 18.3 | 20.1 | + 9.8%
16 | 18.3 | 20.1 | + 9.5%
64 | 18.3 | 26.3 | +43.8%
128 | 20.4 | 26.3 | +28.6%
256 | 20.0 | 26.4 | +32.1%
1024 | 26.6 | 26.7 | + 0.2%
8192 |541.3 | 33.5 | -93.8%
No surprises since the hash table has 256 entries.
The microbenchmark scans indexes in order, if the pattern is more
random xa starts to win at 512 devices already. But that's a lot
of devices, in practice.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
include/net/net_namespace.h | 4 +-
net/core/dev.c | 82 ++++++++++++++++++++++++-------------
2 files changed, 57 insertions(+), 29 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 78beaa765c73..9f6add96de2d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -42,6 +42,7 @@
#include <linux/idr.h>
#include <linux/skbuff.h>
#include <linux/notifier.h>
+#include <linux/xarray.h>
struct user_namespace;
struct proc_dir_entry;
@@ -69,7 +70,7 @@ struct net {
atomic_t dev_unreg_count;
unsigned int dev_base_seq; /* protected by rtnl_mutex */
- int ifindex;
+ u32 ifindex;
spinlock_t nsid_lock;
atomic_t fnhe_genid;
@@ -110,6 +111,7 @@ struct net {
struct hlist_head *dev_name_head;
struct hlist_head *dev_index_head;
+ struct xarray dev_by_index;
struct raw_notifier_head netdev_chain;
/* Note that @hash_mix can be read millions times per second,
diff --git a/net/core/dev.c b/net/core/dev.c
index 8e7d0cb540cd..3c22a1691dc7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -388,6 +388,8 @@ static void list_netdevice(struct net_device *dev)
hlist_add_head_rcu(&dev->index_hlist,
dev_index_hash(net, dev->ifindex));
write_unlock(&dev_base_lock);
+ /* We reserved the ifindex, this can't fail */
+ WARN_ON(xa_store(&net->dev_by_index, dev->ifindex, dev, GFP_KERNEL));
dev_base_seq_inc(net);
}
@@ -397,8 +399,12 @@ static void list_netdevice(struct net_device *dev)
*/
static void unlist_netdevice(struct net_device *dev, bool lock)
{
+ struct net *net = dev_net(dev);
+
ASSERT_RTNL();
+ xa_erase(&net->dev_by_index, dev->ifindex);
+
/* Unlink dev from the device chain */
if (lock)
write_lock(&dev_base_lock);
@@ -9566,23 +9572,35 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
}
/**
- * dev_new_index - allocate an ifindex
- * @net: the applicable net namespace
+ * dev_index_reserve() - allocate an ifindex in a namespace
+ * @net: the applicable net namespace
+ * @ifindex: requested ifindex, pass %0 to get one allocated
*
- * Returns a suitable unique value for a new device interface
- * number. The caller must hold the rtnl semaphore or the
- * dev_base_lock to be sure it remains unique.
+ * Allocate a ifindex for a new device. Caller must either use the ifindex
+ * to store the device (via list_netdevice()) or call dev_index_release()
+ * to give the index up.
+ *
+ * Return: a suitable unique value for a new device interface number or -errno.
*/
-static int dev_new_index(struct net *net)
+static int dev_index_reserve(struct net *net, u32 ifindex)
{
- int ifindex = net->ifindex;
+ int err;
- for (;;) {
- if (++ifindex <= 0)
- ifindex = 1;
- if (!__dev_get_by_index(net, ifindex))
- return net->ifindex = ifindex;
- }
+ if (!ifindex)
+ err = xa_alloc_cyclic(&net->dev_by_index, &ifindex, NULL,
+ xa_limit_31b, &net->ifindex, GFP_KERNEL);
+ else
+ err = xa_insert(&net->dev_by_index, ifindex, NULL, GFP_KERNEL);
+ if (err)
+ return err;
+
+ return ifindex;
+}
+
+static void dev_index_release(struct net *net, int ifindex)
+{
+ /* Expect only unused indexes, unlist_netdevice() removes the used */
+ WARN_ON(xa_erase(&net->dev_by_index, ifindex));
}
/* Delayed registration/unregisteration */
@@ -10052,11 +10070,10 @@ int register_netdevice(struct net_device *dev)
goto err_uninit;
}
- ret = -EBUSY;
- if (!dev->ifindex)
- dev->ifindex = dev_new_index(net);
- else if (__dev_get_by_index(net, dev->ifindex))
+ ret = dev_index_reserve(net, dev->ifindex);
+ if (ret < 0)
goto err_uninit;
+ dev->ifindex = ret;
/* Transfer changeable features to wanted_features and enable
* software offloads (GSO and GRO).
@@ -10103,7 +10120,7 @@ int register_netdevice(struct net_device *dev)
ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
ret = notifier_to_errno(ret);
if (ret)
- goto err_uninit;
+ goto err_ifindex_release;
ret = netdev_register_kobject(dev);
write_lock(&dev_base_lock);
@@ -10159,6 +10176,8 @@ int register_netdevice(struct net_device *dev)
err_uninit_notify:
call_netdevice_notifiers(NETDEV_PRE_UNINIT, dev);
+err_ifindex_release:
+ dev_index_release(net, dev->ifindex);
err_uninit:
if (dev->netdev_ops->ndo_uninit)
dev->netdev_ops->ndo_uninit(dev);
@@ -11036,9 +11055,19 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
}
/* Check that new_ifindex isn't used yet. */
- err = -EBUSY;
- if (new_ifindex && __dev_get_by_index(net, new_ifindex))
- goto out;
+ if (new_ifindex) {
+ err = dev_index_reserve(net, new_ifindex);
+ if (err < 0)
+ goto out;
+ } else {
+ /* If there is an ifindex conflict assign a new one */
+ err = dev_index_reserve(net, dev->ifindex);
+ if (err == -EBUSY)
+ err = dev_index_reserve(net, 0);
+ if (err < 0)
+ goto out;
+ new_ifindex = err;
+ }
/*
* And now a mini version of register_netdevice unregister_netdevice.
@@ -11066,13 +11095,6 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
rcu_barrier();
new_nsid = peernet2id_alloc(dev_net(dev), net, GFP_KERNEL);
- /* If there is an ifindex conflict assign a new one */
- if (!new_ifindex) {
- if (__dev_get_by_index(net, dev->ifindex))
- new_ifindex = dev_new_index(net);
- else
- new_ifindex = dev->ifindex;
- }
rtmsg_ifinfo_newnet(RTM_DELLINK, dev, ~0U, GFP_KERNEL, &new_nsid,
new_ifindex);
@@ -11250,6 +11272,9 @@ static int __net_init netdev_init(struct net *net)
if (net->dev_index_head == NULL)
goto err_idx;
+ net->ifindex = 1;
+ xa_init_flags(&net->dev_by_index, XA_FLAGS_ALLOC);
+
RAW_INIT_NOTIFIER_HEAD(&net->netdev_chain);
return 0;
@@ -11347,6 +11372,7 @@ static void __net_exit netdev_exit(struct net *net)
{
kfree(net->dev_name_head);
kfree(net->dev_index_head);
+ xa_destroy(&net->dev_by_index);
if (net != &init_net)
WARN_ON_ONCE(!list_empty(&net->dev_base_head));
}
--
2.41.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH net-next 2/2] net: convert some netlink netdev iterators to depend on the xarray
2023-07-22 1:42 [PATCH net-next 0/2] net: store netdevs in an xarray Jakub Kicinski
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
@ 2023-07-22 1:42 ` Jakub Kicinski
2023-07-24 15:28 ` [PATCH net-next 0/2] net: store netdevs in an xarray Simon Horman
2 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-22 1:42 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, mkubecek, lorenzo, Jakub Kicinski
Reap the benefits of easier iteration thanks to the xarray.
Convert just the genetlink ones, those are easier to test.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
include/linux/netdevice.h | 3 ++
net/core/netdev-genl.c | 37 +++++---------------
net/ethtool/netlink.c | 59 ++++++++-----------------------
net/ethtool/tunnels.c | 73 +++++++++++++++------------------------
4 files changed, 52 insertions(+), 120 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3800d0479698..1159ca17e246 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3014,6 +3014,9 @@ extern rwlock_t dev_base_lock; /* Device list lock */
if (netdev_master_upper_dev_get_rcu(slave) == (bond))
#define net_device_entry(lh) list_entry(lh, struct net_device, dev_list)
+#define for_each_netdev_dump(net, d, ifindex) \
+ xa_for_each_start(&(net)->dev_by_index, (ifindex), (d), (ifindex))
+
static inline struct net_device *next_net_device(struct net_device *dev)
{
struct list_head *lh;
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 65ef4867fc49..797c813c7c77 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -101,43 +101,22 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
{
struct net *net = sock_net(skb->sk);
struct net_device *netdev;
- int idx = 0, s_idx;
- int h, s_h;
- int err;
-
- s_h = cb->args[0];
- s_idx = cb->args[1];
+ int err = 0;
rtnl_lock();
-
- for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
- struct hlist_head *head;
-
- idx = 0;
- head = &net->dev_index_head[h];
- hlist_for_each_entry(netdev, head, index_hlist) {
- if (idx < s_idx)
- goto cont;
- err = netdev_nl_dev_fill(netdev, skb,
- NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq, 0,
- NETDEV_CMD_DEV_GET);
- if (err < 0)
- break;
-cont:
- idx++;
- }
+ for_each_netdev_dump(net, netdev, cb->args[0]) {
+ err = netdev_nl_dev_fill(netdev, skb,
+ NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, 0,
+ NETDEV_CMD_DEV_GET);
+ if (err < 0)
+ break;
}
-
rtnl_unlock();
if (err != -EMSGSIZE)
return err;
- cb->args[1] = idx;
- cb->args[0] = h;
- cb->seq = net->dev_base_seq;
-
return skb->len;
}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 39a459b0111b..ae344f1b0bbd 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -252,8 +252,7 @@ int ethnl_multicast(struct sk_buff *skb, struct net_device *dev)
* @ops: request ops of currently processed message type
* @req_info: parsed request header of processed request
* @reply_data: data needed to compose the reply
- * @pos_hash: saved iteration position - hashbucket
- * @pos_idx: saved iteration position - index
+ * @pos_ifindex: saved iteration position - ifindex
*
* These parameters are kept in struct netlink_callback as context preserved
* between iterations. They are initialized by ethnl_default_start() and used
@@ -263,8 +262,7 @@ struct ethnl_dump_ctx {
const struct ethnl_request_ops *ops;
struct ethnl_req_info *req_info;
struct ethnl_reply_data *reply_data;
- int pos_hash;
- int pos_idx;
+ unsigned long pos_ifindex;
};
static const struct ethnl_request_ops *
@@ -490,55 +488,27 @@ static int ethnl_default_dumpit(struct sk_buff *skb,
{
struct ethnl_dump_ctx *ctx = ethnl_dump_context(cb);
struct net *net = sock_net(skb->sk);
- int s_idx = ctx->pos_idx;
- int h, idx = 0;
+ struct net_device *dev;
int ret = 0;
rtnl_lock();
- for (h = ctx->pos_hash; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
- struct hlist_head *head;
- struct net_device *dev;
- unsigned int seq;
+ for_each_netdev_dump(net, dev, ctx->pos_ifindex) {
+ dev_hold(dev);
+ rtnl_unlock();
- head = &net->dev_index_head[h];
+ ret = ethnl_default_dump_one(skb, dev, ctx, cb);
-restart_chain:
- seq = net->dev_base_seq;
- cb->seq = seq;
- idx = 0;
- hlist_for_each_entry(dev, head, index_hlist) {
- if (idx < s_idx)
- goto cont;
- dev_hold(dev);
- rtnl_unlock();
+ rtnl_lock();
+ dev_put(dev);
- ret = ethnl_default_dump_one(skb, dev, ctx, cb);
- dev_put(dev);
- if (ret < 0) {
- if (ret == -EOPNOTSUPP)
- goto lock_and_cont;
- if (likely(skb->len))
- ret = skb->len;
- goto out;
- }
-lock_and_cont:
- rtnl_lock();
- if (net->dev_base_seq != seq) {
- s_idx = idx + 1;
- goto restart_chain;
- }
-cont:
- idx++;
+ if (ret < 0 && ret != -EOPNOTSUPP) {
+ if (likely(skb->len))
+ ret = skb->len;
+ break;
}
-
}
rtnl_unlock();
-out:
- ctx->pos_hash = h;
- ctx->pos_idx = idx;
- nl_dump_check_consistent(cb, nlmsg_hdr(skb));
-
return ret;
}
@@ -584,8 +554,7 @@ static int ethnl_default_start(struct netlink_callback *cb)
ctx->ops = ops;
ctx->req_info = req_info;
ctx->reply_data = reply_data;
- ctx->pos_hash = 0;
- ctx->pos_idx = 0;
+ ctx->pos_ifindex = 0;
return 0;
diff --git a/net/ethtool/tunnels.c b/net/ethtool/tunnels.c
index 67fb414ca859..05f752557b5e 100644
--- a/net/ethtool/tunnels.c
+++ b/net/ethtool/tunnels.c
@@ -212,8 +212,7 @@ int ethnl_tunnel_info_doit(struct sk_buff *skb, struct genl_info *info)
struct ethnl_tunnel_info_dump_ctx {
struct ethnl_req_info req_info;
- int pos_hash;
- int pos_idx;
+ unsigned long ifindex;
};
int ethnl_tunnel_info_start(struct netlink_callback *cb)
@@ -243,56 +242,38 @@ int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
{
struct ethnl_tunnel_info_dump_ctx *ctx = (void *)cb->ctx;
struct net *net = sock_net(skb->sk);
- int s_idx = ctx->pos_idx;
- int h, idx = 0;
+ struct net_device *dev;
int ret = 0;
void *ehdr;
rtnl_lock();
- cb->seq = net->dev_base_seq;
- for (h = ctx->pos_hash; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
- struct hlist_head *head;
- struct net_device *dev;
-
- head = &net->dev_index_head[h];
- idx = 0;
- hlist_for_each_entry(dev, head, index_hlist) {
- if (idx < s_idx)
- goto cont;
-
- ehdr = ethnl_dump_put(skb, cb,
- ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY);
- if (!ehdr) {
- ret = -EMSGSIZE;
- goto out;
- }
-
- ret = ethnl_fill_reply_header(skb, dev, ETHTOOL_A_TUNNEL_INFO_HEADER);
- if (ret < 0) {
- genlmsg_cancel(skb, ehdr);
- goto out;
- }
-
- ctx->req_info.dev = dev;
- ret = ethnl_tunnel_info_fill_reply(&ctx->req_info, skb);
- ctx->req_info.dev = NULL;
- if (ret < 0) {
- genlmsg_cancel(skb, ehdr);
- if (ret == -EOPNOTSUPP)
- goto cont;
- goto out;
- }
- genlmsg_end(skb, ehdr);
-cont:
- idx++;
+ for_each_netdev_dump(net, dev, ctx->ifindex) {
+ ehdr = ethnl_dump_put(skb, cb,
+ ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY);
+ if (!ehdr) {
+ ret = -EMSGSIZE;
+ break;
}
- }
-out:
- rtnl_unlock();
- ctx->pos_hash = h;
- ctx->pos_idx = idx;
- nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+ ret = ethnl_fill_reply_header(skb, dev,
+ ETHTOOL_A_TUNNEL_INFO_HEADER);
+ if (ret < 0) {
+ genlmsg_cancel(skb, ehdr);
+ break;
+ }
+
+ ctx->req_info.dev = dev;
+ ret = ethnl_tunnel_info_fill_reply(&ctx->req_info, skb);
+ ctx->req_info.dev = NULL;
+ if (ret < 0) {
+ genlmsg_cancel(skb, ehdr);
+ if (ret == -EOPNOTSUPP)
+ continue;
+ break;
+ }
+ genlmsg_end(skb, ehdr);
+ }
+ rtnl_unlock();
if (ret == -EMSGSIZE && skb->len)
return skb->len;
--
2.41.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
@ 2023-07-22 1:47 ` Jakub Kicinski
2023-07-24 8:18 ` Paolo Abeni
2023-07-24 19:09 ` Leon Romanovsky
2 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-22 1:47 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, mkubecek, lorenzo
On Fri, 21 Jul 2023 18:42:36 -0700 Jakub Kicinski wrote:
> #devs | hash | xa | delta
> 2 | 18.3 | 20.1 | + 9.8%
> 16 | 18.3 | 20.1 | + 9.5%
> 64 | 18.3 | 26.3 | +43.8%
> 128 | 20.4 | 26.3 | +28.6%
> 256 | 20.0 | 26.4 | +32.1%
> 1024 | 26.6 | 26.7 | + 0.2%
> 8192 |541.3 | 33.5 | -93.8%
^^^^^^ ^^^^^^
I failed to clarify these columns are msecs of runtime.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
2023-07-22 1:47 ` Jakub Kicinski
@ 2023-07-24 8:18 ` Paolo Abeni
2023-07-24 15:41 ` Jakub Kicinski
2023-07-24 19:09 ` Leon Romanovsky
2 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2023-07-24 8:18 UTC (permalink / raw)
To: Jakub Kicinski, davem; +Cc: netdev, edumazet, mkubecek, lorenzo
On Fri, 2023-07-21 at 18:42 -0700, Jakub Kicinski wrote:
> Iterating over the netdev hash table for netlink dumps is hard.
> Dumps are done in "chunks" so we need to save the position
> after each chunk, so we know where to restart from. Because
> netdevs are stored in a hash table we remember which bucket
> we were in and how many devices we dumped.
>
> Since we don't hold any locks across the "chunks" - devices may
> come and go while we're dumping. If that happens we may miss
> a device (if device is deleted from the bucket we were in).
> We indicate to user space that this may have happened by setting
> NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
> if it sees that. Somehow I doubt most user space gets this right..
>
> To illustrate let's look at an example:
>
> System state:
> start: # [A, B, C]
> del: B # [A, C]
>
> with the hash table we may dump [A, B], missing C completely even
> tho it existed both before and after the "del B".
>
> Add an xarray and use it to allocate ifindexes. This way we
> can iterate ifindexes in order, without the worry that we'll
> skip one. We may still generate a dump of a state which "never
> existed", for example for a set of values and sequence of ops:
>
> System state:
> start: # [A, B]
> add: C # [A, C, B]
> del: B # [A, C]
>
> we may generate a dump of [A], if C got an index between A and B.
> System has never been in such state. But I'm 90% sure that's perfectly
> fine, important part is that we can't _miss_ devices which exist before
> and after. User space which wants to mirror kernel's state subscribes
> to notifications and does periodic dumps so it will know that C exists
> from the notification about its creation or from the next dump
> (next dump is _guaranteed_ to include C, if it doesn't get removed).
>
> To avoid any perf regressions keep the hash table for now. Most
> net namespaces have very few devices and microbenchmarking 1M lookups
> on Skylake I get the following results (not counting loopback
> to number of devs):
A possibly dumb question: why using an xarray over a plain list? It
looks like the idea is to additionally use xarray for device lookup
beyond for dumping?
WRT the above, have you considered instead replacing dev_name_head with
an rhashtable? (and add the mentioned list)
Cheers,
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 0/2] net: store netdevs in an xarray
2023-07-22 1:42 [PATCH net-next 0/2] net: store netdevs in an xarray Jakub Kicinski
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
2023-07-22 1:42 ` [PATCH net-next 2/2] net: convert some netlink netdev iterators to depend on the xarray Jakub Kicinski
@ 2023-07-24 15:28 ` Simon Horman
2 siblings, 0 replies; 15+ messages in thread
From: Simon Horman @ 2023-07-24 15:28 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, mkubecek, lorenzo
On Fri, Jul 21, 2023 at 06:42:35PM -0700, Jakub Kicinski wrote:
> One of more annoying developer experience gaps we have in netlink
> is iterating over netdevs. It's painful. Add an xarray to make
> it trivial.
>
> Jakub Kicinski (2):
> net: store netdevs in an xarray
> net: convert some netlink netdev iterators to depend on the xarray
Reviewed-by: Simon Horman <simon.horman@corigine.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 8:18 ` Paolo Abeni
@ 2023-07-24 15:41 ` Jakub Kicinski
2023-07-24 16:23 ` Paolo Abeni
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-24 15:41 UTC (permalink / raw)
To: Paolo Abeni
Cc: davem, netdev, edumazet, mkubecek, lorenzo, Herbert Xu,
Neil Brown
On Mon, 24 Jul 2023 10:18:04 +0200 Paolo Abeni wrote:
> A possibly dumb question: why using an xarray over a plain list?
We need to drop the lock during the walk. So for a list we'd need
to either
- add explicit iteration "cursor" or
- suffer O(n) for insertion + O(n^2) for restart?
Or there's a easier way to do it?
The cursor is not the worst option, I guess, a bit less intuitive
and harder to clean up on error, but doable?
> It looks like the idea is to additionally use xarray for device lookup
> beyond for dumping?
I was measuring it to find out if we can delete the hash table without
anyone noticing, but it's not really the motivation.
> WRT the above, have you considered instead replacing dev_name_head with
> an rhashtable? (and add the mentioned list)
The main motivation is ease of iteration for netlink dumps.
I'll admit that my understanding of rhashtable is superficial but
I don't think it's possible to do consistent dumps with it. Even
if it supported "cursors" (which I'm not sure it does) a rhash could
rearrange buckets and mess the order up. The comment on
rhltable_walk_enter() says:
* For a completely stable walk you should construct your own data
* structure outside the hash table.
Given the iteration process is controlled by user space trying to
constrain a rhash also used by the fast path to get consistent dumps
seems risky.
Adding Herbert and Neil to keep me honest.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 15:41 ` Jakub Kicinski
@ 2023-07-24 16:23 ` Paolo Abeni
2023-07-24 17:27 ` Jakub Kicinski
0 siblings, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2023-07-24 16:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, mkubecek, lorenzo, Herbert Xu,
Neil Brown
On Mon, 2023-07-24 at 08:41 -0700, Jakub Kicinski wrote:
> On Mon, 24 Jul 2023 10:18:04 +0200 Paolo Abeni wrote:
> > A possibly dumb question: why using an xarray over a plain list?
>
> We need to drop the lock during the walk.
I should have looked more closely to patch 2/2.
> So for a list we'd need
> to either
> - add explicit iteration "cursor" or
Would a cursor require acquiring a netdev reference? If so it looks
problematic (an evil/buggy userspace could keep such reference held for
an unbounded amount of time).
I agree xarray looks a better solution.
I still have some minor doubts WRT the 'missed device' scenario you
described in the commit message. What if the user-space is doing
'create the new one before deleting the old one' with the assumption
that at least one of old/new is always reported in dumps? Is that a too
bold assumption?
> I was measuring it to find out if we can delete the hash table without
> anyone noticing, but it's not really the motivation.
Understood. I though about rhashtable with the opposite assumption :)
So no need to discuss such option further, I guess.
Cheers,
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 16:23 ` Paolo Abeni
@ 2023-07-24 17:27 ` Jakub Kicinski
2023-07-24 19:07 ` Jakub Kicinski
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-24 17:27 UTC (permalink / raw)
To: Paolo Abeni
Cc: davem, netdev, edumazet, mkubecek, lorenzo, Herbert Xu,
Neil Brown
On Mon, 24 Jul 2023 18:23:03 +0200 Paolo Abeni wrote:
> On Mon, 2023-07-24 at 08:41 -0700, Jakub Kicinski wrote:
> > On Mon, 24 Jul 2023 10:18:04 +0200 Paolo Abeni wrote:
> > > A possibly dumb question: why using an xarray over a plain list?
> >
> > We need to drop the lock during the walk.
>
> I should have looked more closely to patch 2/2.
>
> > So for a list we'd need
> > to either
> > - add explicit iteration "cursor" or
>
> Would a cursor require acquiring a netdev reference? If so it looks
> problematic (an evil/buggy userspace could keep such reference held for
> an unbounded amount of time).
I was thinking we can add a cursor which looks like:
struct netdev_list_node {
struct list head_list;
bool is_a_device;
};
then iteration can put this struct into its dump state and insert
*itself* into the device list. All iterations will then have to:
struct netdev_list_node *lnode;
list_for_each_entry... lnode) {
/* skip cursors inserted into the list */
if (!lnode->is_a_device)
continue;
netdev = containerof(lnode);
...
}
But then we need to add some code to .start and .stop callbacks
and make sure the "cursor" is removed if someone closes the socket
half way thru the dump.
> I agree xarray looks a better solution.
I do think xarray is the simplest solution.
> I still have some minor doubts WRT the 'missed device' scenario you
> described in the commit message. What if the user-space is doing
> 'create the new one before deleting the old one' with the assumption
> that at least one of old/new is always reported in dumps? Is that a too
> bold assumption?
The problem is kinda theoretical in the first place because it assumes
ifindexes got wrapped so that the new netdev comes "before" the old in
the xarray. Which would require adding and removing 2B netdevs, assuming
one add+remove takes 10 usec (impossibly fast), wrapping ifindex would
take 68 years.
And if that's not enough we can make the iteration index ulong
(i.e. something separate from ifindex as ifindex is hardwired to 31b
by uAPI).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 17:27 ` Jakub Kicinski
@ 2023-07-24 19:07 ` Jakub Kicinski
2023-07-25 11:11 ` Paolo Abeni
2023-07-25 17:54 ` Sabrina Dubroca
0 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-24 19:07 UTC (permalink / raw)
To: Paolo Abeni; +Cc: davem, netdev, edumazet, mkubecek, lorenzo
On Mon, 24 Jul 2023 10:27:41 -0700 Jakub Kicinski wrote:
> > I still have some minor doubts WRT the 'missed device' scenario you
> > described in the commit message. What if the user-space is doing
> > 'create the new one before deleting the old one' with the assumption
> > that at least one of old/new is always reported in dumps? Is that a too
> > bold assumption?
>
> The problem is kinda theoretical in the first place because it assumes
> ifindexes got wrapped so that the new netdev comes "before" the old in
> the xarray. Which would require adding and removing 2B netdevs, assuming
> one add+remove takes 10 usec (impossibly fast), wrapping ifindex would
> take 68 years.
I guess the user space can shoot itself in the foot by selecting
the lower index for the new device explicitly.
> And if that's not enough we can make the iteration index ulong
> (i.e. something separate from ifindex as ifindex is hardwired to 31b
> by uAPI).
We can get the create, delete ordering with this or the list, but the
inverse theoretical case of delete, create ordering can't be covered.
A case where user wants to make sure at most one device is visible.
I'm not sure how much we should care about this. The basic hash table
had the very real problem of hiding devices which were there *before
and after* the dump.
Inconsistent info on devices which were created / deleted *during* the
dump seems to me like something that's best handled with notifications.
I'm not sure whether we should set the inconsistency mark on the dump
when del/add operation happened in the meantime either, as
the probability that the user space will care is minuscule.
LMK what you think.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
2023-07-22 1:47 ` Jakub Kicinski
2023-07-24 8:18 ` Paolo Abeni
@ 2023-07-24 19:09 ` Leon Romanovsky
2 siblings, 0 replies; 15+ messages in thread
From: Leon Romanovsky @ 2023-07-24 19:09 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, mkubecek, lorenzo
On Fri, Jul 21, 2023 at 06:42:36PM -0700, Jakub Kicinski wrote:
> Iterating over the netdev hash table for netlink dumps is hard.
> Dumps are done in "chunks" so we need to save the position
> after each chunk, so we know where to restart from. Because
> netdevs are stored in a hash table we remember which bucket
> we were in and how many devices we dumped.
>
> Since we don't hold any locks across the "chunks" - devices may
> come and go while we're dumping. If that happens we may miss
> a device (if device is deleted from the bucket we were in).
> We indicate to user space that this may have happened by setting
> NLM_F_DUMP_INTR. User space is supposed to dump again (I think)
> if it sees that. Somehow I doubt most user space gets this right..
>
> To illustrate let's look at an example:
>
> System state:
> start: # [A, B, C]
> del: B # [A, C]
>
> with the hash table we may dump [A, B], missing C completely even
> tho it existed both before and after the "del B".
>
> Add an xarray and use it to allocate ifindexes. This way we
> can iterate ifindexes in order, without the worry that we'll
> skip one. We may still generate a dump of a state which "never
> existed", for example for a set of values and sequence of ops:
>
> System state:
> start: # [A, B]
> add: C # [A, C, B]
> del: B # [A, C]
>
> we may generate a dump of [A], if C got an index between A and B.
> System has never been in such state. But I'm 90% sure that's perfectly
> fine, important part is that we can't _miss_ devices which exist before
> and after. User space which wants to mirror kernel's state subscribes
> to notifications and does periodic dumps so it will know that C exists
> from the notification about its creation or from the next dump
> (next dump is _guaranteed_ to include C, if it doesn't get removed).
>
> To avoid any perf regressions keep the hash table for now. Most
> net namespaces have very few devices and microbenchmarking 1M lookups
> on Skylake I get the following results (not counting loopback
> to number of devs):
>
> #devs | hash | xa | delta
> 2 | 18.3 | 20.1 | + 9.8%
> 16 | 18.3 | 20.1 | + 9.5%
> 64 | 18.3 | 26.3 | +43.8%
> 128 | 20.4 | 26.3 | +28.6%
> 256 | 20.0 | 26.4 | +32.1%
> 1024 | 26.6 | 26.7 | + 0.2%
> 8192 |541.3 | 33.5 | -93.8%
>
> No surprises since the hash table has 256 entries.
> The microbenchmark scans indexes in order, if the pattern is more
> random xa starts to win at 512 devices already. But that's a lot
> of devices, in practice.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> include/net/net_namespace.h | 4 +-
> net/core/dev.c | 82 ++++++++++++++++++++++++-------------
> 2 files changed, 57 insertions(+), 29 deletions(-)
<...>
> + if (!ifindex)
> + err = xa_alloc_cyclic(&net->dev_by_index, &ifindex, NULL,
> + xa_limit_31b, &net->ifindex, GFP_KERNEL);
> + else
> + err = xa_insert(&net->dev_by_index, ifindex, NULL, GFP_KERNEL);
> + if (err)
> + return err;
Please pay attention that xa_alloc_cyclic() returns 1 if the allocation
succeeded after wrapping. So the more accurate check is "if (err < 0) ..."
Thanks
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 19:07 ` Jakub Kicinski
@ 2023-07-25 11:11 ` Paolo Abeni
2023-07-25 16:56 ` Jakub Kicinski
2023-07-25 17:54 ` Sabrina Dubroca
1 sibling, 1 reply; 15+ messages in thread
From: Paolo Abeni @ 2023-07-25 11:11 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: davem, netdev, edumazet, mkubecek, lorenzo
On Mon, 2023-07-24 at 12:07 -0700, Jakub Kicinski wrote:
> On Mon, 24 Jul 2023 10:27:41 -0700 Jakub Kicinski wrote:
> > > I still have some minor doubts WRT the 'missed device' scenario you
> > > described in the commit message. What if the user-space is doing
> > > 'create the new one before deleting the old one' with the assumption
> > > that at least one of old/new is always reported in dumps? Is that a too
> > > bold assumption?
> >
> > The problem is kinda theoretical in the first place because it assumes
> > ifindexes got wrapped so that the new netdev comes "before" the old in
> > the xarray. Which would require adding and removing 2B netdevs, assuming
> > one add+remove takes 10 usec (impossibly fast), wrapping ifindex would
> > take 68 years.
>
> I guess the user space can shoot itself in the foot by selecting
> the lower index for the new device explicitly.
>
> > And if that's not enough we can make the iteration index ulong
> > (i.e. something separate from ifindex as ifindex is hardwired to 31b
> > by uAPI).
>
> We can get the create, delete ordering with this or the list, but the
> inverse theoretical case of delete, create ordering can't be covered.
> A case where user wants to make sure at most one device is visible.
>
> I'm not sure how much we should care about this. The basic hash table
> had the very real problem of hiding devices which were there *before
> and after* the dump.
>
> Inconsistent info on devices which were created / deleted *during* the
> dump seems to me like something that's best handled with notifications.
>
> I'm not sure whether we should set the inconsistency mark on the dump
> when del/add operation happened in the meantime either, as
> the probability that the user space will care is minuscule.
You convinced me the 'missed device' scenario is not very relevant.
The cursor with the dummy placeholder looks error-prone and/or too
invasive to me.
I'm fine with this approach.
Thanks!
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-25 11:11 ` Paolo Abeni
@ 2023-07-25 16:56 ` Jakub Kicinski
0 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-25 16:56 UTC (permalink / raw)
To: Paolo Abeni; +Cc: davem, netdev, edumazet, mkubecek, lorenzo
On Tue, 25 Jul 2023 13:11:50 +0200 Paolo Abeni wrote:
> > We can get the create, delete ordering with this or the list, but the
> > inverse theoretical case of delete, create ordering can't be covered.
> > A case where user wants to make sure at most one device is visible.
> >
> > I'm not sure how much we should care about this. The basic hash table
> > had the very real problem of hiding devices which were there *before
> > and after* the dump.
> >
> > Inconsistent info on devices which were created / deleted *during* the
> > dump seems to me like something that's best handled with notifications.
> >
> > I'm not sure whether we should set the inconsistency mark on the dump
> > when del/add operation happened in the meantime either, as
> > the probability that the user space will care is minuscule.
>
> You convinced me the 'missed device' scenario is not very relevant.
>
> The cursor with the dummy placeholder looks error-prone and/or too
> invasive to me.
>
> I'm fine with this approach.
I'll post a v2 with the fix Leon pointed out.
But do feel free to change your mind.
It seems like a problem where there's no perfect solution but
there are multiple non-perfect ones, each with its advantages
and disadvantages.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-24 19:07 ` Jakub Kicinski
2023-07-25 11:11 ` Paolo Abeni
@ 2023-07-25 17:54 ` Sabrina Dubroca
2023-07-25 19:45 ` Jakub Kicinski
1 sibling, 1 reply; 15+ messages in thread
From: Sabrina Dubroca @ 2023-07-25 17:54 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: Paolo Abeni, davem, netdev, edumazet, mkubecek, lorenzo
2023-07-24, 12:07:18 -0700, Jakub Kicinski wrote:
> On Mon, 24 Jul 2023 10:27:41 -0700 Jakub Kicinski wrote:
> > > I still have some minor doubts WRT the 'missed device' scenario you
> > > described in the commit message. What if the user-space is doing
> > > 'create the new one before deleting the old one' with the assumption
> > > that at least one of old/new is always reported in dumps? Is that a too
> > > bold assumption?
> >
> > The problem is kinda theoretical in the first place because it assumes
> > ifindexes got wrapped so that the new netdev comes "before" the old in
> > the xarray. Which would require adding and removing 2B netdevs, assuming
> > one add+remove takes 10 usec (impossibly fast), wrapping ifindex would
> > take 68 years.
>
> I guess the user space can shoot itself in the foot by selecting
> the lower index for the new device explicitly.
Or just moving devices between netns?
Scenario 1:
- create lots of devices in netns A, last ifindex is 100 (call it dummy100)
- netns B isn't creating many devices, last ifindex is 8
- move dummy100 from netns A to netns B, it keeps ifindex=100
- create a device in netns B, it gets ifindex=9 even though we already
have a device with ifindex=100
Scenario 2:
- create lots of devices in netns A, last ifindex is 100
- delete devices with ifindex 10..80 from A
- move a device with ifindex=20 into netns A, it keeps ifindex=20
- we have a new device in netns A with a smaller ifindex than the max
I think with this patch we could still see states that never existed,
if both a low ifindex and then a high ifindex are created during the
dump. We could see high ifindex without low ifindex being listed, if
the dump was already past the low id. Or if we delete one at a low
ifindex then create one at a high ifindex, we could see both devices
listed in the dump when they never both existed at the same time.
> > And if that's not enough we can make the iteration index ulong
> > (i.e. something separate from ifindex as ifindex is hardwired to 31b
> > by uAPI).
>
> We can get the create, delete ordering with this or the list, but the
> inverse theoretical case of delete, create ordering can't be covered.
> A case where user wants to make sure at most one device is visible.
>
> I'm not sure how much we should care about this. The basic hash table
> had the very real problem of hiding devices which were there *before
> and after* the dump.
>
> Inconsistent info on devices which were created / deleted *during* the
> dump seems to me like something that's best handled with notifications.
>
> I'm not sure whether we should set the inconsistency mark on the dump
> when del/add operation happened in the meantime either, as
> the probability that the user space will care is minuscule.
The inconsistent dump mark may be more relevant for changes in device
properties than link creation/removal. If the MTU on 2 devices changes
while the dump is running (one low ifindex, one high ifindex), we'll
see the old MTU for the first device and the new MTU for the 2nd. Or
by adding/removing bridge ports while the dump runs, I can make it
look like bridge0 has mulitple ports with the same port_no.
I don't know how likely those cases are, but if they happen I think
they'd be more confusing than a missing/extra device.
--
Sabrina
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net-next 1/2] net: store netdevs in an xarray
2023-07-25 17:54 ` Sabrina Dubroca
@ 2023-07-25 19:45 ` Jakub Kicinski
0 siblings, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2023-07-25 19:45 UTC (permalink / raw)
To: Sabrina Dubroca; +Cc: Paolo Abeni, davem, netdev, edumazet, mkubecek, lorenzo
On Tue, 25 Jul 2023 19:54:43 +0200 Sabrina Dubroca wrote:
> > > And if that's not enough we can make the iteration index ulong
> > > (i.e. something separate from ifindex as ifindex is hardwired to 31b
> > > by uAPI).
> >
> > We can get the create, delete ordering with this or the list, but the
> > inverse theoretical case of delete, create ordering can't be covered.
> > A case where user wants to make sure at most one device is visible.
> >
> > I'm not sure how much we should care about this. The basic hash table
> > had the very real problem of hiding devices which were there *before
> > and after* the dump.
> >
> > Inconsistent info on devices which were created / deleted *during* the
> > dump seems to me like something that's best handled with notifications.
> >
> > I'm not sure whether we should set the inconsistency mark on the dump
> > when del/add operation happened in the meantime either, as
> > the probability that the user space will care is minuscule.
>
> The inconsistent dump mark may be more relevant for changes in device
> properties than link creation/removal. If the MTU on 2 devices changes
> while the dump is running (one low ifindex, one high ifindex), we'll
> see the old MTU for the first device and the new MTU for the 2nd. Or
> by adding/removing bridge ports while the dump runs, I can make it
> look like bridge0 has mulitple ports with the same port_no.
>
> I don't know how likely those cases are, but if they happen I think
> they'd be more confusing than a missing/extra device.
I believe that for netdevs dev_base_seq_inc() is used to indicate
a change. It's only called when listing / unlisting devices so
the changes to device config are already not covered :(
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-07-25 19:46 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-22 1:42 [PATCH net-next 0/2] net: store netdevs in an xarray Jakub Kicinski
2023-07-22 1:42 ` [PATCH net-next 1/2] " Jakub Kicinski
2023-07-22 1:47 ` Jakub Kicinski
2023-07-24 8:18 ` Paolo Abeni
2023-07-24 15:41 ` Jakub Kicinski
2023-07-24 16:23 ` Paolo Abeni
2023-07-24 17:27 ` Jakub Kicinski
2023-07-24 19:07 ` Jakub Kicinski
2023-07-25 11:11 ` Paolo Abeni
2023-07-25 16:56 ` Jakub Kicinski
2023-07-25 17:54 ` Sabrina Dubroca
2023-07-25 19:45 ` Jakub Kicinski
2023-07-24 19:09 ` Leon Romanovsky
2023-07-22 1:42 ` [PATCH net-next 2/2] net: convert some netlink netdev iterators to depend on the xarray Jakub Kicinski
2023-07-24 15:28 ` [PATCH net-next 0/2] net: store netdevs in an xarray Simon Horman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).