* [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables
@ 2016-11-16 14:08 Jiri Pirko
2016-11-16 14:08 ` [patch net-next 1/8] ipv4: fib: Export free_fib_info() Jiri Pirko
` (8 more replies)
0 siblings, 9 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Jiri Pirko <jiri@mellanox.com>
Ido says:
In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.
The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by introducing a new API to dump
the FIB tables.
The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first five patches.
The sixth patch adds the dump callback itself and the next patches call
it from current listeners of the FIB notification chain.
Ido Schimmel (8):
ipv4: fib: Export free_fib_info()
mlxsw: spectrum_router: Implement FIB offload in delayed work
rocker: Implement FIB offload in delayed work
ipv4: fib: Convert FIB notification chain to be atomic
net: ipv4: Send notifications only after removing FIB alias
ipv4: fib: Add an API to request a FIB dump
mlxsw: spectrum_router: Request a dump of FIB tables during init
rocker: Request a dump of FIB tables during init
drivers/net/ethernet/mellanox/mlxsw/core.c | 6 +
drivers/net/ethernet/mellanox/mlxsw/core.h | 1 +
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 73 ++++++++++--
drivers/net/ethernet/rocker/rocker_main.c | 59 ++++++++--
drivers/net/ethernet/rocker/rocker_ofdpa.c | 1 +
include/net/ip_fib.h | 1 +
net/ipv4/fib_semantics.c | 1 +
net/ipv4/fib_trie.c | 128 +++++++++++++++++++--
8 files changed, 242 insertions(+), 28 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 27+ messages in thread
* [patch net-next 1/8] ipv4: fib: Export free_fib_info()
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
@ 2016-11-16 14:08 ` Jiri Pirko
2016-11-16 14:08 ` [patch net-next 2/8] mlxsw: spectrum_router: Implement FIB offload in delayed work Jiri Pirko
` (7 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
delayed work, as hardware operations entail sleeping.
However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().
Export free_fib_info() so that modules will be able to invoke
fib_info_put().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
net/ipv4/fib_semantics.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 388d3e2..c1bc1e9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -234,6 +234,7 @@ void free_fib_info(struct fib_info *fi)
#endif
call_rcu(&fi->rcu, free_fib_info_rcu);
}
+EXPORT_SYMBOL_GPL(free_fib_info);
void fib_release_info(struct fib_info *fi)
{
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 2/8] mlxsw: spectrum_router: Implement FIB offload in delayed work
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
2016-11-16 14:08 ` [patch net-next 1/8] ipv4: fib: Export free_fib_info() Jiri Pirko
@ 2016-11-16 14:08 ` Jiri Pirko
2016-11-16 14:08 ` [patch net-next 3/8] rocker: " Jiri Pirko
` (6 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.
Instead, defer the operation to process context using delayed work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.
Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/core.c | 6 ++
drivers/net/ethernet/mellanox/mlxsw/core.h | 1 +
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 72 +++++++++++++++++++---
3 files changed, 69 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 6004817..7874e30 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1839,6 +1839,12 @@ int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay)
}
EXPORT_SYMBOL(mlxsw_core_schedule_dw);
+void mlxsw_core_flush_wq(void)
+{
+ flush_workqueue(mlxsw_wq);
+}
+EXPORT_SYMBOL(mlxsw_core_flush_wq);
+
static int __init mlxsw_core_module_init(void)
{
int err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index c0acc1b..e382ed0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -156,6 +156,7 @@ enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
u8 local_port);
int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay);
+void mlxsw_core_flush_wq(void);
#define MLXSW_CONFIG_PROFILE_SWID_COUNT 8
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 683f045..a8011a5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -593,6 +593,14 @@ static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
{
+ /* At this stage we're guaranteed not to have new incoming
+ * FIB notifications and the work queue is free from FIBs
+ * sitting on top of mlxsw netdevs. However, we can still
+ * have other FIBs queued. Flush the queue before flushing
+ * the device's tables. No need for locks, as we're the only
+ * writer.
+ */
+ mlxsw_core_flush_wq();
mlxsw_sp_router_fib_flush(mlxsw_sp);
kfree(mlxsw_sp->router.vrs);
}
@@ -1948,30 +1956,74 @@ static void __mlxsw_sp_router_fini(struct mlxsw_sp *mlxsw_sp)
kfree(mlxsw_sp->rifs);
}
-static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
- unsigned long event, void *ptr)
+struct mlxsw_sp_fib_event_work {
+ struct fib_entry_notifier_info fen_info;
+ struct mlxsw_sp *mlxsw_sp;
+ struct delayed_work dw;
+ unsigned long event;
+};
+
+static void mlxsw_sp_router_fib_event_work(struct work_struct *work)
{
- struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
- struct fib_entry_notifier_info *fen_info = ptr;
+ struct mlxsw_sp_fib_event_work *fib_work =
+ container_of(work, struct mlxsw_sp_fib_event_work, dw.work);
+ struct mlxsw_sp *mlxsw_sp = fib_work->mlxsw_sp;
int err;
- if (!net_eq(fen_info->info.net, &init_net))
- return NOTIFY_DONE;
-
- switch (event) {
+ /* Protect internal structures from changes */
+ rtnl_lock();
+ switch (fib_work->event) {
case FIB_EVENT_ENTRY_ADD:
- err = mlxsw_sp_router_fib4_add(mlxsw_sp, fen_info);
+ err = mlxsw_sp_router_fib4_add(mlxsw_sp, &fib_work->fen_info);
if (err)
mlxsw_sp_router_fib4_abort(mlxsw_sp);
+ fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_ENTRY_DEL:
- mlxsw_sp_router_fib4_del(mlxsw_sp, fen_info);
+ mlxsw_sp_router_fib4_del(mlxsw_sp, &fib_work->fen_info);
+ fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_RULE_ADD: /* fall through */
case FIB_EVENT_RULE_DEL:
mlxsw_sp_router_fib4_abort(mlxsw_sp);
break;
}
+ rtnl_unlock();
+ kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
+ unsigned long event, void *ptr)
+{
+ struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+ struct mlxsw_sp_fib_event_work *fib_work;
+ struct fib_notifier_info *info = ptr;
+
+ if (!net_eq(info->net, &init_net))
+ return NOTIFY_DONE;
+
+ fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+ if (WARN_ON(!fib_work))
+ return NOTIFY_BAD;
+
+ INIT_DELAYED_WORK(&fib_work->dw, mlxsw_sp_router_fib_event_work);
+ fib_work->mlxsw_sp = mlxsw_sp;
+ fib_work->event = event;
+
+ switch (event) {
+ case FIB_EVENT_ENTRY_ADD: /* fall through */
+ case FIB_EVENT_ENTRY_DEL:
+ memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+ /* Take referece on fib_info to prevent it from being
+ * freed while work is queued. Release it afterwards.
+ */
+ atomic_inc(&fib_work->fen_info.fi->fib_clntref);
+ break;
+ }
+
+ mlxsw_core_schedule_dw(&fib_work->dw, 0);
+
return NOTIFY_DONE;
}
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 3/8] rocker: Implement FIB offload in delayed work
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
2016-11-16 14:08 ` [patch net-next 1/8] ipv4: fib: Export free_fib_info() Jiri Pirko
2016-11-16 14:08 ` [patch net-next 2/8] mlxsw: spectrum_router: Implement FIB offload in delayed work Jiri Pirko
@ 2016-11-16 14:08 ` Jiri Pirko
2016-11-16 14:08 ` [patch net-next 4/8] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
` (5 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
Convert rocker to offload FIBs in delayed work in a similar fashion to
mlxsw, which was converted in the previous patch.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/rocker/rocker_main.c | 58 +++++++++++++++++++++++++-----
drivers/net/ethernet/rocker/rocker_ofdpa.c | 1 +
2 files changed, 51 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 67df4cf..b80ff12 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2165,28 +2165,70 @@ static const struct switchdev_ops rocker_port_switchdev_ops = {
.switchdev_port_obj_dump = rocker_port_obj_dump,
};
-static int rocker_router_fib_event(struct notifier_block *nb,
- unsigned long event, void *ptr)
+struct rocker_fib_event_work {
+ struct fib_entry_notifier_info fen_info;
+ struct rocker *rocker;
+ struct delayed_work dw;
+ unsigned long event;
+};
+
+static void rocker_router_fib_event_work(struct work_struct *work)
{
- struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
- struct fib_entry_notifier_info *fen_info = ptr;
+ struct rocker_fib_event_work *fib_work =
+ container_of(work, struct rocker_fib_event_work, dw.work);
+ struct rocker *rocker = fib_work->rocker;
int err;
- switch (event) {
+ /* Protect internal structures from changes */
+ rtnl_lock();
+ switch (fib_work->event) {
case FIB_EVENT_ENTRY_ADD:
- err = rocker_world_fib4_add(rocker, fen_info);
+ err = rocker_world_fib4_add(rocker, &fib_work->fen_info);
if (err)
rocker_world_fib4_abort(rocker);
- else
+ fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_ENTRY_DEL:
- rocker_world_fib4_del(rocker, fen_info);
+ rocker_world_fib4_del(rocker, &fib_work->fen_info);
+ fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_RULE_ADD: /* fall through */
case FIB_EVENT_RULE_DEL:
rocker_world_fib4_abort(rocker);
break;
}
+ rtnl_unlock();
+ kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int rocker_router_fib_event(struct notifier_block *nb,
+ unsigned long event, void *ptr)
+{
+ struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
+ struct rocker_fib_event_work *fib_work;
+
+ fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+ if (WARN_ON(!fib_work))
+ return NOTIFY_BAD;
+
+ INIT_DELAYED_WORK(&fib_work->dw, rocker_router_fib_event_work);
+ fib_work->rocker = rocker;
+ fib_work->event = event;
+
+ switch (event) {
+ case FIB_EVENT_ENTRY_ADD: /* fall through */
+ case FIB_EVENT_ENTRY_DEL:
+ memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+ /* Take referece on fib_info to prevent it from being
+ * freed while work is queued. Release it afterwards.
+ */
+ atomic_inc(&fib_work->fen_info.fi->fib_clntref);
+ break;
+ }
+
+ schedule_work(&fib_work->dw.work);
+
return NOTIFY_DONE;
}
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 4ca4613..5d7c738 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2516,6 +2516,7 @@ static void ofdpa_fini(struct rocker *rocker)
int bkt;
del_timer_sync(&ofdpa->fdb_cleanup_timer);
+ flush_scheduled_work();
spin_lock_irqsave(&ofdpa->flow_tbl_lock, flags);
hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry)
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 4/8] ipv4: fib: Convert FIB notification chain to be atomic
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (2 preceding siblings ...)
2016-11-16 14:08 ` [patch net-next 3/8] rocker: " Jiri Pirko
@ 2016-11-16 14:08 ` Jiri Pirko
2016-11-16 14:09 ` [patch net-next 5/8] net: ipv4: Send notifications only after removing FIB alias Jiri Pirko
` (4 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:08 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.
Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
net/ipv4/fib_trie.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 4cff74d..64a51ec 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -84,17 +84,17 @@
#include <trace/events/fib.h>
#include "fib_lookup.h"
-static BLOCKING_NOTIFIER_HEAD(fib_chain);
+static ATOMIC_NOTIFIER_HEAD(fib_chain);
int register_fib_notifier(struct notifier_block *nb)
{
- return blocking_notifier_chain_register(&fib_chain, nb);
+ return atomic_notifier_chain_register(&fib_chain, nb);
}
EXPORT_SYMBOL(register_fib_notifier);
int unregister_fib_notifier(struct notifier_block *nb)
{
- return blocking_notifier_chain_unregister(&fib_chain, nb);
+ return atomic_notifier_chain_unregister(&fib_chain, nb);
}
EXPORT_SYMBOL(unregister_fib_notifier);
@@ -102,7 +102,7 @@ int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
struct fib_notifier_info *info)
{
info->net = net;
- return blocking_notifier_call_chain(&fib_chain, event_type, info);
+ return atomic_notifier_call_chain(&fib_chain, event_type, info);
}
static int call_fib_entry_notifiers(struct net *net,
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 5/8] net: ipv4: Send notifications only after removing FIB alias
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (3 preceding siblings ...)
2016-11-16 14:08 ` [patch net-next 4/8] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
@ 2016-11-16 14:09 ` Jiri Pirko
2016-11-16 14:09 ` [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
` (3 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
When removing a FIB alias we should send a notification (both to user
space and in kernel) only after the fact, or otherwise we could end up
in problematic situations.
For example, assume we have two tasks:
a) Task A - Removing a FIB alias (fa1) following RTM_DELROUTE
b) Task B - Replaying FIB tables under RCU for listener (l) following
fib_notifier_dump() (to be introduced in the next patch).
Timeline:
t0 - Task A calls FIB_EVENT_ENTRY_DEL for fa1. Will be ignored by
listener l since it's not aware of fa1.
t1 - Task B reaches fa1 in the trie and replays it to listener l using
FIB_EVENT_ENTRY_ADD.
t2 - Task A removes fa1 from the trie.
The above will result in listener l (f.e., some capable device) having
fa1 in its tables whereas fa1 isn't present in the kernel's table
anymore.
If we always send notifications after the fact, then we can avoid such
races. During insertion, if we traversed the trie when fa1 wasn't
present, then we'll eventually get the notification and process it.
During deletion, if we traversed the trie when fa1 was still present,
then the notification will let us know that it actually needs to be
removed.
The above is consistent with insertion of a FIB alias, where
notifications are sent only after the fact.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
net/ipv4/fib_trie.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 64a51ec..2195601 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1569,17 +1569,17 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
if (!fa_to_delete)
return -ESRCH;
+ if (!plen)
+ tb->tb_num_default--;
+
+ fib_remove_alias(t, tp, l, fa_to_delete);
+
call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, key, plen,
fa_to_delete->fa_info, tos, cfg->fc_type,
tb->tb_id, 0);
rtmsg_fib(RTM_DELROUTE, htonl(key), fa_to_delete, plen, tb->tb_id,
&cfg->fc_nlinfo, 0);
- if (!plen)
- tb->tb_num_default--;
-
- fib_remove_alias(t, tp, l, fa_to_delete);
-
if (fa_to_delete->fa_state & FA_S_ACCESSED)
rt_cache_flush(cfg->fc_nlinfo.nl_net);
@@ -1810,12 +1810,12 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
continue;
}
+ hlist_del_rcu(&fa->fa_list);
call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL,
n->key,
KEYLENGTH - fa->fa_slen,
fi, fa->fa_tos, fa->fa_type,
tb->tb_id, 0);
- hlist_del_rcu(&fa->fa_list);
fib_release_info(fa->fa_info);
alias_free_mem_rcu(fa);
found++;
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (4 preceding siblings ...)
2016-11-16 14:09 ` [patch net-next 5/8] net: ipv4: Send notifications only after removing FIB alias Jiri Pirko
@ 2016-11-16 14:09 ` Jiri Pirko
2016-11-16 14:51 ` Hannes Frederic Sowa
2016-11-16 14:09 ` [patch net-next 7/8] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
` (2 subsequent siblings)
8 siblings, 1 reply; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.
However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.
Solve that by adding an API to request a FIB dump. The dump itself it
done using RCU in order not to starve consumers that need RTNL to make
progress.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
include/net/ip_fib.h | 1 +
net/ipv4/fib_trie.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+)
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index b9314b4..095a374 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -221,6 +221,7 @@ enum fib_event_type {
FIB_EVENT_RULE_DEL,
};
+void fib_notifier_dump(struct notifier_block *nb);
int register_fib_notifier(struct notifier_block *nb);
int unregister_fib_notifier(struct notifier_block *nb);
int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2195601..9670f3f 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -86,6 +86,58 @@
static ATOMIC_NOTIFIER_HEAD(fib_chain);
+static int call_fib_notifier(struct notifier_block *nb, struct net *net,
+ enum fib_event_type event_type,
+ struct fib_notifier_info *info)
+{
+ info->net = net;
+ return nb->notifier_call(nb, event_type, info);
+}
+
+static void fib_rules_notify(struct net *net, struct notifier_block *nb,
+ enum fib_event_type event_type)
+{
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+ struct fib_notifier_info info;
+
+ if (net->ipv4.fib_has_custom_rules)
+ call_fib_notifier(nb, net, event_type, &info);
+#endif
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+ enum fib_event_type event_type);
+
+static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
+ enum fib_event_type event_type, u32 dst,
+ int dst_len, struct fib_info *fi,
+ u8 tos, u8 type, u32 tb_id, u32 nlflags)
+{
+ struct fib_entry_notifier_info info = {
+ .dst = dst,
+ .dst_len = dst_len,
+ .fi = fi,
+ .tos = tos,
+ .type = type,
+ .tb_id = tb_id,
+ .nlflags = nlflags,
+ };
+ return call_fib_notifier(nb, net, event_type, &info.info);
+}
+
+void fib_notifier_dump(struct notifier_block *nb)
+{
+ struct net *net;
+
+ rcu_read_lock();
+ for_each_net_rcu(net) {
+ fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
+ fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
+ }
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(fib_notifier_dump);
+
int register_fib_notifier(struct notifier_block *nb)
{
return atomic_notifier_chain_register(&fib_chain, nb);
@@ -1834,6 +1886,62 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
return found;
}
+static void fib_leaf_notify(struct net *net, struct key_vector *l,
+ struct fib_table *tb, struct notifier_block *nb,
+ enum fib_event_type event_type)
+{
+ struct fib_alias *fa;
+
+ hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+ struct fib_info *fi = fa->fa_info;
+
+ if (!fi)
+ continue;
+
+ /* local and main table can share the same trie,
+ * so don't notify twice for the same entry.
+ */
+ if (tb->tb_id != fa->tb_id)
+ continue;
+
+ call_fib_entry_notifier(nb, net, event_type, l->key,
+ KEYLENGTH - fa->fa_slen, fi, fa->fa_tos,
+ fa->fa_type, fa->tb_id, 0);
+ }
+}
+
+static void fib_table_notify(struct net *net, struct fib_table *tb,
+ struct notifier_block *nb,
+ enum fib_event_type event_type)
+{
+ struct trie *t = (struct trie *)tb->tb_data;
+ struct key_vector *l, *tp = t->kv;
+ t_key key = 0;
+
+ while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
+ fib_leaf_notify(net, l, tb, nb, event_type);
+
+ key = l->key + 1;
+ /* stop in case of wrap around */
+ if (key < l->key)
+ break;
+ }
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+ enum fib_event_type event_type)
+{
+ unsigned int h;
+
+ for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
+ struct hlist_head *head = &net->ipv4.fib_table_hash[h];
+ struct fib_table *tb;
+
+ hlist_for_each_entry_rcu(tb, head, tb_hlist)
+ fib_table_notify(net, tb, nb, event_type);
+ }
+}
+
static void __trie_free_rcu(struct rcu_head *head)
{
struct fib_table *tb = container_of(head, struct fib_table, rcu);
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 7/8] mlxsw: spectrum_router: Request a dump of FIB tables during init
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (5 preceding siblings ...)
2016-11-16 14:09 ` [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
@ 2016-11-16 14:09 ` Jiri Pirko
2016-11-16 14:09 ` [patch net-next 8/8] rocker: " Jiri Pirko
2016-11-16 16:38 ` [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Alexander Duyck
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
Make sure the device has a complete view of the FIB tables by invoking
their dump during module init.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index a8011a5..46126d9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2048,6 +2048,7 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
register_fib_notifier(&mlxsw_sp->fib_nb);
+ fib_notifier_dump(&mlxsw_sp->fib_nb);
return 0;
err_neigh_init:
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [patch net-next 8/8] rocker: Request a dump of FIB tables during init
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (6 preceding siblings ...)
2016-11-16 14:09 ` [patch net-next 7/8] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
@ 2016-11-16 14:09 ` Jiri Pirko
2016-11-16 16:38 ` [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Alexander Duyck
8 siblings, 0 replies; 27+ messages in thread
From: Jiri Pirko @ 2016-11-16 14:09 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Ido Schimmel <idosch@mellanox.com>
Make sure the device has a complete view of the FIB tables by invoking
their dump during module init.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
drivers/net/ethernet/rocker/rocker_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index b80ff12..6968473 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2806,6 +2806,7 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
rocker->fib_nb.notifier_call = rocker_router_fib_event;
register_fib_notifier(&rocker->fib_nb);
+ fib_notifier_dump(&rocker->fib_nb);
dev_info(&pdev->dev, "Rocker switch with id %*phN\n",
(int)sizeof(rocker->hw.id), &rocker->hw.id);
--
2.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 14:09 ` [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
@ 2016-11-16 14:51 ` Hannes Frederic Sowa
2016-11-16 15:18 ` Ido Schimmel
2016-11-16 16:27 ` David Miller
0 siblings, 2 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-16 14:51 UTC (permalink / raw)
To: Jiri Pirko, netdev
Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 16.11.2016 15:09, Jiri Pirko wrote:
> From: Ido Schimmel <idosch@mellanox.com>
>
> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> introduced a new notification chain to notify listeners (f.e., switchdev
> drivers) about addition and deletion of routes.
>
> However, upon registration to the chain the FIB tables can already be
> populated, which means potential listeners will have an incomplete view
> of the tables.
>
> Solve that by adding an API to request a FIB dump. The dump itself it
> done using RCU in order not to starve consumers that need RTNL to make
> progress.
>
> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Have you looked at potential inconsistencies resulting of RCU walking
the table and having concurrent inserts?
I don't see a way around doing a journal like in filesystems somehow,
but maybe the effects are not that severe and it is not a problem after all.
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 14:51 ` Hannes Frederic Sowa
@ 2016-11-16 15:18 ` Ido Schimmel
2016-11-16 17:35 ` Hannes Frederic Sowa
2016-11-16 16:27 ` David Miller
1 sibling, 1 reply; 27+ messages in thread
From: Ido Schimmel @ 2016-11-16 15:18 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
Hi Hannes,
On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 15:09, Jiri Pirko wrote:
> > From: Ido Schimmel <idosch@mellanox.com>
> >
> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> > introduced a new notification chain to notify listeners (f.e., switchdev
> > drivers) about addition and deletion of routes.
> >
> > However, upon registration to the chain the FIB tables can already be
> > populated, which means potential listeners will have an incomplete view
> > of the tables.
> >
> > Solve that by adding an API to request a FIB dump. The dump itself it
> > done using RCU in order not to starve consumers that need RTNL to make
> > progress.
> >
> > Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
> Have you looked at potential inconsistencies resulting of RCU walking
> the table and having concurrent inserts?
Yes. I did try to think about situations in which this approach will
fail, but I could only find problems with concurrent removals, which I
addressed in 5/8. In case of concurrent insertions, even if you missed
the node, you would still get the ENTRY_ADD event to your listener.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 14:51 ` Hannes Frederic Sowa
2016-11-16 15:18 ` Ido Schimmel
@ 2016-11-16 16:27 ` David Miller
2016-11-16 16:42 ` Ido Schimmel
2016-11-16 16:56 ` Hannes Frederic Sowa
1 sibling, 2 replies; 27+ messages in thread
From: David Miller @ 2016-11-16 16:27 UTC (permalink / raw)
To: hannes
Cc: jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 16 Nov 2016 15:51:01 +0100
> I don't see a way around doing a journal like in filesystems somehow,
We really just need a sequence counter incremented for each insert/remove,
and restart the dump from the beginning if it changes mid-dump.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
` (7 preceding siblings ...)
2016-11-16 14:09 ` [patch net-next 8/8] rocker: " Jiri Pirko
@ 2016-11-16 16:38 ` Alexander Duyck
8 siblings, 0 replies; 27+ messages in thread
From: Alexander Duyck @ 2016-11-16 16:38 UTC (permalink / raw)
To: Jiri Pirko
Cc: Netdev, David Miller, idosch, eladr, yotamg, nogahf, arkadis,
Or Gerlitz, roopa, David Ahern, nikolay, Andy Gospodarek,
vivien.didelot, Andrew Lunn, Florian Fainelli, Duyck, Alexander H,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
kaber@trash.net
On Wed, Nov 16, 2016 at 6:08 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Ido says:
>
> In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
> by a new FIB notification chain to which modules could register in order
> to be notified about the addition and deletion of FIB entries. The
> motivation for this change was that switchdev drivers need to be able to
> reflect the entire FIB table and not only FIBs configured on top of the
> port netdevs themselves. This is useful in case of in-band management.
>
> The fundamental problem with this approach is that upon registration
> listeners lose all the information previously sent in the chain and
> thus have an incomplete view of the FIB tables, which can result in
> packet loss. This patchset fixes that by introducing a new API to dump
> the FIB tables.
>
> The entire dump process is done under RCU and thus the FIB notification
> chain is converted to be atomic. The listeners are modified accordingly.
> This is done in the first five patches.
>
> The sixth patch adds the dump callback itself and the next patches call
> it from current listeners of the FIB notification chain.
>
> Ido Schimmel (8):
> ipv4: fib: Export free_fib_info()
> mlxsw: spectrum_router: Implement FIB offload in delayed work
> rocker: Implement FIB offload in delayed work
> ipv4: fib: Convert FIB notification chain to be atomic
> net: ipv4: Send notifications only after removing FIB alias
> ipv4: fib: Add an API to request a FIB dump
> mlxsw: spectrum_router: Request a dump of FIB tables during init
> rocker: Request a dump of FIB tables during init
>
> drivers/net/ethernet/mellanox/mlxsw/core.c | 6 +
> drivers/net/ethernet/mellanox/mlxsw/core.h | 1 +
> .../net/ethernet/mellanox/mlxsw/spectrum_router.c | 73 ++++++++++--
> drivers/net/ethernet/rocker/rocker_main.c | 59 ++++++++--
> drivers/net/ethernet/rocker/rocker_ofdpa.c | 1 +
> include/net/ip_fib.h | 1 +
> net/ipv4/fib_semantics.c | 1 +
> net/ipv4/fib_trie.c | 128 +++++++++++++++++++--
> 8 files changed, 242 insertions(+), 28 deletions(-)
>
So for the pieces related to the fib_trie itself this patch series
looks good to me.
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 16:27 ` David Miller
@ 2016-11-16 16:42 ` Ido Schimmel
2016-11-16 16:56 ` Hannes Frederic Sowa
1 sibling, 0 replies; 27+ messages in thread
From: Ido Schimmel @ 2016-11-16 16:42 UTC (permalink / raw)
To: David Miller
Cc: hannes, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
Hi Dave,
On Wed, Nov 16, 2016 at 11:27:35AM -0500, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 16 Nov 2016 15:51:01 +0100
>
> > I don't see a way around doing a journal like in filesystems somehow,
>
> We really just need a sequence counter incremented for each insert/remove,
> and restart the dump from the beginning if it changes mid-dump.
When you dump the FIB table your listener is already registered to the
chain, so any changes happening mid-dump will be propagated to it. I've
noted this in 5/8 and in my reply to Hannes.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 16:27 ` David Miller
2016-11-16 16:42 ` Ido Schimmel
@ 2016-11-16 16:56 ` Hannes Frederic Sowa
1 sibling, 0 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-16 16:56 UTC (permalink / raw)
To: David Miller
Cc: jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 16.11.2016 17:27, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 16 Nov 2016 15:51:01 +0100
>
>> I don't see a way around doing a journal like in filesystems somehow,
>
> We really just need a sequence counter incremented for each insert/remove,
> and restart the dump from the beginning if it changes mid-dump.
I thought about this at first, too, but became afraid if we could end up
looping endlessly because of routing changes happen constantly while
trying to upload the fib into the hardware.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 15:18 ` Ido Schimmel
@ 2016-11-16 17:35 ` Hannes Frederic Sowa
2016-11-16 18:51 ` Ido Schimmel
0 siblings, 1 reply; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-16 17:35 UTC (permalink / raw)
To: Ido Schimmel
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 16.11.2016 16:18, Ido Schimmel wrote:
> Hi Hannes,
>
> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>> From: Ido Schimmel <idosch@mellanox.com>
>>>
>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>> drivers) about addition and deletion of routes.
>>>
>>> However, upon registration to the chain the FIB tables can already be
>>> populated, which means potential listeners will have an incomplete view
>>> of the tables.
>>>
>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>> done using RCU in order not to starve consumers that need RTNL to make
>>> progress.
>>>
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>
>> Have you looked at potential inconsistencies resulting of RCU walking
>> the table and having concurrent inserts?
>
> Yes. I did try to think about situations in which this approach will
> fail, but I could only find problems with concurrent removals, which I
> addressed in 5/8. In case of concurrent insertions, even if you missed
> the node, you would still get the ENTRY_ADD event to your listener.
Theoretically a node could still be installed while the deletion event
fired before registering the notifier. E.g. a synchronize_net before
dumping could help here?
I don't know how you prepare the data structures for inserting in into
the hardware, but if ordering matters, the notifier for a delete event
can be called before the dump installed the fib entry? E.g. you also
don't check what events are already queued in the delayed work queue so
far. I hope I understand the patches correctly?
Thanks,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 17:35 ` Hannes Frederic Sowa
@ 2016-11-16 18:51 ` Ido Schimmel
2016-11-16 19:43 ` Hannes Frederic Sowa
0 siblings, 1 reply; 27+ messages in thread
From: Ido Schimmel @ 2016-11-16 18:51 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
Hi,
On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 16:18, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 15:09, Jiri Pirko wrote:
> >>> From: Ido Schimmel <idosch@mellanox.com>
> >>>
> >>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> >>> introduced a new notification chain to notify listeners (f.e., switchdev
> >>> drivers) about addition and deletion of routes.
> >>>
> >>> However, upon registration to the chain the FIB tables can already be
> >>> populated, which means potential listeners will have an incomplete view
> >>> of the tables.
> >>>
> >>> Solve that by adding an API to request a FIB dump. The dump itself it
> >>> done using RCU in order not to starve consumers that need RTNL to make
> >>> progress.
> >>>
> >>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>
> >> Have you looked at potential inconsistencies resulting of RCU walking
> >> the table and having concurrent inserts?
> >
> > Yes. I did try to think about situations in which this approach will
> > fail, but I could only find problems with concurrent removals, which I
> > addressed in 5/8. In case of concurrent insertions, even if you missed
> > the node, you would still get the ENTRY_ADD event to your listener.
>
> Theoretically a node could still be installed while the deletion event
> fired before registering the notifier. E.g. a synchronize_net before
> dumping could help here?
If the deletion event fired for some fib alias, then by 5/8 we are
guaranteed that it was already unlinked from the fib alias list in the
leaf in which it was contained. So, while it's possible we didn't
register our listener in time for the deletion event, we won't traverse
this fib alias while dumping the trie anyway. Did I understand you
correctly?
> I don't know how you prepare the data structures for inserting in into
> the hardware, but if ordering matters, the notifier for a delete event
> can be called before the dump installed the fib entry?
Right. It's possible for the listener to receive a deletion event for a
fib entry it doesn't have, in which case it should just ignore it (as
current listeners do).
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 18:51 ` Ido Schimmel
@ 2016-11-16 19:43 ` Hannes Frederic Sowa
2016-11-16 21:06 ` Ido Schimmel
2016-11-17 13:10 ` Ido Schimmel
0 siblings, 2 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-16 19:43 UTC (permalink / raw)
To: Ido Schimmel
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 16.11.2016 19:51, Ido Schimmel wrote:
> Hi,
>
> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 16:18, Ido Schimmel wrote:
>>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>>>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>
>>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>>>> drivers) about addition and deletion of routes.
>>>>>
>>>>> However, upon registration to the chain the FIB tables can already be
>>>>> populated, which means potential listeners will have an incomplete view
>>>>> of the tables.
>>>>>
>>>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>>>> done using RCU in order not to starve consumers that need RTNL to make
>>>>> progress.
>>>>>
>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>
>>>> Have you looked at potential inconsistencies resulting of RCU walking
>>>> the table and having concurrent inserts?
>>>
>>> Yes. I did try to think about situations in which this approach will
>>> fail, but I could only find problems with concurrent removals, which I
>>> addressed in 5/8. In case of concurrent insertions, even if you missed
>>> the node, you would still get the ENTRY_ADD event to your listener.
>>
>> Theoretically a node could still be installed while the deletion event
>> fired before registering the notifier. E.g. a synchronize_net before
>> dumping could help here?
>
> If the deletion event fired for some fib alias, then by 5/8 we are
> guaranteed that it was already unlinked from the fib alias list in the
> leaf in which it was contained. So, while it's possible we didn't
> register our listener in time for the deletion event, we won't traverse
> this fib alias while dumping the trie anyway. Did I understand you
> correctly?
>
Theoretically we can have the same problem for insertion:
You receive a delete event from the notifier that is queued up first but
the dump will still see the entry in the fib due to being managed by RCU
(the notifier running on another CPU).
The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
still not strongly ordered against the local fib dump trie walk.
>> I don't know how you prepare the data structures for inserting in into
>> the hardware, but if ordering matters, the notifier for a delete event
>> can be called before the dump installed the fib entry?
>
> Right. It's possible for the listener to receive a deletion event for a
> fib entry it doesn't have, in which case it should just ignore it (as
> current listeners do).
Yep, for this specific case.
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 19:43 ` Hannes Frederic Sowa
@ 2016-11-16 21:06 ` Ido Schimmel
2016-11-17 13:10 ` Ido Schimmel
1 sibling, 0 replies; 27+ messages in thread
From: Ido Schimmel @ 2016-11-16 21:06 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 19:51, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 16:18, Ido Schimmel wrote:
> >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> >>>> On 16.11.2016 15:09, Jiri Pirko wrote:
> >>>>> From: Ido Schimmel <idosch@mellanox.com>
> >>>>>
> >>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> >>>>> introduced a new notification chain to notify listeners (f.e., switchdev
> >>>>> drivers) about addition and deletion of routes.
> >>>>>
> >>>>> However, upon registration to the chain the FIB tables can already be
> >>>>> populated, which means potential listeners will have an incomplete view
> >>>>> of the tables.
> >>>>>
> >>>>> Solve that by adding an API to request a FIB dump. The dump itself it
> >>>>> done using RCU in order not to starve consumers that need RTNL to make
> >>>>> progress.
> >>>>>
> >>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>>>
> >>>> Have you looked at potential inconsistencies resulting of RCU walking
> >>>> the table and having concurrent inserts?
> >>>
> >>> Yes. I did try to think about situations in which this approach will
> >>> fail, but I could only find problems with concurrent removals, which I
> >>> addressed in 5/8. In case of concurrent insertions, even if you missed
> >>> the node, you would still get the ENTRY_ADD event to your listener.
> >>
> >> Theoretically a node could still be installed while the deletion event
> >> fired before registering the notifier. E.g. a synchronize_net before
> >> dumping could help here?
> >
> > If the deletion event fired for some fib alias, then by 5/8 we are
> > guaranteed that it was already unlinked from the fib alias list in the
> > leaf in which it was contained. So, while it's possible we didn't
> > register our listener in time for the deletion event, we won't traverse
> > this fib alias while dumping the trie anyway. Did I understand you
> > correctly?
> >
>
> Theoretically we can have the same problem for insertion:
>
> You receive a delete event from the notifier that is queued up first but
> the dump will still see the entry in the fib due to being managed by RCU
> (the notifier running on another CPU).
>
> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
> still not strongly ordered against the local fib dump trie walk.
It's pretty late here so I would have to check this out tomorrow
morning. If this is indeed the case (not saying you're wrong, just want
to verify for myself), then I guess 5/8 can be dropped and instead we
should go with Dave's suggestion? I don't see any other way given the
constraints...
Thanks a lot Hannes!
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-16 19:43 ` Hannes Frederic Sowa
2016-11-16 21:06 ` Ido Schimmel
@ 2016-11-17 13:10 ` Ido Schimmel
2016-11-17 14:36 ` Hannes Frederic Sowa
1 sibling, 1 reply; 27+ messages in thread
From: Ido Schimmel @ 2016-11-17 13:10 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
Hi Hannes,
On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
> On 16.11.2016 19:51, Ido Schimmel wrote:
> > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
> >> On 16.11.2016 16:18, Ido Schimmel wrote:
> >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
> >>>> On 16.11.2016 15:09, Jiri Pirko wrote:
> >>>>> From: Ido Schimmel <idosch@mellanox.com>
> >>>>>
> >>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
> >>>>> introduced a new notification chain to notify listeners (f.e., switchdev
> >>>>> drivers) about addition and deletion of routes.
> >>>>>
> >>>>> However, upon registration to the chain the FIB tables can already be
> >>>>> populated, which means potential listeners will have an incomplete view
> >>>>> of the tables.
> >>>>>
> >>>>> Solve that by adding an API to request a FIB dump. The dump itself it
> >>>>> done using RCU in order not to starve consumers that need RTNL to make
> >>>>> progress.
> >>>>>
> >>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> >>>>
> >>>> Have you looked at potential inconsistencies resulting of RCU walking
> >>>> the table and having concurrent inserts?
> >>>
> >>> Yes. I did try to think about situations in which this approach will
> >>> fail, but I could only find problems with concurrent removals, which I
> >>> addressed in 5/8. In case of concurrent insertions, even if you missed
> >>> the node, you would still get the ENTRY_ADD event to your listener.
> >>
> >> Theoretically a node could still be installed while the deletion event
> >> fired before registering the notifier. E.g. a synchronize_net before
> >> dumping could help here?
> >
> > If the deletion event fired for some fib alias, then by 5/8 we are
> > guaranteed that it was already unlinked from the fib alias list in the
> > leaf in which it was contained. So, while it's possible we didn't
> > register our listener in time for the deletion event, we won't traverse
> > this fib alias while dumping the trie anyway. Did I understand you
> > correctly?
> >
>
> Theoretically we can have the same problem for insertion:
>
> You receive a delete event from the notifier that is queued up first but
> the dump will still see the entry in the fib due to being managed by RCU
> (the notifier running on another CPU).
>
> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
> still not strongly ordered against the local fib dump trie walk.
OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
dump to see fa1 in t1, which will lead to fa1 being permanently present
in the listener's table.
Given the above, I think Dave's suggestion is the only applicable
solution. Do you agree? Any other suggestions?
Thanks
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 13:10 ` Ido Schimmel
@ 2016-11-17 14:36 ` Hannes Frederic Sowa
2016-11-17 16:45 ` David Miller
0 siblings, 1 reply; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-17 14:36 UTC (permalink / raw)
To: Ido Schimmel
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 17.11.2016 14:10, Ido Schimmel wrote:
> Hi Hannes,
>
> On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote:
>> On 16.11.2016 19:51, Ido Schimmel wrote:
>>> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote:
>>>> On 16.11.2016 16:18, Ido Schimmel wrote:
>>>>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote:
>>>>>> On 16.11.2016 15:09, Jiri Pirko wrote:
>>>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>>>
>>>>>>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
>>>>>>> introduced a new notification chain to notify listeners (f.e., switchdev
>>>>>>> drivers) about addition and deletion of routes.
>>>>>>>
>>>>>>> However, upon registration to the chain the FIB tables can already be
>>>>>>> populated, which means potential listeners will have an incomplete view
>>>>>>> of the tables.
>>>>>>>
>>>>>>> Solve that by adding an API to request a FIB dump. The dump itself it
>>>>>>> done using RCU in order not to starve consumers that need RTNL to make
>>>>>>> progress.
>>>>>>>
>>>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>> Have you looked at potential inconsistencies resulting of RCU walking
>>>>>> the table and having concurrent inserts?
>>>>>
>>>>> Yes. I did try to think about situations in which this approach will
>>>>> fail, but I could only find problems with concurrent removals, which I
>>>>> addressed in 5/8. In case of concurrent insertions, even if you missed
>>>>> the node, you would still get the ENTRY_ADD event to your listener.
>>>>
>>>> Theoretically a node could still be installed while the deletion event
>>>> fired before registering the notifier. E.g. a synchronize_net before
>>>> dumping could help here?
>>>
>>> If the deletion event fired for some fib alias, then by 5/8 we are
>>> guaranteed that it was already unlinked from the fib alias list in the
>>> leaf in which it was contained. So, while it's possible we didn't
>>> register our listener in time for the deletion event, we won't traverse
>>> this fib alias while dumping the trie anyway. Did I understand you
>>> correctly?
>>>
>>
>> Theoretically we can have the same problem for insertion:
>>
>> You receive a delete event from the notifier that is queued up first but
>> the dump will still see the entry in the fib due to being managed by RCU
>> (the notifier running on another CPU).
>>
>> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is
>> still not strongly ordered against the local fib dump trie walk.
>
> OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking
> fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib
> dump to see fa1 in t1, which will lead to fa1 being permanently present
> in the listener's table.
Yep. :(
Also the delayed work queue is only partial ordered (only on one CPU),
so you don't know about when an event is processed from the dump and the
notifier. I think you need to create your own workqueue for doing so.
> Given the above, I think Dave's suggestion is the only applicable
> solution. Do you agree? Any other suggestions?
As I wrote before the problem with seqcounter is, that with a quagga
running on top and pushing updates you end up never getting a stable
view, so maybe some logic to abort in case it cannot be loaded after a
few tries would be fine.
The other way is the journal idea I had, which uses an rb-tree with
timestamps as keys (can be lamport timestamps). You insert into the tree
until the dump is finished and use it as queue later to shuffle stuff
into the hardware.
Because you should be able to hold refs to the fa_info or other structs
directly, I don't expect gigantic memory overhead, just for a cell to
store timetamp and pointer (rb_node).
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 14:36 ` Hannes Frederic Sowa
@ 2016-11-17 16:45 ` David Miller
2016-11-17 17:20 ` Hannes Frederic Sowa
0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2016-11-17 16:45 UTC (permalink / raw)
To: hannes
Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 17 Nov 2016 15:36:48 +0100
> The other way is the journal idea I had, which uses an rb-tree with
> timestamps as keys (can be lamport timestamps). You insert into the tree
> until the dump is finished and use it as queue later to shuffle stuff
> into the hardware.
If you have this "place" where pending inserts are stored, you have
a policy decision to make.
First of all what do other lookups see when there are pending entires?
If, once inserted into the pending queue, you return success to the
inserting entity, then you must make those pending entries visible
to lookups.
If you block the inserting entity, well that doesn't make a lot of
sense. If blocking is a workable solution, then we can just block the
entire insert while this FIB dump to the device driver is happening.
But I am pretty sure the idea of blocking modifications for so long
was considered undesirable.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 16:45 ` David Miller
@ 2016-11-17 17:20 ` Hannes Frederic Sowa
2016-11-17 18:16 ` David Miller
2016-11-17 18:32 ` Ido Schimmel
0 siblings, 2 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-17 17:20 UTC (permalink / raw)
To: David Miller
Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
Hi,
On 17.11.2016 17:45, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 17 Nov 2016 15:36:48 +0100
>
>> The other way is the journal idea I had, which uses an rb-tree with
>> timestamps as keys (can be lamport timestamps). You insert into the tree
>> until the dump is finished and use it as queue later to shuffle stuff
>> into the hardware.
>
> If you have this "place" where pending inserts are stored, you have
> a policy decision to make.
>
> First of all what do other lookups see when there are pending entires?
I think this is a problem with the current approach already, as the
delayed work queue already postpones the insert for an undecidable
amount of time (and reorders depending on which CPU the entry was
inserted and the fib notifier was called).
For user space queries we would still query the in-kernel table.
> If, once inserted into the pending queue, you return success to the
> inserting entity, then you must make those pending entries visible
> to lookups.
I think this same problem is in this patch set already. The way I
understood it, is, that if a problem during insert emerges, the driver
calls abort and every packet will be send to user space, as the routing
table cannot be offloaded and it won't try it again, Ido?
Probably this is an artifact of the mellanox implementation and we can
implement this differently for other cards, but the schema to abort all
if the modification doesn't work, seems to be fundamental (I think we
require the all-or-nothing approach for now).
> If you block the inserting entity, well that doesn't make a lot of
> sense. If blocking is a workable solution, then we can just block the
> entire insert while this FIB dump to the device driver is happening.
I don't think we should really block (as in kernel-block) at any time.
I was suggesting something like:
rtnl_lock();
synchronize_rcu_expedited(); // barrier, all rounting modifications are
stable and no new can be added due to rtnl_lock
register notifier(); // notifier adds entries also into journal();
rtnl_unlock();
walk_fib_tree_rcu_into_journal();
// walk finished
start_syncing_journal_to_hw(); // if new entries show up we sync them
asap after this point
The journal would need a spin lock to protect its internal state and
order events correctly.
> But I am pretty sure the idea of blocking modifications for so long
> was considered undesirable.
Yes, this is also still my opinion.
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 17:20 ` Hannes Frederic Sowa
@ 2016-11-17 18:16 ` David Miller
2016-11-17 19:03 ` Hannes Frederic Sowa
2016-11-17 18:32 ` Ido Schimmel
1 sibling, 1 reply; 27+ messages in thread
From: David Miller @ 2016-11-17 18:16 UTC (permalink / raw)
To: hannes
Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 17 Nov 2016 18:20:39 +0100
> Hi,
>
> On 17.11.2016 17:45, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>
>>> The other way is the journal idea I had, which uses an rb-tree with
>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>> until the dump is finished and use it as queue later to shuffle stuff
>>> into the hardware.
>>
>> If you have this "place" where pending inserts are stored, you have
>> a policy decision to make.
>>
>> First of all what do other lookups see when there are pending entires?
>
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).
>
> For user space queries we would still query the in-kernel table.
Ok, I think I might misunderstand something.
What is going into this journal exactly? The actual full software and
hardware insert operation, or just the notification to the hardware
device driver notifiers?
The "lookup" I'm mostly concerned with is the fast path where the
packets being processed actually look up a route.
I do not think we can return success on the insert to the user yet
have the route lookup dataplace not return that route on a lookup.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 17:20 ` Hannes Frederic Sowa
2016-11-17 18:16 ` David Miller
@ 2016-11-17 18:32 ` Ido Schimmel
2016-11-17 19:05 ` Hannes Frederic Sowa
1 sibling, 1 reply; 27+ messages in thread
From: Ido Schimmel @ 2016-11-17 18:32 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
andrew, f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji,
kaber
On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
> On 17.11.2016 17:45, David Miller wrote:
> > From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Date: Thu, 17 Nov 2016 15:36:48 +0100
> >
> >> The other way is the journal idea I had, which uses an rb-tree with
> >> timestamps as keys (can be lamport timestamps). You insert into the tree
> >> until the dump is finished and use it as queue later to shuffle stuff
> >> into the hardware.
> >
> > If you have this "place" where pending inserts are stored, you have
> > a policy decision to make.
> >
> > First of all what do other lookups see when there are pending entires?
>
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).
The delayed work queue only postpones the insert into the listener's
table, but the entries will be present in the kernel's table as usual.
Therefore, other lookup made on the kernel's table will see the pending
entries.
Note that both listeners that currently call the dump API do that during
their init, before any lookups can be made on their tables. If you think
it's critical, then we can make sure the workqueues are flushed before
the listeners register their netdevs and effectively expose their tables
to lookups.
I'm looking into the reordering issues you mentioned. I belive it's a
valid point.
> For user space queries we would still query the in-kernel table.
Right. No changes here.
> > If, once inserted into the pending queue, you return success to the
> > inserting entity, then you must make those pending entries visible
> > to lookups.
>
> I think this same problem is in this patch set already. The way I
> understood it, is, that if a problem during insert emerges, the driver
> calls abort and every packet will be send to user space, as the routing
> table cannot be offloaded and it won't try it again, Ido?
First of all, this abort mechanism is already in place and isn't part of
this patchset. Secondly, why is this a problem? The all-or-nothing
approach is an hard requirement and current implementation is infinitely
better than previous approach in which the kernel's tables were flushed
upon route insertion failure. It rendered the system unusable. Current
implementation of abort mechanism keeps the system usable, but with
reduced performance.
> Probably this is an artifact of the mellanox implementation and we can
> implement this differently for other cards, but the schema to abort all
> if the modification doesn't work, seems to be fundamental (I think we
> require the all-or-nothing approach for now).
Yes, it's an hard requirement for all implementations. mlxsw implements
it by evicting all routes from its tables and inserting a default route
that traps packets to CPU.
> > If you block the inserting entity, well that doesn't make a lot of
> > sense. If blocking is a workable solution, then we can just block the
> > entire insert while this FIB dump to the device driver is happening.
>
> I don't think we should really block (as in kernel-block) at any time.
>
> I was suggesting something like:
>
> rtnl_lock();
> synchronize_rcu_expedited(); // barrier, all rounting modifications are
> stable and no new can be added due to rtnl_lock
> register notifier(); // notifier adds entries also into journal();
> rtnl_unlock();
> walk_fib_tree_rcu_into_journal();
> // walk finished
> start_syncing_journal_to_hw(); // if new entries show up we sync them
> asap after this point
>
> The journal would need a spin lock to protect its internal state and
> order events correctly.
>
> > But I am pretty sure the idea of blocking modifications for so long
> > was considered undesirable.
>
> Yes, this is also still my opinion.
It was indeed rejected :)
https://marc.info/?l=linux-netdev&m=147794848224884&w=2
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 18:16 ` David Miller
@ 2016-11-17 19:03 ` Hannes Frederic Sowa
0 siblings, 0 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-17 19:03 UTC (permalink / raw)
To: David Miller
Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
On 17.11.2016 19:16, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 17 Nov 2016 18:20:39 +0100
>
>> Hi,
>>
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
>>>> The other way is the journal idea I had, which uses an rb-tree with
>>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>>> until the dump is finished and use it as queue later to shuffle stuff
>>>> into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
>>
>> For user space queries we would still query the in-kernel table.
>
> Ok, I think I might misunderstand something.
>
> What is going into this journal exactly? The actual full software and
> hardware insert operation, or just the notification to the hardware
> device driver notifiers?
The journal is only used as a timely ordered queue for updating the
hardware in correct order.
The enqueue side is the fib notifier only. If no fib notifier is
registered we don't use this code at all (and also don't hit the lock
protecting this journal in fib insertion/deletion path - fast in-kernel
path is untouched - otherwise just a spin_lock already under rtnl_lock
in slow path).
The fib notifier enqueues the packet with a timestamp into this journal
and can also merge entries while they are in the queue, e.g. we got a
delete from the fib notifier but the rcu walk indicated an addition of
the entry, so we can merge that at this point and depending on the
timestamp remove the entry or drop the deletion event.
We start dequeueing the fib entries into the hardware as soon as the rcu
dump is finished, at this point we are up-to-date in the queue with all
events. New events can be added to the journal (with appropriate
locking) during this time, as the queue was once in proper synced state
we stay proper synchronized. We keep up with the queue in steady state
after the dump, so syncing happens ASAP. Maybe we can also drop the
journal then.
Something alike this described queue is implemented here (haven't
checked if it exactly matches the specification, certainly it provides
more features):
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c
For this to work the config path needs to add timestamps to the
fib_infos or fib_aliases.
> The "lookup" I'm mostly concerned with is the fast path where the
> packets being processed actually look up a route.
This doesn't change at all. All code will be hidden in a library that
gets attached to the fib notifier, which is configuration code path.
> I do not think we can return success on the insert to the user yet
> have the route lookup dataplace not return that route on a lookup.
We don't change kernel fast path at all.
If we add/delete a route in software and hardware, kernel indicates
success as soon as software has the entry added. It also gets queued up
in the journal. Journal will be lazily processed, if error happens
during that (e.g. hardware signals table full), abort is called and all
packets go to user space ASAP. User space will always show the route as
it is added to in the first place and after the driver called abort also
process the packets accordingly.
I can imagine this can get very complicated. David's approach with a
counter to check for modifications and a limited number of retries
probably works too, especially because the hardware will probably be
initialized before routing daemons start up and will be synced up
hopefully all the time.
So maybe this is over engineered, but I have no idea how long hardware
needs to sync up a e.g. full IPv4 routing table into hardware (if that
is actually the current goal of this).
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
2016-11-17 18:32 ` Ido Schimmel
@ 2016-11-17 19:05 ` Hannes Frederic Sowa
0 siblings, 0 replies; 27+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-17 19:05 UTC (permalink / raw)
To: Ido Schimmel
Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
andrew, f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji,
kaber
On 17.11.2016 19:32, Ido Schimmel wrote:
> On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
>>>> The other way is the journal idea I had, which uses an rb-tree with
>>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>>> until the dump is finished and use it as queue later to shuffle stuff
>>>> into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
>
> The delayed work queue only postpones the insert into the listener's
> table, but the entries will be present in the kernel's table as usual.
> Therefore, other lookup made on the kernel's table will see the pending
> entries.
>
> Note that both listeners that currently call the dump API do that during
> their init, before any lookups can be made on their tables. If you think
> it's critical, then we can make sure the workqueues are flushed before
> the listeners register their netdevs and effectively expose their tables
> to lookups.
I do see routing anyway as a best-effort. ;)
That means in not too long time the hardware needs to be fully
synchronized to the software path and either have all correct entries or
abort must have been called.
> I'm looking into the reordering issues you mentioned. I belive it's a
> valid point.
Thanks!
>> For user space queries we would still query the in-kernel table.
>
> Right. No changes here.
>
>>> If, once inserted into the pending queue, you return success to the
>>> inserting entity, then you must make those pending entries visible
>>> to lookups.
>>
>> I think this same problem is in this patch set already. The way I
>> understood it, is, that if a problem during insert emerges, the driver
>> calls abort and every packet will be send to user space, as the routing
>> table cannot be offloaded and it won't try it again, Ido?
>
> First of all, this abort mechanism is already in place and isn't part of
> this patchset. Secondly, why is this a problem? The all-or-nothing
> approach is an hard requirement and current implementation is infinitely
> better than previous approach in which the kernel's tables were flushed
> upon route insertion failure. It rendered the system unusable. Current
> implementation of abort mechanism keeps the system usable, but with
> reduced performance.
Yes, I argued below that I am toally fine with this approach.
>> Probably this is an artifact of the mellanox implementation and we can
>> implement this differently for other cards, but the schema to abort all
>> if the modification doesn't work, seems to be fundamental (I think we
>> require the all-or-nothing approach for now).
>
> Yes, it's an hard requirement for all implementations. mlxsw implements
> it by evicting all routes from its tables and inserting a default route
> that traps packets to CPU.
Correct, I don't see how fib offloading can do something else, besides
semantically looking at the routing table and figure out where to insert
"exceptions" for subtrees but keep most packets flowing over the hw
directly. But this for another time... ;)
>>> If you block the inserting entity, well that doesn't make a lot of
>>> sense. If blocking is a workable solution, then we can just block the
>>> entire insert while this FIB dump to the device driver is happening.
>>
>> I don't think we should really block (as in kernel-block) at any time.
>>
>> I was suggesting something like:
>>
>> rtnl_lock();
>> synchronize_rcu_expedited(); // barrier, all rounting modifications are
>> stable and no new can be added due to rtnl_lock
>> register notifier(); // notifier adds entries also into journal();
>> rtnl_unlock();
>> walk_fib_tree_rcu_into_journal();
>> // walk finished
>> start_syncing_journal_to_hw(); // if new entries show up we sync them
>> asap after this point
>>
>> The journal would need a spin lock to protect its internal state and
>> order events correctly.
>>
>>> But I am pretty sure the idea of blocking modifications for so long
>>> was considered undesirable.
>>
>> Yes, this is also still my opinion.
>
> It was indeed rejected :)
> https://marc.info/?l=linux-netdev&m=147794848224884&w=2
Bye,
Hannes
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2016-11-17 19:05 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-16 14:08 [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Jiri Pirko
2016-11-16 14:08 ` [patch net-next 1/8] ipv4: fib: Export free_fib_info() Jiri Pirko
2016-11-16 14:08 ` [patch net-next 2/8] mlxsw: spectrum_router: Implement FIB offload in delayed work Jiri Pirko
2016-11-16 14:08 ` [patch net-next 3/8] rocker: " Jiri Pirko
2016-11-16 14:08 ` [patch net-next 4/8] ipv4: fib: Convert FIB notification chain to be atomic Jiri Pirko
2016-11-16 14:09 ` [patch net-next 5/8] net: ipv4: Send notifications only after removing FIB alias Jiri Pirko
2016-11-16 14:09 ` [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump Jiri Pirko
2016-11-16 14:51 ` Hannes Frederic Sowa
2016-11-16 15:18 ` Ido Schimmel
2016-11-16 17:35 ` Hannes Frederic Sowa
2016-11-16 18:51 ` Ido Schimmel
2016-11-16 19:43 ` Hannes Frederic Sowa
2016-11-16 21:06 ` Ido Schimmel
2016-11-17 13:10 ` Ido Schimmel
2016-11-17 14:36 ` Hannes Frederic Sowa
2016-11-17 16:45 ` David Miller
2016-11-17 17:20 ` Hannes Frederic Sowa
2016-11-17 18:16 ` David Miller
2016-11-17 19:03 ` Hannes Frederic Sowa
2016-11-17 18:32 ` Ido Schimmel
2016-11-17 19:05 ` Hannes Frederic Sowa
2016-11-16 16:27 ` David Miller
2016-11-16 16:42 ` Ido Schimmel
2016-11-16 16:56 ` Hannes Frederic Sowa
2016-11-16 14:09 ` [patch net-next 7/8] mlxsw: spectrum_router: Request a dump of FIB tables during init Jiri Pirko
2016-11-16 14:09 ` [patch net-next 8/8] rocker: " Jiri Pirko
2016-11-16 16:38 ` [patch net-next 0/8] ipv4: fib: Allow modules to dump FIB tables Alexander Duyck
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).