* [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
This patch set fixes a few append and replace uses cases for IPv6 and
adds test cases that codifies the expectations of how append and replace
are expected to work. In paricular it allows a multipath route to have
a dev-only nexthop, something Thomas tried to accomplish with commit
edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6") which had to be
reverted because of breakage, and to replace an existing FIB entry
with a reject route.
There are a number of inconsistent and surprising aspects to the Linux
API for adding, deleting, replacing and changing FIB entries. For example,
with IPv4 NLM_F_APPEND means insert the route after any existing entries
with the same key (prefix + priority + TOS for IPv4) and NLM_F_CREATE
without the append flag inserts the new route before any existing entries.
IPv6 on the other hand attempts to guess whether a new route should be
appended to an existing one, possibly creating a multipath route, or to
add a new entry after any existing ones. This applies to both the 'append'
(NLM_F_CREATE + NLM_F_APPEND) and 'prepend' (NLM_F_CREATE only) cases
meaning for IPv6 the NLM_F_APPEND is basically ignored. This guessing
whether the route should be added to a multipath route (gateway routes)
or inserted after existing entries (non-gateway based routes) means a
multipath route can not have a dev only nexthop (potentially required in
some cases - tunnels or VRF route leaking for example) and route 'replace'
is a bit adhoc treating gateway based routes and dev-only / reject routes
differently.
This has led to frustration with developers working on routing suites
such as FRR where workarounds such as delete and add.
After this patch set there are 2 differences between IPv4 and IPv6:
1. 'ip ro prepend' = NLM_F_CREATE only
IPv4 adds the new route before any existing ones
IPv6 adds new route after any existing ones
2. 'ip ro append' = NLM_F_CREATE|NLM_F_APPEND
IPv4 adds the new route after any existing ones
IPv6 adds the nexthop to existing routes converting to multipath
For the former, there are cases where we want same prefix routes added
after existing ones (e.g., multicast, prefix routes for macvlan when used
for virtual router redundancy). Requiring the APPEND flag to add a new
route to an existing one helps here but is a slight change in behavior
since prepend with gateway routes now create a separate entry.
For the latter IPv6 behavior is preferred - appending a route for the same
prefix and metric to make a multipath route, so really IPv4 not allowing an
existing route to be updated is the limiter. This will be fixed when
nexthops become separate objects - a future patch set.
Thank you to Thomas and Ido for testing earlier versions of this set, and
to Ido for providing an update to the mlxsw driver.
David Ahern (7):
mlxsw: spectrum_router: Add support for route append
net/ipv6: Simplify appending route into multipath route
selftests: fib_tests: Add success-fail counts
selftests: fib_tests: Add command line options
selftests: fib_tests: Add option to pause after each test
selftests: fib_tests: Add ipv6 route add append replace tests
selftests: fib_tests: Add ipv4 route add append replace tests
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 +
include/net/ip6_route.h | 6 -
net/ipv6/ip6_fib.c | 157 +++--
net/ipv6/route.c | 3 +-
tools/testing/selftests/net/fib_tests.sh | 673 ++++++++++++++++++++-
5 files changed, 737 insertions(+), 104 deletions(-)
--
2.11.0
^ permalink raw reply
* [PATCH RFC net-next 1/7] mlxsw: spectrum_router: Add support for route append
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Handle append for gateway based routes. Dev-only multipath routes will
be handled by a follow on patch.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 8028d221aece..77b2adb29341 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -5725,6 +5725,7 @@ static void mlxsw_sp_router_fib6_event_work(struct work_struct *work)
switch (fib_work->event) {
case FIB_EVENT_ENTRY_REPLACE: /* fall through */
+ case FIB_EVENT_ENTRY_APPEND: /* fall through */
case FIB_EVENT_ENTRY_ADD:
replace = fib_work->event == FIB_EVENT_ENTRY_REPLACE;
err = mlxsw_sp_router_fib6_add(mlxsw_sp,
@@ -5831,6 +5832,7 @@ static void mlxsw_sp_router_fib6_event(struct mlxsw_sp_fib_event_work *fib_work,
switch (fib_work->event) {
case FIB_EVENT_ENTRY_REPLACE: /* fall through */
+ case FIB_EVENT_ENTRY_APPEND: /* fall through */
case FIB_EVENT_ENTRY_ADD: /* fall through */
case FIB_EVENT_ENTRY_DEL:
fen6_info = container_of(info, struct fib6_entry_notifier_info,
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 2/7] net/ipv6: Simplify route replace and appending into multipath route
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Remove rt6_qualify_for_ecmp which is just guess work. It fails in 2 cases:
1. can not replace a route with a reject route. Existing code appends
a new route instead of replacing the existing one.
2. can not have a multipath route where a leg uses a dev only nexthop
Existing use cases affected by this change:
1. adding a route with existing prefix and metric using NLM_F_CREATE
without NLM_F_APPEND or NLM_F_EXCL (ie., what iproute2 calls
'prepend'). Existing code auto-determines that the new nexthop can
be appended to an existing route to create a multipath route. This
change breaks that by requiring the APPEND flag for the new route
to be added to an existing one. Instead the prepend just adds another
route entry.
2. route replace. Existing code replaces first matching multipath route
if new route is multipath capable and fallback to first matching
non-ECMP route (reject or dev only route) in case one isn't available.
New behavior replaces first matching route. (Thanks to Ido for spotting
this one)
Note: Newer iproute2 is needed to display multipath routes with a dev-only
nexthop. This is due to a bug in iproute2 and parsing nexthops.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
include/net/ip6_route.h | 6 --
net/ipv6/ip6_fib.c | 157 ++++++++++++++++++++++--------------------------
net/ipv6/route.c | 3 +-
3 files changed, 73 insertions(+), 93 deletions(-)
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4cf1ef935ed9..9e4d0f0aeb6d 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -66,12 +66,6 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr)
(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK);
}
-static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i)
-{
- return (f6i->fib6_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) ==
- RTF_GATEWAY;
-}
-
void ip6_route_input(struct sk_buff *skb);
struct dst_entry *ip6_route_input_lookup(struct net *net,
struct net_device *dev,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f0a4262a4789..f57ec3bc12d1 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -927,19 +927,19 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
{
struct fib6_info *leaf = rcu_dereference_protected(fn->leaf,
lockdep_is_held(&rt->fib6_table->tb6_lock));
- struct fib6_info *iter = NULL;
+ struct fib6_info *iter = NULL, *match = NULL;
struct fib6_info __rcu **ins;
- struct fib6_info __rcu **fallback_ins = NULL;
int replace = (info->nlh &&
(info->nlh->nlmsg_flags & NLM_F_REPLACE));
+ int append = (info->nlh &&
+ (info->nlh->nlmsg_flags & NLM_F_APPEND));
int add = (!info->nlh ||
(info->nlh->nlmsg_flags & NLM_F_CREATE));
int found = 0;
- bool rt_can_ecmp = rt6_qualify_for_ecmp(rt);
u16 nlflags = NLM_F_EXCL;
int err;
- if (info->nlh && (info->nlh->nlmsg_flags & NLM_F_APPEND))
+ if (append)
nlflags |= NLM_F_APPEND;
ins = &fn->leaf;
@@ -961,13 +961,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
nlflags &= ~NLM_F_EXCL;
if (replace) {
- if (rt_can_ecmp == rt6_qualify_for_ecmp(iter)) {
- found++;
- break;
- }
- if (rt_can_ecmp)
- fallback_ins = fallback_ins ?: ins;
- goto next_iter;
+ found++;
+ break;
}
if (rt6_duplicate_nexthop(iter, rt)) {
@@ -982,86 +977,67 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu);
return -EEXIST;
}
- /* If we have the same destination and the same metric,
- * but not the same gateway, then the route we try to
- * add is sibling to this route, increment our counter
- * of siblings, and later we will add our route to the
- * list.
- * Only static routes (which don't have flag
- * RTF_EXPIRES) are used for ECMPv6.
- *
- * To avoid long list, we only had siblings if the
- * route have a gateway.
- */
- if (rt_can_ecmp &&
- rt6_qualify_for_ecmp(iter))
- rt->fib6_nsiblings++;
+
+ /* first route that matches */
+ if (!match)
+ match = iter;
}
if (iter->fib6_metric > rt->fib6_metric)
break;
-next_iter:
ins = &iter->fib6_next;
}
- if (fallback_ins && !found) {
- /* No ECMP-able route found, replace first non-ECMP one */
- ins = fallback_ins;
- iter = rcu_dereference_protected(*ins,
- lockdep_is_held(&rt->fib6_table->tb6_lock));
- found++;
- }
-
/* Reset round-robin state, if necessary */
if (ins == &fn->leaf)
fn->rr_ptr = NULL;
/* Link this route to others same route. */
- if (rt->fib6_nsiblings) {
- unsigned int fib6_nsiblings;
+ if (append && match) {
struct fib6_info *sibling, *temp_sibling;
- /* Find the first route that have the same metric */
- sibling = leaf;
- while (sibling) {
- if (sibling->fib6_metric == rt->fib6_metric &&
- rt6_qualify_for_ecmp(sibling)) {
- list_add_tail(&rt->fib6_siblings,
- &sibling->fib6_siblings);
- break;
- }
- sibling = rcu_dereference_protected(sibling->fib6_next,
- lockdep_is_held(&rt->fib6_table->tb6_lock));
+ if (rt->fib6_flags & RTF_REJECT) {
+ NL_SET_ERR_MSG(extack,
+ "Can not append a REJECT route");
+ return -EINVAL;
+ } else if (match->fib6_flags & RTF_REJECT) {
+ NL_SET_ERR_MSG(extack,
+ "Can not append to a REJECT route");
+ return -EINVAL;
}
+ rt->fib6_nsiblings = match->fib6_nsiblings;
+ list_add_tail(&rt->fib6_siblings, &match->fib6_siblings);
+ match->fib6_nsiblings++;
+
/* For each sibling in the list, increment the counter of
* siblings. BUG() if counters does not match, list of siblings
* is broken!
*/
- fib6_nsiblings = 0;
list_for_each_entry_safe(sibling, temp_sibling,
- &rt->fib6_siblings, fib6_siblings) {
+ &match->fib6_siblings, fib6_siblings) {
sibling->fib6_nsiblings++;
- BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings);
- fib6_nsiblings++;
+ BUG_ON(sibling->fib6_nsiblings != match->fib6_nsiblings);
}
- BUG_ON(fib6_nsiblings != rt->fib6_nsiblings);
- rt6_multipath_rebalance(temp_sibling);
+
+ rt6_multipath_rebalance(match);
}
/*
* insert node
*/
if (!replace) {
+ enum fib_event_type event;
+
if (!add)
pr_warn("NLM_F_CREATE should be set when creating new route\n");
add:
nlflags |= NLM_F_CREATE;
- err = call_fib6_entry_notifiers(info->nl_net,
- FIB_EVENT_ENTRY_ADD,
- rt, extack);
+ event = append ? FIB_EVENT_ENTRY_APPEND : FIB_EVENT_ENTRY_ADD;
+ err = call_fib6_entry_notifiers(info->nl_net, event, rt,
+ extack);
if (err)
return err;
@@ -1079,7 +1055,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
}
} else {
- int nsiblings;
+ struct fib6_info *tmp;
if (!found) {
if (add)
@@ -1094,48 +1070,57 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt,
if (err)
return err;
+ /* if route being replaced has siblings, set tmp to
+ * last one, otherwise tmp is current route. this is
+ * used to set fib6_next for new route
+ */
+ if (iter->fib6_nsiblings)
+ tmp = list_last_entry(&iter->fib6_siblings,
+ struct fib6_info,
+ fib6_siblings);
+ else
+ tmp = iter;
+
+ /* insert new route */
atomic_inc(&rt->fib6_ref);
rcu_assign_pointer(rt->fib6_node, fn);
- rt->fib6_next = iter->fib6_next;
+ rt->fib6_next = tmp->fib6_next;
rcu_assign_pointer(*ins, rt);
+
if (!info->skip_notify)
inet6_rt_notify(RTM_NEWROUTE, rt, info, NLM_F_REPLACE);
if (!(fn->fn_flags & RTN_RTINFO)) {
info->nl_net->ipv6.rt6_stats->fib_route_nodes++;
fn->fn_flags |= RTN_RTINFO;
}
- nsiblings = iter->fib6_nsiblings;
- iter->fib6_node = NULL;
- fib6_purge_rt(iter, fn, info->nl_net);
- if (rcu_access_pointer(fn->rr_ptr) == iter)
- fn->rr_ptr = NULL;
- fib6_info_release(iter);
- if (nsiblings) {
+ /* delete old route */
+ rt = iter;
+
+ if (rt->fib6_nsiblings) {
+ struct fib6_info *tmp;
+
/* Replacing an ECMP route, remove all siblings */
- ins = &rt->fib6_next;
- iter = rcu_dereference_protected(*ins,
- lockdep_is_held(&rt->fib6_table->tb6_lock));
- while (iter) {
- if (iter->fib6_metric > rt->fib6_metric)
- break;
- if (rt6_qualify_for_ecmp(iter)) {
- *ins = iter->fib6_next;
- iter->fib6_node = NULL;
- fib6_purge_rt(iter, fn, info->nl_net);
- if (rcu_access_pointer(fn->rr_ptr) == iter)
- fn->rr_ptr = NULL;
- fib6_info_release(iter);
- nsiblings--;
- info->nl_net->ipv6.rt6_stats->fib_rt_entries--;
- } else {
- ins = &iter->fib6_next;
- }
- iter = rcu_dereference_protected(*ins,
- lockdep_is_held(&rt->fib6_table->tb6_lock));
+ list_for_each_entry_safe(iter, tmp, &rt->fib6_siblings,
+ fib6_siblings) {
+ iter->fib6_node = NULL;
+ fib6_purge_rt(iter, fn, info->nl_net);
+ if (rcu_access_pointer(fn->rr_ptr) == iter)
+ fn->rr_ptr = NULL;
+ fib6_info_release(iter);
+
+ rt->fib6_nsiblings--;
+ info->nl_net->ipv6.rt6_stats->fib_rt_entries--;
}
- WARN_ON(nsiblings != 0);
}
+
+ WARN_ON(rt->fib6_nsiblings != 0);
+
+ rt->fib6_node = NULL;
+ fib6_purge_rt(rt, fn, info->nl_net);
+ if (rcu_access_pointer(fn->rr_ptr) == rt)
+ fn->rr_ptr = NULL;
+ fib6_info_release(rt);
}
return 0;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index af0416701fb2..f267e320ee2e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3781,7 +3781,7 @@ static struct fib6_info *rt6_multipath_first_sibling(const struct fib6_info *rt)
lockdep_is_held(&rt->fib6_table->tb6_lock));
while (iter) {
if (iter->fib6_metric == rt->fib6_metric &&
- rt6_qualify_for_ecmp(iter))
+ iter->fib6_nsiblings)
return iter;
iter = rcu_dereference_protected(iter->fib6_next,
lockdep_is_held(&rt->fib6_table->tb6_lock));
@@ -4371,6 +4371,7 @@ static int ip6_route_multipath_add(struct fib6_config *cfg,
*/
cfg->fc_nlinfo.nlh->nlmsg_flags &= ~(NLM_F_EXCL |
NLM_F_REPLACE);
+ cfg->fc_nlinfo.nlh->nlmsg_flags |= NLM_F_APPEND;
nhn++;
}
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 3/7] selftests: fib_tests: Add success-fail counts
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
As more tests are added, it is convenient to have a tally at the end.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
tools/testing/selftests/net/fib_tests.sh | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 9164e60d4b66..7e2291161e15 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -18,8 +18,10 @@ log_test()
if [ ${rc} -eq ${expected} ]; then
printf " TEST: %-60s [ OK ]\n" "${msg}"
+ nsuccess=$((nsuccess+1))
else
ret=1
+ nfail=$((nfail+1))
printf " TEST: %-60s [FAIL]\n" "${msg}"
if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
echo
@@ -598,4 +600,9 @@ cleanup &> /dev/null
fib_test
+if [ "$TESTS" != "none" ]; then
+ printf "\nTests passed: %3d\n" ${nsuccess}
+ printf "Tests failed: %3d\n" ${nfail}
+fi
+
exit $ret
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 4/7] selftests: fib_tests: Add command line options
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Add command line options for controlling pause on fail, controlling
specific tests to run and verbose mode rather than relying on environment
variables.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
tools/testing/selftests/net/fib_tests.sh | 53 ++++++++++++++++++++++++--------
1 file changed, 40 insertions(+), 13 deletions(-)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 7e2291161e15..8c99f0689efc 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,8 +6,9 @@
ret=0
-VERBOSE=${VERBOSE:=0}
-PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
+TESTS="unregister down carrier nexthop"
+VERBOSE=0
+PAUSE_ON_FAIL=no
IP="ip -netns testns"
log_test()
@@ -565,20 +566,36 @@ fib_nexthop_test()
}
################################################################################
-#
+# usage
-fib_test()
+usage()
{
- if [ -n "$TEST" ]; then
- eval $TEST
- else
- fib_unreg_test
- fib_down_test
- fib_carrier_test
- fib_nexthop_test
- fi
+ cat <<EOF
+usage: ${0##*/} OPTS
+
+ -t <test> Test(s) to run (default: all)
+ (options: $TESTS)
+ -p Pause on fail
+ -v verbose mode (show commands and output)
+EOF
}
+################################################################################
+# main
+
+while getopts :t:pPhv o
+do
+ case $o in
+ t) TESTS=$OPTARG;;
+ p) PAUSE_ON_FAIL=yes;;
+ v) VERBOSE=$(($VERBOSE + 1));;
+ h) usage; exit 0;;
+ *) usage; exit 1;;
+ esac
+done
+
+PEER_CMD="ip netns exec ${PEER_NS}"
+
if [ "$(id -u)" -ne 0 ];then
echo "SKIP: Need root privileges"
exit 0
@@ -598,7 +615,17 @@ fi
# start clean
cleanup &> /dev/null
-fib_test
+for t in $TESTS
+do
+ case $t in
+ fib_unreg_test|unregister) fib_unreg_test;;
+ fib_down_test|down) fib_down_test;;
+ fib_carrier_test|carrier) fib_carrier_test;;
+ fib_nexthop_test|nexthop) fib_nexthop_test;;
+
+ help) echo "Test names: $TESTS"; exit 0;;
+ esac
+done
if [ "$TESTS" != "none" ]; then
printf "\nTests passed: %3d\n" ${nsuccess}
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 5/7] selftests: fib_tests: Add option to pause after each test
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Add option to pause after each test before cleanup is done. Allows
user to do manual inspection or more ad-hoc testing after each test
with the setup in tact.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
tools/testing/selftests/net/fib_tests.sh | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 8c99f0689efc..12b648826151 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -9,6 +9,7 @@ ret=0
TESTS="unregister down carrier nexthop"
VERBOSE=0
PAUSE_ON_FAIL=no
+PAUSE=no
IP="ip -netns testns"
log_test()
@@ -31,6 +32,13 @@ log_test()
[ "$a" = "q" ] && exit 1
fi
fi
+
+ if [ "${PAUSE}" = "yes" ]; then
+ echo
+ echo "hit enter to continue, 'q' to quit"
+ read a
+ [ "$a" = "q" ] && exit 1
+ fi
}
setup()
@@ -576,6 +584,7 @@ usage: ${0##*/} OPTS
-t <test> Test(s) to run (default: all)
(options: $TESTS)
-p Pause on fail
+ -P Pause after each test before cleanup
-v verbose mode (show commands and output)
EOF
}
@@ -588,6 +597,7 @@ do
case $o in
t) TESTS=$OPTARG;;
p) PAUSE_ON_FAIL=yes;;
+ P) PAUSE=yes;;
v) VERBOSE=$(($VERBOSE + 1));;
h) usage; exit 0;;
*) usage; exit 1;;
@@ -596,6 +606,9 @@ done
PEER_CMD="ip netns exec ${PEER_NS}"
+# make sure we don't pause twice
+[ "${PAUSE}" = "yes" ] && PAUSE_ON_FAIL=no
+
if [ "$(id -u)" -ne 0 ];then
echo "SKIP: Need root privileges"
exit 0
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 6/7] selftests: fib_tests: Add ipv6 route add append replace tests
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Add IPv6 route tests covering add, append and replace permutations.
Assumes the ability to add a basic single path route works; this is
required for example when adding an address to an interface.
$ fib_tests.sh -t ipv6_rt
IPv6 route add / append tests
TEST: Attempt to add duplicate route - gw [ OK ]
TEST: Attempt to add duplicate route - dev only [ OK ]
TEST: Attempt to add duplicate route - reject route [ OK ]
TEST: Add new nexthop for existing prefix [ OK ]
TEST: Append leg to existing route - gw [ OK ]
TEST: Append leg to existing route - dev only [ OK ]
TEST: Append leg to existing route - reject route [ OK ]
TEST: Append leg to existing reject route - gw [ OK ]
TEST: Append leg to existing reject route - dev only [ OK ]
TEST: add multipath route [ OK ]
TEST: Attempt to add duplicate multipath route [ OK ]
TEST: route add with different metrics [ OK ]
TEST: route delete with metric [ OK ]
IPv6 route replace tests
TEST: single path with single path [ OK ]
TEST: single path with multipath [ OK ]
TEST: single path with reject route [ OK ]
TEST: single path with single path via multipath attribute [ OK ]
TEST: invalid nexthop [ OK ]
TEST: single path - replace of non-existent route [ OK ]
TEST: multipath with multipath [ OK ]
TEST: multipath with single path [ OK ]
TEST: multipath with single path via multipath attribute [ OK ]
TEST: multipath with reject route [ OK ]
TEST: multipath - invalid first nexthop [ OK ]
TEST: multipath - invalid second nexthop [ OK ]
TEST: multipath - replace of non-existent route [ OK ]
Signed-off-by: David Ahern <dsahern@gmail.com>
---
tools/testing/selftests/net/fib_tests.sh | 331 ++++++++++++++++++++++++++++++-
1 file changed, 330 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 12b648826151..20cb1d2c4e72 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,7 +6,7 @@
ret=0
-TESTS="unregister down carrier nexthop"
+TESTS="unregister down carrier nexthop ipv6_rt"
VERBOSE=0
PAUSE_ON_FAIL=no
PAUSE=no
@@ -574,6 +574,334 @@ fib_nexthop_test()
}
################################################################################
+# Tests on route add and replace
+
+run_cmd()
+{
+ local cmd="$1"
+ local out
+ local stderr="2>/dev/null"
+
+ if [ "$VERBOSE" = "1" ]; then
+ printf " COMMAND: $cmd\n"
+ stderr=
+ fi
+
+ out=$(eval $cmd $stderr)
+ rc=$?
+ if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+ echo " $out"
+ fi
+
+ [ "$VERBOSE" = "1" ] && echo
+
+ return $rc
+}
+
+# add route for a prefix, flushing any existing routes first
+# expected to be the first step of a test
+add_route6()
+{
+ local pfx="$1"
+ local nh="$2"
+ local out
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo
+ echo " ##################################################"
+ echo
+ fi
+
+ run_cmd "$IP -6 ro flush ${pfx}"
+ [ $? -ne 0 ] && exit 1
+
+ out=$($IP -6 ro ls match ${pfx})
+ if [ -n "$out" ]; then
+ echo "Failed to flush routes for prefix used for tests."
+ exit 1
+ fi
+
+ run_cmd "$IP -6 ro add ${pfx} ${nh}"
+ if [ $? -ne 0 ]; then
+ echo "Failed to add initial route for test."
+ exit 1
+ fi
+}
+
+# add initial route - used in replace route tests
+add_initial_route6()
+{
+ add_route6 "2001:db8:104::/64" "$1"
+}
+
+check_route6()
+{
+ local pfx="2001:db8:104::/64"
+ local expected="$1"
+ local out
+ local rc=0
+
+ out=$($IP -6 ro ls match ${pfx} | sed -e 's/ pref medium//')
+ if [ -z "${out}" ]; then
+ if [ "$VERBOSE" = "1" ]; then
+ printf "\nNo route entry found\n"
+ printf "Expected:\n"
+ printf " ${expected}\n"
+ fi
+ return 1
+ fi
+
+ # tricky way to convert output to 1-line without ip's
+ # messy '\'; this drops all extra white space
+ out=$(echo ${out})
+ if [ "${out}" != "${expected}" ]; then
+ rc=1
+ if [ "${VERBOSE}" = "1" ]; then
+ printf " Unexpected route entry. Have:\n"
+ printf " ${out}\n"
+ printf " Expected:\n"
+ printf " ${expected}\n\n"
+ fi
+ fi
+
+ return $rc
+}
+
+route_cleanup()
+{
+ $IP li del red 2>/dev/null
+ $IP li del dummy1 2>/dev/null
+ $IP li del veth1 2>/dev/null
+ $IP li del veth3 2>/dev/null
+
+ cleanup &> /dev/null
+}
+
+route_setup()
+{
+ route_cleanup
+ setup
+
+ [ "${VERBOSE}" = "1" ] && set -x
+ set -e
+
+ $IP li add red up type vrf table 101
+ $IP li add veth1 type veth peer name veth2
+ $IP li add veth3 type veth peer name veth4
+
+ $IP li set veth1 up
+ $IP li set veth3 up
+ $IP li set veth2 vrf red up
+ $IP li set veth4 vrf red up
+ $IP li add dummy1 type dummy
+ $IP li set dummy1 vrf red up
+
+ $IP -6 addr add 2001:db8:101::1/64 dev veth1
+ $IP -6 addr add 2001:db8:101::2/64 dev veth2
+ $IP -6 addr add 2001:db8:103::1/64 dev veth3
+ $IP -6 addr add 2001:db8:103::2/64 dev veth4
+ $IP -6 addr add 2001:db8:104::1/64 dev dummy1
+
+ set +ex
+}
+
+# assumption is that basic add of a single path route works
+# otherwise just adding an address on an interface is broken
+ipv6_rt_add()
+{
+ local rc
+
+ echo
+ echo "IPv6 route add / append tests"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro add 2001:db8:104::/64 via 2001:db8:103::2"
+ log_test $? 2 "Attempt to add duplicate route - gw"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro add 2001:db8:104::/64 dev veth3"
+ log_test $? 2 "Attempt to add duplicate route - dev only"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro add unreachable 2001:db8:104::/64"
+ log_test $? 2 "Attempt to add duplicate route - reject route"
+
+ # iproute2 prepend only sets NLM_F_CREATE
+ # - adds a new route; does NOT convert existing route to ECMP
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro prepend 2001:db8:104::/64 via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 via 2001:db8:101::2 dev veth1 metric 1024 2001:db8:104::/64 via 2001:db8:103::2 dev veth3 metric 1024"
+ log_test $? 0 "Add new nexthop for existing prefix"
+
+ # route append with same prefix adds a new route
+ # - iproute2 sets NLM_F_CREATE | NLM_F_APPEND
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro append 2001:db8:104::/64 via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::2 dev veth1 weight 1 nexthop via 2001:db8:103::2 dev veth3 weight 1"
+ log_test $? 0 "Append leg to existing route - gw"
+
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro append 2001:db8:104::/64 dev veth3"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::2 dev veth1 weight 1 nexthop dev veth3 weight 1"
+ log_test $? 0 "Append leg to existing route - dev only"
+
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro append unreachable 2001:db8:104::/64"
+ log_test $? 2 "Append leg to existing route - reject route"
+
+ run_cmd "$IP -6 ro flush 2001:db8:104::/64"
+ run_cmd "$IP -6 ro add unreachable 2001:db8:104::/64"
+ run_cmd "$IP -6 ro append 2001:db8:104::/64 via 2001:db8:103::2"
+ log_test $? 2 "Append leg to existing reject route - gw"
+
+ run_cmd "$IP -6 ro flush 2001:db8:104::/64"
+ run_cmd "$IP -6 ro add unreachable 2001:db8:104::/64"
+ run_cmd "$IP -6 ro append 2001:db8:104::/64 dev veth3"
+ log_test $? 2 "Append leg to existing reject route - dev only"
+
+ # insert mpath directly
+ add_route6 "2001:db8:104::/64" "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::2 dev veth1 weight 1 nexthop via 2001:db8:103::2 dev veth3 weight 1"
+ log_test $? 0 "add multipath route"
+
+ add_route6 "2001:db8:104::/64" "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro add 2001:db8:104::/64 nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ log_test $? 2 "Attempt to add duplicate multipath route"
+
+ # insert of a second route without append but different metric
+ add_route6 "2001:db8:104::/64" "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro add 2001:db8:104::/64 via 2001:db8:103::2 metric 512"
+ rc=$?
+ if [ $rc -eq 0 ]; then
+ run_cmd "$IP -6 ro add 2001:db8:104::/64 via 2001:db8:103::3 metric 256"
+ rc=$?
+ fi
+ log_test $rc 0 "route add with different metrics"
+
+ run_cmd "$IP -6 ro del 2001:db8:104::/64 metric 512"
+ rc=$?
+ if [ $rc -eq 0 ]; then
+ check_route6 "2001:db8:104::/64 via 2001:db8:103::3 dev veth3 metric 256 2001:db8:104::/64 via 2001:db8:101::2 dev veth1 metric 1024"
+ rc=$?
+ fi
+ log_test $rc 0 "route delete with metric"
+}
+
+ipv6_rt_replace_single()
+{
+ # single path with single path
+ #
+ add_initial_route6 "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 via 2001:db8:103::2 dev veth3 metric 1024"
+ log_test $? 0 "single path with single path"
+
+ # single path with multipath
+ #
+ add_initial_route6 "nexthop via 2001:db8:101::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:101::3 nexthop via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::3 dev veth1 weight 1 nexthop via 2001:db8:103::2 dev veth3 weight 1"
+ log_test $? 0 "single path with multipath"
+
+ # single path with reject
+ #
+ add_initial_route6 "nexthop via 2001:db8:101::2"
+ run_cmd "$IP -6 ro replace unreachable 2001:db8:104::/64"
+ check_route6 "unreachable 2001:db8:104::/64 dev lo metric 1024"
+ log_test $? 0 "single path with reject route"
+
+ # single path with single path using MULTIPATH attribute
+ #
+ add_initial_route6 "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:103::2"
+ check_route6 "2001:db8:104::/64 via 2001:db8:103::2 dev veth3 metric 1024"
+ log_test $? 0 "single path with single path via multipath attribute"
+
+ # route replace fails - invalid nexthop
+ add_initial_route6 "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 via 2001:db8:104::2"
+ if [ $? -eq 0 ]; then
+ log_test 0 1 "invalid nexthop"
+ else
+ check_route6 "2001:db8:104::/64 via 2001:db8:101::2 dev veth1 metric 1024"
+ log_test $? 0 "invalid nexthop"
+ fi
+
+ # replace non-existent route
+ # - note use of change versus replace since ip adds NLM_F_CREATE
+ # for replace
+ add_initial_route6 "via 2001:db8:101::2"
+ run_cmd "$IP -6 ro change 2001:db8:105::/64 via 2001:db8:101::2"
+ log_test $? 2 "single path - replace of non-existent route"
+}
+
+ipv6_rt_replace_mpath()
+{
+ # multipath with multipath
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:101::3 nexthop via 2001:db8:103::3"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::3 dev veth1 weight 1 nexthop via 2001:db8:103::3 dev veth3 weight 1"
+ log_test $? 0 "multipath with multipath"
+
+ # multipath with single
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 via 2001:db8:101::3"
+ check_route6 "2001:db8:104::/64 via 2001:db8:101::3 dev veth1 metric 1024"
+ log_test $? 0 "multipath with single path"
+
+ # multipath with single
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:101::3"
+ check_route6 "2001:db8:104::/64 via 2001:db8:101::3 dev veth1 metric 1024"
+ log_test $? 0 "multipath with single path via multipath attribute"
+
+ # multipath with reject
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace unreachable 2001:db8:104::/64"
+ check_route6 "unreachable 2001:db8:104::/64 dev lo metric 1024"
+ log_test $? 0 "multipath with reject route"
+
+ # route replace fails - invalid nexthop 1
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:111::3 nexthop via 2001:db8:103::3"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::2 dev veth1 weight 1 nexthop via 2001:db8:103::2 dev veth3 weight 1"
+ log_test $? 0 "multipath - invalid first nexthop"
+
+ # route replace fails - invalid nexthop 2
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro replace 2001:db8:104::/64 nexthop via 2001:db8:101::3 nexthop via 2001:db8:113::3"
+ check_route6 "2001:db8:104::/64 metric 1024 nexthop via 2001:db8:101::2 dev veth1 weight 1 nexthop via 2001:db8:103::2 dev veth3 weight 1"
+ log_test $? 0 "multipath - invalid second nexthop"
+
+ # multipath non-existent route
+ add_initial_route6 "nexthop via 2001:db8:101::2 nexthop via 2001:db8:103::2"
+ run_cmd "$IP -6 ro change 2001:db8:105::/64 nexthop via 2001:db8:101::3 nexthop via 2001:db8:103::3"
+ log_test $? 2 "multipath - replace of non-existent route"
+}
+
+ipv6_rt_replace()
+{
+ echo
+ echo "IPv6 route replace tests"
+
+ ipv6_rt_replace_single
+ ipv6_rt_replace_mpath
+}
+
+ipv6_route_test()
+{
+ route_setup
+
+ ipv6_rt_add
+ ipv6_rt_replace
+
+ route_cleanup
+}
+
+################################################################################
# usage
usage()
@@ -635,6 +963,7 @@ do
fib_down_test|down) fib_down_test;;
fib_carrier_test|carrier) fib_carrier_test;;
fib_nexthop_test|nexthop) fib_nexthop_test;;
+ ipv6_route_test|ipv6_rt) ipv6_route_test;;
help) echo "Test names: $TESTS"; exit 0;;
esac
--
2.11.0
^ permalink raw reply related
* [PATCH RFC net-next 7/7] selftests: fib_tests: Add ipv4 route add append replace tests
From: David Ahern @ 2018-05-15 2:51 UTC (permalink / raw)
To: netdev; +Cc: Thomas.Winter, idosch, sharpd, roopa, David Ahern
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Add IPv4 route tests covering add, append and replace permutations.
Assumes the ability to add a basic single path route works; this is
required for example when adding an address to an interface.
$ fib_tests.sh -t ipv4_rt
IPv4 route add / append tests
TEST: Attempt to add duplicate route - gw [ OK ]
TEST: Attempt to add duplicate route - dev only [ OK ]
TEST: Attempt to add duplicate route - reject route [ OK ]
TEST: Add new nexthop for existing prefix [ OK ]
TEST: Append leg to existing route - gw [ OK ]
TEST: Append leg to existing route - dev only [ OK ]
TEST: Append leg to existing route - reject route [ OK ]
TEST: Append leg to existing reject route - gw [ OK ]
TEST: Append leg to existing reject route - dev only [ OK ]
TEST: add multipath route [ OK ]
TEST: Attempt to add duplicate multipath route [ OK ]
TEST: route add with different metrics [ OK ]
TEST: route delete with metric [ OK ]
IPv4 route replace tests
TEST: single path with single path [ OK ]
TEST: single path with multipath [ OK ]
TEST: single path with reject route [ OK ]
TEST: single path with single path via multipath attribute [ OK ]
TEST: invalid nexthop [ OK ]
TEST: single path - replace of non-existent route [ OK ]
TEST: multipath with multipath [ OK ]
TEST: multipath with single path [ OK ]
TEST: multipath with single path via multipath attribute [ OK ]
TEST: multipath with reject route [ OK ]
TEST: multipath - invalid first nexthop [ OK ]
TEST: multipath - invalid second nexthop [ OK ]
TEST: multipath - replace of non-existent route [ OK ]
Signed-off-by: David Ahern <dsahern@gmail.com>
---
tools/testing/selftests/net/fib_tests.sh | 277 ++++++++++++++++++++++++++++++-
1 file changed, 276 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 20cb1d2c4e72..19248453a2ba 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,7 +6,7 @@
ret=0
-TESTS="unregister down carrier nexthop ipv6_rt"
+TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt"
VERBOSE=0
PAUSE_ON_FAIL=no
PAUSE=no
@@ -702,6 +702,12 @@ route_setup()
$IP -6 addr add 2001:db8:103::2/64 dev veth4
$IP -6 addr add 2001:db8:104::1/64 dev dummy1
+ $IP addr add 172.16.101.1/24 dev veth1
+ $IP addr add 172.16.101.2/24 dev veth2
+ $IP addr add 172.16.103.1/24 dev veth3
+ $IP addr add 172.16.103.2/24 dev veth4
+ $IP addr add 172.16.104.1/24 dev dummy1
+
set +ex
}
@@ -901,6 +907,274 @@ ipv6_route_test()
route_cleanup
}
+# add route for a prefix, flushing any existing routes first
+# expected to be the first step of a test
+add_route()
+{
+ local pfx="$1"
+ local nh="$2"
+ local out
+
+ if [ "$VERBOSE" = "1" ]; then
+ echo
+ echo " ##################################################"
+ echo
+ fi
+
+ run_cmd "$IP ro flush ${pfx}"
+ [ $? -ne 0 ] && exit 1
+
+ out=$($IP ro ls match ${pfx})
+ if [ -n "$out" ]; then
+ echo "Failed to flush routes for prefix used for tests."
+ exit 1
+ fi
+
+ run_cmd "$IP ro add ${pfx} ${nh}"
+ if [ $? -ne 0 ]; then
+ echo "Failed to add initial route for test."
+ exit 1
+ fi
+}
+
+# add initial route - used in replace route tests
+add_initial_route()
+{
+ add_route "172.16.104.0/24" "$1"
+}
+
+check_route()
+{
+ local pfx="172.16.104.0/24"
+ local expected="$1"
+ local out
+ local rc=0
+
+ out=$($IP ro ls match ${pfx})
+ if [ -z "${out}" ]; then
+ if [ "$VERBOSE" = "1" ]; then
+ printf "\nNo route entry found\n"
+ printf "Expected:\n"
+ printf " ${expected}\n"
+ fi
+ return 1
+ fi
+
+ # tricky way to convert output to 1-line without ip's
+ # messy '\'; this drops all extra white space
+ out=$(echo ${out})
+ if [ "${out}" != "${expected}" ]; then
+ rc=1
+ if [ "${VERBOSE}" = "1" ]; then
+ printf " Unexpected route entry. Have:\n"
+ printf " ${out}\n"
+ printf " Expected:\n"
+ printf " ${expected}\n\n"
+ fi
+ fi
+
+ return $rc
+}
+
+# assumption is that basic add of a single path route works
+# otherwise just adding an address on an interface is broken
+ipv4_rt_add()
+{
+ local rc
+
+ echo
+ echo "IPv4 route add / append tests"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro add 172.16.104.0/24 via 172.16.103.2"
+ log_test $? 2 "Attempt to add duplicate route - gw"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro add 172.16.104.0/24 dev veth3"
+ log_test $? 2 "Attempt to add duplicate route - dev only"
+
+ # route add same prefix - fails with EEXISTS b/c ip adds NLM_F_EXCL
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro add unreachable 172.16.104.0/24"
+ log_test $? 2 "Attempt to add duplicate route - reject route"
+
+ # iproute2 prepend only sets NLM_F_CREATE
+ # - adds a new route; does NOT convert existing route to ECMP
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro prepend 172.16.104.0/24 via 172.16.103.2"
+ check_route "172.16.104.0/24 via 172.16.103.2 dev veth3 172.16.104.0/24 via 172.16.101.2 dev veth1"
+ log_test $? 0 "Add new nexthop for existing prefix"
+
+ # route append with same prefix adds a new route
+ # - iproute2 sets NLM_F_CREATE | NLM_F_APPEND
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro append 172.16.104.0/24 via 172.16.103.2"
+ check_route "172.16.104.0/24 via 172.16.101.2 dev veth1 172.16.104.0/24 via 172.16.103.2 dev veth3"
+ log_test $? 0 "Append leg to existing route - gw"
+
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro append 172.16.104.0/24 dev veth3"
+ check_route "172.16.104.0/24 via 172.16.101.2 dev veth1 172.16.104.0/24 dev veth3 scope link"
+ log_test $? 0 "Append leg to existing route - dev only"
+
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro append unreachable 172.16.104.0/24"
+ check_route "172.16.104.0/24 via 172.16.101.2 dev veth1 unreachable 172.16.104.0/24"
+ log_test $? 0 "Append leg to existing route - reject route"
+
+ run_cmd "$IP ro flush 172.16.104.0/24"
+ run_cmd "$IP ro add unreachable 172.16.104.0/24"
+ run_cmd "$IP ro append 172.16.104.0/24 via 172.16.103.2"
+ check_route "unreachable 172.16.104.0/24 172.16.104.0/24 via 172.16.103.2 dev veth3"
+ log_test $? 0 "Append leg to existing reject route - gw"
+
+ run_cmd "$IP ro flush 172.16.104.0/24"
+ run_cmd "$IP ro add unreachable 172.16.104.0/24"
+ run_cmd "$IP ro append 172.16.104.0/24 dev veth3"
+ check_route "unreachable 172.16.104.0/24 172.16.104.0/24 dev veth3 scope link"
+ log_test $? 0 "Append leg to existing reject route - dev only"
+
+ # insert mpath directly
+ add_route "172.16.104.0/24" "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ check_route "172.16.104.0/24 nexthop via 172.16.101.2 dev veth1 weight 1 nexthop via 172.16.103.2 dev veth3 weight 1"
+ log_test $? 0 "add multipath route"
+
+ add_route "172.16.104.0/24" "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro add 172.16.104.0/24 nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ log_test $? 2 "Attempt to add duplicate multipath route"
+
+ # insert of a second route without append but different metric
+ add_route "172.16.104.0/24" "via 172.16.101.2"
+ run_cmd "$IP ro add 172.16.104.0/24 via 172.16.103.2 metric 512"
+ rc=$?
+ if [ $rc -eq 0 ]; then
+ run_cmd "$IP ro add 172.16.104.0/24 via 172.16.103.3 metric 256"
+ rc=$?
+ fi
+ log_test $rc 0 "route add with different metrics"
+
+ run_cmd "$IP ro del 172.16.104.0/24 metric 512"
+ rc=$?
+ if [ $rc -eq 0 ]; then
+ check_route "172.16.104.0/24 via 172.16.101.2 dev veth1 172.16.104.0/24 via 172.16.103.3 dev veth3 metric 256"
+ rc=$?
+ fi
+ log_test $rc 0 "route delete with metric"
+}
+
+ipv4_rt_replace_single()
+{
+ # single path with single path
+ #
+ add_initial_route "via 172.16.101.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 via 172.16.103.2"
+ check_route "172.16.104.0/24 via 172.16.103.2 dev veth3"
+ log_test $? 0 "single path with single path"
+
+ # single path with multipath
+ #
+ add_initial_route "nexthop via 172.16.101.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.101.3 nexthop via 172.16.103.2"
+ check_route "172.16.104.0/24 nexthop via 172.16.101.3 dev veth1 weight 1 nexthop via 172.16.103.2 dev veth3 weight 1"
+ log_test $? 0 "single path with multipath"
+
+ # single path with reject
+ #
+ add_initial_route "nexthop via 172.16.101.2"
+ run_cmd "$IP ro replace unreachable 172.16.104.0/24"
+ check_route "unreachable 172.16.104.0/24"
+ log_test $? 0 "single path with reject route"
+
+ # single path with single path using MULTIPATH attribute
+ #
+ add_initial_route "via 172.16.101.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.103.2"
+ check_route "172.16.104.0/24 via 172.16.103.2 dev veth3"
+ log_test $? 0 "single path with single path via multipath attribute"
+
+ # route replace fails - invalid nexthop
+ add_initial_route "via 172.16.101.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 via 2001:db8:104::2"
+ if [ $? -eq 0 ]; then
+ log_test 0 1 "invalid nexthop"
+ else
+ check_route "172.16.104.0/24 via 172.16.101.2 dev veth1"
+ log_test $? 0 "invalid nexthop"
+ fi
+
+ # replace non-existent route
+ # - note use of change versus replace since ip adds NLM_F_CREATE
+ # for replace
+ add_initial_route "via 172.16.101.2"
+ run_cmd "$IP ro change 172.16.105.0/24 via 172.16.101.2"
+ log_test $? 2 "single path - replace of non-existent route"
+}
+
+ipv4_rt_replace_mpath()
+{
+ # multipath with multipath
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.101.3 nexthop via 172.16.103.3"
+ check_route "172.16.104.0/24 nexthop via 172.16.101.3 dev veth1 weight 1 nexthop via 172.16.103.3 dev veth3 weight 1"
+ log_test $? 0 "multipath with multipath"
+
+ # multipath with single
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 via 172.16.101.3"
+ check_route "172.16.104.0/24 via 172.16.101.3 dev veth1"
+ log_test $? 0 "multipath with single path"
+
+ # multipath with single
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.101.3"
+ check_route "172.16.104.0/24 via 172.16.101.3 dev veth1"
+ log_test $? 0 "multipath with single path via multipath attribute"
+
+ # multipath with reject
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace unreachable 172.16.104.0/24"
+ check_route "unreachable 172.16.104.0/24"
+ log_test $? 0 "multipath with reject route"
+
+ # route replace fails - invalid nexthop 1
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.111.3 nexthop via 172.16.103.3"
+ check_route "172.16.104.0/24 nexthop via 172.16.101.2 dev veth1 weight 1 nexthop via 172.16.103.2 dev veth3 weight 1"
+ log_test $? 0 "multipath - invalid first nexthop"
+
+ # route replace fails - invalid nexthop 2
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro replace 172.16.104.0/24 nexthop via 172.16.101.3 nexthop via 172.16.113.3"
+ check_route "172.16.104.0/24 nexthop via 172.16.101.2 dev veth1 weight 1 nexthop via 172.16.103.2 dev veth3 weight 1"
+ log_test $? 0 "multipath - invalid second nexthop"
+
+ # multipath non-existent route
+ add_initial_route "nexthop via 172.16.101.2 nexthop via 172.16.103.2"
+ run_cmd "$IP ro change 172.16.105.0/24 nexthop via 172.16.101.3 nexthop via 172.16.103.3"
+ log_test $? 2 "multipath - replace of non-existent route"
+}
+
+ipv4_rt_replace()
+{
+ echo
+ echo "IPv4 route replace tests"
+
+ ipv4_rt_replace_single
+ ipv4_rt_replace_mpath
+}
+
+ipv4_route_test()
+{
+ route_setup
+
+ ipv4_rt_add
+ ipv4_rt_replace
+
+ route_cleanup
+}
+
################################################################################
# usage
@@ -964,6 +1238,7 @@ do
fib_carrier_test|carrier) fib_carrier_test;;
fib_nexthop_test|nexthop) fib_nexthop_test;;
ipv6_route_test|ipv6_rt) ipv6_route_test;;
+ ipv4_route_test|ipv4_rt) ipv4_route_test;;
help) echo "Test names: $TESTS"; exit 0;;
esac
--
2.11.0
^ permalink raw reply related
* Re: [PATCH] lib/rhashtable: reorder some inititalization sequences
From: David Miller @ 2018-05-15 2:52 UTC (permalink / raw)
To: dave; +Cc: akpm, tgraf, herbert, netdev, linux-kernel, dbueso
In-Reply-To: <20180514151332.31352-1-dave@stgolabs.net>
From: Davidlohr Bueso <dave@stgolabs.net>
Date: Mon, 14 May 2018 08:13:32 -0700
> rhashtable_init() allocates memory at the very end of the
> call, once everything is setup; with the exception of the
> nelems parameter. However, unless the user is doing something
> bogus with params for which -EINVAL is returned, memory
> allocation is the only operation that can trigger the call
> to fail.
>
> Thus move bucket_table_alloc() up such that we fail back to
> the caller asap, instead of doing useless checks. This is
> safe as the the table allocation isn't using the halfly
> setup 'ht' structure and bucket_table_alloc() call chain only
> ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD.
>
> Also move the locking initialization down to the end.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
The user potentially "doing something bogus" is why the most
expensive part of the initialization (the memory allocation)
is done after everything else is validated.
I think it's best to keep things as-is.
^ permalink raw reply
* Re: [PATCH net-next v3 0/8] sctp: refactor sctp_outq_flush
From: David Miller @ 2018-05-15 2:57 UTC (permalink / raw)
To: marcelo.leitner; +Cc: netdev, linux-sctp, nhorman, lucien.xin, vyasevich
In-Reply-To: <cover.1526318522.git.marcelo.leitner@gmail.com>
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Mon, 14 May 2018 14:34:35 -0300
> Currently sctp_outq_flush does many different things and arguably
> unrelated, such as doing transport selection and outq dequeueing.
>
> This patchset refactors it into smaller and more dedicated functions.
> The end behavior should be the same.
>
> The next patchset will rework the function parameters.
>
> Changes since v1:
> - fix build issues on patches 3 and 4, and updated 5 and 8 because of
> it.
>
> Changes since v2:
> - fixed panic if building with just up to patch 3 applied
Series applied, thanks.
^ permalink raw reply
* Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier
From: Alexei Starovoitov @ 2018-05-15 3:04 UTC (permalink / raw)
To: Joe Stringer; +Cc: daniel, netdev, ast, john.fastabend, kafai
In-Reply-To: <20180509210709.7201-7-joe@wand.net.nz>
On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote:
> Allow helper functions to acquire a reference and return it into a
> register. Specific pointer types such as the PTR_TO_SOCKET will
> implicitly represent such a reference. The verifier must ensure that
> these references are released exactly once in each path through the
> program.
>
> To achieve this, this commit assigns an id to the pointer and tracks it
> in the 'bpf_func_state', then when the function or program exits,
> verifies that all of the acquired references have been freed. When the
> pointer is passed to a function that frees the reference, it is removed
> from the 'bpf_func_state` and all existing copies of the pointer in
> registers are marked invalid.
>
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> include/linux/bpf_verifier.h | 18 ++-
> kernel/bpf/verifier.c | 295 ++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 292 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 9dcd87f1d322..8dbee360b3ec 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -104,6 +104,11 @@ struct bpf_stack_state {
> u8 slot_type[BPF_REG_SIZE];
> };
>
> +struct bpf_reference_state {
> + int id;
> + int insn_idx; /* allocation insn */
the insn_idx is for more verbose messages, right?
It doesn't seem to affect the safety of algorithm.
Please add a comment to clarify that.
> +};
> +
> /* state of the program:
> * type of all registers and stack info
> */
> @@ -122,7 +127,9 @@ struct bpf_func_state {
> */
> u32 subprogno;
>
> - /* should be second to last. See copy_func_state() */
> + /* The following fields should be last. See copy_func_state() */
> + int acquired_refs;
> + struct bpf_reference_state *refs;
> int allocated_stack;
> struct bpf_stack_state *stack;
> };
> @@ -218,11 +225,16 @@ void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt,
> __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
> const char *fmt, ...);
>
> -static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
> {
> struct bpf_verifier_state *cur = env->cur_state;
>
> - return cur->frame[cur->curframe]->regs;
> + return cur->frame[cur->curframe];
> +}
> +
> +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +{
> + return cur_func(env)->regs;
> }
>
> int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f426ebf2b6bf..92b9a5dc465a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1,5 +1,6 @@
> /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> * Copyright (c) 2016 Facebook
> + * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
> *
> * This program is free software; you can redistribute it and/or
> * modify it under the terms of version 2 of the GNU General Public
> @@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = {
> *
> * After the call R0 is set to return type of the function and registers R1-R5
> * are set to NOT_INIT to indicate that they are no longer readable.
> + *
> + * The following reference types represent a potential reference to a kernel
> + * resource which, after first being allocated, must be checked and freed by
> + * the BPF program:
> + * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
> + *
> + * When the verifier sees a helper call return a reference type, it allocates a
> + * pointer id for the reference and stores it in the current function state.
> + * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
> + * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type
> + * passes through a NULL-check conditional. For the branch wherein the state is
> + * changed to CONST_IMM, the verifier releases the reference.
> */
>
> /* verifier_state + insn_idx are pushed to stack when branch is encountered */
> @@ -229,7 +242,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
>
> static bool reg_type_may_be_null(enum bpf_reg_type type)
> {
> - return type == PTR_TO_MAP_VALUE_OR_NULL;
> + return type == PTR_TO_MAP_VALUE_OR_NULL ||
> + type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool type_is_refcounted(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET;
> +}
> +
> +static bool type_is_refcounted_or_null(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool reg_is_refcounted(const struct bpf_reg_state *reg)
> +{
> + return type_is_refcounted(reg->type);
> +}
> +
> +static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg)
> +{
> + return type_is_refcounted_or_null(reg->type);
> +}
> +
> +static bool arg_type_is_refcounted(enum bpf_arg_type type)
> +{
> + return type == ARG_PTR_TO_SOCKET;
> +}
> +
> +/* Determine whether the function releases some resources allocated by another
> + * function call. The first reference type argument will be assumed to be
> + * released by release_reference().
> + */
> +static bool is_release_function(enum bpf_func_id func_id)
> +{
> + return false;
> }
>
> /* string representation of 'enum bpf_reg_type' */
> @@ -344,6 +392,12 @@ static void print_verifier_state(struct bpf_verifier_env *env,
> if (state->stack[i].slot_type[0] == STACK_ZERO)
> verbose(env, " fp%d=0", (-i - 1) * BPF_REG_SIZE);
> }
> + if (state->acquired_refs && state->refs[0].id) {
> + verbose(env, " refs=%d", state->refs[0].id);
> + for (i = 1; i < state->acquired_refs; i++)
> + if (state->refs[i].id)
> + verbose(env, ",%d", state->refs[i].id);
> + }
> verbose(env, "\n");
> }
>
> @@ -362,6 +416,8 @@ static int copy_##NAME##_state(struct bpf_func_state *dst, \
> sizeof(*src->FIELD) * (src->COUNT / SIZE)); \
> return 0; \
> }
> +/* copy_reference_state() */
> +COPY_STATE_FN(reference, acquired_refs, refs, 1)
> /* copy_stack_state() */
> COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> #undef COPY_STATE_FN
> @@ -400,6 +456,8 @@ static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \
> state->FIELD = new_##FIELD; \
> return 0; \
> }
> +/* realloc_reference_state() */
> +REALLOC_STATE_FN(reference, acquired_refs, refs, 1)
> /* realloc_stack_state() */
> REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> #undef REALLOC_STATE_FN
> @@ -411,16 +469,89 @@ REALLOC_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE)
> * which realloc_stack_state() copies over. It points to previous
> * bpf_verifier_state which is never reallocated.
> */
> -static int realloc_func_state(struct bpf_func_state *state, int size,
> - bool copy_old)
> +static int realloc_func_state(struct bpf_func_state *state, int stack_size,
> + int refs_size, bool copy_old)
> {
> - return realloc_stack_state(state, size, copy_old);
> + int err = realloc_reference_state(state, refs_size, copy_old);
> + if (err)
> + return err;
> + return realloc_stack_state(state, stack_size, copy_old);
> +}
> +
> +/* Acquire a pointer id from the env and update the state->refs to include
> + * this new pointer reference.
> + * On success, returns a valid pointer id to associate with the register
> + * On failure, returns a negative errno.
> + */
> +static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
> +{
> + struct bpf_func_state *state = cur_func(env);
> + int new_ofs = state->acquired_refs;
> + int id, err;
> +
> + err = realloc_reference_state(state, state->acquired_refs + 1, true);
> + if (err)
> + return err;
> + id = ++env->id_gen;
> + state->refs[new_ofs].id = id;
> + state->refs[new_ofs].insn_idx = insn_idx;
I thought that we may avoid this extra 'ref_state' array if we store
'id' into 'aux' array which is one to one to array of instructions
and avoid this expensive reallocs, but then I realized we can go
through the same instruction that returns a pointer to socket
multiple times and every time it needs to be different 'id' and
tracked indepdently, so yeah. All that infra is necessary.
Would be good to document the algorithm a bit more.
> +
> + return id;
> +}
> +
> +/* release function corresponding to acquire_reference_state(). Idempotent. */
> +static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
> +{
> + int i, last_idx;
> +
> + if (!ptr_id)
> + return 0;
> +
> + last_idx = state->acquired_refs - 1;
> + for (i = 0; i < state->acquired_refs; i++) {
> + if (state->refs[i].id == ptr_id) {
> + if (last_idx && i != last_idx)
> + memcpy(&state->refs[i], &state->refs[last_idx],
> + sizeof(*state->refs));
> + memset(&state->refs[last_idx], 0, sizeof(*state->refs));
> + state->acquired_refs--;
> + return 0;
> + }
> + }
> + return -EFAULT;
> +}
> +
> +/* variation on the above for cases where we expect that there must be an
> + * outstanding reference for the specified ptr_id.
> + */
> +static int release_reference_state(struct bpf_verifier_env *env, int ptr_id)
> +{
> + struct bpf_func_state *state = cur_func(env);
> + int err;
> +
> + err = __release_reference_state(state, ptr_id);
> + if (WARN_ON_ONCE(err != 0))
> + verbose(env, "verifier internal error: can't release reference\n");
> + return err;
> +}
> +
> +static int transfer_reference_state(struct bpf_func_state *dst,
> + struct bpf_func_state *src)
> +{
> + int err = realloc_reference_state(dst, src->acquired_refs, false);
> + if (err)
> + return err;
> + err = copy_reference_state(dst, src);
> + if (err)
> + return err;
> + return 0;
> }
>
> static void free_func_state(struct bpf_func_state *state)
> {
> if (!state)
> return;
> + kfree(state->refs);
> kfree(state->stack);
> kfree(state);
> }
> @@ -446,10 +577,14 @@ static int copy_func_state(struct bpf_func_state *dst,
> {
> int err;
>
> - err = realloc_func_state(dst, src->allocated_stack, false);
> + err = realloc_func_state(dst, src->allocated_stack, src->acquired_refs,
> + false);
> + if (err)
> + return err;
> + memcpy(dst, src, offsetof(struct bpf_func_state, acquired_refs));
> + err = copy_reference_state(dst, src);
> if (err)
> return err;
> - memcpy(dst, src, offsetof(struct bpf_func_state, allocated_stack));
> return copy_stack_state(dst, src);
> }
>
> @@ -1019,7 +1154,7 @@ static int check_stack_write(struct bpf_verifier_env *env,
> enum bpf_reg_type type;
>
> err = realloc_func_state(state, round_up(slot + 1, BPF_REG_SIZE),
> - true);
> + state->acquired_refs, true);
> if (err)
> return err;
> /* caller checked that off % size == 0 and -MAX_BPF_STACK <= off < 0,
> @@ -2259,10 +2394,32 @@ static bool check_arg_pair_ok(const struct bpf_func_proto *fn)
> return true;
> }
>
> +static bool check_refcount_ok(const struct bpf_func_proto *fn)
> +{
> + int count = 0;
> +
> + if (arg_type_is_refcounted(fn->arg1_type))
> + count++;
> + if (arg_type_is_refcounted(fn->arg2_type))
> + count++;
> + if (arg_type_is_refcounted(fn->arg3_type))
> + count++;
> + if (arg_type_is_refcounted(fn->arg4_type))
> + count++;
> + if (arg_type_is_refcounted(fn->arg5_type))
> + count++;
> +
> + /* We only support one arg being unreferenced at the moment,
> + * which is sufficient for the helper functions we have right now.
> + */
> + return count <= 1;
> +}
> +
> static int check_func_proto(const struct bpf_func_proto *fn)
> {
> return check_raw_mode_ok(fn) &&
> - check_arg_pair_ok(fn) ? 0 : -EINVAL;
> + check_arg_pair_ok(fn) &&
> + check_refcount_ok(fn) ? 0 : -EINVAL;
> }
>
> /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
> @@ -2295,12 +2452,57 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
> __clear_all_pkt_pointers(env, vstate->frame[i]);
> }
>
> +static void release_reg_references(struct bpf_verifier_env *env,
> + struct bpf_func_state *state, int id)
> +{
> + struct bpf_reg_state *regs = state->regs, *reg;
> + int i;
> +
> + for (i = 0; i < MAX_BPF_REG; i++)
> + if (regs[i].id == id)
> + mark_reg_unknown(env, regs, i);
> +
> + for_each_spilled_reg(i, state, reg) {
> + if (!reg)
> + continue;
> + if (reg_is_refcounted(reg) && reg->id == id)
> + __mark_reg_unknown(reg);
> + }
> +}
> +
> +/* The pointer with the specified id has released its reference to kernel
> + * resources. Identify all copies of the same pointer and clear the reference.
> + */
> +static int release_reference(struct bpf_verifier_env *env)
> +{
> + struct bpf_verifier_state *vstate = env->cur_state;
> + struct bpf_reg_state *regs = cur_regs(env);
> + int i, ptr_id = 0;
> +
> + for (i = BPF_REG_1; i < BPF_REG_6; i++) {
> + if (reg_is_refcounted(®s[i])) {
> + ptr_id = regs[i].id;
> + break;
> + }
> + }
> + if (WARN_ON_ONCE(!ptr_id)) {
> + /* references must be special pointer types that are checked
> + * against argument requirements for the release function. */
> + verbose(env, "verifier internal error: can't locate refcounted arg\n");
> + return -EFAULT;
> + }
> + for (i = 0; i <= vstate->curframe; i++)
> + release_reg_references(env, vstate->frame[i], ptr_id);
> +
> + return release_reference_state(env, ptr_id);
> +}
> +
> static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> int *insn_idx)
> {
> struct bpf_verifier_state *state = env->cur_state;
> struct bpf_func_state *caller, *callee;
> - int i, subprog, target_insn;
> + int i, err, subprog, target_insn;
>
> if (state->curframe + 1 >= MAX_CALL_FRAMES) {
> verbose(env, "the call stack of %d frames is too deep\n",
> @@ -2338,6 +2540,11 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> state->curframe + 1 /* frameno within this callchain */,
> subprog /* subprog number within this prog */);
>
> + /* Transfer references to the callee */
> + err = transfer_reference_state(callee, caller);
> + if (err)
> + return err;
> +
> /* copy r1 - r5 args that callee can access */
> for (i = BPF_REG_1; i <= BPF_REG_5; i++)
> callee->regs[i] = caller->regs[i];
> @@ -2368,6 +2575,7 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
> struct bpf_verifier_state *state = env->cur_state;
> struct bpf_func_state *caller, *callee;
> struct bpf_reg_state *r0;
> + int err;
>
> callee = state->frame[state->curframe];
> r0 = &callee->regs[BPF_REG_0];
> @@ -2387,6 +2595,11 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
> /* return to the caller whatever r0 had in the callee */
> caller->regs[BPF_REG_0] = *r0;
>
> + /* Transfer references to the caller */
> + err = transfer_reference_state(caller, callee);
> + if (err)
> + return err;
> +
> *insn_idx = callee->callsite + 1;
> if (env->log.level) {
> verbose(env, "returning from callee:\n");
> @@ -2498,6 +2711,15 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
> return err;
> }
>
> + /* If the function is a release() function, mark all copies of the same
> + * pointer as "freed" in all registers and in the stack.
> + */
> + if (is_release_function(func_id)) {
> + err = release_reference(env);
I think this can be improved if check_func_arg() stores ptr_id into meta.
Then this loop
for (i = BPF_REG_1; i < BPF_REG_6; i++) {
if (reg_is_refcounted(®s[i])) {
in release_reference() won't be needed.
Also the macros from the previous patch look ugly, but considering this patch
I guess it's justified. At least I don't see a better way of doing it.
^ permalink raw reply
* Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context
From: Richard Guy Briggs @ 2018-05-15 3:05 UTC (permalink / raw)
To: Paul Moore
Cc: Linux-Audit Mailing List, LKML,
Linux NetDev Upstream Mailing List, Netfilter Devel List,
Linux Security Module list, Integrity Measurement Architecture,
SElinux list, Eric Paris, Steve Grubb, Ingo Molnar, David Howells
In-Reply-To: <CAHC9VhQXr76FXPCe8vUWNwa2VshcPg1mZmMY7M30H1F6w30rFA@mail.gmail.com>
On 2018-05-14 17:44, Paul Moore wrote:
> On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Recognizing that the audit context is an internal audit value, use an
> > access function to retrieve the audit context pointer for the task
> > rather than reaching directly into the task struct to get it.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> > include/linux/audit.h | 14 ++++++--
> > include/net/xfrm.h | 2 +-
> > kernel/audit.c | 6 ++--
> > kernel/audit_watch.c | 2 +-
> > kernel/auditsc.c | 64 +++++++++++++++++-------------------
> > net/bridge/netfilter/ebtables.c | 2 +-
> > net/core/dev.c | 2 +-
> > net/netfilter/x_tables.c | 2 +-
> > net/netlabel/netlabel_user.c | 2 +-
> > security/integrity/ima/ima_api.c | 2 +-
> > security/integrity/integrity_audit.c | 2 +-
> > security/lsm_audit.c | 2 +-
> > security/selinux/hooks.c | 4 +--
> > security/selinux/selinuxfs.c | 6 ++--
> > security/selinux/ss/services.c | 12 +++----
> > 15 files changed, 64 insertions(+), 60 deletions(-)
>
> Merged, but there was some fuzz due to the missing 1/5 patch and a
> handfull of checkpatch.pl fixes. Please take a look at the commit in
> the audit/next branch and if anything looks awry please send a patch
> to fix it.
Some of that fuzz was due to the two patches (ghak46/47) that went
through the xelinux tree... There will be a merge conflict.
Otherwise, looks ok.
> paul moore
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] sctp: Introduce sctp_flush_ctx
From: David Miller @ 2018-05-15 3:15 UTC (permalink / raw)
To: marcelo.leitner; +Cc: netdev, linux-sctp, nhorman, lucien.xin, vyasevich
In-Reply-To: <cover.1526319083.git.marcelo.leitner@gmail.com>
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Mon, 14 May 2018 14:35:17 -0300
> This struct will hold all the context used during the outq flush, so we
> don't have to pass lots of pointers all around.
>
> Checked on x86_64, the compiler inlines all these functions and there is no
> derreference added because of the struct.
>
> This patchset depends on 'sctp: refactor sctp_outq_flush'
>
> Changes since v1:
> - updated to build on top of v2 of 'sctp: refactor sctp_outq_flush'
>
> Changes since v2:
> - fixed a rebase issue which reverted a change in patch 2.
> - rebased on v3 of 'sctp: refactor sctp_outq_flush'
Also applied, thanks Marcelo.
^ permalink raw reply
* Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF
From: Alexei Starovoitov @ 2018-05-15 3:16 UTC (permalink / raw)
To: Joe Stringer; +Cc: Martin KaFai Lau, daniel, netdev, ast, john fastabend
In-Reply-To: <CAOftzPinEyoaaokgsssA809LQ571x_u8hhA1rXw0HjN8a5O-7w@mail.gmail.com>
On Fri, May 11, 2018 at 05:54:33PM -0700, Joe Stringer wrote:
> On 11 May 2018 at 14:41, Martin KaFai Lau <kafai@fb.com> wrote:
> > On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
> >> On 10 May 2018 at 22:00, Martin KaFai Lau <kafai@fb.com> wrote:
> >> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
> >> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
> >> >> programs to find out if there is a socket listening on this host, and
> >> >> returns a socket pointer which the BPF program can then access to
> >> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
> >> >> takes a reference on the socket, so when a BPF program makes use of this
> >> >> function, it must subsequently pass the returned pointer into the newly
> >> >> added sk_release() to return the reference.
> >> >>
> >> >> By way of example, the following pseudocode would filter inbound
> >> >> connections at XDP if there is no corresponding service listening for
> >> >> the traffic:
> >> >>
> >> >> struct bpf_sock_tuple tuple;
> >> >> struct bpf_sock_ops *sk;
> >> >>
> >> >> populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> >> >> sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
> >> >> if (!sk) {
> >> >> // Couldn't find a socket listening for this traffic. Drop.
> >> >> return TC_ACT_SHOT;
> >> >> }
> >> >> bpf_sk_release(sk, 0);
> >> >> return TC_ACT_OK;
> >> >>
> >> >> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> >> >> ---
> >>
> >> ...
> >>
> >> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> >> >> };
> >> >> #endif
> >> >>
> >> >> +struct sock *
> >> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
> >> > Would it be possible to have another version that
> >> > returns a sk without taking its refcnt?
> >> > It may have performance benefit.
> >>
> >> Not really. The sockets are not RCU-protected, and established sockets
> >> may be torn down without notice. If we don't take a reference, there's
> >> no guarantee that the socket will continue to exist for the duration
> >> of running the BPF program.
> >>
> >> From what I follow, the comment below has a hidden implication which
> >> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
> >> directly freed regardless of RCU.
> > Right, SOCK_RCU_FREE sk is the one I am concern about.
> > For example, TCP_LISTEN socket does not require taking a refcnt
> > now. Doing a bpf_sk_lookup() may have a rather big
> > impact on handling TCP syn flood. or the usual intention
> > is to redirect instead of passing it up to the stack?
>
> I see, if you're only interested in listen sockets then probably this
> series could be extended with a new flag, eg something like
> BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
> found to only listen sockets, then the implementation would call into
> __inet_lookup_listener() instead of inet_lookup(). The presence of
> that flag in the relevant register during CALL instruction would show
> that the verifier should not reference-track the result, then there'd
> need to be a check on the release to ensure that this unreferenced
> socket is never released. Just a thought, completely untested and I
> could still be missing some detail..
>
> That said, I don't completely follow how you would expect to handle
> the traffic for sockets that are already established - the helper
> would no longer find those sockets, so you wouldn't know whether to
> pass the traffic up the stack for established traffic or not.
I think Martin has a valid concern here that if somebody starts using
this helper on the rx traffic the bpf program (via these two new
helpers) will be doing refcnt++ and refcnt-- even for listener
sockets which will cause synflood to suffer.
One can argue that this is bpf author mistake, but without fixes
(and api changes) to the helper the programmer doesn't really have a way
of avoiding this situation.
Also udp sockets don't need refcnt at all.
How about we split this single helper into three:
- bpf_sk_lookup_tcp_established() that will return refcnt-ed socket
and has to be bpf_sk_release()d by the program.
- bpf_sk_lookup_tcp_listener() that doesn't refcnt, since progs
run in rcu.
- bpf_sk_lookup_udp() that also doesn't refcnt.
The logic you want to put into this helper can be easily
replicated with these three helpers and the whole thing will
be much faster.
Thoughts?
^ permalink raw reply
* Re: [RFC bpf-next 11/11] Documentation: Describe bpf reference tracking
From: Alexei Starovoitov @ 2018-05-15 3:19 UTC (permalink / raw)
To: Joe Stringer; +Cc: daniel, netdev, ast, john.fastabend, kafai
In-Reply-To: <20180509210709.7201-12-joe@wand.net.nz>
On Wed, May 09, 2018 at 02:07:09PM -0700, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <joe@wand.net.nz>
> ---
> Documentation/networking/filter.txt | 64 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 64 insertions(+)
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index 5032e1263bc9..77be17977bc5 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -1125,6 +1125,14 @@ pointer type. The types of pointers describe their base, as follows:
> PTR_TO_STACK Frame pointer.
> PTR_TO_PACKET skb->data.
> PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
> + PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted.
> + PTR_TO_SOCKET_OR_NULL
> + Either a pointer to a socket, or NULL; socket lookup
> + returns this type, which becomes a PTR_TO_SOCKET when
> + checked != NULL. PTR_TO_SOCKET is reference-counted,
> + so programs must release the reference through the
> + socket release function before the end of the program.
> + Arithmetic on these pointers is forbidden.
> However, a pointer may be offset from this base (as a result of pointer
> arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
> offset'. The former is used when an exactly-known value (e.g. an immediate
> @@ -1168,6 +1176,13 @@ over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
> pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
> bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
> that pointer are safe.
> +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
> +to all copies of the pointer returned from a socket lookup. This has similar
> +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
> +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
> +represents a reference to the corresponding 'struct sock'. To ensure that the
> +reference is not leaked, it is imperative to NULL-check the reference and in
> +the non-NULL case, and pass the valid reference to the socket release function.
>
> Direct packet access
> --------------------
> @@ -1441,6 +1456,55 @@ Error:
> 8: (7a) *(u64 *)(r0 +0) = 1
> R0 invalid mem access 'imm'
>
> +Program that performs a socket lookup then sets the pointer to NULL without
> +checking it:
> +value:
> + BPF_MOV64_IMM(BPF_REG_2, 0),
> + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
> + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_MOV64_IMM(BPF_REG_3, 4),
> + BPF_MOV64_IMM(BPF_REG_4, 0),
> + BPF_MOV64_IMM(BPF_REG_5, 0),
> + BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
> + BPF_MOV64_IMM(BPF_REG_0, 0),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (b7) r2 = 0
> + 1: (63) *(u32 *)(r10 -8) = r2
> + 2: (bf) r2 = r10
> + 3: (07) r2 += -8
> + 4: (b7) r3 = 4
> + 5: (b7) r4 = 0
> + 6: (b7) r5 = 0
> + 7: (85) call bpf_sk_lookup#65
> + 8: (b7) r0 = 0
> + 9: (95) exit
> + Unreleased reference id=1, alloc_insn=7
> +
> +Program that performs a socket lookup but does not NULL-check the returned
> +value:
> + BPF_MOV64_IMM(BPF_REG_2, 0),
> + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
> + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> + BPF_MOV64_IMM(BPF_REG_3, 4),
> + BPF_MOV64_IMM(BPF_REG_4, 0),
> + BPF_MOV64_IMM(BPF_REG_5, 0),
> + BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
> + BPF_EXIT_INSN(),
> +Error:
> + 0: (b7) r2 = 0
> + 1: (63) *(u32 *)(r10 -8) = r2
> + 2: (bf) r2 = r10
> + 3: (07) r2 += -8
> + 4: (b7) r3 = 4
> + 5: (b7) r4 = 0
> + 6: (b7) r5 = 0
> + 7: (85) call bpf_sk_lookup#65
> + 8: (95) exit
> + Unreleased reference id=1, alloc_insn=7
Nice. Thank you for updating this doc. We haven't touched it in long time.
It probably long overdue for complete overhaul.
^ permalink raw reply
* Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context
From: Richard Guy Briggs @ 2018-05-15 3:28 UTC (permalink / raw)
To: Paul Moore
Cc: Linux-Audit Mailing List, LKML,
Linux NetDev Upstream Mailing List, Netfilter Devel List,
Linux Security Module list, Integrity Measurement Architecture,
SElinux list, Eric Paris, Steve Grubb, Ingo Molnar, David Howells
In-Reply-To: <20180515030545.7oiyz33rzoj6sxs5@madcap2.tricolour.ca>
On 2018-05-14 23:05, Richard Guy Briggs wrote:
> On 2018-05-14 17:44, Paul Moore wrote:
> > On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Recognizing that the audit context is an internal audit value, use an
> > > access function to retrieve the audit context pointer for the task
> > > rather than reaching directly into the task struct to get it.
> > >
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > > include/linux/audit.h | 14 ++++++--
> > > include/net/xfrm.h | 2 +-
> > > kernel/audit.c | 6 ++--
> > > kernel/audit_watch.c | 2 +-
> > > kernel/auditsc.c | 64 +++++++++++++++++-------------------
> > > net/bridge/netfilter/ebtables.c | 2 +-
> > > net/core/dev.c | 2 +-
> > > net/netfilter/x_tables.c | 2 +-
> > > net/netlabel/netlabel_user.c | 2 +-
> > > security/integrity/ima/ima_api.c | 2 +-
> > > security/integrity/integrity_audit.c | 2 +-
> > > security/lsm_audit.c | 2 +-
> > > security/selinux/hooks.c | 4 +--
> > > security/selinux/selinuxfs.c | 6 ++--
> > > security/selinux/ss/services.c | 12 +++----
> > > 15 files changed, 64 insertions(+), 60 deletions(-)
> >
> > Merged, but there was some fuzz due to the missing 1/5 patch and a
> > handfull of checkpatch.pl fixes. Please take a look at the commit in
> > the audit/next branch and if anything looks awry please send a patch
> > to fix it.
>
> Some of that fuzz was due to the two patches (ghak46/47) that went
> through the xelinux tree... There will be a merge conflict.
>
> Otherwise, looks ok.
Spoke too soon, missed one from the new seccomp actions_logged...
Patch pending...
> > paul moore
>
> - RGB
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH] lib/rhashtable: reorder some inititalization sequences
From: Herbert Xu @ 2018-05-15 3:37 UTC (permalink / raw)
To: David Miller; +Cc: dave, akpm, tgraf, netdev, linux-kernel, dbueso
In-Reply-To: <20180514.225213.1789810083198383905.davem@davemloft.net>
On Mon, May 14, 2018 at 10:52:13PM -0400, David Miller wrote:
> From: Davidlohr Bueso <dave@stgolabs.net>
> Date: Mon, 14 May 2018 08:13:32 -0700
>
> > rhashtable_init() allocates memory at the very end of the
> > call, once everything is setup; with the exception of the
> > nelems parameter. However, unless the user is doing something
> > bogus with params for which -EINVAL is returned, memory
> > allocation is the only operation that can trigger the call
> > to fail.
> >
> > Thus move bucket_table_alloc() up such that we fail back to
> > the caller asap, instead of doing useless checks. This is
> > safe as the the table allocation isn't using the halfly
> > setup 'ht' structure and bucket_table_alloc() call chain only
> > ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD.
> >
> > Also move the locking initialization down to the end.
> >
> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>
> The user potentially "doing something bogus" is why the most
> expensive part of the initialization (the memory allocation)
> is done after everything else is validated.
>
> I think it's best to keep things as-is.
I agree.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Ahern @ 2018-05-15 3:46 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: netdev, borkmann, ast, davem, shm, roopa, brouer, toke,
john.fastabend
In-Reply-To: <4729b693-20d7-dd9e-c48b-be8386ce9bed@gmail.com>
On 4/29/18 7:13 PM, David Ahern wrote:
>
> The idea here is to fast pass packets that fit a supported profile and
> are to be forwarded. Everything else should continue up the stack as it
> has wider capabilities. The helper and XDP programs should make no
> assumptions on what the broader kernel and userspace might be monitoring
> or want to do with packets that can not be forwarded in the fast path.
> This is very similar to hardware forwarding when it punts packets to the
> CPU for control plane assistance.
>
Thinking about this some more and how to return more information to the
bpf program about the FIB lookup.
bpf_fib_lookup struct is 64-bytes. It can not be expanded without
hurting performance. I could do another union on an input parameter and
return flags indicating why the returned index is 0. Something like this:
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 360a1168c353..75591522444c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2314,6 +2314,12 @@ struct bpf_raw_tracepoint_args {
#define BPF_FIB_LOOKUP_DIRECT BIT(0)
#define BPF_FIB_LOOKUP_OUTPUT BIT(1)
+#define BPF_FIB_LKUP_RET_NO_FWD BIT(0) /* pkt is not fwded */
+#define BPF_FIB_LKUP_RET_UNSUPP_LWT BIT(1) /* fwd requires unsupp
encap */
+#define BPF_FIB_LKUP_RET_NO_NHDEV BIT(2) /* nh device does not exist */
+#define BPF_FIB_LKUP_RET_NO_NEIGH BIT(3) /* no neigh entry for nh */
+#define BPF_FIB_LKUP_RET_FRAG_NEEDED BIT(4) /* pkt too big to fwd */
+
struct bpf_fib_lookup {
/* input */
__u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */
@@ -2325,7 +2331,11 @@ struct bpf_fib_lookup {
/* total length of packet from network header - used for MTU
check */
__u16 tot_len;
- __u32 ifindex; /* L3 device index for lookup */
+
+ union {
+ __u32 ifindex; /* in: L3 device index for lookup */
+ __u32 ret_flags; /* out: BPF_FIB_LOOKUP_RET flags */
+ }
union {
/* inputs to lookup */
Similarly for the fib result, it could be returned with a union on say
family:
union {
__u8 family; /* in: network family, AF_INET, AF_INET6, AF_MPLS */
__u8 rt_type; /* out: FIB lookup route type */
};
Then if the fib result is -EINVAL/-EHOSTUNREACH/-EACCES, rt_type is set
to RTN_BLACKHOLE/RTN_UNREACHABLE/RTN_PROHIBIT allowing the XDP program
to make an informed decision on dropping the packet.
To avoid performance hits on the forwarding path, these return values
would *only* set if the ifindex returned is 0.
^ permalink raw reply related
* mounting NFS on the same host leads to D state
From: maowenan @ 2018-05-15 3:47 UTC (permalink / raw)
To: netdev@vger.kernel.org, jlayton, linux-nfs; +Cc: yangerkun, weiyongjun (A)
Hi,
Recently I have tested NFS and exportfs scenario,
that NFS server and client are in the same host.
And I found mounting NFS filesystm onto the same host
can lead to rpc.mountd and related task become D state.
My kernel version is based on 3.10, and I find 4.15 has the same
appearance.
My test step as below:
1)create dir.
mkdir -p /home/test1 /home/test2
2)share dir /home/test1
echo '/home/test1 localhost(rw,all_squash,anonuid=0,anongid=0)' > /etc/exports
3)exportfs
exportfs -vr || echo "Failed to export /home/test1"
4)mount NFS.
mount localhost:/home/test1 /home/test2 -o vers=3,soft
5)share dir /home/test2
echo '/home/test2 *(rw,all_squash,anonuid=0,anongid=0)' >> /etc/exports
6)exportfs
exportfs -vr
7) list /home/test2
ls /home/test2
then we found ls command is hung, ls and rpc.mountd became "D" state, and after
180 second ls command return.
Another scenario as below:
1)create dir.
mkdir -p /home/test3 /home/test4
2)share dir /home/test3
echo '/home/test3 localhost(rw,sync,no_wdelay,anonuid=0,anongid=0,no_subtree_check)' > /etc/exports
3)exportfs
exportfs -r
4)to see NFS status
showmount -e localhost
5)mount NFS
mount -t nfs4 -o proto=tcp,nolock,soft,timeo=50 localhost:/home/test3 /home/test4
6) stop nfs service,and and check ls task state is D.
service nfs stop
ls /home/test4
ls command is hung and became D state.
I wonder to know is it reasonable about these test scenario because NFS server and
client are in the same host? Since some task went into D state, is there any reason about this?
and is there any patch to fix this issue?
Here is a link to talk about NFS mounting on the same host, https://lwn.net/Articles/595652/
^ permalink raw reply
* Re: [PATCH iproute2] ip: do not drop capabilities if net_admin=i is set
From: Stephen Hemminger @ 2018-05-15 4:10 UTC (permalink / raw)
To: Luca Boccassi; +Cc: netdev, dsahern, luto
In-Reply-To: <20180511123956.5638-1-bluca@debian.org>
On Fri, 11 May 2018 13:39:56 +0100
Luca Boccassi <bluca@debian.org> wrote:
> Users have reported a regression due to ip now dropping capabilities
> unconditionally.
> zerotier-one VPN and VirtualBox use ambient capabilities in their
> binary and then fork out to ip to set routes and links, and this
> does not work anymore.
>
> As a workaround, do not drop caps if CAP_NET_ADMIN (the most common
> capability used by ip) is set with the INHERITABLE flag.
> Users that want ip vrf exec to work do not need to set INHERITABLE,
> which will then only set when the calling program had privileges to
> give itself the ambient capability.
>
> Fixes: ba2fc55b99f8 ("Drop capabilities if not running ip exec vrf with libcap")
>
> Signed-off-by: Luca Boccassi <bluca@debian.org>
Applied
^ permalink raw reply
* [PATCH net] tcp: purge write queue in tcp_connect_init()
From: Eric Dumazet @ 2018-05-15 4:14 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, Eric Dumazet, Yuchung Cheng, Neal Cardwell
syzkaller found a reliable way to crash the host, hitting a BUG()
in __tcp_retransmit_skb()
Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
in tcp_connect_init() at the point we init snd_una/write_seq.
This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()
kernel BUG at net/ipv4/tcp_output.c:2837!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206
RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49
RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005
RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2
R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad
R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80
FS: 0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x79e/0xc50 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d1/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:525 [inline]
smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
net/ipv4/tcp_output.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 383cac0ff0ec059ca7dbc1a6304cc7f8183e008d..d07e34f8e3091144976358674b92458076f92bfb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2833,8 +2833,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
return -EBUSY;
if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
- if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
- BUG();
+ if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
return -ENOMEM;
}
@@ -3342,6 +3344,7 @@ static void tcp_connect_init(struct sock *sk)
sock_reset_flag(sk, SOCK_DONE);
tp->snd_wnd = 0;
tcp_init_wl(tp, 0);
+ tcp_write_queue_purge(sk);
tp->snd_una = tp->write_seq;
tp->snd_sml = tp->write_seq;
tp->snd_up = tp->write_seq;
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* Re: [PATCH net-next 2/3] net: qualcomm: rmnet: Add support for ethtool private stats
From: kbuild test robot @ 2018-05-15 4:42 UTC (permalink / raw)
To: Subash Abhinov Kasiviswanathan
Cc: kbuild-all, davem, netdev, Subash Abhinov Kasiviswanathan
In-Reply-To: <1526328527-20026-3-git-send-email-subashab@codeaurora.org>
[-- Attachment #1: Type: text/plain, Size: 2578 bytes --]
Hi Subash,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Updates-2018-05-14/20180515-115355
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c: In function 'rmnet_map_checksum_downlink_packet':
>> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:382:2: warning: this 'else' clause does not guard... [-Wmisleading-indentation]
else
^~~~
drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:384:3: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'else'
return -EPROTONOSUPPORT;
^~~~~~
vim +/else +382 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
349
350 /* Validates packet checksums. Function takes a pointer to
351 * the beginning of a buffer which contains the IP payload +
352 * padding + checksum trailer.
353 * Only IPv4 and IPv6 are supported along with TCP & UDP.
354 * Fragmented or tunneled packets are not supported.
355 */
356 int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len)
357 {
358 struct rmnet_priv *priv = netdev_priv(skb->dev);
359 struct rmnet_map_dl_csum_trailer *csum_trailer;
360
361 if (unlikely(!(skb->dev->features & NETIF_F_RXCSUM))) {
362 priv->stats.csum_sw++;
363 return -EOPNOTSUPP;
364 }
365
366 csum_trailer = (struct rmnet_map_dl_csum_trailer *)(skb->data + len);
367
368 if (!csum_trailer->valid) {
369 priv->stats.csum_valid_unset++;
370 return -EINVAL;
371 }
372
373 if (skb->protocol == htons(ETH_P_IP))
374 return rmnet_map_ipv4_dl_csum_trailer(skb, csum_trailer, priv);
375 else if (skb->protocol == htons(ETH_P_IPV6))
376 #if IS_ENABLED(CONFIG_IPV6)
377 return rmnet_map_ipv6_dl_csum_trailer(skb, csum_trailer, priv);
378 #else
379 priv->stats.csum_err_invalid_ip_version++;
380 return -EPROTONOSUPPORT;
381 #endif
> 382 else
383 priv->stats.csum_err_invalid_ip_version++;
384 return -EPROTONOSUPPORT;
385
386 return 0;
387 }
388
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 63034 bytes --]
^ permalink raw reply
* Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER
From: Y Song @ 2018-05-15 4:48 UTC (permalink / raw)
To: Sean Young
Cc: linux-media, linux-kernel, Alexei Starovoitov,
Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
Devin Heitmueller
In-Reply-To: <32a944171d5c48abf126259595b0088ce3122c91.1526331777.git.sean@mess.org>
On Mon, May 14, 2018 at 2:10 PM, Sean Young <sean@mess.org> wrote:
> Add support for BPF_PROG_IR_DECODER. This type of BPF program can call
> rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> that the last key should be repeated.
>
> Signed-off-by: Sean Young <sean@mess.org>
> ---
> drivers/media/rc/Kconfig | 8 +++
> drivers/media/rc/Makefile | 1 +
> drivers/media/rc/ir-bpf-decoder.c | 93 +++++++++++++++++++++++++++++++
> include/linux/bpf_types.h | 3 +
> include/uapi/linux/bpf.h | 16 +++++-
> 5 files changed, 120 insertions(+), 1 deletion(-)
> create mode 100644 drivers/media/rc/ir-bpf-decoder.c
>
> diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> index eb2c3b6eca7f..10ad6167d87c 100644
> --- a/drivers/media/rc/Kconfig
> +++ b/drivers/media/rc/Kconfig
> @@ -120,6 +120,14 @@ config IR_IMON_DECODER
> remote control and you would like to use it with a raw IR
> receiver, or if you wish to use an encoder to transmit this IR.
>
> +config IR_BPF_DECODER
> + bool "Enable IR raw decoder using BPF"
> + depends on BPF_SYSCALL
> + depends on RC_CORE=y
> + help
> + Enable this option to make it possible to load custom IR
> + decoders written in BPF.
> +
> endif #RC_DECODERS
>
> menuconfig RC_DEVICES
> diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
> index 2e1c87066f6c..12e1118430d0 100644
> --- a/drivers/media/rc/Makefile
> +++ b/drivers/media/rc/Makefile
> @@ -5,6 +5,7 @@ obj-y += keymaps/
> obj-$(CONFIG_RC_CORE) += rc-core.o
> rc-core-y := rc-main.o rc-ir-raw.o
> rc-core-$(CONFIG_LIRC) += lirc_dev.o
> +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o
> obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
> obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
> obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
> diff --git a/drivers/media/rc/ir-bpf-decoder.c b/drivers/media/rc/ir-bpf-decoder.c
> new file mode 100644
> index 000000000000..aaa5e208b1a5
> --- /dev/null
> +++ b/drivers/media/rc/ir-bpf-decoder.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// ir-bpf-decoder.c - handles bpf decoders
> +//
> +// Copyright (C) 2018 Sean Young <sean@mess.org>
> +
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include "rc-core-priv.h"
> +
> +/*
> + * BPF interface for raw IR decoder
> + */
> +const struct bpf_prog_ops ir_decoder_prog_ops = {
> +};
> +
> +BPF_CALL_1(bpf_rc_repeat, struct ir_raw_event*, event)
> +{
> + struct ir_raw_event_ctrl *ctrl;
> +
> + ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> +
> + rc_repeat(ctrl->dev);
> + return 0;
> +}
> +
> +static const struct bpf_func_proto rc_repeat_proto = {
> + .func = bpf_rc_repeat,
> + .gpl_only = true, // rc_repeat is EXPORT_SYMBOL_GPL
> + .ret_type = RET_VOID,
I suggest using RET_INTEGER here since we do return an integer 0.
RET_INTEGER will also make it easy to extend if you want to return
a non-zero value for error code or other reasons.
> + .arg1_type = ARG_PTR_TO_CTX,
> +};
> +
> +BPF_CALL_4(bpf_rc_keydown, struct ir_raw_event*, event, u32, protocol,
> + u32, scancode, u32, toggle)
> +{
> + struct ir_raw_event_ctrl *ctrl;
> +
> + ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
> + return 0;
> +}
> +
> +static const struct bpf_func_proto rc_keydown_proto = {
> + .func = bpf_rc_keydown,
> + .gpl_only = true, // rc_keydown is EXPORT_SYMBOL_GPL
> + .ret_type = RET_VOID,
ditto. RET_INTEGER is preferable.
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_ANYTHING,
> + .arg3_type = ARG_ANYTHING,
> + .arg4_type = ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto *ir_decoder_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> + switch (func_id) {
> + case BPF_FUNC_rc_repeat:
> + return &rc_repeat_proto;
> + case BPF_FUNC_rc_keydown:
> + return &rc_keydown_proto;
> + case BPF_FUNC_map_lookup_elem:
> + return &bpf_map_lookup_elem_proto;
> + case BPF_FUNC_map_update_elem:
> + return &bpf_map_update_elem_proto;
> + case BPF_FUNC_map_delete_elem:
> + return &bpf_map_delete_elem_proto;
> + case BPF_FUNC_ktime_get_ns:
> + return &bpf_ktime_get_ns_proto;
> + case BPF_FUNC_tail_call:
> + return &bpf_tail_call_proto;
> + case BPF_FUNC_get_prandom_u32:
> + return &bpf_get_prandom_u32_proto;
> + default:
> + return NULL;
> + }
> +}
> +
> +static bool ir_decoder_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + if (type == BPF_WRITE)
> + return false;
> + if (off < 0 || off + size > sizeof(struct ir_raw_event))
> + return false;
You probably need more than just checking the boundary.
>From patch #3, the structure is:
struct ir_raw_event {
union {
__u32 duration;
__u32 carrier;
};
__u8 duty_cycle;
unsigned pulse:1;
unsigned reset:1;
unsigned timeout:1;
unsigned carrier_report:1;
};
You would like the memory access to be aligned,
so accessing duration/carrier with 4-byte alignment, and
pulse/reset/timeout/carrier_report 4-byte alignment as well.
You could only allow __u32 access to duration/carrier.
But if you want bpf program to access duration/carrier with
code like (__u16)(ctx->duration), then the compiler may
translate the load to a 2-byte load. You may need to handle
endianness here. You can check net/core/filter.c function
bpf_skb_is_valid_access for some examples.
> +
> + return true;
> +}
> +
> +const struct bpf_verifier_ops ir_decoder_verifier_ops = {
> + .get_func_proto = ir_decoder_func_proto,
> + .is_valid_access = ir_decoder_is_valid_access
> +};
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 2b28fcf6f6ae..ee5355715ee0 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -25,6 +25,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> #ifdef CONFIG_CGROUP_BPF
> BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> #endif
> +#ifdef CONFIG_IR_BPF_DECODER
> +BPF_PROG_TYPE(BPF_PROG_TYPE_RAWIR_DECODER, ir_decoder)
> +#endif
>
> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec89732a8d..6ad053e831c0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -137,6 +137,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_SK_MSG,
> BPF_PROG_TYPE_RAW_TRACEPOINT,
> BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> + BPF_PROG_TYPE_RAWIR_DECODER,
> };
>
> enum bpf_attach_type {
> @@ -755,6 +756,17 @@ union bpf_attr {
> * @addr: pointer to struct sockaddr to bind socket to
> * @addr_len: length of sockaddr structure
> * Return: 0 on success or negative error code
> + *
> + * int bpf_rc_keydown(ctx, protocol, scancode, toggle)
> + * Report decoded scancode with toggle value
> + * @ctx: pointer to ctx
> + * @protocol: decoded protocol
> + * @scancode: decoded scancode
> + * @toggle: set to 1 if button was toggled, else 0
> + *
> + * int bpf_rc_repeat(ctx)
> + * Repeat the last decoded scancode
> + * @ctx: pointer to ctx
The comment format has changed dramatically for
documentation reason. Could you rebase your change
on top of bpf-next tree? You will need to rewrite the above
helper description so tools can generate proper documentation
for them.
> */
> #define __BPF_FUNC_MAPPER(FN) \
> FN(unspec), \
> @@ -821,7 +833,9 @@ union bpf_attr {
> FN(msg_apply_bytes), \
> FN(msg_cork_bytes), \
> FN(msg_pull_data), \
> - FN(bind),
> + FN(bind), \
> + FN(rc_repeat), \
> + FN(rc_keydown),
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> * function eBPF program intends to call
> --
> 2.17.0
>
^ permalink raw reply
* Re: possible deadlock in sk_diag_fill
From: Dmitry Vyukov @ 2018-05-15 5:19 UTC (permalink / raw)
To: Andrei Vagin; +Cc: syzbot, avagin, David Miller, LKML, netdev, syzkaller-bugs
In-Reply-To: <20180514180035.GA20189@outlook.office365.com>
On Mon, May 14, 2018 at 8:00 PM, Andrei Vagin <avagin@virtuozzo.com> wrote:
>> >> Hello,
>> >>
>> >> syzbot found the following crash on:
>> >>
>> >> HEAD commit: c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
>> >> git tree: upstream
>> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c97800000
>> >> kernel config: https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
>> >> dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
>> >> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>> >> userspace arch: i386
>> >>
>> >> Unfortunately, I don't have any reproducer for this crash yet.
>> >>
>> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> >> Reported-by: syzbot+c1872be62e587eae9669@syzkaller.appspotmail.com
>> >>
>> >>
>> >> ======================================================
>> >> WARNING: possible circular locking dependency detected
>> >> 4.17.0-rc3+ #59 Not tainted
>> >> ------------------------------------------------------
>> >> syz-executor1/25282 is trying to acquire lock:
>> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
>> >> net/unix/diag.c:82 [inline]
>> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
>> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>> >>
>> >> but task is already holding lock:
>> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
>> >> include/linux/spinlock.h:310 [inline]
>> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
>> >> net/unix/diag.c:64 [inline]
>> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_fill.isra.5+0x94e/0x10d0
>> >> net/unix/diag.c:144
>> >>
>> >> which lock already depends on the new lock.
>> >
>> > In the code, we have a comment which explains why it is safe to take this lock
>> >
>> > /*
>> > * The state lock is outer for the same sk's
>> > * queue lock. With the other's queue locked it's
>> > * OK to lock the state.
>> > */
>> > unix_state_lock_nested(req);
>> >
>> > It is a question how to explain this to lockdep.
>>
>> Do I understand it correctly that (&u->lock)->rlock associated with
>> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
>> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
>> think we need to split (&u->lock)->rlock by family too, so that we
>> have u->lock-AF_UNIX and u->lock-AF_NETLINK.
>
> I think here is another problem. lockdep woried about
> sk->sk_receive_queue vs unix_sk(s)->lock.
>
> sk_diag_dump_icons() takes sk->sk_receive_queue and then
> unix_sk(s)->lock.
>
> unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue.
>
> sk_diag_dump_icons() takes locks for two different sockets, but
> unix_dgram_sendmsg() takes locks for one socket.
>
> sk_diag_dump_icons
> if (sk->sk_state == TCP_LISTEN) {
> spin_lock(&sk->sk_receive_queue.lock);
> skb_queue_walk(&sk->sk_receive_queue, skb) {
> unix_state_lock_nested(req);
> spin_lock_nested(&unix_sk(s)->lock,
>
>
> unix_dgram_sendmsg
> unix_state_lock(other)
> spin_lock(&unix_sk(s)->lock)
> skb_queue_tail(&other->sk_receive_queue, skb);
> spin_lock_irqsave(&list->lock, flags);
Do you mean the following?
There is socket 1 with state lock (S1) and queue lock (Q2), and socket
2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock
S1->Q1. And sk_diag_dump_icons locks Q1->S2.
If yes, then this looks pretty much as deadlock. Consider that 2
unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively.
Now 2 sk_diag_dump_icons in 2 different threads lock Q1 and Q2
respectively. Now sk_diag_dump_icons want to lock S's, and
unix_dgram_sendmsg want to lock Q's. Nobody can proceed.
^ permalink raw reply
* Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases
From: Tobin C. Harding @ 2018-05-15 5:21 UTC (permalink / raw)
To: David Ahern; +Cc: netdev, Thomas.Winter, idosch, sharpd, roopa
In-Reply-To: <20180515025112.16647-1-dsahern@gmail.com>
Hi David,
On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote:
> This patch set fixes a few append and replace uses cases for IPv6 and
> adds test cases that codifies the expectations of how append and replace
> are expected to work.
Nood question: what commit does this apply on top of please. I
attempted to apply the set on top of net-next
commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx')
patch 4 and 5 have merge conflicts (I stopped at 5).
thanks,
Tobin.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox