* [PATCH] selftests: bpf: notification about privilege required to run test_kmod.sh testing script
From: Jeffrin Jose T @ 2018-06-21 17:00 UTC (permalink / raw)
To: ast, daniel, shuah; +Cc: netdev, linux-kernel, linux-kselftest, Jeffrin Jose T
The test_kmod.sh script require root privilege for the successful
execution of the test.
This patch is to notify the user about the privilege the script
demands for the successful execution of the test.
Signed-off-by: Jeffrin Jose T (Rajagiri SET) <ahiliation@gmail.com>
---
tools/testing/selftests/bpf/test_kmod.sh | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_kmod.sh b/tools/testing/selftests/bpf/test_kmod.sh
index 35669ccd4d23..378ccc512ad3 100755
--- a/tools/testing/selftests/bpf/test_kmod.sh
+++ b/tools/testing/selftests/bpf/test_kmod.sh
@@ -1,6 +1,15 @@
#!/bin/sh
# SPDX-License-Identifier: GPL-2.0
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+msg="skip all tests:"
+if [ "$(id -u)" != "0" ]; then
+ echo $msg please run this as root >&2
+ exit $ksft_skip
+fi
+
SRC_TREE=../../../../
test_run()
--
2.17.0
^ permalink raw reply related
* [PATCH bpf] tools/bpf: fix test_sockmap failure
From: Yonghong Song @ 2018-06-21 17:02 UTC (permalink / raw)
To: ast, daniel, netdev; +Cc: kernel-team
On one of our production test machine, when running
bpf selftest test_sockmap, I got the following error:
# sudo ./test_sockmap
libbpf: failed to create map (name: 'sock_map'): Operation not permitted
libbpf: failed to load object 'test_sockmap_kern.o'
libbpf: Can't get the 0th fd from program sk_skb1: only -1 instances
......
load_bpf_file: (-1) Operation not permitted
ERROR: (-1) load bpf failed
The error is due to not-big-enough rlimit
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
The test already includes "bpf_rlimit.h", which sets current
and max rlimit to RLIM_INFINITY. Let us just use it.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
tools/testing/selftests/bpf/test_sockmap.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 05c8cb7..9e78df2 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -1413,18 +1413,12 @@ static int test_suite(void)
int main(int argc, char **argv)
{
- struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
int iov_count = 1, length = 1024, rate = 1;
struct sockmap_options options = {0};
int opt, longindex, err, cg_fd = 0;
char *bpf_file = BPF_SOCKMAP_FILENAME;
int test = PING_PONG;
- if (setrlimit(RLIMIT_MEMLOCK, &r)) {
- perror("setrlimit(RLIMIT_MEMLOCK)");
- return 1;
- }
-
if (argc < 2)
return test_suite();
--
2.9.5
^ permalink raw reply related
* Re: [PATCH v2 bpf-net] bpf: Change bpf_fib_lookup to return lookup status
From: Martin KaFai Lau @ 2018-06-21 17:09 UTC (permalink / raw)
To: dsahern; +Cc: netdev, borkmann, ast, davem, David Ahern
In-Reply-To: <20180621030011.7441-1-dsahern@kernel.org>
On Wed, Jun 20, 2018 at 08:00:11PM -0700, dsahern@kernel.org wrote:
> From: David Ahern <dsahern@gmail.com>
>
> For ACLs implemented using either FIB rules or FIB entries, the BPF
> program needs the FIB lookup status to be able to drop the packet.
> Since the bpf_fib_lookup API has not reached a released kernel yet,
> change the return code to contain an encoding of the FIB lookup
> result and return the nexthop device index in the params struct.
>
> In addition, inform the BPF program of any post FIB lookup reason as
> to why the packet needs to go up the stack.
>
> The fib result for unicast routes must have an egress device, so remove
> the check that it is non-NULL.
Acked-by: Martin KaFai Lau <kafai@fb.com>
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
> v2
> - drop BPF_FIB_LKUP_RET_NO_NHDEV; check in dev in fib result not needed
> - enhance documentation of BPF_FIB_LKUP_RET_ codes
>
> include/uapi/linux/bpf.h | 28 ++++++++++++++----
> net/core/filter.c | 72 ++++++++++++++++++++++++++++++----------------
> samples/bpf/xdp_fwd_kern.c | 8 +++---
> 3 files changed, 74 insertions(+), 34 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..b7db3261c62d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1857,7 +1857,8 @@ union bpf_attr {
> * is resolved), the nexthop address is returned in ipv4_dst
> * or ipv6_dst based on family, smac is set to mac address of
> * egress device, dmac is set to nexthop mac address, rt_metric
> - * is set to metric from route (IPv4/IPv6 only).
> + * is set to metric from route (IPv4/IPv6 only), and ifindex
> + * is set to the device index of the nexthop from the FIB lookup.
> *
> * *plen* argument is the size of the passed in struct.
> * *flags* argument can be a combination of one or more of the
> @@ -1873,9 +1874,10 @@ union bpf_attr {
> * *ctx* is either **struct xdp_md** for XDP programs or
> * **struct sk_buff** tc cls_act programs.
> * Return
> - * Egress device index on success, 0 if packet needs to continue
> - * up the stack for further processing or a negative error in case
> - * of failure.
> + * * < 0 if any input argument is invalid
> + * * 0 on success (packet is forwarded, nexthop neighbor exists)
> + * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the
> + * * packet is not forwarded or needs assist from full stack
> *
> * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
> * Description
> @@ -2612,6 +2614,18 @@ struct bpf_raw_tracepoint_args {
> #define BPF_FIB_LOOKUP_DIRECT BIT(0)
> #define BPF_FIB_LOOKUP_OUTPUT BIT(1)
>
> +enum {
> + BPF_FIB_LKUP_RET_SUCCESS, /* lookup successful */
> + BPF_FIB_LKUP_RET_BLACKHOLE, /* dest is blackholed; can be dropped */
> + BPF_FIB_LKUP_RET_UNREACHABLE, /* dest is unreachable; can be dropped */
> + BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed; can be dropped */
> + BPF_FIB_LKUP_RET_NOT_FWDED, /* packet is not forwarded */
> + BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */
> + BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */
> + BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */
> + BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */
> +};
> +
> struct bpf_fib_lookup {
> /* input: network family for lookup (AF_INET, AF_INET6)
> * output: network family of egress nexthop
> @@ -2625,7 +2639,11 @@ struct bpf_fib_lookup {
>
> /* total length of packet from network header - used for MTU check */
> __u16 tot_len;
> - __u32 ifindex; /* L3 device index for lookup */
> +
> + /* input: L3 device index for lookup
> + * output: device index from FIB lookup
> + */
> + __u32 ifindex;
>
> union {
> /* inputs to lookup */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index e7f12e9f598c..f8dd8aa89de4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4073,8 +4073,9 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
> memcpy(params->smac, dev->dev_addr, ETH_ALEN);
> params->h_vlan_TCI = 0;
> params->h_vlan_proto = 0;
> + params->ifindex = dev->ifindex;
>
> - return dev->ifindex;
> + return 0;
> }
> #endif
>
> @@ -4098,7 +4099,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> /* verify forwarding is enabled on this interface */
> in_dev = __in_dev_get_rcu(dev);
> if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
> - return 0;
> + return BPF_FIB_LKUP_RET_FWD_DISABLED;
>
> if (flags & BPF_FIB_LOOKUP_OUTPUT) {
> fl4.flowi4_iif = 1;
> @@ -4123,7 +4124,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>
> tb = fib_get_table(net, tbid);
> if (unlikely(!tb))
> - return 0;
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
>
> err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
> } else {
> @@ -4135,8 +4136,20 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
> }
>
> - if (err || res.type != RTN_UNICAST)
> - return 0;
> + if (err) {
> + /* map fib lookup errors to RTN_ type */
> + if (err == -EINVAL)
> + return BPF_FIB_LKUP_RET_BLACKHOLE;
> + if (err == -EHOSTUNREACH)
> + return BPF_FIB_LKUP_RET_UNREACHABLE;
> + if (err == -EACCES)
> + return BPF_FIB_LKUP_RET_PROHIBIT;
> +
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
> + }
> +
> + if (res.type != RTN_UNICAST)
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
>
> if (res.fi->fib_nhs > 1)
> fib_select_path(net, &res, &fl4, NULL);
> @@ -4144,19 +4157,16 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> if (check_mtu) {
> mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
> if (params->tot_len > mtu)
> - return 0;
> + return BPF_FIB_LKUP_RET_FRAG_NEEDED;
> }
>
> nh = &res.fi->fib_nh[res.nh_sel];
>
> /* do not handle lwt encaps right now */
> if (nh->nh_lwtstate)
> - return 0;
> + return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>
> dev = nh->nh_dev;
> - if (unlikely(!dev))
> - return 0;
> -
> if (nh->nh_gw)
> params->ipv4_dst = nh->nh_gw;
>
> @@ -4166,10 +4176,10 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> * rcu_read_lock_bh is not needed here
> */
> neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
> - if (neigh)
> - return bpf_fib_set_fwd_params(params, neigh, dev);
> + if (!neigh)
> + return BPF_FIB_LKUP_RET_NO_NEIGH;
>
> - return 0;
> + return bpf_fib_set_fwd_params(params, neigh, dev);
> }
> #endif
>
> @@ -4190,7 +4200,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>
> /* link local addresses are never forwarded */
> if (rt6_need_strict(dst) || rt6_need_strict(src))
> - return 0;
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
>
> dev = dev_get_by_index_rcu(net, params->ifindex);
> if (unlikely(!dev))
> @@ -4198,7 +4208,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>
> idev = __in6_dev_get_safely(dev);
> if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
> - return 0;
> + return BPF_FIB_LKUP_RET_FWD_DISABLED;
>
> if (flags & BPF_FIB_LOOKUP_OUTPUT) {
> fl6.flowi6_iif = 1;
> @@ -4225,7 +4235,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
>
> tb = ipv6_stub->fib6_get_table(net, tbid);
> if (unlikely(!tb))
> - return 0;
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
>
> f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
> } else {
> @@ -4238,11 +4248,23 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> }
>
> if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
> - return 0;
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
> +
> + if (unlikely(f6i->fib6_flags & RTF_REJECT)) {
> + switch (f6i->fib6_type) {
> + case RTN_BLACKHOLE:
> + return BPF_FIB_LKUP_RET_BLACKHOLE;
> + case RTN_UNREACHABLE:
> + return BPF_FIB_LKUP_RET_UNREACHABLE;
> + case RTN_PROHIBIT:
> + return BPF_FIB_LKUP_RET_PROHIBIT;
> + default:
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
> + }
> + }
>
> - if (unlikely(f6i->fib6_flags & RTF_REJECT ||
> - f6i->fib6_type != RTN_UNICAST))
> - return 0;
> + if (f6i->fib6_type != RTN_UNICAST)
> + return BPF_FIB_LKUP_RET_NOT_FWDED;
>
> if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
> f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
> @@ -4252,11 +4274,11 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> if (check_mtu) {
> mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
> if (params->tot_len > mtu)
> - return 0;
> + return BPF_FIB_LKUP_RET_FRAG_NEEDED;
> }
>
> if (f6i->fib6_nh.nh_lwtstate)
> - return 0;
> + return BPF_FIB_LKUP_RET_UNSUPP_LWT;
>
> if (f6i->fib6_flags & RTF_GATEWAY)
> *dst = f6i->fib6_nh.nh_gw;
> @@ -4270,10 +4292,10 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
> */
> neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
> ndisc_hashfn, dst, dev);
> - if (neigh)
> - return bpf_fib_set_fwd_params(params, neigh, dev);
> + if (!neigh)
> + return BPF_FIB_LKUP_RET_NO_NEIGH;
>
> - return 0;
> + return bpf_fib_set_fwd_params(params, neigh, dev);
> }
> #endif
>
> diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> index 6673cdb9f55c..a7e94e7ff87d 100644
> --- a/samples/bpf/xdp_fwd_kern.c
> +++ b/samples/bpf/xdp_fwd_kern.c
> @@ -48,9 +48,9 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
> struct ethhdr *eth = data;
> struct ipv6hdr *ip6h;
> struct iphdr *iph;
> - int out_index;
> u16 h_proto;
> u64 nh_off;
> + int rc;
>
> nh_off = sizeof(*eth);
> if (data + nh_off > data_end)
> @@ -101,7 +101,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>
> fib_params.ifindex = ctx->ingress_ifindex;
>
> - out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
> + rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
>
> /* verify egress index has xdp support
> * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> @@ -109,7 +109,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
> * NOTE: without verification that egress index supports XDP
> * forwarding packets are dropped.
> */
> - if (out_index > 0) {
> + if (rc == 0) {
> if (h_proto == htons(ETH_P_IP))
> ip_decrease_ttl(iph);
> else if (h_proto == htons(ETH_P_IPV6))
> @@ -117,7 +117,7 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>
> memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
> memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
> - return bpf_redirect_map(&tx_port, out_index, 0);
> + return bpf_redirect_map(&tx_port, fib_params.ifindex, 0);
> }
>
> return XDP_PASS;
> --
> 2.11.0
>
^ permalink raw reply
* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 17:11 UTC (permalink / raw)
To: Vadim Pasternak; +Cc: davem, netdev, jiri
In-Reply-To: <1529594883-20619-4-git-send-email-vadimp@mellanox.com>
> New internal API reads the temperature from all the modules, which are
> equipped with the thermal sensor and exposes temperature according to
> the worst measure. All individual temperature values are normalized to
> pre-defined range.
Hi Vadim
Could you explain this normalization process. Why are you not just
expose each sensors temperature in millidegrees C, which is the normal
for HWMON.
Andrew
^ permalink raw reply
* [PATCH] net: ethernet: ti: davinci_cpdma: make function cpdma_desc_pool_create static
From: Colin King @ 2018-06-21 17:16 UTC (permalink / raw)
To: David S . Miller, Florian Fainelli, Ivan Khoronzhuk, linux-omap,
netdev
Cc: kernel-janitors, linux-kernel
From: Colin Ian King <colin.king@canonical.com>
The function cpdma_desc_pool_create is local to the source and does not
need to be in global scope, so make it static.
Cleans up sparse warning:
warning: symbol 'cpdma_desc_pool_create' was not declared. Should it
be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index cdbddf16dd29..4f1267477aa4 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -205,7 +205,7 @@ static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
* devices (e.g. cpsw switches) use plain old memory. Descriptor pools
* abstract out these details
*/
-int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
+static int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
{
struct cpdma_params *cpdma_params = &ctlr->params;
struct cpdma_desc_pool *pool;
--
2.17.0
^ permalink raw reply related
* Re: [PATCH] net: ethernet: ti: davinci_cpdma: make function cpdma_desc_pool_create static
From: Grygorii Strashko @ 2018-06-21 17:22 UTC (permalink / raw)
To: Colin King, David S . Miller, Florian Fainelli, Ivan Khoronzhuk,
linux-omap, netdev, netdev
Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180621171645.29734-1-colin.king@canonical.com>
Please, add netdev@vger.kernel.org for the future
On 06/21/2018 12:16 PM, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
>
> The function cpdma_desc_pool_create is local to the source and does not
> need to be in global scope, so make it static.
>
> Cleans up sparse warning:
> warning: symbol 'cpdma_desc_pool_create' was not declared. Should it
> be static?
>
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
> drivers/net/ethernet/ti/davinci_cpdma.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
> index cdbddf16dd29..4f1267477aa4 100644
> --- a/drivers/net/ethernet/ti/davinci_cpdma.c
> +++ b/drivers/net/ethernet/ti/davinci_cpdma.c
> @@ -205,7 +205,7 @@ static void cpdma_desc_pool_destroy(struct cpdma_ctlr *ctlr)
> * devices (e.g. cpsw switches) use plain old memory. Descriptor pools
> * abstract out these details
> */
> -int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
> +static int cpdma_desc_pool_create(struct cpdma_ctlr *ctlr)
> {
> struct cpdma_params *cpdma_params = &ctlr->params;
> struct cpdma_desc_pool *pool;
>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
--
regards,
-grygorii
^ permalink raw reply
* Re: [PATCH net-next 0/2] fixes for ipsec selftests
From: Shannon Nelson @ 2018-06-21 17:25 UTC (permalink / raw)
To: Anders Roxell; +Cc: Networking, David Miller
In-Reply-To: <CADYN=9+DSfu+UN3d8Te71F91ZWxCFUS0RBJLdzO4M0hZUonPiA@mail.gmail.com>
On 6/21/2018 9:56 AM, Anders Roxell wrote:
> On Thu, 21 Jun 2018 at 02:32, Shannon Nelson <shannon.nelson@oracle.com> wrote:
>>
>> On 6/20/2018 4:18 PM, Anders Roxell wrote:
>>> On Thu, 21 Jun 2018 at 00:26, Shannon Nelson <shannon.nelson@oracle.com> wrote:
>>>>
>>>> On 6/20/2018 12:09 PM, Anders Roxell wrote:
>>>>> On Wed, 20 Jun 2018 at 07:42, Shannon Nelson <shannon.nelson@oracle.com> wrote:
>>>>>>
>>>>>> A couple of bad behaviors in the ipsec selftest were pointed out
>>>>>> by Anders Roxell <anders.roxell@linaro.org> and are addressed here.
>>>>>>
>>>>>> Shannon Nelson (2):
>>>>>> selftests: rtnetlink: hide complaint from terminated monitor
>>>>>> selftests: rtnetlink: use a local IP address for IPsec tests
>>>>>>
>>>>>> tools/testing/selftests/net/rtnetlink.sh | 11 +++++++----
>>>>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.7.4
>>>>>>
>>>>>
>>>>> Hi Shannon,
>>>>>
>>>>> With this patches applied and my config patch.
>>>>>
>>>>> I still get this error when I run the ipsec test:
>>>>>
>>>>> FAIL: can't add fou port 7777, skipping test
>>>>> RTNETLINK answers: Operation not supported
>>>>> FAIL: can't add macsec interface, skipping test
>>>>> RTNETLINK answers: Protocol not supported
>>>>> RTNETLINK answers: No such process
>>>>> RTNETLINK answers: No such process
>>>>> FAIL: ipsec
>>>>
>>>> One of the odd things I noticed about this script is that there really
>>>> aren't any diagnosis messages, just PASS or FAIL. I followed this
>>>> custom when I added the ipsec tests, but I think this is something that
>>>> should change so we can get some idea of what breaks.
>>>>
>>>> I'm curious about the "RTNETLINK answers" messages and where they might
>>>> be coming from, especially "RTNETLINK answers: Protocol not supported".
>>>
>>> I added: "set -x" in the beginning of the rtnetlink.sh script.
>>> + ip x s add proto esp src 10.66.17.140 dst 10.66.17.141 spi 0x07 mode
>>> transport reqid 0x07 replay-window 32 aead 'rfc4106(gcm(aes))'
>>> 0x3132333435
>>> 363738393031323334353664636261 128 sel src 10.66.17.140/24 dst 10.66.17.141/24
>>> RTNETLINK answers: Protocol not supported
>>
>> Okay, so ip didn't like this command...
>>
>>>> What are the XFRM and AES settings in your kernel config - what is the
>>>> output from
>>>> egrep -i "xfrm|_aes" .config
>>>
>>> CONFIG_XFRM=y
>>> CONFIG_XFRM_ALGO=y
>>> CONFIG_XFRM_USER=y
>>> CONFIG_INET_XFRM_MODE_TUNNEL=y
>>> CONFIG_INET6_XFRM_MODE_TRANSPORT=y
>>> CONFIG_INET6_XFRM_MODE_TUNNEL=y
>>> CONFIG_INET6_XFRM_MODE_BEET=y
>>> CONFIG_CRYPTO_AES=y
>>
>> And this is probably why - there seem to be a few config variables
>> missing, including CONFIG_INET_XFRM_MODE_TRANSPORT, which might be why
>> the ip command fails above.
>>
>> Here's what I have in my config:
>> CONFIG_XFRM=y
>> CONFIG_XFRM_OFFLOAD=y
>> CONFIG_XFRM_ALGO=m
>> CONFIG_XFRM_USER=m
>> # CONFIG_XFRM_SUB_POLICY is not set
>> # CONFIG_XFRM_MIGRATE is not set
>> CONFIG_XFRM_STATISTICS=y
>> CONFIG_XFRM_IPCOMP=m
>> CONFIG_INET_XFRM_TUNNEL=m
>> CONFIG_INET_XFRM_MODE_TRANSPORT=m
>> CONFIG_INET_XFRM_MODE_TUNNEL=m
>> CONFIG_INET_XFRM_MODE_BEET=m
>> CONFIG_INET6_XFRM_TUNNEL=m
>> CONFIG_INET6_XFRM_MODE_TRANSPORT=m
>> CONFIG_INET6_XFRM_MODE_TUNNEL=m
>> CONFIG_INET6_XFRM_MODE_BEET=m
>> CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
>> CONFIG_SECURITY_NETWORK_XFRM=y
>> CONFIG_CRYPTO_AES=y
>> # CONFIG_CRYPTO_AES_TI is not set
>> CONFIG_CRYPTO_AES_X86_64=m
>> CONFIG_CRYPTO_AES_NI_INTEL=m
>> CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64=m
>> CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64=m
>> CONFIG_CRYPTO_DEV_PADLOCK_AES=m
>>
>> Can I talk you into adding CONFIG_INET_XFRM_MODE_TRANSPORT to your
>> config
>
> Yes you can.
>
>> and trying again?
>
> same issue with CONFIG_INET_XFRM_MODE_TRANSPORT=y
Interesting. I took only CONFIG_INET_XFRM_MODE_TRANSPORT out of my
config and was able to see the "Protocol not supported" message. I'm
not familiar enough with the crypto algorithm setup, but I suspect
there's a combination of the other missing CONFIGs that are needed along
with CONFIG_INET_XFRM_MODE_TRANSPORT.
My knee-jerk reaction voice wants to say this is the test working as
expected, pointing out to us that the kernel config is not up to what it
should be. However, perhaps a better answer is that the test should be
reworked to just skip the rest if it can't set up the expected test
environment, as is done in the macsec case.
So the remaining question then is should the test be marked as failed,
as in the macsec test if it can't set up it's interface, or just skipped?
sln
>
> Cheers,
> Anders
>
^ permalink raw reply
* Re: [PATCH rdma-next 0/2] RoCE ICRC counter
From: Jason Gunthorpe @ 2018-06-21 17:43 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Mark Bloch,
Talat Batheesh, Saeed Mahameed, linux-netdev
In-Reply-To: <20180621123756.32645-1-leon@kernel.org>
On Thu, Jun 21, 2018 at 03:37:54PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
>
> Hi,
>
> This series exposes RoCE ICRC counter through existing RDMA hw_counters
> sysfs interface.
>
> First patch has all HW definitions in mlx5_ifc.h file and second patch is
> actual counter implementation.
The RDMA parts are OK, can you please send me the commit for the mlx5
patch when applied?
Thanks,
Jason
^ permalink raw reply
* Re: [PATCH mlx5-next 1/2] net/mlx5: Add RoCE RX ICRC encapsulated counter
From: Leon Romanovsky @ 2018-06-21 17:53 UTC (permalink / raw)
To: Doug Ledford, Jason Gunthorpe
Cc: RDMA mailing list, Mark Bloch, Talat Batheesh, Saeed Mahameed,
linux-netdev
In-Reply-To: <20180621123756.32645-2-leon@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 1117 bytes --]
On Thu, Jun 21, 2018 at 03:37:55PM +0300, Leon Romanovsky wrote:
> From: Talat Batheesh <talatb@mellanox.com>
>
> Add capability bit in PCAM register and RoCE ICRC error counter
> to PPCNT register.
>
> Signed-off-by: Talat Batheesh <talatb@mellanox.com>
> Reviewed-by: Mark Bloch <markb@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
> include/linux/mlx5/mlx5_ifc.h | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index b4302ccb63a6..9e8682489951 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -1687,7 +1687,11 @@ struct mlx5_ifc_eth_extended_cntrs_grp_data_layout_bits {
>
> u8 rx_buffer_full_low[0x20];
>
> - u8 reserved_at_1c0[0x600];
> + u8 rx_icrc_encapsulated_high[0x20];
> +
> + u8 rx_icrc_encapsulated_low[0x20];
> +
> + u8 reserved_at_3c0[0x5c0];
reserved_at_3c0 should be reserved_at_200, fixed and applied to mlx5-next.
Commit 0af5107cd0640ee3424e337b492e4b11b450ce28 in mlx5-next.
Thanks
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply
* [PATCH net v2] cls_flower: fix use after free in flower S/W path
From: Paolo Abeni @ 2018-06-21 18:02 UTC (permalink / raw)
To: netdev
Cc: Jamal Hadi Salim, Cong Wang, Jiri Pirko, Marcelo Ricardo Leitner,
Paul Blakey
If flower filter is created without the skip_sw flag, fl_mask_put()
can race with fl_classify() and we can destroy the mask rhashtable
while a lookup operation is accessing it.
BUG: unable to handle kernel paging request at 00000000000911d1
PGD 0 P4D 0
SMP PTI
CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
RIP: 0010:rht_bucket_nested+0x20/0x60
Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 00 00 0
RSP: 0018:ffffafc5cfbb7a48 EFLAGS: 00010206
RAX: 0000000000001978 RBX: ffff9f12dff88a00 RCX: 00000000ffff9f12
RDX: 00000000000911d1 RSI: 0000000000000148 RDI: 0000000000000001
RBP: ffff9f12dff88a00 R08: 000000005f1cc119 R09: 00000000a715fae2
R10: ffffafc5cfbb7aa8 R11: ffff9f1cb4be804e R12: ffff9f1265e13000
R13: 0000000000000000 R14: ffffafc5cfbb7b48 R15: ffff9f12dff88b68
FS: 0000000000000000(0000) GS:ffff9f1d3f0c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000911d1 CR3: 0000001575a94006 CR4: 00000000001626e0
Call Trace:
fl_lookup+0x134/0x140 [cls_flower]
fl_classify+0xf3/0x180 [cls_flower]
tcf_classify+0x78/0x150
__netif_receive_skb_core+0x69e/0xa50
netif_receive_skb_internal+0x42/0xf0
tun_get_user+0xdd5/0xfd0 [tun]
tun_sendmsg+0x52/0x70 [tun]
handle_tx+0x2b3/0x5f0 [vhost_net]
vhost_worker+0xab/0x100 [vhost]
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap sch_ingress
CR2: 00000000000911d1
Fix the above waiting for a RCU grace period before destroying the
rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
must run in process context, as pointed out by Cong Wang.
v1 -> v2: use tcf_queue_work to run rhashtable_destroy().
Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
net/sched/cls_flower.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 2b5be42a9f1c..9e8b26a80fb3 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -66,7 +66,7 @@ struct fl_flow_mask {
struct rhashtable_params filter_ht_params;
struct flow_dissector dissector;
struct list_head filters;
- struct rcu_head rcu;
+ struct rcu_work rwork;
struct list_head list;
};
@@ -203,6 +203,20 @@ static int fl_init(struct tcf_proto *tp)
return rhashtable_init(&head->ht, &mask_ht_params);
}
+static void fl_mask_free(struct fl_flow_mask *mask)
+{
+ rhashtable_destroy(&mask->ht);
+ kfree(mask);
+}
+
+static void fl_mask_free_work(struct work_struct *work)
+{
+ struct fl_flow_mask *mask = container_of(to_rcu_work(work),
+ struct fl_flow_mask, rwork);
+
+ fl_mask_free(mask);
+}
+
static bool fl_mask_put(struct cls_fl_head *head, struct fl_flow_mask *mask,
bool async)
{
@@ -210,12 +224,11 @@ static bool fl_mask_put(struct cls_fl_head *head, struct fl_flow_mask *mask,
return false;
rhashtable_remove_fast(&head->ht, &mask->ht_node, mask_ht_params);
- rhashtable_destroy(&mask->ht);
list_del_rcu(&mask->list);
if (async)
- kfree_rcu(mask, rcu);
+ tcf_queue_work(&mask->rwork, fl_mask_free_work);
else
- kfree(mask);
+ fl_mask_free(mask);
return true;
}
--
2.17.1
^ permalink raw reply related
* RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-21 18:14 UTC (permalink / raw)
To: Andrew Lunn; +Cc: davem@davemloft.net, netdev@vger.kernel.org, jiri@resnulli.us
In-Reply-To: <20180621171120.GA6830@lunn.ch>
> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, June 21, 2018 8:11 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
>
> > New internal API reads the temperature from all the modules, which are
> > equipped with the thermal sensor and exposes temperature according to
> > the worst measure. All individual temperature values are normalized to
> > pre-defined range.
>
> Hi Vadim
>
> Could you explain this normalization process. Why are you not just expose each
> sensors temperature in millidegrees C, which is the normal for HWMON.
Hi Andrew,
The temperature of each individual module can be obtained
through ethtool.
The worst temperature is necessary for the system cooling
control decision.
Up to 64 SFP/QSFP modules could be connected to the system.
Some of them could cooper modules, which doesn't provide
temperature measurement.
Some of them could be optical modules, providing untrusted
temperature measurement, which could impact thermal
control of the system.
Also optical modules could be from the different vendors, and
this is real situation, when, f.e. one module has the warning and
critical thresholds 75C and 85C, while another 70C and 80C.
In such case the first module temperature 72C is better, then the
second module temperature 71C.
And deltas between warning and critical thresholds, could be
different as well. It could be 5C, 10C, etc.
So, nominal temperature is not the case here, we should know the
"worst" value for the thermal control decision.
Thanks,
Vadim.
>
> Andrew
^ permalink raw reply
* Re: [PATCH net v2] cls_flower: fix use after free in flower S/W path
From: Jiri Pirko @ 2018-06-21 18:16 UTC (permalink / raw)
To: Paolo Abeni
Cc: netdev, Jamal Hadi Salim, Cong Wang, Marcelo Ricardo Leitner,
Paul Blakey
In-Reply-To: <fd96de4e9dc358e3982922ae681fdb1b9d8ae72a.1529603970.git.pabeni@redhat.com>
Thu, Jun 21, 2018 at 08:02:16PM CEST, pabeni@redhat.com wrote:
>If flower filter is created without the skip_sw flag, fl_mask_put()
>can race with fl_classify() and we can destroy the mask rhashtable
>while a lookup operation is accessing it.
>
> BUG: unable to handle kernel paging request at 00000000000911d1
> PGD 0 P4D 0
> SMP PTI
> CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
> Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
> RIP: 0010:rht_bucket_nested+0x20/0x60
> Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 00 00 0
> RSP: 0018:ffffafc5cfbb7a48 EFLAGS: 00010206
> RAX: 0000000000001978 RBX: ffff9f12dff88a00 RCX: 00000000ffff9f12
> RDX: 00000000000911d1 RSI: 0000000000000148 RDI: 0000000000000001
> RBP: ffff9f12dff88a00 R08: 000000005f1cc119 R09: 00000000a715fae2
> R10: ffffafc5cfbb7aa8 R11: ffff9f1cb4be804e R12: ffff9f1265e13000
> R13: 0000000000000000 R14: ffffafc5cfbb7b48 R15: ffff9f12dff88b68
> FS: 0000000000000000(0000) GS:ffff9f1d3f0c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000911d1 CR3: 0000001575a94006 CR4: 00000000001626e0
> Call Trace:
> fl_lookup+0x134/0x140 [cls_flower]
> fl_classify+0xf3/0x180 [cls_flower]
> tcf_classify+0x78/0x150
> __netif_receive_skb_core+0x69e/0xa50
> netif_receive_skb_internal+0x42/0xf0
> tun_get_user+0xdd5/0xfd0 [tun]
> tun_sendmsg+0x52/0x70 [tun]
> handle_tx+0x2b3/0x5f0 [vhost_net]
> vhost_worker+0xab/0x100 [vhost]
> kthread+0xf8/0x130
> ret_from_fork+0x35/0x40
> Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap sch_ingress
> CR2: 00000000000911d1
>
>Fix the above waiting for a RCU grace period before destroying the
>rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
>must run in process context, as pointed out by Cong Wang.
>
>v1 -> v2: use tcf_queue_work to run rhashtable_destroy().
>
>Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
>Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
^ permalink raw reply
* Re: Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-21 18:20 UTC (permalink / raw)
To: Cornelia Huck
Cc: Siwei Liu, Samudrala, Sridhar, Alexander Duyck, virtio-dev,
aaron.f.brown, Jiri Pirko, Jakub Kicinski, Netdev, qemu-devel,
virtualization, konrad.wilk, boris.ostrovsky, Joao Martins,
Venu Busireddy, vijay.balakrishna
In-Reply-To: <20180621165913.7e3f4faa.cohuck@redhat.com>
On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> On Wed, 20 Jun 2018 22:48:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> > On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> > > In any case, I'm not sure anymore why we'd want the extra uuid.
> >
> > It's mostly so we can have e.g. multiple devices with same MAC
> > (which some people seem to want in order to then use
> > then with different containers).
> >
> > But it is also handy for when you assign a PF, since then you
> > can't set the MAC.
> >
>
> OK, so what about the following:
>
> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> that we have a new uuid field in the virtio-net config space
> - in QEMU, add a property for virtio-net that allows to specify a uuid,
> offer VIRTIO_NET_F_STANDBY_UUID if set
> - when configuring, set the property to the group UUID of the vfio-pci
> device
> - in the guest, use the uuid from the virtio-net device's config space
> if applicable; else, fall back to matching by MAC as done today
>
> That should work for all virtio transports.
True. I'm a bit unhappy that it's virtio net specific though
since down the road I expect we'll have a very similar feature
for scsi (and maybe others).
But we do not have a way to have fields that are portable
both across devices and transports, and I think it would
be a useful addition. How would this work though? Any idea?
--
MST
^ permalink raw reply
* Re: [PATCH] net: nixge: Add __packed attribute to DMA descriptor struct
From: Moritz Fischer @ 2018-06-21 18:30 UTC (permalink / raw)
To: David Miller; +Cc: f.fainelli, mdf, keescook, netdev, linux-kernel
In-Reply-To: <20180620.073750.642289685695664600.davem@davemloft.net>
Hi David,
On Wed, Jun 20, 2018 at 07:37:50AM +0900, David Miller wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
> Date: Tue, 19 Jun 2018 10:13:55 -0700
>
> > How could padding be inserted given than all of the structure members
> > are naturally aligned (all u32 type). Compiler bug?
>
> Agreed, this looks completely unnecessary.
>
> __packed should only be used when absolutely necessary because using
> it generates less efficient code on some architectures.
Thanks for your input, will fix with the whole series when I submit it.
- Moritz
^ permalink raw reply
* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 18:34 UTC (permalink / raw)
To: Vadim Pasternak, Guenter Roeck
Cc: davem@davemloft.net, netdev@vger.kernel.org, jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37531F5503D85EB153A6C672A2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>
> Hi Andrew,
Adding Guenter Roeck, the HWMON maintainer.
> The temperature of each individual module can be obtained
> through ethtool.
You mean via --module-info?
FYI: I plan to add hwmon support to the kernel SFP code. So if you
ever decide to swap to the kernel SFP code, not your own, the raw
temperatures will be exported.
> The worst temperature is necessary for the system cooling
> control decision.
I would expect the system cooling would understand that.
> Up to 64 SFP/QSFP modules could be connected to the system.
> Some of them could cooper modules, which doesn't provide
> temperature measurement.
SFP modules are hot-plugable. So i would also expect the hwmon devices
to hotplug. If there is no sensor, then there is no hwmon device... If
there is no hwmon device, it plays no part in the thermal control
loop.
> Some of them could be optical modules, providing untrusted
> temperature measurement, which could impact thermal
> control of the system.
Why would you not trust it? Are you saying some modules simply have
broken temperature sensors? Do you have a whitelist/blacklist of
modules?
> Also optical modules could be from the different vendors, and
> this is real situation, when, f.e. one module has the warning and
> critical thresholds 75C and 85C, while another 70C and 80C.
But hwmon exports both the actual temperature and the alarm
temperatures. I would expect the thermal control code to use all this
information when making its decisions, not just the current
temperature.
> So, nominal temperature is not the case here, we should know the
> "worst" value for the thermal control decision.
What it sounds like to me is you are working around problems in the
thermal control by fudging the raw temperatures. That is the wrong
thing to do. hwmon should export the raw data, and you should fix the
thermal control code to use it correctly.
Andrew
^ permalink raw reply
* [PATCH] net: phy: Allow compile test of GPIO consumers if !GPIOLIB
From: Geert Uytterhoeven @ 2018-06-21 18:58 UTC (permalink / raw)
To: Andrew Lunn, Florian Fainelli, David S . Miller
Cc: netdev, linux-kernel, Geert Uytterhoeven
The GPIO subsystem provides dummy GPIO consumer functions if GPIOLIB is
not enabled. Hence drivers that depend on GPIOLIB, but use GPIO consumer
functionality only, can still be compiled if GPIOLIB is not enabled.
Relax the dependency on GPIOLIB if COMPILE_TEST is enabled, where
appropriate.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
---
v3:
- Rebased,
v2:
- Add Acked-by.
---
drivers/net/phy/Kconfig | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 343989f9f9d981e2..ceede09a28459f45 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -92,7 +92,8 @@ config MDIO_CAVIUM
config MDIO_GPIO
tristate "GPIO lib-based bitbanged MDIO buses"
- depends on MDIO_BITBANG && GPIOLIB
+ depends on MDIO_BITBANG
+ depends on GPIOLIB || COMPILE_TEST
---help---
Supports GPIO lib-based MDIO busses.
--
2.17.1
^ permalink raw reply related
* RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-21 19:17 UTC (permalink / raw)
To: Andrew Lunn, Guenter Roeck
Cc: davem@davemloft.net, netdev@vger.kernel.org, jiri@resnulli.us
In-Reply-To: <20180621183440.GA10038@lunn.ch>
> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, June 21, 2018 9:35 PM
> To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck <linux@roeck-
> us.net>
> Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
>
> > Hi Andrew,
>
> Adding Guenter Roeck, the HWMON maintainer.
>
> > The temperature of each individual module can be obtained through
> > ethtool.
>
> You mean via --module-info?
Yes.
>
> FYI: I plan to add hwmon support to the kernel SFP code. So if you ever decide to
> swap to the kernel SFP code, not your own, the raw temperatures will be
> exported.
>
Not sure it'll work for us, since we read SFP/QSFP ports through our SW/FW
interface.
But would be nice if you can provide some reference to this code.
> > The worst temperature is necessary for the system cooling control
> > decision.
>
> I would expect the system cooling would understand that.
>
In thermal zone infrastructure there is one temperature input.
How you can consider 64+ different inputs?
> > Up to 64 SFP/QSFP modules could be connected to the system.
> > Some of them could cooper modules, which doesn't provide temperature
> > measurement.
>
> SFP modules are hot-plugable. So i would also expect the hwmon devices to
> hotplug. If there is no sensor, then there is no hwmon device... If there is no
> hwmon device, it plays no part in the thermal control loop.
>
> > Some of them could be optical modules, providing untrusted temperature
> > measurement, which could impact thermal control of the system.
>
> Why would you not trust it? Are you saying some modules simply have broken
> temperature sensors? Do you have a whitelist/blacklist of modules?
>
We are reading temperature info through the firmware.
In case of "broken" module (module is supposed to be capable of
reading temperature, but returns some non-valid code), we'll get
some error code.
> > Also optical modules could be from the different vendors, and this is
> > real situation, when, f.e. one module has the warning and critical
> > thresholds 75C and 85C, while another 70C and 80C.
>
> But hwmon exports both the actual temperature and the alarm temperatures. I
> would expect the thermal control code to use all this information when making
> its decisions, not just the current temperature.
>
All information is used, but the decision to increase FAN speed is taken
based on the worst measure, which is logical.
> > So, nominal temperature is not the case here, we should know the
> > "worst" value for the thermal control decision.
>
> What it sounds like to me is you are working around problems in the thermal
> control by fudging the raw temperatures. That is the wrong thing to do. hwmon
> should export the raw data, and you should fix the thermal control code to use it
> correctly.
By default we are using kernel step-wise thermal algorithm, considering
all the module and ASIC ambient sensors temperature. This is not working
around. In thermal zone we have one PWM control and cumulative temperature
from the modules and ASIC. And it gives stable and correct results.
>
> Andrew
^ permalink raw reply
* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 19:49 UTC (permalink / raw)
To: Vadim Pasternak
Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37537B5DCD0D607DFB7C7099A2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>
On Thu, Jun 21, 2018 at 07:17:03PM +0000, Vadim Pasternak wrote:
>
>
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Thursday, June 21, 2018 9:35 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck <linux@roeck-
> > us.net>
> > Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> > Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> > port temperature reading
> >
> > > Hi Andrew,
> >
> > Adding Guenter Roeck, the HWMON maintainer.
> >
> > > The temperature of each individual module can be obtained through
> > > ethtool.
> >
> > You mean via --module-info?
>
> Yes.
>
> >
> > FYI: I plan to add hwmon support to the kernel SFP code. So if you ever decide to
> > swap to the kernel SFP code, not your own, the raw temperatures will be
> > exported.
> >
>
> Not sure it'll work for us, since we read SFP/QSFP ports through our SW/FW
> interface.
Can you make fake i2c busses? Pass the i2c transactions to the
firmware?
> But would be nice if you can provide some reference to this code.
drivers/net/phy/sfp.c
>
> > > The worst temperature is necessary for the system cooling control
> > > decision.
> >
> > I would expect the system cooling would understand that.
> >
>
> In thermal zone infrastructure there is one temperature input.
> How you can consider 64+ different inputs?
I've never used the thermal zone code. But i've used boards with 4
sensors spread around it. If the thermal zone code could not support
that, i would be surprised.
[Goes away and reads https://www.kernel.org/doc/Documentation/thermal/sysfs-api.txt]
So it sounds like, one zone is one sensor. So you actually have
hot-plugable zones, up to 64 of them. It also looks like you can bind
a zone to a cooling device. There does not seem to be a 1:1
mapping. So you should be able to bind 64 zones to one fan. Or if you
have multiple fans, bind a zone to the nearest fan.
But as i said, i'm no expert on this. You really should be posting
these patches on the hwmon list and the linux-pm list. The netdev list
does not have the needed specialist. Once Rui Zhang, Eduardo Valentin,
and Guenter Roack have given them Acked-by, David Miller can then
merge them.
Andrew
^ permalink raw reply
* Re: Route fallback issue
From: Julian Anastasov @ 2018-06-21 19:57 UTC (permalink / raw)
To: Grant Taylor; +Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <0a920d2d-4e97-284b-9aad-54cf75bcb755@spamtrap.tnetconsulting.net>
Hello,
On Wed, 20 Jun 2018, Grant Taylor wrote:
> On 06/20/2018 01:00 PM, Julian Anastasov wrote:
> > You can also try alternative routes.
>
> "Alternative routes"? I can't say as I've heard that description as a
> specific technique / feature / capability before.
>
> Is that it's official name?
I think so
> Where can I find out more about it?
You can search on net. I have some old docs on
these issues, they should be actual:
http://ja.ssi.bg/dgd-usage.txt
> > But as the kernel supports only default alternative routes, you can put them
> > in their own table:
>
> I don't know that that is the case any more.
>
> I was able to issue the following commands without a problem:
>
> # ip route append 192.0.2.128/26 via 192.0.2.62
> # ip route append 192.0.2.128/26 via 192.0.2.126
>
> I crated two network namespaces and had a pair of vEths between them
> (192.0.2.0/26 and 192.0.2.64/26). I added a dummy network to each NetNS
> (192.0.2.128/26 and 192.0.2.192/26).
>
> I ran the following commands while a persistent ping was running from one
> NetNS to the IP on the other's dummy0 interface:
>
> # ip link set ns2b up && ip route append 192.0.2.192/26 via 192.0.2.126 && ip
> link set ns2a down
> (pause and watch things)
> # ip link set ns2a up && ip route append 192.0.2.192/26 via 192.0.2.62 && ip
> link set ns2b down
> (pause and watch things)
>
> I could iterate between the two above commands and pings continued to work.
>
> So, I think that it's now possible to use "alternate routes" (new to me) on
> specific prefixes in addition to the default. Thus there is no longer any
> need for a separate table and the associated IP rule.
Not true. net/ipv4/fib_semantics.c:fib_select_path()
calls fib_select_default() only when prefixlen = 0 (default route).
Otherwise, only the first route will be considered.
fib_select_default() is the function that decides which
nexthop is reachable and whether to contact it. It uses the ARP
state via fib_detect_death(). That is all code that is behind this
feature called "alternative routes": the kernel selects one
based on nexthop's ARP state. Routes with different metric are
considered only when the routes with lower metric are removed.
> I'm running kernel version 4.9.76.
>
> I did go ahead and set net.ipv4.conf.ns2b.ignore_routes_with_linkdown to 1.
>
> for i in /proc/sys/net/ipv4/conf/*/ignore_routes_with_linkdown; do echo 1 >
> $i; done
IIRC, this flag invalidates nexthops depending on
the link state. If your link is always UP it does not help
much. If you rely on user space tool, you can check the state
of the desired hops: device link state, your gateway to
ISP, one or more gateways in the ISP network which you
consider permanent part of the path via this ISP.
> Doing that dropped the number of dropped pings from 60 ~ 90 (1 / second) to 0
> ~ 5 (1 / second). (Rarely, maybe 1 out of 20 flips, would it take upwards of
> 10 pings / seconds.)
>
> > # Alternative routes use same metric!!!
> > ip route append default via 192.168.1.254 dev eno1 table 100
> > ip route append default via 192.168.2.254 dev eno2 table 100
> > ip rule add prio 100 to 172.16.0.0/12 table 100
>
> I did have to "append" the route. I couldn't just "add" the route. When I
> tried to "add" the second route, I got an error about the route already
> existing. Using "append" instead of "add" with everything else the same
> worked just fine.
>
> Note: I did go ahead and remove the single route that was added via "add" and
> used "append" for both.
First route can be created with 'add' but all next
alternative routes can be added only with "append". If you
successfully add them with "add" it means they are not
alternatives to the first one, they are not considered at all.
Regards
^ permalink raw reply
* RE: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-21 20:02 UTC (permalink / raw)
To: Andrew Lunn
Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
jiri@resnulli.us
In-Reply-To: <20180621194917.GC10038@lunn.ch>
> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, June 21, 2018 10:49 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: Guenter Roeck <linux@roeck-us.net>; davem@davemloft.net;
> netdev@vger.kernel.org; jiri@resnulli.us
> Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for
> port temperature reading
>
> On Thu, Jun 21, 2018 at 07:17:03PM +0000, Vadim Pasternak wrote:
> >
> >
> > > -----Original Message-----
> > > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > > Sent: Thursday, June 21, 2018 9:35 PM
> > > To: Vadim Pasternak <vadimp@mellanox.com>; Guenter Roeck
> > > <linux@roeck- us.net>
> > > Cc: davem@davemloft.net; netdev@vger.kernel.org; jiri@resnulli.us
> > > Subject: Re: [PATCH v0 03/12] mlxsw: core: Add core environment
> > > module for port temperature reading
> > >
> > > > Hi Andrew,
> > >
> > > Adding Guenter Roeck, the HWMON maintainer.
> > >
> > > > The temperature of each individual module can be obtained through
> > > > ethtool.
> > >
> > > You mean via --module-info?
> >
> > Yes.
> >
> > >
> > > FYI: I plan to add hwmon support to the kernel SFP code. So if you
> > > ever decide to swap to the kernel SFP code, not your own, the raw
> > > temperatures will be exported.
> > >
> >
> > Not sure it'll work for us, since we read SFP/QSFP ports through our
> > SW/FW interface.
>
> Can you make fake i2c busses? Pass the i2c transactions to the firmware?
Theoretically yes.
But have well-defined SW/FW interface, working over PCI and FW at
ASIC end implements I2C master.
>
> > But would be nice if you can provide some reference to this code.
>
> drivers/net/phy/sfp.c
>
Thank you.
> >
> > > > The worst temperature is necessary for the system cooling control
> > > > decision.
> > >
> > > I would expect the system cooling would understand that.
> > >
> >
> > In thermal zone infrastructure there is one temperature input.
> > How you can consider 64+ different inputs?
>
> I've never used the thermal zone code. But i've used boards with 4 sensors
> spread around it. If the thermal zone code could not support that, i would be
> surprised.
>
> [Goes away and reads
> https://www.kernel.org/doc/Documentation/thermal/sysfs-api.txt]
>
> So it sounds like, one zone is one sensor. So you actually have hot-plugable
> zones, up to 64 of them. It also looks like you can bind a zone to a cooling
> device. There does not seem to be a 1:1 mapping. So you should be able to bind
> 64 zones to one fan. Or if you have multiple fans, bind a zone to the nearest fan.
>
It means I will have 64 thermal zones for each module (actually for the
some coming new systems will support 128 modules, plus thermal zone
for ASIC ambient temperatures.
And each zone will try to control same PWM.
As I result PWM will be extremely jumpy and non-effective.
> But as i said, i'm no expert on this. You really should be posting these patches on
> the hwmon list and the linux-pm list. The netdev list does not have the needed
> specialist. Once Rui Zhang, Eduardo Valentin, and Guenter Roack have given
> them Acked-by, David Miller can then merge them.
Thanks,
Vadim.
>
> Andrew
^ permalink raw reply
* [PATCH] net: bridge: fix potential null pointer dereference on return from br_port_get_rtnl()
From: Garry McNulty @ 2018-06-21 20:14 UTC (permalink / raw)
To: netdev; +Cc: stephen, davem, jiri, nikolay, bridge, linux-kernel,
Garry McNulty
br_port_get_rtnl() can return NULL if the network device is not a bridge
port (IFF_BRIDGE_PORT flag not set). br_port_slave_changelink() and
br_port_fill_slave_info() callbacks dereference this pointer without
checking. Currently this is not a problem because slave devices always
set this flag. Add null check in case these conditions ever change.
Detected by CoverityScan, CID 1339613 ("Dereference null return value")
Signed-off-by: Garry McNulty <garrmcnu@gmail.com>
---
net/bridge/br_netlink.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 9f5eb05b0373..b3ad135b7157 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -947,13 +947,14 @@ static int br_port_slave_changelink(struct net_device *brdev,
struct netlink_ext_ack *extack)
{
struct net_bridge *br = netdev_priv(brdev);
+ struct net_bridge_port *p = br_port_get_rtnl(dev);
int ret;
- if (!data)
+ if (!data || !p)
return 0;
spin_lock_bh(&br->lock);
- ret = br_setport(br_port_get_rtnl(dev), data);
+ ret = br_setport(p, data);
spin_unlock_bh(&br->lock);
return ret;
@@ -963,7 +964,9 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
const struct net_device *brdev,
const struct net_device *dev)
{
- return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
+ struct net_bridge_port *p = br_port_get_rtnl(dev);
+
+ return p ? br_port_fill_attrs(skb, p) : -EINVAL;
}
static size_t br_port_get_slave_size(const struct net_device *brdev,
--
2.14.4
^ permalink raw reply related
* Re: [PATCH v0 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-21 20:18 UTC (permalink / raw)
To: Vadim Pasternak
Cc: Guenter Roeck, davem@davemloft.net, netdev@vger.kernel.org,
jiri@resnulli.us
In-Reply-To: <HE1PR0502MB3753C8AF457059B3D95FA35CA2760@HE1PR0502MB3753.eurprd05.prod.outlook.com>
> It means I will have 64 thermal zones for each module (actually for the
> some coming new systems will support 128 modules, plus thermal zone
> for ASIC ambient temperatures.
> And each zone will try to control same PWM.
> As I result PWM will be extremely jumpy and non-effective.
Hi Vadim
Please repost to the correct lists, and ask the experts how this
should be done. netdev is not the right place to discuss this.
Andrew
^ permalink raw reply
* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Don Bollinger @ 2018-06-21 20:28 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev, Florian Fainelli, don
In-Reply-To: <20180621081127.GA31742@lunn.ch>
On Thu, Jun 21, 2018 at 10:11:27AM +0200, Andrew Lunn wrote:
> > I'm trying to figure out how the netdev environment works on large
> > switches. I can't imagine that the kernel network stack is involved in
> > any part of the data plane.
>
> Hi Don
>
> It is involved in the slow path. I.e. packets from the host out
> network ports. BPDU, IGMP, ARP, ND, etc. It can also be involved when
> the hardware is missing features. Also, for communication with the
> host itself.
>
> What is more important is it is the control plane. Want to bridge two
> ports together? You create a software bridge, add the two ports, and
> then offload it to the hardware. The kernel STP code in the software
> bridge then does the state tracking, blocked, learning, forwarding
> etc. Need RSTP? Just run the standard RSTP daemon on the software
> bridge interface.
>
> Basically, use the Linux network stack as is, and offload what you can
> to the hardware. That means you keep all your existing user space
> network tools, daemons, SNMP agents, etc. They all just work, because
> the kernel APIs remain the same, independent of if you have a switch,
> or just a bunch of networks cards.
>
> > Can you point me to any conference slides,
> > or design docs or other documentation that describes a netdev
> > implementation on Trident II switch silicon? Or any other switch that
> > supports 128 x 10G (or 32 x 40G) ports?
>
> Look at past netdev conference. In particular, talks given by
> Mellanox, Cumulus, and Netronome. You can also see there drivers in
Thanks. I found a slide deck from Cumulus at
www.slideshare.net/CumulusNetworks/webinarlinux-networking-is-awesome
I think this connects the dots between our worlds. It turns out that
optoe actually is derived from the Cumulus sff_8436 driver, which they
use to access QSFP devices. It was the best available, but had an
experimental implementation of SFP (didn't work yet). They actually use
the at24 driver for SFP. Optoe is actually architecturally identical to
the Cumulus implementation. It does not use the SFP framework, but it
does interface with their Linux network stack via ethtool, etc. In slide
10 of the deck they explicitly call out device drivers, saying "Innovation
and change here is good."
> drivers/ethernet/{mellonex|netronome}. These devices however tend to
> go for firmware to control the PHYs, not the Linux network stack.
> drivers/net/dsa covers SOHO switches, so up to 10 ports, mostly 1G,
> some 10G ports. There is a lot of industry involved thin this segment,
> trains, planes, busses, plant automation, etc, and some WiFi and STP.
> Switches with DSA drivers make use of Linux to control the PHYs, not
> firmware.
Again, optoe does not control the PHYs. It only access the EEPROMs (on
the PHYs). It does not touch any of the electrical pins. It can provide
the EEPROM access to any component that wants that access, including sfp.c.
> SFP are also slowly starting to enter the consumer market, with
> products like the Clearfog, MACCHIATObin, and there are a few
> industrial boards with SOHO switches with SFP sockets or SFF
> modules. These are what are driving in kernel SFP support.
Got it. I'm targeting a different market, with a different
architecture. In this architecture it makes more sense to separate the
EEPROM access from the IO pins control.
> > Also, I see the sfp.c code. Is there any QSFP code? I'd like to see
> > how that code compares to the sfp.c code.
>
> Currently, none that i know of. SFP+ is the limit so far. Mainly
> because SoC/SOHO switches currently top out at 10G, so SFP+ is
> sufficient.
SFP+ is not sufficient for another market, which is using Linux to
manage larger switches. These switches all have some QSFP ports, many
of them have exclusively QSFP ports. I have some useful code for those
environments.
> > optoe can provide access, through the SFP code, to the rest of the EEPROM
> > address space. It can also provide access to the QSFP EEPROM. I would
> > like to collaborate on the integration, that would fit my objective of
> > making more EEPROM space accessible on more platforms and distros.
> >
> > However, you don't want me to make the changes to SFP myself. I don't
> > have any hardware or OS environment that currently supports that code.
> > The cost and learning curve exceed my resources. I *think* the changes
> > to the SFP code would be small, but I would need someone who understands
> > the code and can test it to actually make and submit the changes.
>
> So i have been thinking about this some more. But given your lack of
> resources, i'm guessing this won't actually work for you. But it is
> probably the correct way forwards.
>
> The basic problem the systems you are talking about is that they don't
> have a network interface per port. So they cannot use ethtool
> --module-info, which IS the defined API to get access to SFP
> data. Adding another API is probably not going to get accepted.
Got it. I don't think I'm adding another API. Note that Cumulus is
using the same architecture as optoe, and providing all the expected
linux services, including ethtool --module-info. They are accessing the
module-info data throug ioctl, which opens the device file provided by
their driver and reads/writes the appropriate location. Optoe works the
same way.
> However, the current ethtool API is ioctl based. The network stack is
> moving away from that, replacing it with netlink sockets. All IP
> configuration is now via netlink. All bridge configuration is by
> netlink, etc. So there is a desire to move ethtool to netlink.
>
> This move makes the API more flexible. By default, you would expect
> the replacement implementation for --module-info to pass an ifindex,
> or maybe an interface name. However, it could be made to optionally
> take an i2c bus number. That could then go direct to the SFP code,
> rather than go via the MAC driver. That would give evil user space,
> proprietary, binary blob drivers access to SFP via the standard
> supported kernel API, using the standard supported kernel SFP driver.
Here's what I have in mind... struct sfp in sfp.c has a read and write:
int (*read) struct sfp *, bool, u8, void *, size_t);
int (*write) struct sfp *, bool, u8, void *, size_t);
These are instantiated with:
sfp->read = sfp_i2c_read;
sfp->write = sfp_i2c_write;
So to insert optoe into this stack, we would need to add an i2c_client
to struct sfp:
struct i2c_client *i2c_client;
We would need to initialize that i2c_client in sfp_i2c_configure:
board_info = alloc_board_info(sfp);
sfp->i2c_client = i2c_new_device(i2c, board_info);
We need to write the brief routine that allocates a struct_i2c_board_info,
and stuffs the necessary data into it. I'm assuming that data can come
from sfp. The data required includes an appropriate name for this device,
and whether it is an SFP or QSFP device. (When QSFP is added to sfp.c,
we can add a flag to struct sfp. Your stack will have to know which it
is anyway to know where the necessary data is in the EEPROM.)
Finally, we replace the body of sfp_i2c_{read, write} with a callback to
optoe. All of the necessary data is already in the parameters to
sfp_i2c_{read, write}.
>
> But that requires you roll up your sleeves and get stuck in to this
> conversion work.
I'm offering an improvement to sfp.c. The improvement is access to more
pages of SFP EEPROM, and access to QSFP EEPROMs. It comes in the form of
a specialized EEPROM driver custom built for {Q}SFP devices. I'm also
offering to help integrate that driver into sfp.c. I can modify optoe
to accomodate sfp.c, I can recommend how to instantiate and call it. I am
not going to be able to spend the time and money required to modify and
test sfp.c. I'm pretty sure you can do it MUCH faster, and MUCH better
than I can.
>
> But you say you work for a fibre module company. Do they produce
> SFP/SFP+ modules? You could get one of the supported consumer boards
> with an SFP/SFP+ socket and test your modules work properly. Build out
Unfortunately, that isn't going to happen on their dime. Their dimes
are running out for this kind of work.
> the SFP code. It has been on my TODO list to add HWMON support for the
> temperature sensors, etc.
Huh. Just read Documentation/hwmon/sysfs-interface. Looks like a good
way to deliver that EEPROM data. Wish I'd found that two years ago when
there were a few more dimes available.
Don
^ permalink raw reply
* Re: [PATCH net] bpf: enforce correct alignment for instructions
From: Daniel Borkmann @ 2018-06-21 21:03 UTC (permalink / raw)
To: Eric Dumazet, David Miller, edumazet; +Cc: netdev, kafai, ast
In-Reply-To: <e1aec35e-957d-c38d-0bc5-d39c20773136@gmail.com>
On 06/21/2018 06:08 AM, Eric Dumazet wrote:
> On 06/20/2018 08:46 PM, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Wed, 20 Jun 2018 17:24:09 -0700
>>
>>> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
>>> breaking powerpc BPF badly, since instructions need to be word aligned.
>>>
>>> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>
>> I'll apply this directly, thanks Eric.
>
> Thanks David :)
Sigh, sorry for the breakage, looks like I got fooled by x86 gcc.
struct bpf_binary_header {
u16 pages; /* 0 2 */
u16 locked:1; /* 2:15 2 */
/* XXX 15 bits hole, try to pack */
u8 image[0]; /* 4 0 */
/* size: 4, cachelines: 1, members: 3 */
/* bit holes: 1, sum bit holes: 15 bits */
/* last cacheline: 4 bytes */
};
Thanks Eric!
^ permalink raw reply
* Re: Route fallback issue
From: Grant Taylor @ 2018-06-21 21:08 UTC (permalink / raw)
To: Julian Anastasov
Cc: Akshat Kakkar, netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <alpine.LFD.2.20.1806212218070.2159@ja.home.ssi.bg>
[-- Attachment #1: Type: text/plain, Size: 6350 bytes --]
On 06/21/2018 01:57 PM, Julian Anastasov wrote:
> Hello,
Hi.
> I think so
Okay.
I'll do some more digging.
> You can search on net. I have some old docs on these issues, they should
> be actual:
>
> http://ja.ssi.bg/dgd-usage.txt
"DGD" or "Dead Gateway Detection" sounds very familiar. I referenced it
in an earlier reply.
I distinctly remember DGD not behaving satisfactorily years ago. Where
unsatisfactorily was something like 90 seconds (or more) to recover.
Which actually matches what I was getting without the
ignore_routes_with_linkdown=1 setting that David A. mentioned.
With ignore_routes_with_linkdown=1 things behaved much better.
> Not true. net/ipv4/fib_semantics.c:fib_select_path() calls
> fib_select_default() only when prefixlen = 0 (default route).
Okay.... My testing last night disagrees with you. Specifically, I was
able to add a alternate routes to the same prefix, 192.0.2.128/26.
There was not any default gateway configured on any of the NetNSs. So
everything was using routes for locally attacked or the two added via
"ip route append".
What am I misinterpreting? Or where are we otherwise talking past each
other?
> Otherwise, only the first route will be considered.
"only the first route" almost sounds like something akin to Equal Cost
Multi Path.
I was not expecting "alternative routes" to use more than one route at a
time, equally or otherwise. I was wanting for the kernel to fall back
to an alternate route / gateway / path in the event that the one that
was being used became unusable / unreachable.
So what should "Alternative Routes" do? How does this compare /
contract to E.C.M.P. or D.G.D.
> fib_select_default() is the function that decides which nexthop
> is reachable and whether to contact it. It uses the ARP state via
> fib_detect_death(). That is all code that is behind this feature called
> "alternative routes": the kernel selects one based on nexthop's ARP
> state.
Please confirm that you aren't entering / referring to E.C.M.P.
territory when you say "nexthop". I think that you are not, but I want
to ask and be sure, particularly seeing as how things are very closely
related.
It sounds like you're referring to literally the router that is the next
hop in the path. I.e. the device on the other end of the wire.
I'll have to find, read, and try to grok the code to have a better idea.
That being said, it looks like (based on the name) that
fib_select_default() deals with the default route. The testing I did
last night, and positive results, indicate that the kernel did what I
wanted it to do. (See above about D.G.D. vs E.C.M.P.)
So, it seems as if something about alternative routes worked using
non-default routes. I have no way of knowing if it was the code that
we're talking about, or something else that produced the results. Given
the way I did the test (specific prefixes, non-default, routes being
appended with no other routes) worked the way that I would have thought
that a feature that uses alternative routes (or historically D.G.D.)
would have worked.
The following ping works just fine as I bounce interfaces on NS1.
ns2# ping -I 192.0.2.254 192.0.2.129
I can confirm that traffic is moving back and forth between the vEth
links between the NetNSs. Granted, the traffic sticks to one vEth
interface until it goes away.
I can shut down ns2a on NS1 so that ns1a sees loss of link but but stays
up on NS2, and traffic moves to vEth-B.
I can then open up ns2a on NS1 so that ns1a sees link on NS2, and
re-append the route on NS1.
I can then shut down ns2b on NS1 so that ns1b sees loss of link but
stays up on NS2, and traffic moves to vEth-A.
I can then open up ns2b on NS1 so that ns1b sees link on NS2, and
re-append the route on NS1.
NS2 behaves exactly as I would hope. Traffic will move from the down
interface to the remaining up interface. Back and forth, no problem.
I don't know where the disconnect is, but I feel like there is one.
> Routes with different metric are considered only when the routes with
> lower metric are removed.
I agree with the statement. What I question is where metric came into
play here. All of the routes had the same (default) metric. None of
the routes I tested had different metrics.
ns1# ip route show
192.0.2.0/26 dev ns2a proto kernel scope link src 192.0.2.1
192.0.2.64/26 dev ns2b proto kernel scope link src 192.0.2.65
192.0.2.128/26 dev dummy0 proto kernel scope link src 192.0.2.129
192.0.2.192/26 via 192.0.2.62 dev ns2a
192.0.2.192/26 via 192.0.2.126 dev ns2b
ns2# ip route show
192.0.2.0/26 dev ns1a proto kernel scope link src 192.0.2.62
192.0.2.64/26 dev ns1b proto kernel scope link src 192.0.2.126
192.0.2.128/26 via 192.0.2.65 dev ns1b
192.0.2.128/26 via 192.0.2.1 dev ns1a
192.0.2.192/26 dev dummy0 proto kernel scope link src 192.0.2.254
> IIRC, this flag invalidates nexthops depending on the link state. If
> your link is always UP it does not help much.
That's what I gathered. So things like DSL & cable modems or other L2
bridging devices might not drop the link when their circuit drops.
This is also why I asked the follow up questions to David's email.
I want to do some testing to see if fib_multipath_use_neigh alters this
behavior at all. I'm hoping that it will invalidate an alternate route
if the MAC is not resolvable even if the physical link stays up.
Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering
this behavior. But having that timeout and starting to use an
alternative route is considerably better than not using an alternative
route.
> If you rely on user space tool, you can check the state of the desired
> hops: device link state, your gateway to ISP, one or more gateways in the
> ISP network which you consider permanent part of the path via this ISP.
This is what I have thought about doing previously.
> First route can be created with 'add' but all next alternative routes
> can be added only with "append". If you successfully add them with
> "add" it means they are not alternatives to the first one, they are not
> considered at all.
ACK
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3982 bytes --]
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox