* Re: [RFC v3 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-05-10 7:32 UTC (permalink / raw)
To: Tiwei Bie, mst, virtualization, linux-kernel, netdev; +Cc: wexu, jfreimann
In-Reply-To: <20180425051550.24342-4-tiwei.bie@intel.com>
On 2018年04月25日 13:15, Tiwei Bie wrote:
> + /* We're using some buffers from the free list. */
> + vq->vq.num_free -= descs_used;
> +
> + /* Update free pointer */
> + if (indirect) {
> + n = head + 1;
> + if (n >= vq->vring_packed.num) {
> + n = 0;
> + vq->wrap_counter ^= 1;
> + }
> + vq->next_avail_idx = n;
> + } else
> + vq->next_avail_idx = i;
During testing zerocopy (out of order completion), I found driver may
submit two identical buffer id to vhost. So the above code may not work
well.
Consider the case that driver adds 3 buffer and virtqueue size is 8.
a) id = 0,count = 2,next_avail = 2
b) id = 2,count = 4,next_avail = 2
c) id = 4,count = 2,next_avail = 0
if packet b is done before packet a, driver may think buffer id 0 is
available and try to use it if even if the real buffer 0 was not done.
Thanks
^ permalink raw reply
* Re: [bpf-next v3 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: Jesper Dangaard Brouer @ 2018-05-10 7:31 UTC (permalink / raw)
To: David Ahern
Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
brouer
In-Reply-To: <20180510033427.20756-9-dsahern@gmail.com>
On Wed, 9 May 2018 20:34:26 -0700
David Ahern <dsahern@gmail.com> wrote:
> Provide a helper for doing a FIB and neighbor lookup in the kernel
> tables from an XDP program. The helper provides a fastpath for forwarding
> packets. If the packet is a local delivery or for any reason is not a
> simple lookup and forward, the packet continues up the stack.
>
> If it is to be forwarded, the forwarding can be done directly if the
> neighbor is already known. If the neighbor does not exist, the first
> few packets go up the stack for neighbor resolution. Once resolved, the
> xdp program provides the fast path.
>
> On successful lookup the nexthop dmac, current device smac and egress
> device index are returned.
>
> The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
> are implemented in this patch. The API includes layer 4 parameters if
> the XDP program chooses to do deep packet inspection to allow compare
> against ACLs implemented as FIB rules.
>
> Header rewrite is left to the XDP program.
>
> The lookup takes 2 flags:
> - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
> straight to the table associated with the device (expert setting for
> those looking to maximize throughput)
>
> - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
> Default is an ingress lookup.
>
> Initial performance numbers collected by Jesper, forwarded packets/sec:
>
> Full stack XDP FIB lookup XDP Direct lookup
> IPv4 1,947,969 7,074,156 7,415,333
> IPv6 1,728,000 6,165,504 7,262,720
>
The "Full stack" tests were with netfilter modules unloaded. Default
setting with netfilter conntrack loaded and default Fedora firewall
rules, show around 700Kpps.
> These number are single CPU core forwarding on a Broadwell
> E5-1650 v4 @ 3.60GHz.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
This helper is awesome, as it really shows how XDP is meant to work in
concert and cooperate with the existing network stack.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH] e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
From: Benjamin Poirier @ 2018-05-10 7:28 UTC (permalink / raw)
To: Jeff Kirsher
Cc: Keller, Jacob E, Achim Mildenberger, olouvignes, jayanth,
ehabkost, postmodern.mod3, Bart.VanAssche, intel-wired-lan,
netdev, linux-kernel
In-Reply-To: <02874ECE860811409154E81DA85FBB5882D918F3@ORSMSX115.amr.corp.intel.com>
There have been multiple reports of crashes that look like
kernel: RIP: 0010:[<ffffffff8110303f>] timecounter_read+0xf/0x50
[...]
kernel: Call Trace:
kernel: [<ffffffffa0806b0f>] e1000e_phc_gettime+0x2f/0x60 [e1000e]
kernel: [<ffffffffa0806c5d>] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
kernel: [<ffffffff810992c5>] process_one_work+0x155/0x440
kernel: [<ffffffff81099e16>] worker_thread+0x116/0x4b0
kernel: [<ffffffff8109f422>] kthread+0xd2/0xf0
kernel: [<ffffffff8163184f>] ret_from_fork+0x3f/0x70
These can be traced back to the fact that e1000e_systim_reset() skips the
timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL, which
leads to a null deref in timecounter_read().
Commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) reworked
e1000e_get_base_timinca() in such a way that it can return -EINVAL for
e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
Some experimentation has shown that on I219 (e1000_pch_spt, "MAC: 12")
adapters, the E1000_TSYNCRXCTL_SYSCFI flag is unstable; TSYNCRXCTL reads
sometimes don't have the SYSCFI bit set. Retrying the read shortly after
finds the bit to be set. This was observed at boot (probe) but also link up
and link down.
Moreover, the phc (PTP Hardware Clock) seems to operate normally even after
reads where SYSCFI=0. Therefore, remove this register read and
unconditionally set the clock parameters.
Reported-by: Achim Mildenberger <admin@fph.physik.uni-karlsruhe.de>
Message-Id: <20180425065243.g5mqewg5irkwgwgv@f2>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1075876
Fixes: 83129b37ef35 ("e1000e: fix systim issues")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index ec4a9759a6f2..3afb1f3b6f91 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3546,15 +3546,12 @@ s32 e1000e_get_base_timinca(struct e1000_adapter *adapter, u32 *timinca)
}
break;
case e1000_pch_spt:
- if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
- /* Stable 24MHz frequency */
- incperiod = INCPERIOD_24MHZ;
- incvalue = INCVALUE_24MHZ;
- shift = INCVALUE_SHIFT_24MHZ;
- adapter->cc.shift = shift;
- break;
- }
- return -EINVAL;
+ /* Stable 24MHz frequency */
+ incperiod = INCPERIOD_24MHZ;
+ incvalue = INCVALUE_24MHZ;
+ shift = INCVALUE_SHIFT_24MHZ;
+ adapter->cc.shift = shift;
+ break;
case e1000_pch_cnp:
if (er32(TSYNCRXCTL) & E1000_TSYNCRXCTL_SYSCFI) {
/* Stable 24MHz frequency */
--
2.16.3
^ permalink raw reply related
* Re: [bpf-next v3 9/9] samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
From: Jesper Dangaard Brouer @ 2018-05-10 7:22 UTC (permalink / raw)
To: David Ahern
Cc: netdev, borkmann, ast, davem, shm, roopa, toke, john.fastabend,
brouer, Tariq Toukan
In-Reply-To: <20180510033427.20756-10-dsahern@gmail.com>
On Wed, 9 May 2018 20:34:27 -0700
David Ahern <dsahern@gmail.com> wrote:
> Simple example of fast-path forwarding. It has a serious flaw
> in not verifying the egress device index supports XDP forwarding.
> If the egress device does not packets are dropped.
>
> Take this only as a simple example of fast-path forwarding.
>
> Signed-off-by: David Ahern <dsahern@gmail.com>
> Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
I agree that sample program have this flaw, but it should not stop this
patchset. We need to find a more reliable way of detecting/verifying
that an egress device supports XDP forwarding, from within the BPF prog.
As this sample program hints, we could do a lookup in the devmap, to
get this info(?)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply
* [PATCH v2] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
From: Christophe JAILLET @ 2018-05-10 7:06 UTC (permalink / raw)
To: davem, tariqt
Cc: netdev, linux-rdma, linux-kernel, kernel-janitors,
Christophe JAILLET
If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.
So, doing some explicit kfree in the error handling path would lead to
some double kfree.
Simplify code to avoid such a case.
Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
v1 -> v2 : rewrite the fix as explained by Tariq Toukan
(this 2nd version may have been posted twice, once without the
v2 tag. PLease ignore the first one)
---
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_ring[t]) {
err = -ENOMEM;
- goto err_free_tx;
+ goto out;
}
priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_cq[t]) {
- kfree(priv->tx_ring[t]);
err = -ENOMEM;
goto out;
}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
return 0;
-err_free_tx:
- while (t--) {
- kfree(priv->tx_ring[t]);
- kfree(priv->tx_cq[t]);
- }
out:
mlx4_en_destroy_netdev(dev);
return err;
--
2.17.0
^ permalink raw reply related
* [PATCH] net/mlx4_en: Fix an error handling path in 'mlx4_en_init_netdev()'
From: Christophe JAILLET @ 2018-05-10 7:02 UTC (permalink / raw)
To: davem, tariqt
Cc: netdev, linux-rdma, linux-kernel, kernel-janitors,
Christophe JAILLET
If an error occurs, 'mlx4_en_destroy_netdev()' is called.
It then calls 'mlx4_en_free_resources()' which does the needed resources
cleanup.
So, doing some explicit kfree in the error handling path would lead to
some double kfree.
Simplify code to avoid such a case.
Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index e0adac4a9a19..9670b33fc9b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3324,12 +3324,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_ring[t]) {
err = -ENOMEM;
- goto err_free_tx;
+ goto out;
}
priv->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
MAX_TX_RINGS, GFP_KERNEL);
if (!priv->tx_cq[t]) {
- kfree(priv->tx_ring[t]);
err = -ENOMEM;
goto out;
}
@@ -3582,11 +3581,6 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
return 0;
-err_free_tx:
- while (t--) {
- kfree(priv->tx_ring[t]);
- kfree(priv->tx_cq[t]);
- }
out:
mlx4_en_destroy_netdev(dev);
return err;
--
2.17.0
^ permalink raw reply related
* [PATCH net-next v2] tcp: Add mark for TIMEWAIT sockets
From: Jon Maxwell @ 2018-05-10 6:53 UTC (permalink / raw)
To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell
This version has some suggestions by Eric Dumazet:
- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
- Factorize code as sk_fullsock() check is not necessary.
Aidan McGurn from Openwave Mobility systems reported the following bug:
"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."
Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).
Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
include/net/inet_timewait_sock.h | 1 +
net/ipv4/ip_output.c | 2 +-
net/ipv4/tcp_ipv4.c | 16 ++++++++++++++--
net/ipv4/tcp_minisocks.c | 1 +
net/ipv6/tcp_ipv6.c | 6 +++++-
5 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
#define tw_dr __tw_common.skc_tw_dr
int tw_timeout;
+ __u32 tw_mark;
volatile unsigned char tw_substate;
unsigned char tw_rcv_wscale;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
oif = skb->skb_iif;
flowi4_init_output(&fl4, oif,
- IP4_REPLY_MARK(net, skb->mark),
+ IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
RT_TOS(arg->tos),
RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..caf23de88f8a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
struct sock *sk1 = NULL;
#endif
struct net *net;
+ struct sock *ctl_sk;
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,16 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+ ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+ if (sk)
+ ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+ inet_twsk(sk)->tw_mark : sk->sk_mark;
+ ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len);
+ ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +765,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+ struct sock *ctl_sk;
memset(&rep.th, 0, sizeof(struct tcphdr));
memset(&arg, 0, sizeof(arg));
@@ -809,11 +816,16 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+ ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+ if (sk)
+ ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+ inet_twsk(sk)->tw_mark : sk->sk_mark;
+ ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len);
+ ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
local_bh_enable();
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468b5139..f867658b4b30 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -263,6 +263,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
struct inet_sock *inet = inet_sk(sk);
tw->tw_transparent = inet->transparent;
+ tw->tw_mark = sk->sk_mark;
tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
tcptw->tw_rcv_nxt = tp->rcv_nxt;
tcptw->tw_snd_nxt = tp->snd_nxt;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..7d47c2b550a9 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -803,6 +803,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
unsigned int tot_len = sizeof(struct tcphdr);
struct dst_entry *dst;
__be32 *topt;
+ __u32 mark = 0;
if (tsecr)
tot_len += TCPOLEN_TSTAMP_ALIGNED;
@@ -871,7 +872,10 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
fl6.flowi6_oif = oif;
}
- fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
+ if (sk)
+ mark = (sk->sk_state == TCP_TIME_WAIT) ?
+ inet_twsk(sk)->tw_mark : sk->sk_mark;
+ fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark;
fl6.fl6_dport = t1->dest;
fl6.fl6_sport = t1->source;
fl6.flowi6_uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
--
2.13.6
^ permalink raw reply related
* Re: net: hang in unregister_netdevice: waiting for lo to become free
From: Dmitry Vyukov @ 2018-05-10 6:46 UTC (permalink / raw)
To: Dan Streetman
Cc: Tommi Rantala, Neil Horman, Xin Long, David Ahern,
Daniel Borkmann, Cong Wang, David Miller, Eric Dumazet,
Willem de Bruijn, Jakub Kicinski, Rasmus Villemoes, netdev, LKML,
Alexey Kuznetsov, Hideaki YOSHIFUJI, syzkaller, Dan Streetman,
Eric W. Biederman, Alexey Kodanev
In-Reply-To: <CALZtONDn_yQb_EeSgbwk0Trzp7TA2uXLc6tuxmmWd0uyqOq5aA@mail.gmail.com>
On Mon, Apr 16, 2018 at 9:42 PM, Dan Streetman <ddstreet@ieee.org> wrote:
>>>>> On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>> On 20.02.2018 18:26, Neil Horman wrote:
>>>>>>>
>>>>>>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>>>>>>
>>>>>>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>>>>>> <tommi.t.rantala@nokia.com> wrote:
>>>>>>>>>
>>>>>>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>>>>>>
>>>>>>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>>>>>>> latest upstream tree.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These two commits are in v4.16-rc1:
>>>>>>>>>
>>>>>>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>>>>>>> Author: Tommi Rantala <tommi.t.rantala@nokia.com>
>>>>>>>>> Date: Mon Feb 5 21:48:14 2018 +0200
>>>>>>>>>
>>>>>>>>> sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>>>>>>> ...
>>>>>>>>> Fixes: 410f03831 ("sctp: add routing output fallback")
>>>>>>>>> Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>>>>>>> secondary
>>>>>>>>> addresses")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>>>>>>> Author: Alexey Kodanev <alexey.kodanev@oracle.com>
>>>>>>>>> Date: Mon Feb 5 15:10:35 2018 +0300
>>>>>>>>>
>>>>>>>>> sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>>>>>>> ...
>>>>>>>>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>>>>>>> secondary
>>>>>>>>> addresses for ipv6")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I guess we missed something if it's still reproducible.
>>>>>>>>>
>>>>>>>>> I can check it later this week, unless someone else beat me to it.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Tommi,
>>>>>>>>
>>>>>>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>>>>>>> another one then. But I am still seeing these:
>>>>>>>>
>>>>>>>> [ 58.799130] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [ 60.847138] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [ 62.895093] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>> [ 64.943103] unregister_netdevice: waiting for lo to become free.
>>>>>>>> Usage count = 4
>>>>>>>>
>>>>>>>> on upstream tree pulled ~12 hours ago.
>>>>>>>>
>>>>>>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>>>>>>> out a
>>>>>>> backtrace if the device name matches "lo". That should tell us
>>>>>>> definitively if
>>>>>>> the problem is in the same location or not
>>>>>>
>>>>>>
>>>>>> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
>>>>>> sent in the first email in this thread:
>>>>>>
>>>>>> With 4.16-rc2 unable to reproduce.
>>>>>>
>>>>>> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
>>>>>> lo to become free. Usage count = 3"
>>>>>>
>>>>>> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
>>>>>> cherry-picked on top, unable to reproduce.
>>>>>>
>>>>>>
>>>>>> Is syzkaller doing something else now to trigger the bug...?
>>>>>> Can you still trigger the bug with the same reproducer?
>>>>>
>>>>> Hi Neil, Tommi,
>>>>>
>>>>> Reviving this old thread about "unregister_netdevice: waiting for lo
>>>>> to become free. Usage count = 3" hangs.
>>>>> I still did not have time to deep dive into what happens there (too
>>>>> many bugs coming from syzbot). But this still actively happens and I
>>>>> suspect accounts to a significant portion of various hang reports,
>>>>> which are quite unpleasant.
>>>>>
>>>>> One idea that could make it all simpler:
>>>>>
>>>>> Is this wait loop in netdev_wait_allrefs() supposed to wait for any
>>>>> prolonged periods of time under any non-buggy conditions? E.g. more
>>>>> than 1-2 minutes?
>>>>> If it only supposed to wait briefly for things that already supposed
>>>>> to be shutting down, and we add a WARNING there after some timeout,
>>>>> then syzbot will report all info how/when it happens, hopefully
>>>>> extracting reproducers, and all the nice things.
>>>>> But this WARNING should not have any false positives under any
>>>>> realistic conditions (e.g. waiting for arrival of remote packets with
>>>>> large timeouts).
>>>>>
>>>>> Looking at some task hung reports, it seems that this code holds some
>>>>> mutexes, takes workqueue thread and prevents any progress with
>>>>> destruction of other devices (and net namespace creation/destruction),
>>>>> so I guess it should not wait for any indefinite periods of time?
>>>>
>>>> I'm working on this currently:
>>>> https://bugs.launchpad.net/ubuntu/zesty/+source/linux/+bug/1711407
>>>>
>>>> I added a summary of what I've found to be the cause (or at least, one
>>>> possible cause) of this:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/72
>>>>
>>>> I'm working on a patch to work around the main side-effect of this,
>>>> which is hanging while holding the global net mutex. Hangs will still
>>>> happen (e.g. if a dst leaks) but should not affect anything else,
>>>> other than a leak of the dst and its net namespace.
>>>>
>>>> Fixing the dst leaks is important too, of course, but a dst leak (or
>>>> other cause) shouldn't break the entire system.
>>>
>>> Leaking some memory is definitely better than hanging the system.
>>>
>>> So I've made syzkaller to recognize "unregister_netdevice: waiting for
>>> (.*) to become free" as a kernel bug:
>>> https://github.com/google/syzkaller/commit/7a67784ca8bdc3b26cce2f0ec9a40d2dd9ec9396
>>> Unfortunately it does not make it catch these bugs because creating a
>>> net namespace per test is too damn slow, so namespaces are reused for
>>> lots of tests and when/if it's eventually destroyed it's already too
>>> late to find root cause.
>>>
>>> But I've run a one-off experiment with prompt net namespace
>>> destruction and syzkaller was able to easily extract a C reproducer:
>>> https://gist.githubusercontent.com/dvyukov/d571e8fff24e127ca48a8c4790d42bfa/raw/52050e93ba9afbb5126b9d7bb39b7e71a82af016/gistfile1.txt
>>>
>>> On upstream 16e205cf42da1f497b10a4a24f563e6c0d574eec with this config:
>>> https://gist.githubusercontent.com/dvyukov/9663c57443adb21f2795b92ef0829d62/raw/bbea0652e23746096dd56855a28f6c681aebcdee/gistfile1.txt
>>>
>>> this gives me:
>>>
>>> [ 83.183198] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> [ 85.231202] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> ...
>>> [ 523.511205] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 9
>>> ...
>>>
>>> This is generated from this syzkaller program:
>>>
>>> r0 = socket$inet6(0xa, 0x1, 0x84)
>>> setsockopt$inet6_IPV6_XFRM_POLICY(r0, 0x29, 0x23,
>>> &(0x7f0000000380)={{{@in6=@remote={0xfe, 0x80, [], 0xbb},
>>> @in=@dev={0xac, 0x14, 0x14}, 0x0, 0x0, 0x0, 0x0, 0xa}, {}, {}, 0x0,
>>> 0x0, 0x1}, {{@in=@local={0xac, 0x14, 0x14, 0xaa}, 0x0, 0x32}, 0x0,
>>> @in=@local={0xac, 0x14, 0x14, 0xaa}, 0x3504}}, 0xe8)
>>> bind$inet6(r0, &(0x7f0000000000)={0xa, 0x4e20}, 0x1c)
>>> connect$inet(r0, &(0x7f0000000040)={0x2, 0x4e20, @dev={0xac, 0x14,
>>> 0x14, 0xd}}, 0x10)
>>> syz_emit_ethernet(0x3e, &(0x7f00000001c0)={@local={[0xaa, 0xaa, 0xaa,
>>> 0xaa, 0xaa], 0xaa}, @dev={[0xaa, 0xaa, 0xaa, 0xaa, 0xaa]}, [],
>>> {@ipv6={0x86dd, {0x0, 0x6, "50a09c", 0x8, 0xffffff11, 0x0,
>>> @remote={0xfe, 0x80, [], 0xbb}, @local={0xfe, 0x80, [], 0xaa}, {[],
>>> @udp={0x0, 0x4e20, 0x8}}}}}}, &(0x7f0000000040))
>>>
>>> So this seems to be related to IPv6 and/or xfrm and is potentially
>>> caused by external packets (that syz_emit_ethernet call).
>>
>>
>>
>> Here is another repro which seems to be a different bug (note that it
>> requires fault injection):
>>
>> https://gist.githubusercontent.com/dvyukov/1c56623016cc4c24a69d433c5114ad5b/raw/530478f571b195193101b912aa646948528baa8e/gistfile1.txt
>>
>> Dan, do you mind taking a look at them? Fixing these should eliminate
>> root causes of these hangs/leaks.
>
> Yep I will look at them, thanks for the reproducers.
Hi Dan,
Any updates on this? syzbot is hitting this all the time.
^ permalink raw reply
* [PATCH] net: ipv4: remove define INET_CSK_DEBUG and unnecessary EXPORT_SYMBOL
From: Joe Perches @ 2018-05-10 6:24 UTC (permalink / raw)
To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI
Cc: Li RongQing, Arnaldo Carvalho de Melo, netdev, linux-kernel
INET_CSK_DEBUG is always set and only is used for 2 pr_debug calls.
EXPORT_SYMBOL(inet_csk_timer_bug_msg) is only used by these 2
pr_debug calls and is also unnecessary as the exported string can
be used directly by these calls.
Signed-off-by: Joe Perches <joe@perches.com>
---
include/net/inet_connection_sock.h | 22 ++++------------------
net/ipv4/inet_connection_sock.c | 5 -----
2 files changed, 4 insertions(+), 23 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ab6667275df..0a6c9e0f2b5a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -23,8 +23,6 @@
#include <net/inet_sock.h>
#include <net/request_sock.h>
-#define INET_CSK_DEBUG 1
-
/* Cancel timers, when they are not required. */
#undef INET_CSK_CLEAR_TIMERS
@@ -196,10 +194,6 @@ static inline void inet_csk_delack_init(struct sock *sk)
void inet_csk_delete_keepalive_timer(struct sock *sk);
void inet_csk_reset_keepalive_timer(struct sock *sk, unsigned long timeout);
-#ifdef INET_CSK_DEBUG
-extern const char inet_csk_timer_bug_msg[];
-#endif
-
static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what)
{
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -214,12 +208,9 @@ static inline void inet_csk_clear_xmit_timer(struct sock *sk, const int what)
#ifdef INET_CSK_CLEAR_TIMERS
sk_stop_timer(sk, &icsk->icsk_delack_timer);
#endif
+ } else {
+ pr_debug("inet_csk BUG: unknown timer value\n");
}
-#ifdef INET_CSK_DEBUG
- else {
- pr_debug("%s", inet_csk_timer_bug_msg);
- }
-#endif
}
/*
@@ -232,10 +223,8 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
struct inet_connection_sock *icsk = inet_csk(sk);
if (when > max_when) {
-#ifdef INET_CSK_DEBUG
pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",
sk, what, when, current_text_addr());
-#endif
when = max_when;
}
@@ -249,12 +238,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
icsk->icsk_ack.timeout = jiffies + when;
sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
+ } else {
+ pr_debug("inet_csk BUG: unknown timer value\n");
}
-#ifdef INET_CSK_DEBUG
- else {
- pr_debug("%s", inet_csk_timer_bug_msg);
- }
-#endif
}
static inline unsigned long
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d046f2..33a88e045efd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -27,11 +27,6 @@
#include <net/sock_reuseport.h>
#include <net/addrconf.h>
-#ifdef INET_CSK_DEBUG
-const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-EXPORT_SYMBOL(inet_csk_timer_bug_msg);
-#endif
-
#if IS_ENABLED(CONFIG_IPV6)
/* match_wildcard == true: IPV6_ADDR_ANY equals to any IPv6 addresses if IPv6
* only, and any IPv4 addresses if not IPv6 only
--
2.15.0
^ permalink raw reply related
* Re: [PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets
From: Jonathan Maxwell @ 2018-05-10 6:14 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, kuznet, yoshfuji, netdev, linux-kernel, Jon Maxwell
In-Reply-To: <eece810c-b3c8-4307-c03a-1cfdaa7109de@gmail.com>
On Thu, May 10, 2018 at 3:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/09/2018 10:21 PM, Jon Maxwell wrote:
>
> ...
>
>> if (th->rst)
>> @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
>> arg.tos = ip_hdr(skb)->tos;
>> arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
>> local_bh_disable();
>> - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
>> + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
>> + if (sk && sk->sk_state == TCP_TIME_WAIT)
>> + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
>> + else if (sk && sk_fullsock(sk))
>
> I do not believe we could have a non fullsock here ?
>
Okay thanks I'll make these changes to v2.
> A request socket (SYN_RECV state) at this point is not expected.
>
>
> So you can factorize :
>
> if (sk)
> ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
> inet_twsk(sk)->tw_mark : sk->sk_mark;
>
> (same remark for IPv6)
>
>
>> + ctl_sk->sk_mark = sk->sk_mark;
>> + ip_send_unicast_reply(ctl_sk,
>> skb, &TCP_SKB_CB(skb)->header.h4.opt,
>> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>> &arg, arg.iov[0].iov_len);
>
>
^ permalink raw reply
* Re: [PATCH 2/2] alx: add disable_wol paramenter
From: AceLan Kao @ 2018-05-10 5:58 UTC (permalink / raw)
To: Andrew Lunn
Cc: David Miller, James Cliburn, Chris Snook, rakesh, netdev,
Linux-Kernel@Vger. Kernel. Org, Emily Chien
In-Reply-To: <20180509134543.GF14276@lunn.ch>
Hi Andrew,
We have some machines using Qualcomm Atheros Killer E2400 Gigabit
Ethernet Controller,
but none of them has the unintentional wake up issue.
We're willing to fix it if we encountered the issue, but before we can
do it, we need this feature is supported by the driver.
Taking the feature has been removed for 5 years into account, I doubt
if we still can reproduce this issue,
but again, to verify this issue we need to add back this feature first.
Set WoL disabled by default won't introduce any regression but give
users and developers a chance to fix it.
Best regards,
AceLan Kao.
2018-05-09 21:45 GMT+08:00 Andrew Lunn <andrew@lunn.ch>:
> On Wed, May 09, 2018 at 10:40:13AM +0800, AceLan Kao wrote:
>> Hi,
>>
>> I didn't get any response around one month.
>
> I didn't notice you posting an results of searching for the true
> problem? Please can you point me at an email archive.
>
> Andrew
^ permalink raw reply
* Re: [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Joe Perches @ 2018-05-10 5:49 UTC (permalink / raw)
To: Li RongQing, netdev
In-Reply-To: <1525930656-6901-1-git-send-email-lirongqing@baidu.com>
On Thu, 2018-05-10 at 13:37 +0800, Li RongQing wrote:
> inet_csk_reset_xmit_timer uses multiple equality condition checks,
> so it is better to use switch case instead of them
[]
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
[]
> @@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
> when = max_when;
> }
>
> - if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
> - what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
> - what == ICSK_TIME_REO_TIMEOUT) {
> + switch (what) {
> + case ICSK_TIME_RETRANS:
> + /* fall through */
> + case ICSK_TIME_PROBE0:
> + /* fall through */
> + case ICSK_TIME_EARLY_RETRANS:
> + /* fall through */
> + case ICSK_TIME_LOSS_PROBE:
> + /* fall through */
> + case ICSK_TIME_REO_TIMEOUT:
Please remove the /* fall through */ lines
and use consecutive case labels like:
case ICSK_TIME_RETRANS:
case ICSK_TIME_PROBE0:
case ICSK_TIME_EARLY_RETRANS:
case ICSK_TIME_LOSS_PROBE:
case ICSK_TIME_REO_TIMEOUT:
^ permalink raw reply
* Re: [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Eric Dumazet @ 2018-05-10 5:48 UTC (permalink / raw)
To: Li RongQing, netdev
In-Reply-To: <1525930656-6901-1-git-send-email-lirongqing@baidu.com>
On 05/09/2018 10:37 PM, Li RongQing wrote:
> inet_csk_reset_xmit_timer uses multiple equality condition checks,
> so it is better to use switch case instead of them
>
Why is it better ?
I prefer the concise form, and compiler really should not care.
^ permalink raw reply
* Re: [PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets
From: Eric Dumazet @ 2018-05-10 5:45 UTC (permalink / raw)
To: Jon Maxwell, davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell
In-Reply-To: <20180510052111.17886-1-jmaxwell37@gmail.com>
On 05/09/2018 10:21 PM, Jon Maxwell wrote:
...
> if (th->rst)
> @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
> arg.tos = ip_hdr(skb)->tos;
> arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
> local_bh_disable();
> - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
> + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
> + if (sk && sk->sk_state == TCP_TIME_WAIT)
> + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
> + else if (sk && sk_fullsock(sk))
I do not believe we could have a non fullsock here ?
A request socket (SYN_RECV state) at this point is not expected.
So you can factorize :
if (sk)
ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
inet_twsk(sk)->tw_mark : sk->sk_mark;
(same remark for IPv6)
> + ctl_sk->sk_mark = sk->sk_mark;
> + ip_send_unicast_reply(ctl_sk,
> skb, &TCP_SKB_CB(skb)->header.h4.opt,
> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
> &arg, arg.iov[0].iov_len);
^ permalink raw reply
* Re: [PATCH ipsec-next] xfrm: Allow Output Mark to be Updated Using UPDSA
From: Eyal Birger @ 2018-05-10 5:44 UTC (permalink / raw)
To: Nathan Harold; +Cc: netdev, steffen.klassert, tobias
In-Reply-To: <20180509204626.56561-1-nharold@google.com>
Hi Nathan,
On Wed, 9 May 2018 13:46:26 -0700
Nathan Harold <nharold@google.com> wrote:
> Allow UPDSA to change output_mark to permit
> policy separation of packet routing decisions from
> SA keying in systems that use mark-based routing.
>
> In the output_mark, used as a routing and firewall
> mark for outbound packets, is made update-able which
> allows routing decisions to be handled independently
> of keying/SA creation. To maintain consistency with
> other optional attributes, the output mark is only
> updated if sent with a non-zero value. Once set, the
> output mark may not be reset to zero, which ensures
> that updating the SA does not require the mark to
> be re-sent to avoid the value being clobbered.
There is an attempt to extend the 'output_mark' to support the input
direction and masking.
In the proposed implementation, output_mark is converted to type 'struct
xfrm_mark' where the semantics are as follows:
- If mark is given by XFRMA_OUTPUT_MARK (renamed to XFRMA_SET_MARK)
then a new XFRMA_SET_MARK_MASK attribute is consulted to set the mask
value
- if no XFRMA_SET_MARK_MASK attribute is provided, the mask is set to
0xffffffff
Therefore, if the mask value is 0, we can regard the mark as 'not
given'.
My question is, in the context of this patch, it seems that the
"Once set, the output mark may not be reset to zero" restriction may be
lifted in favor of updating the mark only if the new mask is non zero.
Does this make sense to you?
Eyal
^ permalink raw reply
* [net-next][PATCH] inet: Use switch case instead of multiple condition checks
From: Li RongQing @ 2018-05-10 5:37 UTC (permalink / raw)
To: netdev
inet_csk_reset_xmit_timer uses multiple equality condition checks,
so it is better to use switch case instead of them
after this patch, the increased image size is acceptable
Before After
size of net/ipv4/tcp_output.o: 721640 721648
size of vmlinux: 400236400 400236401
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
include/net/inet_connection_sock.h | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ab6667275df..d2e9314cf43d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -239,22 +239,31 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
when = max_when;
}
- if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
- what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
- what == ICSK_TIME_REO_TIMEOUT) {
+ switch (what) {
+ case ICSK_TIME_RETRANS:
+ /* fall through */
+ case ICSK_TIME_PROBE0:
+ /* fall through */
+ case ICSK_TIME_EARLY_RETRANS:
+ /* fall through */
+ case ICSK_TIME_LOSS_PROBE:
+ /* fall through */
+ case ICSK_TIME_REO_TIMEOUT:
icsk->icsk_pending = what;
icsk->icsk_timeout = jiffies + when;
sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
- } else if (what == ICSK_TIME_DACK) {
+ break;
+ case ICSK_TIME_DACK:
icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
icsk->icsk_ack.timeout = jiffies + when;
sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
- }
+ break;
#ifdef INET_CSK_DEBUG
- else {
+ default:
pr_debug("%s", inet_csk_timer_bug_msg);
- }
+ break;
#endif
+ }
}
static inline unsigned long
--
2.16.2
^ permalink raw reply related
* [PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets
From: Jon Maxwell @ 2018-05-10 5:21 UTC (permalink / raw)
To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, jmaxwell
This version has some suggestions by Eric Dumazet:
- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
Aidan McGurn from Openwave Mobility systems reported the following bug:
"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."
Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).
Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
include/net/inet_timewait_sock.h | 1 +
net/ipv4/ip_output.c | 2 +-
net/ipv4/tcp_ipv4.c | 18 ++++++++++++++++--
net/ipv4/tcp_minisocks.c | 1 +
net/ipv6/tcp_ipv6.c | 7 ++++++-
5 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
#define tw_dr __tw_common.skc_tw_dr
int tw_timeout;
+ __u32 tw_mark;
volatile unsigned char tw_substate;
unsigned char tw_rcv_wscale;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
oif = skb->skb_iif;
flowi4_init_output(&fl4, oif,
- IP4_REPLY_MARK(net, skb->mark),
+ IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
RT_TOS(arg->tos),
RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..fbee36579c83 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
struct sock *sk1 = NULL;
#endif
struct net *net;
+ struct sock *ctl_sk;
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+ ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+ else if (sk && sk_fullsock(sk))
+ ctl_sk->sk_mark = sk->sk_mark;
+ ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len);
+ ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+ struct sock *ctl_sk;
memset(&rep.th, 0, sizeof(struct tcphdr));
memset(&arg, 0, sizeof(arg));
@@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
- ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+ ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+ else if (sk && sk_fullsock(sk))
+ ctl_sk->sk_mark = sk->sk_mark;
+ ip_send_unicast_reply(ctl_sk,
skb, &TCP_SKB_CB(skb)->header.h4.opt,
ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
&arg, arg.iov[0].iov_len);
+ ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
local_bh_enable();
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 57b5468b5139..f867658b4b30 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -263,6 +263,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
struct inet_sock *inet = inet_sk(sk);
tw->tw_transparent = inet->transparent;
+ tw->tw_mark = sk->sk_mark;
tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
tcptw->tw_rcv_nxt = tp->rcv_nxt;
tcptw->tw_snd_nxt = tp->snd_nxt;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..9b1ecdc1722e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -803,6 +803,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
unsigned int tot_len = sizeof(struct tcphdr);
struct dst_entry *dst;
__be32 *topt;
+ __u32 mark = 0;
if (tsecr)
tot_len += TCPOLEN_TSTAMP_ALIGNED;
@@ -871,7 +872,11 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
fl6.flowi6_oif = oif;
}
- fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
+ if (sk && sk->sk_state == TCP_TIME_WAIT)
+ mark = inet_twsk(sk)->tw_mark;
+ else if (sk && sk_fullsock(sk))
+ mark = sk->sk_mark;
+ fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark) ?: mark;
fl6.fl6_dport = t1->dest;
fl6.fl6_sport = t1->source;
fl6.flowi6_uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
--
2.13.6
^ permalink raw reply related
* Re: [PATCH] can: hi311x: Work around TX complete interrupt erratum
From: Lukas Wunner @ 2018-05-10 4:38 UTC (permalink / raw)
To: Marc Kleine-Budde, Wolfgang Grandegger, linux-can, netdev
Cc: Mathias Duckeck, Akshay Bhat, Casey Fitzpatrick
In-Reply-To: <dc7901c96bed90724bb372f4d3eead82bd22ac37.1525869303.git.lukas@wunner.de>
On Wed, May 09, 2018 at 02:43:43PM +0200, Lukas Wunner wrote:
> When sending packets as fast as possible using "cangen -g 0 -i -x", the
> HI-3110 occasionally latches the interrupt pin high on completion of a
> packet, but doesn't set the TXCPLT bit in the INTF register. The INTF
> register contains 0x00 as if no interrupt has occurred. Even waiting
> for a few milliseconds after the interrupt doesn't help.
>
> Work around this apparent erratum by instead checking the TXMTY bit in
> the STATF register ("TX FIFO empty"). We know that we've queued up a
> packet for transmission if priv->tx_len is nonzero. If the TX FIFO is
> empty, transmission of that packet must have completed.
>
> Note that this is congruent with our handling of received packets, which
> likewise gleans from the STATF register whether a packet is waiting in
> the RX FIFO, instead of looking at the INTF register.
I should have mentioned, to verify the existence of the erratum and the
validity of the patch, you can apply the little debug patch below, then
run "cangen -g 0 -i -x". You should see error messages in dmesg within
a few minutes.
-- >8 --
diff --git a/drivers/net/can/spi/hi311x.c b/drivers/net/can/spi/hi311x.c
index 53e320c..48b5c63 100644
--- a/drivers/net/can/spi/hi311x.c
+++ b/drivers/net/can/spi/hi311x.c
@@ -739,6 +739,9 @@ static irqreturn_t hi3110_can_ist(int irq, void *dev_id)
}
if (priv->tx_len && statf & HI3110_STAT_TXMTY) {
+ if (!(intf & HI3110_INT_TXCPLT))
+ dev_err(&spi->dev, "TX FIFO empty and TX was queued but TXCPLT not set\n");
+
net->stats.tx_packets++;
net->stats.tx_bytes += priv->tx_len - 1;
can_led_event(net, CAN_LED_EVENT_TX);
^ permalink raw reply related
* Re: [PATCH net-next] tcp: Add mark for TIMEWAIT sockets
From: Jonathan Maxwell @ 2018-05-10 4:23 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, kuznet, yoshfuji, netdev, linux-kernel, Jon Maxwell
In-Reply-To: <e408e7b1-cb92-fb98-d9ec-256ba18d6ca7@gmail.com>
On Thu, May 10, 2018 at 1:32 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/09/2018 07:07 PM, Jon Maxwell wrote:
>> Aidan McGurn from Openwave Mobility systems reported the following bug:
>>
>> "Marked routing is broken on customer deployment. Its effects are large
>> increase in Uplink retransmissions caused by the client never receiving
>> the final ACK to their FINACK - this ACK misses the mark and routes out
>> of the incorrect route."
>>
>> Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
>> sysctl is enabled. But not for TIME_WAIT sockets where the original socket had
>> sk->sk_mark set via setsockopt(SO_MARK..).
>>
>> Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
>> original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
>> Then copy this into ctl_sk->sk_mark so that the skb gets sent with the correct
>> mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over
>> sk->sk_mark so that netfilter rules are still honored.
>>
>> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
>> ---
>> include/net/inet_timewait_sock.h | 1 +
>> net/ipv4/ip_output.c | 3 ++-
>> net/ipv4/tcp_ipv4.c | 18 ++++++++++++++++--
>> net/ipv4/tcp_minisocks.c | 1 +
>> net/ipv6/tcp_ipv6.c | 8 +++++++-
>> 5 files changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
>> index c7be1ca8e562..659d8ed5a3bc 100644
>> --- a/include/net/inet_timewait_sock.h
>> +++ b/include/net/inet_timewait_sock.h
>> @@ -62,6 +62,7 @@ struct inet_timewait_sock {
>> #define tw_dr __tw_common.skc_tw_dr
>>
>> int tw_timeout;
>> + __u32 tw_mark;
>> volatile unsigned char tw_substate;
>> unsigned char tw_rcv_wscale;
>>
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index 95adb171f852..cca4412dc4cb 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -1539,6 +1539,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
>> struct sk_buff *nskb;
>> int err;
>> int oif;
>> + __u32 mark = IP4_REPLY_MARK(net, skb->mark);
>>
>> if (__ip_options_echo(net, &replyopts.opt.opt, skb, sopt))
>> return;
>> @@ -1561,7 +1562,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
>> oif = skb->skb_iif;
>>
>> flowi4_init_output(&fl4, oif,
>> - IP4_REPLY_MARK(net, skb->mark),
>> + mark ? (mark) : sk->sk_mark,
>
> You can avoid the declaration of mark variable and simply use here :
>
> IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
>
Thanks for the advice and suggestions Eric. That is more elegant. Will do in v1.
>> RT_TOS(arg->tos),
>> RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
>> ip_reply_arg_flowi_flags(arg),
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index f70586b50838..fbee36579c83 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
>> struct sock *sk1 = NULL;
>> #endif
>> struct net *net;
>> + struct sock *ctl_sk;
>>
>> /* Never send a reset in response to a reset. */
>> if (th->rst)
>> @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
>> arg.tos = ip_hdr(skb)->tos;
>> arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
>> local_bh_disable();
>> - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
>> + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
>> + if (sk && sk->sk_state == TCP_TIME_WAIT)
>> + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
>> + else if (sk && sk_fullsock(sk))
>> + ctl_sk->sk_mark = sk->sk_mark;
>> + ip_send_unicast_reply(ctl_sk,
>> skb, &TCP_SKB_CB(skb)->header.h4.opt,
>> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>> &arg, arg.iov[0].iov_len);
>>
>> + ctl_sk->sk_mark = 0;
>> __TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
>> __TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
>> local_bh_enable();
>> @@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
>> } rep;
>> struct net *net = sock_net(sk);
>> struct ip_reply_arg arg;
>> + struct sock *ctl_sk;
>>
>> memset(&rep.th, 0, sizeof(struct tcphdr));
>> memset(&arg, 0, sizeof(arg));
>> @@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
>> arg.tos = tos;
>> arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
>> local_bh_disable();
>> - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
>> + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
>> + if (sk && sk->sk_state == TCP_TIME_WAIT)
>> + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
>> + else if (sk && sk_fullsock(sk))
>> + ctl_sk->sk_mark = sk->sk_mark;
>> + ip_send_unicast_reply(ctl_sk,
>> skb, &TCP_SKB_CB(skb)->header.h4.opt,
>> ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>> &arg, arg.iov[0].iov_len);
>>
>> + ctl_sk->sk_mark = 0;
>> __TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
>> local_bh_enable();
>> }
>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>> index 57b5468b5139..f867658b4b30 100644
>> --- a/net/ipv4/tcp_minisocks.c
>> +++ b/net/ipv4/tcp_minisocks.c
>> @@ -263,6 +263,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
>> struct inet_sock *inet = inet_sk(sk);
>>
>> tw->tw_transparent = inet->transparent;
>> + tw->tw_mark = sk->sk_mark;
>> tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
>> tcptw->tw_rcv_nxt = tp->rcv_nxt;
>> tcptw->tw_snd_nxt = tp->snd_nxt;
>> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
>> index 6d664d83cd16..a6f876125091 100644
>> --- a/net/ipv6/tcp_ipv6.c
>> +++ b/net/ipv6/tcp_ipv6.c
>> @@ -803,6 +803,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
>> unsigned int tot_len = sizeof(struct tcphdr);
>> struct dst_entry *dst;
>> __be32 *topt;
>> + __u32 mark = IP6_REPLY_MARK(net, skb->mark);
>>
>> if (tsecr)
>> tot_len += TCPOLEN_TSTAMP_ALIGNED;
>> @@ -871,11 +872,16 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32
>> fl6.flowi6_oif = oif;
>> }
>>
>> - fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark);
>> + if (sk && sk->sk_state == TCP_TIME_WAIT)
>> + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
>> + else if (sk && sk_fullsock(sk))
>> + ctl_sk->sk_mark = sk->sk_mark;
>
> Unfortunately IPv6 has a single net->ipv6.tcp_sk, shared by all cpus.
>
> So writing ctl_sk->sk_mark is racy on SMP hosts.
>
> I would suggest using a local variable, and not touch ctl_sk->sk_mark
>
Sure I'll use a local variable for that in IPv6 instead and post in v1 as well.
> For consistency, you could do the same for IPv4, even if IPv4 currently uses per-cpu sockets
>
If it's okay I'll stick to ctl_sk->sk_mark for IPv4 as its pulled out
of the ctl_sk again in
ip_send_unicast_reply() and that will avoid having to add another argument.
Regards
Jon
>
>> + fl6.flowi6_mark = mark ? (mark) : ctl_sk->sk_mark;
>> fl6.fl6_dport = t1->dest;
>> fl6.fl6_sport = t1->source;
>> fl6.flowi6_uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
>> security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
>> + ctl_sk->sk_mark = 0;
>>
>> /* Pass a socket to ip6_dst_lookup either it is for RST
>> * Underlying function will use this to retrieve the network
>>
^ permalink raw reply
* Re: [PATCH v1 iproute2-next 2/3] rdma: print driver resource attributes
From: David Ahern @ 2018-05-10 4:08 UTC (permalink / raw)
To: Steve Wise, leon; +Cc: stephen, netdev, linux-rdma
In-Reply-To: <1a0d146dffb17449aa6d8a6b6d06e865e69226de.1525709213.git.swise@opengridcomputing.com>
On 5/7/18 9:53 AM, Steve Wise wrote:
> @@ -152,7 +153,10 @@ int main(int argc, char **argv)
> pretty_output = true;
> break;
> case 'd':
> - show_details = true;
> + if (show_details)
> + show_driver_details = true;
> + else
> + show_details = true;
> break;
> case 'j':
> json_output = true;
The above change should be reflected in the man page.
Also, the set needs to be respun after I merged master where Stephen
brought in updates to the uapi files.
^ permalink raw reply
* Re: [iproute2-next 1/1] tipc: fixed node and name table listings
From: David Ahern @ 2018-05-10 3:57 UTC (permalink / raw)
To: Jon Maloy, stephen, netdev
Cc: mohan.krishna.ghanta.krishnamurthy, tung.q.nguyen, hoang.h.le,
canh.d.luu, ying.xue, tipc-discussion
In-Reply-To: <1525710766-26122-1-git-send-email-jon.maloy@ericsson.com>
On 5/7/18 10:32 AM, Jon Maloy wrote:
> We make it easier for users to correlate between 128-bit node
> identities and 32-bit node hash by extending the 'node list'
> command to also show the hash value.
>
> We also improve the 'nametable show' command to show the node identity
> instead of the node hash value. Since the former potentially is much
> longer than the latter, we make room for it by eliminating the (to the
> user) irrelevant publication key. We also reorder some of the columns
> so that the node id comes last, since this looks nicer and more logical.
>
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> ---
> tipc/misc.c | 18 ++++++++++++++++++
> tipc/misc.h | 1 +
> tipc/nametable.c | 18 ++++++++++--------
> tipc/node.c | 19 ++++++++-----------
> tipc/peer.c | 4 ++++
> 5 files changed, 41 insertions(+), 19 deletions(-)
>
Hi Jon: I see new warnings with Debian stretch gcc:
tipc
CC misc.o
CC nametable.o
CC node.o
CC peer.o
misc.c: In function ‘hash2nodestr’:
misc.c:128:14: warning: pointer targets in passing argument 1 of
‘nodeid2str’ differ in signedness [-Wpointer-sign]
nodeid2str(nr.node_id, str);
^~
misc.c:100:6: note: expected ‘uint8_t * {aka unsigned char *}’ but
argument is of type ‘char *’
void nodeid2str(uint8_t *id, char *str)
^~~~~~~~~~
^ permalink raw reply
* Re: [iproute-next v2] tipc: Add support to set and get MTU for UDP bearer
From: David Ahern @ 2018-05-10 3:55 UTC (permalink / raw)
To: GhantaKrishnamurthy MohanKrishna, tipc-discussion, jon.maloy,
maloy, ying.xue, netdev, stephen
In-Reply-To: <1525780528-31315-1-git-send-email-mohan.krishna.ghanta.krishnamurthy@ericsson.com>
On 5/8/18 5:55 AM, GhantaKrishnamurthy MohanKrishna wrote:
> In this commit we introduce the ability to set and get
> MTU for UDP media and bearer.
>
> For set and get properties such as tolerance, window and priority,
> we already do:
>
> $ tipc media set PPROPERTY media MEDIA
> $ tipc media get PPROPERTY media MEDIA
>
> $ tipc bearer set OPTION media MEDIA ARGS
> $ tipc bearer get [OPTION] media MEDIA ARGS
>
> The same has been extended for MTU, with an exception to support
> only media type UDP.
>
> Acked-by: Jon Maloy <jon.maloy@ericsson.com>
> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
> ---
> tipc/bearer.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++---------
> tipc/media.c | 24 ++++++++++++++++++++++--
> 2 files changed, 68 insertions(+), 11 deletions(-)
>
applied to iproute2-next.
^ permalink raw reply
* [bpf-next v3 9/9] samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
From: David Ahern @ 2018-05-10 3:34 UTC (permalink / raw)
To: netdev, borkmann, ast
Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180510033427.20756-1-dsahern@gmail.com>
Simple example of fast-path forwarding. It has a serious flaw
in not verifying the egress device index supports XDP forwarding.
If the egress device does not packets are dropped.
Take this only as a simple example of fast-path forwarding.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
samples/bpf/Makefile | 4 +
samples/bpf/xdp_fwd_kern.c | 115 +++++++++++++++++++++++++
samples/bpf/xdp_fwd_user.c | 136 ++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/bpf_helpers.h | 3 +
4 files changed, 258 insertions(+)
create mode 100644 samples/bpf/xdp_fwd_kern.c
create mode 100644 samples/bpf/xdp_fwd_user.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 8e0c7fb6d7cc..28513d6be1bf 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -46,6 +46,7 @@ hostprogs-y += syscall_tp
hostprogs-y += cpustat
hostprogs-y += xdp_adjust_tail
hostprogs-y += xdpsock
+hostprogs-y += xdp_fwd
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -100,6 +101,7 @@ syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
+xdp_fwd-objs := bpf_load.o $(LIBBPF) xdp_fwd_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@@ -154,6 +156,7 @@ always += syscall_tp_kern.o
always += cpustat_kern.o
always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
+always += xdp_fwd_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -201,6 +204,7 @@ HOSTLOADLIBES_syscall_tp += -lelf
HOSTLOADLIBES_cpustat += -lelf
HOSTLOADLIBES_xdp_adjust_tail += -lelf
HOSTLOADLIBES_xdpsock += -lelf -pthread
+HOSTLOADLIBES_xdp_fwd += -lelf
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
new file mode 100644
index 000000000000..cdf4fc383cc9
--- /dev/null
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include "bpf_helpers.h"
+
+#define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFFFFFF)
+
+struct bpf_map_def SEC("maps") tx_port = {
+ .type = BPF_MAP_TYPE_DEVMAP,
+ .key_size = sizeof(int),
+ .value_size = sizeof(int),
+ .max_entries = 64,
+};
+
+static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
+{
+ void *data_end = (void *)(long)ctx->data_end;
+ void *data = (void *)(long)ctx->data;
+ struct bpf_fib_lookup fib_params;
+ struct ethhdr *eth = data;
+ int out_index;
+ u16 h_proto;
+ u64 nh_off;
+
+ nh_off = sizeof(*eth);
+ if (data + nh_off > data_end)
+ return XDP_DROP;
+
+ __builtin_memset(&fib_params, 0, sizeof(fib_params));
+
+ h_proto = eth->h_proto;
+ if (h_proto == htons(ETH_P_IP)) {
+ struct iphdr *iph = data + nh_off;
+
+ if (iph + 1 > data_end)
+ return XDP_DROP;
+
+ fib_params.family = AF_INET;
+ fib_params.tos = iph->tos;
+ fib_params.l4_protocol = iph->protocol;
+ fib_params.sport = 0;
+ fib_params.dport = 0;
+ fib_params.tot_len = ntohs(iph->tot_len);
+ fib_params.ipv4_src = iph->saddr;
+ fib_params.ipv4_dst = iph->daddr;
+ } else if (h_proto == htons(ETH_P_IPV6)) {
+ struct in6_addr *src = (struct in6_addr *) fib_params.ipv6_src;
+ struct in6_addr *dst = (struct in6_addr *) fib_params.ipv6_dst;
+ struct ipv6hdr *iph = data + nh_off;
+
+ if (iph + 1 > data_end)
+ return XDP_DROP;
+
+ fib_params.family = AF_INET6;
+ fib_params.flowlabel = *(__be32 *)iph & IPV6_FLOWINFO_MASK;
+ fib_params.l4_protocol = iph->nexthdr;
+ fib_params.sport = 0;
+ fib_params.dport = 0;
+ fib_params.tot_len = ntohs(iph->payload_len);
+ *src = iph->saddr;
+ *dst = iph->daddr;
+ } else {
+ return XDP_PASS;
+ }
+
+ fib_params.ifindex = ctx->ingress_ifindex;
+
+ out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
+
+ /* verify egress index has xdp support
+ * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
+ * cannot pass map_type 14 into func bpf_map_lookup_elem#1:
+ * NOTE: without verification that egress index supports XDP
+ * forwarding packets are dropped.
+ */
+ if (out_index > 0) {
+ memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
+ memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
+ return bpf_redirect_map(&tx_port, out_index, 0);
+ }
+
+ return XDP_PASS;
+}
+
+SEC("xdp_fwd")
+int xdp_fwd_prog(struct xdp_md *ctx)
+{
+ return xdp_fwd_flags(ctx, 0);
+}
+
+SEC("xdp_fwd_direct")
+int xdp_fwd_direct_prog(struct xdp_md *ctx)
+{
+ return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
new file mode 100644
index 000000000000..9c6606f57126
--- /dev/null
+++ b/samples/bpf/xdp_fwd_user.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/limits.h>
+#include <net/if.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libgen.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+
+static int do_attach(int idx, int fd, const char *name)
+{
+ int err;
+
+ err = bpf_set_link_xdp_fd(idx, fd, 0);
+ if (err < 0)
+ printf("ERROR: failed to attach program to %s\n", name);
+
+ return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+ int err;
+
+ err = bpf_set_link_xdp_fd(idx, -1, 0);
+ if (err < 0)
+ printf("ERROR: failed to detach program from %s\n", name);
+
+ return err;
+}
+
+static void usage(const char *prog)
+{
+ fprintf(stderr,
+ "usage: %s [OPTS] interface-list\n"
+ "\nOPTS:\n"
+ " -d detach program\n"
+ " -D direct table lookups (skip fib rules)\n",
+ prog);
+}
+
+int main(int argc, char **argv)
+{
+ char filename[PATH_MAX];
+ int opt, i, idx, err;
+ int prog_id = 0;
+ int attach = 1;
+ int ret = 0;
+
+ while ((opt = getopt(argc, argv, ":dD")) != -1) {
+ switch (opt) {
+ case 'd':
+ attach = 0;
+ break;
+ case 'D':
+ prog_id = 1;
+ break;
+ default:
+ usage(basename(argv[0]));
+ return 1;
+ }
+ }
+
+ if (optind == argc) {
+ usage(basename(argv[0]));
+ return 1;
+ }
+
+ if (attach) {
+ snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+ if (access(filename, O_RDONLY) < 0) {
+ printf("error accessing file %s: %s\n",
+ filename, strerror(errno));
+ return 1;
+ }
+
+ if (load_bpf_file(filename)) {
+ printf("%s", bpf_log_buf);
+ return 1;
+ }
+
+ if (!prog_fd[prog_id]) {
+ printf("load_bpf_file: %s\n", strerror(errno));
+ return 1;
+ }
+ }
+ if (attach) {
+ for (i = 1; i < 64; ++i)
+ bpf_map_update_elem(map_fd[0], &i, &i, 0);
+ }
+
+ for (i = optind; i < argc; ++i) {
+ idx = if_nametoindex(argv[i]);
+ if (!idx)
+ idx = strtoul(argv[i], NULL, 0);
+
+ if (!idx) {
+ fprintf(stderr, "Invalid arg\n");
+ return 1;
+ }
+ if (!attach) {
+ err = do_detach(idx, argv[i]);
+ if (err)
+ ret = err;
+ } else {
+ err = do_attach(idx, prog_fd[prog_id], argv[i]);
+ if (err)
+ ret = err;
+ }
+ }
+
+ return ret;
+}
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 265f8e0e8ada..2375d06c706b 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -103,6 +103,9 @@ static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
(void *) BPF_FUNC_skb_get_xfrm_state;
static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
(void *) BPF_FUNC_get_stack;
+static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
+ int plen, __u32 flags) =
+ (void *) BPF_FUNC_fib_lookup;
/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
--
2.11.0
^ permalink raw reply related
* [bpf-next v3 8/9] bpf: Provide helper to do forwarding lookups in kernel FIB table
From: David Ahern @ 2018-05-10 3:34 UTC (permalink / raw)
To: netdev, borkmann, ast
Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180510033427.20756-1-dsahern@gmail.com>
Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet continues up the stack.
If it is to be forwarded, the forwarding can be done directly if the
neighbor is already known. If the neighbor does not exist, the first
few packets go up the stack for neighbor resolution. Once resolved, the
xdp program provides the fast path.
On successful lookup the nexthop dmac, current device smac and egress
device index are returned.
The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
are implemented in this patch. The API includes layer 4 parameters if
the XDP program chooses to do deep packet inspection to allow compare
against ACLs implemented as FIB rules.
Header rewrite is left to the XDP program.
The lookup takes 2 flags:
- BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
straight to the table associated with the device (expert setting for
those looking to maximize throughput)
- BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
Default is an ingress lookup.
Initial performance numbers collected by Jesper, forwarded packets/sec:
Full stack XDP FIB lookup XDP Direct lookup
IPv4 1,947,969 7,074,156 7,415,333
IPv6 1,728,000 6,165,504 7,262,720
These number are single CPU core forwarding on a Broadwell
E5-1650 v4 @ 3.60GHz.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
include/uapi/linux/bpf.h | 81 +++++++++++++-
net/core/filter.c | 267 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 347 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d615c777b573..02e4112510f8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1828,6 +1828,33 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ *
+ * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
+ * Description
+ * Do FIB lookup in kernel tables using parameters in *params*.
+ * If lookup is successful and result shows packet is to be
+ * forwarded, the neighbor tables are searched for the nexthop.
+ * If successful (ie., FIB lookup shows forwarding and nexthop
+ * is resolved), the nexthop address is returned in ipv4_dst,
+ * ipv6_dst or mpls_out based on family, smac is set to mac
+ * address of egress device, dmac is set to nexthop mac address,
+ * rt_metric is set to metric from route.
+ *
+ * *plen* argument is the size of the passed in struct.
+ * *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
+ *
+ * **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
+ * full lookup using FIB rules
+ * **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress
+ * perspective (default is ingress)
+ *
+ * *ctx* is either **struct xdp_md** for XDP programs or
+ * **struct sk_buff** tc cls_act programs.
+ *
+ * Return
+ * Egress device index on success, 0 if packet needs to continue
+ * up the stack for further processing or a negative error in case
+ * of failure.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -1898,7 +1925,8 @@ union bpf_attr {
FN(xdp_adjust_tail), \
FN(skb_get_xfrm_state), \
FN(get_stack), \
- FN(skb_load_bytes_relative),
+ FN(skb_load_bytes_relative), \
+ FN(fib_lookup),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -2321,4 +2349,55 @@ struct bpf_raw_tracepoint_args {
__u64 args[0];
};
+/* DIRECT: Skip the FIB rules and go to FIB table associated with device
+ * OUTPUT: Do lookup from egress perspective; default is ingress
+ */
+#define BPF_FIB_LOOKUP_DIRECT BIT(0)
+#define BPF_FIB_LOOKUP_OUTPUT BIT(1)
+
+struct bpf_fib_lookup {
+ /* input */
+ __u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */
+
+ /* set if lookup is to consider L4 data - e.g., FIB rules */
+ __u8 l4_protocol;
+ __be16 sport;
+ __be16 dport;
+
+ /* total length of packet from network header - used for MTU check */
+ __u16 tot_len;
+ __u32 ifindex; /* L3 device index for lookup */
+
+ union {
+ /* inputs to lookup */
+ __u8 tos; /* AF_INET */
+ __be32 flowlabel; /* AF_INET6 */
+
+ /* output: metric of fib result */
+ __u32 rt_metric;
+ };
+
+ union {
+ __be32 mpls_in;
+ __be32 ipv4_src;
+ __u32 ipv6_src[4]; /* in6_addr; network order */
+ };
+
+ /* input to bpf_fib_lookup, *dst is destination address.
+ * output: bpf_fib_lookup sets to gateway address
+ */
+ union {
+ /* return for MPLS lookups */
+ __be32 mpls_out[4]; /* support up to 4 labels */
+ __be32 ipv4_dst;
+ __u32 ipv6_dst[4]; /* in6_addr; network order */
+ };
+
+ /* output */
+ __be16 h_vlan_proto;
+ __be16 h_vlan_TCI;
+ __u8 smac[6]; /* ETH_ALEN */
+ __u8 dmac[6]; /* ETH_ALEN */
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 0baa715e4699..ca60d2872da4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -60,6 +60,10 @@
#include <net/xfrm.h>
#include <linux/bpf_trace.h>
#include <net/xdp_sock.h>
+#include <linux/inetdevice.h>
+#include <net/ip_fib.h>
+#include <net/flow.h>
+#include <net/arp.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
@@ -4032,6 +4036,265 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
};
#endif
+#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
+static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
+ const struct neighbour *neigh,
+ const struct net_device *dev)
+{
+ memcpy(params->dmac, neigh->ha, ETH_ALEN);
+ memcpy(params->smac, dev->dev_addr, ETH_ALEN);
+ params->h_vlan_TCI = 0;
+ params->h_vlan_proto = 0;
+
+ return dev->ifindex;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_INET)
+static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+ u32 flags)
+{
+ struct in_device *in_dev;
+ struct neighbour *neigh;
+ struct net_device *dev;
+ struct fib_result res;
+ struct fib_nh *nh;
+ struct flowi4 fl4;
+ int err;
+
+ dev = dev_get_by_index_rcu(net, params->ifindex);
+ if (unlikely(!dev))
+ return -ENODEV;
+
+ /* verify forwarding is enabled on this interface */
+ in_dev = __in_dev_get_rcu(dev);
+ if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
+ return 0;
+
+ if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+ fl4.flowi4_iif = 1;
+ fl4.flowi4_oif = params->ifindex;
+ } else {
+ fl4.flowi4_iif = params->ifindex;
+ fl4.flowi4_oif = 0;
+ }
+ fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
+ fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
+ fl4.flowi4_flags = 0;
+
+ fl4.flowi4_proto = params->l4_protocol;
+ fl4.daddr = params->ipv4_dst;
+ fl4.saddr = params->ipv4_src;
+ fl4.fl4_sport = params->sport;
+ fl4.fl4_dport = params->dport;
+
+ if (flags & BPF_FIB_LOOKUP_DIRECT) {
+ u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+ struct fib_table *tb;
+
+ tb = fib_get_table(net, tbid);
+ if (unlikely(!tb))
+ return 0;
+
+ err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
+ } else {
+ fl4.flowi4_mark = 0;
+ fl4.flowi4_secid = 0;
+ fl4.flowi4_tun_key.tun_id = 0;
+ fl4.flowi4_uid = sock_net_uid(net, NULL);
+
+ err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
+ }
+
+ if (err || res.type != RTN_UNICAST)
+ return 0;
+
+ if (res.fi->fib_nhs > 1)
+ fib_select_path(net, &res, &fl4, NULL);
+
+ nh = &res.fi->fib_nh[res.nh_sel];
+
+ /* do not handle lwt encaps right now */
+ if (nh->nh_lwtstate)
+ return 0;
+
+ dev = nh->nh_dev;
+ if (unlikely(!dev))
+ return 0;
+
+ if (nh->nh_gw)
+ params->ipv4_dst = nh->nh_gw;
+
+ params->rt_metric = res.fi->fib_priority;
+
+ /* xdp and cls_bpf programs are run in RCU-bh so
+ * rcu_read_lock_bh is not needed here
+ */
+ neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
+ if (neigh)
+ return bpf_fib_set_fwd_params(params, neigh, dev);
+
+ return 0;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+ u32 flags)
+{
+ struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
+ struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
+ struct neighbour *neigh;
+ struct net_device *dev;
+ struct inet6_dev *idev;
+ struct fib6_info *f6i;
+ struct flowi6 fl6;
+ int strict = 0;
+ int oif;
+
+ /* link local addresses are never forwarded */
+ if (rt6_need_strict(dst) || rt6_need_strict(src))
+ return 0;
+
+ dev = dev_get_by_index_rcu(net, params->ifindex);
+ if (unlikely(!dev))
+ return -ENODEV;
+
+ idev = __in6_dev_get_safely(dev);
+ if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
+ return 0;
+
+ if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+ fl6.flowi6_iif = 1;
+ oif = fl6.flowi6_oif = params->ifindex;
+ } else {
+ oif = fl6.flowi6_iif = params->ifindex;
+ fl6.flowi6_oif = 0;
+ strict = RT6_LOOKUP_F_HAS_SADDR;
+ }
+ fl6.flowlabel = params->flowlabel;
+ fl6.flowi6_scope = 0;
+ fl6.flowi6_flags = 0;
+ fl6.mp_hash = 0;
+
+ fl6.flowi6_proto = params->l4_protocol;
+ fl6.daddr = *dst;
+ fl6.saddr = *src;
+ fl6.fl6_sport = params->sport;
+ fl6.fl6_dport = params->dport;
+
+ if (flags & BPF_FIB_LOOKUP_DIRECT) {
+ u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+ struct fib6_table *tb;
+
+ tb = ipv6_stub->fib6_get_table(net, tbid);
+ if (unlikely(!tb))
+ return 0;
+
+ f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
+ } else {
+ fl6.flowi6_mark = 0;
+ fl6.flowi6_secid = 0;
+ fl6.flowi6_tun_key.tun_id = 0;
+ fl6.flowi6_uid = sock_net_uid(net, NULL);
+
+ f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict);
+ }
+
+ if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
+ return 0;
+
+ if (unlikely(f6i->fib6_flags & RTF_REJECT ||
+ f6i->fib6_type != RTN_UNICAST))
+ return 0;
+
+ if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
+ f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
+ fl6.flowi6_oif, NULL,
+ strict);
+
+ if (f6i->fib6_nh.nh_lwtstate)
+ return 0;
+
+ if (f6i->fib6_flags & RTF_GATEWAY)
+ *dst = f6i->fib6_nh.nh_gw;
+
+ dev = f6i->fib6_nh.nh_dev;
+ params->rt_metric = f6i->fib6_metric;
+
+ /* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
+ * not needed here. Can not use __ipv6_neigh_lookup_noref here
+ * because we need to get nd_tbl via the stub
+ */
+ neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
+ ndisc_hashfn, dst, dev);
+ if (neigh)
+ return bpf_fib_set_fwd_params(params, neigh, dev);
+
+ return 0;
+}
+#endif
+
+BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
+ struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+ if (plen < sizeof(*params))
+ return -EINVAL;
+
+ switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+ case AF_INET:
+ return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
+ flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+ case AF_INET6:
+ return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
+ flags);
+#endif
+ }
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
+ .func = bpf_xdp_fib_lookup,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_MEM,
+ .arg3_type = ARG_CONST_SIZE,
+ .arg4_type = ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
+ struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+ if (plen < sizeof(*params))
+ return -EINVAL;
+
+ switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+ case AF_INET:
+ return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+ case AF_INET6:
+ return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+ }
+ return -ENOTSUPP;
+}
+
+static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
+ .func = bpf_skb_fib_lookup,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_MEM,
+ .arg3_type = ARG_CONST_SIZE,
+ .arg4_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
@@ -4181,6 +4444,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_skb_get_xfrm_state:
return &bpf_skb_get_xfrm_state_proto;
#endif
+ case BPF_FUNC_fib_lookup:
+ return &bpf_skb_fib_lookup_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -4206,6 +4471,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_xdp_redirect_map_proto;
case BPF_FUNC_xdp_adjust_tail:
return &bpf_xdp_adjust_tail_proto;
+ case BPF_FUNC_fib_lookup:
+ return &bpf_xdp_fib_lookup_proto;
default:
return bpf_base_func_proto(func_id);
}
--
2.11.0
^ permalink raw reply related
* [bpf-next v3 7/9] net/ipv6: Add fib lookup stubs for use in bpf helper
From: David Ahern @ 2018-05-10 3:34 UTC (permalink / raw)
To: netdev, borkmann, ast
Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180510033427.20756-1-dsahern@gmail.com>
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.
The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
---
include/net/addrconf.h | 14 ++++++++++++++
net/ipv6/addrconf_core.c | 33 ++++++++++++++++++++++++++++++++-
net/ipv6/af_inet6.c | 6 +++++-
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 8312cc25a3af..ff766ab207e0 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -223,6 +223,20 @@ struct ipv6_stub {
const struct in6_addr *addr);
int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
struct dst_entry **dst, struct flowi6 *fl6);
+
+ struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
+ struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
+ struct flowi6 *fl6, int flags);
+ struct fib6_info *(*fib6_table_lookup)(struct net *net,
+ struct fib6_table *table,
+ int oif, struct flowi6 *fl6,
+ int flags);
+ struct fib6_info *(*fib6_multipath_select)(const struct net *net,
+ struct fib6_info *f6i,
+ struct flowi6 *fl6, int oif,
+ const struct sk_buff *skb,
+ int strict);
+
void (*udpv6_encap_enable)(void);
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr,
const struct in6_addr *solicited_addr,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 32b564dfd02a..2fe754fd4f5e 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,8 +134,39 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
return -EAFNOSUPPORT;
}
+static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
+{
+ return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table,
+ int oif, struct flowi6 *fl6, int flags)
+{
+ return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
+ int flags)
+{
+ return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i,
+ struct flowi6 *fl6, int oif,
+ const struct sk_buff *skb, int strict)
+{
+ return f6i;
+}
+
const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
- .ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
+ .ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
+ .fib6_get_table = eafnosupport_fib6_get_table,
+ .fib6_table_lookup = eafnosupport_fib6_table_lookup,
+ .fib6_lookup = eafnosupport_fib6_lookup,
+ .fib6_multipath_select = eafnosupport_fib6_multipath_select,
};
EXPORT_SYMBOL_GPL(ipv6_stub);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d0af96e0d109..50de8b0d4f70 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -889,7 +889,11 @@ static struct pernet_operations inet6_net_ops = {
static const struct ipv6_stub ipv6_stub_impl = {
.ipv6_sock_mc_join = ipv6_sock_mc_join,
.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
- .ipv6_dst_lookup = ip6_dst_lookup,
+ .ipv6_dst_lookup = ip6_dst_lookup,
+ .fib6_get_table = fib6_get_table,
+ .fib6_table_lookup = fib6_table_lookup,
+ .fib6_lookup = fib6_lookup,
+ .fib6_multipath_select = fib6_multipath_select,
.udpv6_encap_enable = udpv6_encap_enable,
.ndisc_send_na = ndisc_send_na,
.nd_tbl = &nd_tbl,
--
2.11.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox