Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] netfilter: x_tables: avoid stack-out-of-bounds read in xt_copy_counters_from_user
From: Pablo Neira Ayuso @ 2017-10-06 13:05 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, Jozsef Kadlecsik, netfilter-devel, netdev,
	Willem de Bruijn
In-Reply-To: <20171005095644.GB27522@breakpoint.cc>

On Thu, Oct 05, 2017 at 11:56:44AM +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > syzkaller reports an out of bound read in strlcpy(), triggered
> > by xt_copy_counters_from_user()
> > 
> > Fix this by using memcpy(), then forcing a zero byte at the last position
> > of the destination, as Florian did for the non COMPAT code.
> > 
> > Fixes: d7591f0c41ce ("netfilter: x_tables: introduce and use xt_copy_counters_from_user")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > ---
> >  net/netfilter/x_tables.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index c83a3b5e1c6c2a91b713b6681a794bd79ab3fa08..d8571f4142080a3c121fc90f0b52d81ee9df6712 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -892,7 +892,7 @@ void *xt_copy_counters_from_user(const void __user *user, unsigned int len,
> >  		if (copy_from_user(&compat_tmp, user, sizeof(compat_tmp)) != 0)
> >  			return ERR_PTR(-EFAULT);
> >  
> > -		strlcpy(info->name, compat_tmp.name, sizeof(info->name));
> > +		memcpy(info->name, compat_tmp.name, sizeof(info->name) - 1);
> 
> Argh, right, compat_tmp.name might not be 0 terminated :-/
> 
> Acked-by: Florian Westphal <fw@strlen.de>

Applied to nf, thanks.

^ permalink raw reply

* [PATCH 4/4] tcp: avoid noref dst leak on input path
From: Paolo Abeni @ 2017-10-06 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Josh Triplett, Steven Rostedt, David S. Miller,
	Eric Dumazet, Hannes Frederic Sowa, netdev
In-Reply-To: <cover.1507294365.git.pabeni@redhat.com>

Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat when
processing tcp packets:

           to-be-untracked noref entity ffff942cb71ea300 not found in cache
           ------------[ cut here ]------------
           WARNING: CPU: 24 PID: 178 at kernel/rcu/noref_debug.c:54 rcu_track_noref+0xa4/0xf0
           Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt ipmi_ssif mei_me iTCO_vendor_support mei dcdbas lpc_ich ipmi_si mxm_wmi sg pcspkr ipmi_devintf ipmi_msghandler acpi_power_meter shpchp wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb drm ixgbe mdio crc32c_intel ahci ptp i2c_algo_bit libahci pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
           CPU: 24 PID: 178 Comm: ksoftirqd/24 Not tainted 4.14.0-rc1.noref_route+ #1610
           Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
           task: ffff940e48300000 task.stack: ffffaec406a20000
           RIP: 0010:rcu_track_noref+0xa4/0xf0
           RSP: 0018:ffffaec406a238e0 EFLAGS: 00010246
           RAX: 0000000000000040 RBX: 0000000000000000 RCX: 0000000000000002
           RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000292
           RBP: ffffaec406a238e0 R08: 0000000000000000 R09: 0000000000000000
           R10: 0000000000000001 R11: 0000000000000003 R12: 0000000000000000
           R13: ffff942cb5110000 R14: 000000000000fe88 R15: ffff942cb1a20200
           FS:  0000000000000000(0000) GS:ffff942cbee00000(0000) knlGS:0000000000000000
           CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
           CR2: 00007febc072d140 CR3: 0000001feebd6002 CR4: 00000000003606e0
           DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
           DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
           Call Trace:
            tcp_data_queue+0x82a/0xce0
            tcp_rcv_established+0x283/0x570
            tcp_v4_do_rcv+0x102/0x1e0
            tcp_v4_rcv+0xa9e/0xc10
            ip_local_deliver_finish+0x128/0x380
            ? ip_local_deliver_finish+0x41/0x380
            ip_local_deliver+0x74/0x230
            ip_rcv_finish+0x105/0x5e0
            ip_rcv+0x2a7/0x540
            __netif_receive_skb_core+0x3b9/0xe10
            ? netif_receive_skb_internal+0x40/0x390
            __netif_receive_skb+0x18/0x60
            netif_receive_skb_internal+0x8d/0x390
            ? netif_receive_skb_internal+0x40/0x390
            napi_gro_complete+0x127/0x1d0
            ? napi_gro_complete+0x2a/0x1d0
            napi_gro_flush+0x5f/0x80
            napi_complete_done+0x96/0x100
            ixgbe_poll+0x5f8/0x7c0 [ixgbe]
            net_rx_action+0x27d/0x520
            __do_softirq+0xd1/0x4f5
            run_ksoftirqd+0x25/0x70
            smpboot_thread_fn+0x11a/0x1f0
            kthread+0x155/0x190
            ? sort_range+0x30/0x30
            ? kthread_create_on_node+0x70/0x70
            ret_from_fork+0x2a/0x40
           Code: e9 83 c2 01 48 83 c0 18 83 fa 07 75 ef 80 3d b0 e5 ff 00 00 75 d2 48 c7 c7 50 54 07 9d 31 c0 c6 05 9e e5 ff 00 01 e8 ef af fe ff <0f> ff 5d c3 80 3d 8f e5 ff 00 00 75 b0 48 c7 c7 00 54 07 9d 31

we must clear the skb dst before enqueuing the skb somewhere,
but currently the dst entry for packets delivered
via tcp_rcv_established() -> tcp_queue_rcv() is not cleared.

Fix it by adding the explicit drop in tcp_queue_rcv() and moving
the current skb_dst_drop() just before the other enqueuing
operation, do avoid unneeded double skb_dst_drop() for some
path.

The leak itself is not harmful, because the tcp recvmsg() code
should not access such info.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/tcp_input.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c5d7656beeee..bf4e17edfe7a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4422,6 +4422,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		return;
 	}
 
+	/* drop the -possibly noref - dst before delivery the skb to ofo tree */
+	skb_dst_drop(skb);
+
 	/* Stash tstamp to avoid being stomped on by rbnode */
 	if (TCP_SKB_CB(skb)->has_rxtstamp)
 		TCP_SKB_CB(skb)->swtstamp = skb->tstamp;
@@ -4560,6 +4563,7 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int
 				  skb, fragstolen)) ? 1 : 0;
 	tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
 	if (!eaten) {
+		skb_dst_drop(skb);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		skb_set_owner_r(skb, sk);
 	}
@@ -4626,7 +4630,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		__kfree_skb(skb);
 		return;
 	}
-	skb_dst_drop(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
 	tcp_ecn_accept_cwr(tp, skb);
-- 
2.13.6

^ permalink raw reply related

* [PATCH 3/4] ipv4: drop unneeded and misleading RCU lock in ip_route_input_noref()
From: Paolo Abeni @ 2017-10-06 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Josh Triplett, Steven Rostedt, David S. Miller,
	Eric Dumazet, Hannes Frederic Sowa, netdev
In-Reply-To: <cover.1507294365.git.pabeni@redhat.com>

Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat on the
first ingress IPv4 packet:

       1 noref entities escaped an RCU section, nesting 259, leaked noref list  ffff8edcefb1dc00
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 0 at kernel/rcu/noref_debug.c:87 __rcu_check_noref+0xf8/0x100
       Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd mei_me ipmi_ssif sg mei mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas lpc_ich ipmi_si pcspkr ipmi_devintf ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb drm ixgbe mdio ahci ptp crc32c_intel i2c_algo_bit libahci pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
       CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc1.noref_3+ #1609
       Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
       task: ffffffffae019500 task.stack: ffffffffae000000
       RIP: 0010:__rcu_check_noref+0x7b/0xd0
       RSP: 0018:ffff900afbe03b30 EFLAGS: 00010246
       RAX: 0000000000000034 RBX: ffff900afbfd2500 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000202
       RBP: ffff900afbe03b48 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 00000000c388883e R12: 0000000000000103
       R13: 000000001d24100a R14: ffff900af13e0000 R15: 0000000000000000
       FS:  0000000000000000(0000) GS:ffff900afbe00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00005589ce876398 CR3: 0000001ff52f9005 CR4: 00000000003606f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        <IRQ>
        ip_route_input_noref+0xa6/0x150
        ? ip_route_input_noref+0x5/0x150
        ip_rcv_finish+0x78/0x5e0
        ip_rcv+0x2a7/0x540
        ? packet_rcv+0x52/0x450
        __netif_receive_skb_core+0x3b9/0xe10
        ? netif_receive_skb_internal+0x40/0x390
        __netif_receive_skb+0x18/0x60
        netif_receive_skb_internal+0x8d/0x390
        ? netif_receive_skb_internal+0x40/0x390
        napi_gro_receive+0x15c/0x1f0
        igb_clean_rx_irq+0x36d/0x7f0 [igb]
        igb_poll+0x303/0x780 [igb]
        ? save_stack_trace+0x1b/0x20
        ? __lock_acquire+0xcf2/0x11c0
        ? net_rx_action+0xb4/0x520
        net_rx_action+0x27d/0x520
        __do_softirq+0xd1/0x4f5
        irq_exit+0xfb/0x110
        do_IRQ+0x67/0x120
        common_interrupt+0xa7/0xa7
        </IRQ>
       RIP: 0010:cpuidle_enter_state+0xd0/0x360
       RSP: 0018:ffffffffae003df8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff79
       RAX: ffffffffae019500 RBX: ffffd7d67c604400 RCX: 0000000000000000
       RDX: ffffffffae019500 RSI: 0000000000000001 RDI: ffffffffae019500
       RBP: ffffffffae003e30 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000018 R12: 0000000000000004
       R13: 0000000000000000 R14: ffffd7d67c604400 R15: 00000009c8eafd50
        ? cpuidle_enter_state+0xc9/0x360
        cpuidle_enter+0x17/0x20
        call_cpuidle+0x23/0x40
        do_idle+0x183/0x200
        cpu_startup_entry+0x73/0x80
        rest_init+0xc3/0xd0
        start_kernel+0x4f7/0x518
        ? set_init_arg+0x5a/0x5a
        x86_64_start_reservations+0x24/0x26
        x86_64_start_kernel+0x6f/0x72
        secondary_startup_64+0xa5/0xa5
       Code: f6 75 07 5b 41 5c 41 5d 5d c3 80 3d eb e4 ff 00 00 75 1a 44 89 e2 48 c7 c7 88 54 e7 ad 31 c0 c6 05 d6 e4 ff 00 01 e8 28 af fe ff <0f> ff 41 bd 07 00 00 00 48 8b 33 48 85 f6 74 06 44 39 63 10 74

The rcu protection in ip_route_input_noref() is unneeded and
misleading: the caller still needs to acquire and retain the rcu
lock until the skb - carrying a noref dst on successful return -
is either dropped or the relevant dst is forced to a ref-counted
version.

This change just drops the unneeded lock.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/route.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 94d4cd2d5ea4..5a6ca1f16d3f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2069,14 +2069,9 @@ int ip_route_input_noref(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			 u8 tos, struct net_device *dev)
 {
 	struct fib_result res;
-	int err;
 
 	tos &= IPTOS_RT_MASK;
-	rcu_read_lock();
-	err = ip_route_input_rcu(skb, daddr, saddr, tos, dev, &res);
-	rcu_read_unlock();
-
-	return err;
+	return ip_route_input_rcu(skb, daddr, saddr, tos, dev, &res);
 }
 EXPORT_SYMBOL(ip_route_input_noref);
 
-- 
2.13.6

^ permalink raw reply related

* [PATCH 2/4] net: use RCU noref infrastructure to track dst noref
From: Paolo Abeni @ 2017-10-06 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Josh Triplett, Steven Rostedt, David S. Miller,
	Eric Dumazet, Hannes Frederic Sowa, netdev
In-Reply-To: <cover.1507294365.git.pabeni@redhat.com>

We can now ensure noref dsts do no escape the relevant RCU
section. This does not introduce any functional changes, it adds
the relevant debug code enabled by CONFIG_RCU_NOREF_DEBUG.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h | 1 +
 include/net/dst.h      | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 72299ef00061..5df77c8a0319 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -905,6 +905,7 @@ static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
 static inline void skb_dst_set_noref(struct sk_buff *skb, struct dst_entry *dst)
 {
 	WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
+	rcu_track_noref(skb, dst, true);
 	skb->_skb_refdst = (unsigned long)dst | SKB_DST_NOREF;
 }
 
diff --git a/include/net/dst.h b/include/net/dst.h
index 06a6765da074..6bc6fefe178e 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -294,6 +294,8 @@ static inline void refdst_drop(unsigned long refdst)
 static inline void skb_dst_drop(struct sk_buff *skb)
 {
 	if (skb->_skb_refdst) {
+		if (skb->_skb_refdst & SKB_DST_NOREF)
+			rcu_track_noref(skb, skb_dst(skb), false);
 		refdst_drop(skb->_skb_refdst);
 		skb->_skb_refdst = 0UL;
 	}
@@ -304,6 +306,8 @@ static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst)
 	nskb->_skb_refdst = refdst;
 	if (!(nskb->_skb_refdst & SKB_DST_NOREF))
 		dst_clone(skb_dst(nskb));
+	else
+		rcu_track_noref(nskb, skb_dst(nskb), true);
 }
 
 static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb)
@@ -335,6 +339,7 @@ static inline void skb_dst_force(struct sk_buff *skb)
 		struct dst_entry *dst = skb_dst(skb);
 
 		WARN_ON(!rcu_read_lock_held());
+		rcu_track_noref(skb, dst, false);
 		if (!dst_hold_safe(dst))
 			dst = NULL;
 
-- 
2.13.6

^ permalink raw reply related

* [PATCH 1/4] rcu: introduce noref debug
From: Paolo Abeni @ 2017-10-06 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Josh Triplett, Steven Rostedt, David S. Miller,
	Eric Dumazet, Hannes Frederic Sowa, netdev
In-Reply-To: <cover.1507294365.git.pabeni@redhat.com>

We currently lack a debugging infrastructure to ensure that
long-lived noref dst are properly handled - namely dropped
or converted to a refcounted version before escaping the current
RCU section.

This changeset implements such infra, tracking the noref pointer
on a per CPU store and checking that all noref tracked at any
given RCU nesting level are cleared before leaving such RCU
section.

Each noref entity is identified using the entity pointer and an
additional, optional, key/opaque pointer. This is needed to cope
with usage case scenario where the same noref entity is stored
in different places (e.g. noref dst on skb_clone()).

To keep the patch/implementation simple RCU_NOREF_DEBUG depends
on PREEMPT_RCU=n in Kconfig.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/rcupdate.h | 11 ++++++
 kernel/rcu/Kconfig.debug | 15 ++++++++
 kernel/rcu/Makefile      |  1 +
 kernel/rcu/noref_debug.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 116 insertions(+)
 create mode 100644 kernel/rcu/noref_debug.c

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de50d8a4cf41..20c1ce08e3eb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -60,6 +60,15 @@ void call_rcu_sched(struct rcu_head *head, rcu_callback_t func);
 void synchronize_sched(void);
 void rcu_barrier_tasks(void);
 
+#ifdef CONFIG_RCU_NOREF_DEBUG
+void rcu_track_noref(void *key, void *noref, bool track);
+void __rcu_check_noref(void);
+
+#else
+static inline void rcu_track_noref(void *key, void *noref, bool add) { }
+static inline void __rcu_check_noref(void) { }
+#endif
+
 #ifdef CONFIG_PREEMPT_RCU
 
 void __rcu_read_lock(void);
@@ -85,6 +94,7 @@ static inline void __rcu_read_lock(void)
 
 static inline void __rcu_read_unlock(void)
 {
+	__rcu_check_noref();
 	if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
 		preempt_enable();
 }
@@ -723,6 +733,7 @@ static inline void rcu_read_unlock_bh(void)
 			 "rcu_read_unlock_bh() used illegally while idle");
 	rcu_lock_release(&rcu_bh_lock_map);
 	__release(RCU_BH);
+	__rcu_check_noref();
 	local_bh_enable();
 }
 
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 0ec7d1d33a14..6c7f52a3e809 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -68,6 +68,21 @@ config RCU_TRACE
 	  Say Y here if you want to enable RCU tracing
 	  Say N if you are unsure.
 
+config RCU_NOREF_DEBUG
+	bool "Debugging for RCU-protected noref entries"
+	depends on PREEMPT_RCU=n
+	select PREEMPT_COUNT
+	default n
+	help
+	  In case of long lasting rcu_read_lock sections this debug
+	  feature enables one to ensure that no rcu managed dereferenced
+	  data leaves the locked section. One use case is the tracking
+	  of dst_entries in struct sk_buff ->dst, which is used to pass
+	  the dst_entry around in the whole stack while under RCU lock.
+
+	  Say Y here if you want to enable noref RCU debugging
+	  Say N if you are unsure.
+
 config RCU_EQS_DEBUG
 	bool "Provide debugging asserts for adding NO_HZ support to an arch"
 	depends on DEBUG_KERNEL
diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
index 13c0fc852767..c67d7c65c582 100644
--- a/kernel/rcu/Makefile
+++ b/kernel/rcu/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_TREE_RCU) += tree.o
 obj-$(CONFIG_PREEMPT_RCU) += tree.o
 obj-$(CONFIG_TINY_RCU) += tiny.o
 obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
+obj-$(CONFIG_RCU_NOREF_DEBUG) += noref_debug.o
\ No newline at end of file
diff --git a/kernel/rcu/noref_debug.c b/kernel/rcu/noref_debug.c
new file mode 100644
index 000000000000..ae2e104b11d6
--- /dev/null
+++ b/kernel/rcu/noref_debug.c
@@ -0,0 +1,89 @@
+#include <linux/bug.h>
+#include <linux/percpu.h>
+#include <linux/skbuff.h>
+#include <linux/bitops.h>
+
+#define NOREF_MAX 7
+struct noref_entry {
+	void *noref;
+	void *key;
+	int nesting;
+};
+
+struct noref_cache {
+	struct noref_entry store[NOREF_MAX];
+};
+
+static DEFINE_PER_CPU(struct noref_cache, per_cpu_noref_cache);
+
+static int noref_cache_lookup(struct noref_cache *cache, void *noref, void *key)
+{
+	int i;
+
+	for (i = 0; i < NOREF_MAX; ++i)
+		if (cache->store[i].noref == noref &&
+		    cache->store[i].key == key)
+			return i;
+
+	return -1;
+}
+
+void rcu_track_noref(void *key, void *noref, bool track)
+{
+	struct noref_cache *cache = this_cpu_ptr(&per_cpu_noref_cache);
+	int i;
+
+	if (track) {
+		/* find the first empty slot */
+		i = noref_cache_lookup(cache, NULL, 0);
+		if (i < 0) {
+			WARN_ONCE(1, "can't find empty an slot to track a noref"
+				  " noref tracking will be inaccurate");
+			return;
+		}
+
+		cache->store[i].noref = noref;
+		cache->store[i].key = key;
+		cache->store[i].nesting = preempt_count();
+		return;
+	}
+
+	i = noref_cache_lookup(cache, noref, key);
+	if (i == -1) {
+		WARN_ONCE(1, "to-be-untracked noref entity %p not found in "
+			  "cache\n", noref);
+		return;
+	}
+
+	cache->store[i].noref = NULL;
+	cache->store[i].key = NULL;
+	cache->store[i].nesting = 0;
+}
+EXPORT_SYMBOL_GPL(rcu_track_noref);
+
+void __rcu_check_noref(void)
+{
+	struct noref_cache *cache = this_cpu_ptr(&per_cpu_noref_cache);
+	char *cur, buf[strlen("0xffffffffffffffff") * NOREF_MAX + 1];
+	int nesting = preempt_count();
+	int i, ret, cnt = 0;
+
+	cur = buf;
+	for (i = 0; i < NOREF_MAX; ++i) {
+		if (!cache->store[i].noref ||
+		    cache->store[i].nesting != nesting)
+			continue;
+
+		cnt++;
+		ret = sprintf(cur, " %p", cache->store[i].noref);
+		if (ret >= 0)
+			cur += ret;
+		cache->store[i].noref = NULL;
+		cache->store[i].key = NULL;
+		cache->store[i].nesting = 0;
+	}
+
+	WARN_ONCE(cnt, "%d noref entities escaped an RCU section, "
+		  "nesting %d, leaked noref list %s", cnt, nesting, buf);
+}
+EXPORT_SYMBOL_GPL(__rcu_check_noref);
-- 
2.13.6

^ permalink raw reply related

* [PATCH 0/4] RCU: introduce noref debug
From: Paolo Abeni @ 2017-10-06 12:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Paul E. McKenney, Josh Triplett, Steven Rostedt, David S. Miller,
	Eric Dumazet, Hannes Frederic Sowa, netdev

The networking subsystem is currently using some kind of long-lived
RCU-protected, references to avoid the overhead of full book-keeping.

Such references - skb_dst() noref - are stored inside the skbs and can be
moved across relevant slices of the network stack, with the users
being in charge of properly clearing the relevant skb - or properly refcount
the related dst references - before the skb escapes the RCU section.

We currently don't have any deterministic debug infrastructure to check
the dst noref usages - and the introduction of others noref artifact is
currently under discussion.

This series tries to tackle the above introducing an RCU debug infrastructure
aimed at spotting incorrect noref pointer usage, in patch one. The
infrastructure is small and must be explicitly enabled via a newly introduced
build option.

Patch two uses such infrastructure to track dst noref usage in the networking
stack.

Patch 3 and 4 are bugfixes for small buglet found running this infrastructure
on basic scenarios.

Paolo Abeni (4):
  rcu: introduce noref debug
  net: use RCU noref infrastructure to track dst noref
  ipv4: drop unneeded and misleading RCU lock in ip_route_input_noref()
  tcp: avoid noref dst leak on input path

 include/linux/rcupdate.h | 11 ++++++
 include/linux/skbuff.h   |  1 +
 include/net/dst.h        |  5 +++
 kernel/rcu/Kconfig.debug | 15 ++++++++
 kernel/rcu/Makefile      |  1 +
 kernel/rcu/noref_debug.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/route.c         |  7 +---
 net/ipv4/tcp_input.c     |  5 ++-
 8 files changed, 127 insertions(+), 7 deletions(-)
 create mode 100644 kernel/rcu/noref_debug.c

-- 
2.13.6

^ permalink raw reply

* Re: [PATCH 3/3] ARM: dts: gr-peach: Add ETHER pin group
From: jacopo mondi @ 2017-10-06 12:24 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Geert Uytterhoeven, Chris Brandt, f.fainelli, netdev
In-Reply-To: <20171005164826.GL13247@lunn.ch>

Hi Andrew,
   thanks for the suggestion

On Thu, Oct 05, 2017 at 06:48:26PM +0200, Andrew Lunn wrote:
> On Thu, Oct 05, 2017 at 05:42:39PM +0200, jacopo mondi wrote:
> > Hi Andrew,

[snip]

> > > Hi Jocopo
> > >
> > > So what is this reset resetting?
> > >
> > > The MAC?
> > > The PHY?
> >
> > The reset line goes from our SoC to LAN8710A PHY chip external reset pin.
>
> So yes, this is a PHY property, and should be in the PHY node.
>
> Documentation/devicetree/bindings/net/mdio.txt does not apply here
> anyway. That is for an MDIO binding. This node is an ethernet MAC.
>
> So your binding whats to look something like
>
>         ether: ethernet@e8203000 {
>                 compatible = "renesas,ether-r7s72100";
>                 reg = <0xe8203000 0x800>,
>                       <0xe8204800 0x200>;
>                 interrupts = <GIC_SPI 327 IRQ_TYPE_LEVEL_HIGH>;
>                 clocks = <&mstp7_clks R7S72100_CLK_ETHER>;
>                 power-domains = <&cpg_clocks>;
>                 phy-mode = "mii";
> 		          phy-handle = <&phy0>;
>                 #address-cells = <1>;
>                 #size-cells = <0>;
>
>                 mdio: bus-bus {
>                       #address-cells = <1>;
>                       #size-cells = <0>;
>
>                       phy0: ethernet-phy@1 {
>                             reg = <1>;

Why reg = <1> ?
Shouldn't this be 0, or even better with no reg property at all?

        mdio: bus-bus {
              phy-0 {
                  reset-gpios = <&port4 2 GPIO_ACTIVE_LOW>;
                  reset-delay-us = <5>;
              };
        };

Thanks
   j

>                             reset-gpios = <&port4 2 GPIO_ACTIVE_LOW>;
>                             reset-delay-us = <5>;
>                       };
>                 };
>         };
>
> 	Andrew

^ permalink raw reply

* Re: [net-next V4 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation
From: Jesper Dangaard Brouer @ 2017-10-06 12:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, pavel.odintsov,
	Jason Wang, mchan, John Fastabend, peter.waskiewicz.jr,
	Daniel Borkmann, Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <59D607F3.6090306@iogearbox.net>

On Thu, 05 Oct 2017 12:22:43 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/04/2017 02:03 PM, Jesper Dangaard Brouer wrote:
> [...]
> >   static int cpu_map_kthread_run(void *data)
> >   {
> >   	struct bpf_cpu_map_entry *rcpu = data;
> >
> >   	set_current_state(TASK_INTERRUPTIBLE);
> >   	while (!kthread_should_stop()) {
> > +		unsigned int processed = 0, drops = 0;
> >   		struct xdp_pkt *xdp_pkt;
> >
> > -		schedule();
> > -		/* Do work */
> > -		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > -			/* For now just "refcnt-free" */
> > -			page_frag_free(xdp_pkt);
> > +		/* Release CPU reschedule checks */
> > +		if (__ptr_ring_empty(rcpu->queue)) {
> > +			schedule();
> > +		} else {
> > +			cond_resched();
> > +		}
> > +
> > +		/* Process packets in rcpu->queue */
> > +		local_bh_disable();
> > +		/*
> > +		 * The bpf_cpu_map_entry is single consumer, with this
> > +		 * kthread CPU pinned. Lockless access to ptr_ring
> > +		 * consume side valid as no-resize allowed of queue.
> > +		 */
> > +		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
> > +			struct sk_buff *skb;
> > +			int ret;
> > +
> > +			skb = cpu_map_build_skb(rcpu, xdp_pkt);
> > +			if (!skb) {
> > +				page_frag_free(xdp_pkt);
> > +				continue;
> > +			}
> > +
> > +			/* Inject into network stack */
> > +			ret = netif_receive_skb_core(skb);  
> 
> Don't we need to hold RCU read lock for above netif_receive_skb_core()?

Yes, I guess we do, but I'll place it in netif_receive_skb_core() before
invoking __netif_receive_skb_core(), like netif_receive_skb() does
around __netif_receive_skb().

It looks like the RCU section protects:
 	rx_handler = rcu_dereference(skb->dev->rx_handler);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] ipv6: gso: fix payload length when gso_size is zero
From: Alexey Kodanev @ 2017-10-06 12:16 UTC (permalink / raw)
  To: Duyck, Alexander H, netdev@vger.kernel.org
  Cc: davem@davemloft.net, steffen.klassert@secunet.com
In-Reply-To: <1507229878.2098.56.camel@intel.com>

On 10/05/2017 09:58 PM, Duyck, Alexander H wrote:
> On Thu, 2017-10-05 at 20:06 +0300, Alexey Kodanev wrote:
>> When gso_size reset to zero for the tail segment in skb_segment(), later
>> in ipv6_gso_segment(), we will get incorrect payload_len for that segment.
>> inet_gso_segment() already has a check for gso_size before calculating
>> payload so fixing only IPv6 part.
>>
>> The issue was found with LTP vxlan & gre tests over ixgbe NIC.
>>
>> Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list pointer")
>> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
>> ---
>>  net/ipv6/ip6_offload.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
>> index cdb3728..4a87f94 100644
>> --- a/net/ipv6/ip6_offload.c
>> +++ b/net/ipv6/ip6_offload.c
>> @@ -105,7 +105,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
>>  
>>  	for (skb = segs; skb; skb = skb->next) {
>>  		ipv6h = (struct ipv6hdr *)(skb_mac_header(skb) + nhoff);
>> -		if (gso_partial)
>> +		if (gso_partial && skb_is_gso(skb))
>>  			payload_len = skb_shinfo(skb)->gso_size +
>>  				      SKB_GSO_CB(skb)->data_offset +
>>  				      skb->head - (unsigned char *)(ipv6h + 1);
> So looking over this change it looks good to me. I'm just wondering if
> you have looked at the code in __skb_udp_tunnel_segment or
> gre_gso_segment? It seems like if you needed this change here you
> should need to make similar changes to those functions as well. I'm
> wondering if we just aren't seeing issues due to the segments already
> being MSS sized before being handed to us for segmentation.

Right, it can happen in __skb_udp_tunnel_segment as well. I wasn't able
to reproduce with gre, it looks like it doesn't go to that part of code,
skipping it, possibly, on gso_type & SKB_GSO_GRE_CSUM check, though the
NIC has tx-gre-csum-segmentation enabled...

I'll send new version after testing completed.

Thanks,
Alexey

^ permalink raw reply

* Re: [net-next V4 PATCH 2/5] bpf: XDP_REDIRECT enable use of cpumap
From: Jesper Dangaard Brouer @ 2017-10-06 12:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, pavel.odintsov,
	Jason Wang, mchan, John Fastabend, peter.waskiewicz.jr,
	Daniel Borkmann, Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <20171006131748.75185f65@redhat.com>

On Fri, 6 Oct 2017 13:17:48 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > > -int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp)
> > > +int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
> > > +		    struct net_device *dev_rx)
> > >   {
> > >   	struct xdp_pkt *xdp_pkt;
> > >   	int headroom;
> > > @@ -505,7 +506,7 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp)
> > >   	xdp_pkt = xdp->data_hard_start;
> > >   	xdp_pkt->data = xdp->data;
> > >   	xdp_pkt->len  = xdp->data_end - xdp->data;
> > > -	xdp_pkt->headroom = headroom;
> > > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);    
> > 
> > (Just a note, bit confusing that first two patches add and extend
> >   this, and only in the third you add the xdp->data_meta handling,
> >   makes it harder to review at least.)  
> 
> Sorry.  This is a left-overs from rebasing and measuring the cost of
> transferring only the pointer to the page, and remote put_page().
> And your xdp->data_meta, happen basically while my patches was in-flight.
> 
> I'll move this one-line back to patch 2, to spreading over too many
> patches.

I instead choose to move the creation of cpu_map_enqueue() into this
patch, but in a more simple version stating explicit that this is only
seen as a void pointer enqueue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 2/3 v2] net: phy: DP83822 initial driver submission
From: Dan Murphy @ 2017-10-06 12:00 UTC (permalink / raw)
  To: Andrew Lunn, Woojung.Huh; +Cc: f.fainelli, netdev, afd
In-Reply-To: <20171004235307.GD16612@lunn.ch>

Andrew

On 10/04/2017 06:53 PM, Andrew Lunn wrote:
> On Wed, Oct 04, 2017 at 10:44:36PM +0000, Woojung.Huh@microchip.com wrote:
>>> +static int dp83822_suspend(struct phy_device *phydev)
>>> +{
>>> +	int value;
>>> +
>>> +	mutex_lock(&phydev->lock);
>>> +	value = phy_read_mmd(phydev, DP83822_DEVADDR,
>>> MII_DP83822_WOL_CFG);
>>> +	mutex_unlock(&phydev->lock);
> 
>> Would we need mutex to access phy_read_mmd()?
>> phy_read_mmd() has mdio_lock for indirect access.
> 
> Hi Woojung
> 
> The mdio lock is not sufficient. It protects against two mdio
> accesses. But here we need to protect against two phy operations.
> There is a danger something else tries to access the phy during
> suspend.
> 
>>> +	if (!(value & DP83822_WOL_EN))
>>> +		genphy_suspend(phydev);
> 
> Releasing the lock before calling genphy_suspend() is not so nice.
> Maybe add a version which assumes the lock has already been taken?
> 

The marvell driver does not take a lock and calls genphy_suspend/resume
so I am wondering if this driver needs to take a lock.

The at803x needs to take the lock because it does not call into the genphy
functions.

I will submit a new version with the lock removed.

Dan

>       Andrew
> 


-- 
------------------
Dan Murphy

^ permalink raw reply

* Re: [net-next V4 PATCH 2/5] bpf: XDP_REDIRECT enable use of cpumap
From: Jesper Dangaard Brouer @ 2017-10-06 11:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, pavel.odintsov,
	Jason Wang, mchan, John Fastabend, peter.waskiewicz.jr,
	Daniel Borkmann, Alexei Starovoitov, Andy Gospodarek, brouer
In-Reply-To: <59D60505.2040004@iogearbox.net>


On Thu, 05 Oct 2017 12:10:13 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/04/2017 02:03 PM, Jesper Dangaard Brouer wrote:
> > This patch connects cpumap to the xdp_do_redirect_map infrastructure.
> >
> > Still no SKB allocation are done yet.  The XDP frames are transferred
> > to the other CPU, but they are simply refcnt decremented on the remote
> > CPU.  This served as a good benchmark for measuring the overhead of
> > remote refcnt decrement.  If driver page recycle cache is not
> > efficient then this, exposes a bottleneck in the page allocator.
> >
> > A shout-out to MST's ptr_ring, which is the secret behind is being so
> > efficient to transfer memory pointers between CPUs, without constantly
> > bouncing cache-lines between CPUs.
> >
> > V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
> >
> > V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
> >   as implementation require a separate upstream discussion.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> [...]
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > index ae8e29352261..4926a9971f90 100644
> > --- a/kernel/bpf/cpumap.c
> > +++ b/kernel/bpf/cpumap.c
> > @@ -493,7 +493,8 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
> >   	return 0;
> >   }
> >
> > -int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp)
> > +int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
> > +		    struct net_device *dev_rx)
> >   {
> >   	struct xdp_pkt *xdp_pkt;
> >   	int headroom;
> > @@ -505,7 +506,7 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp)
> >   	xdp_pkt = xdp->data_hard_start;
> >   	xdp_pkt->data = xdp->data;
> >   	xdp_pkt->len  = xdp->data_end - xdp->data;
> > -	xdp_pkt->headroom = headroom;
> > +	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);  
> 
> (Just a note, bit confusing that first two patches add and extend
>   this, and only in the third you add the xdp->data_meta handling,
>   makes it harder to review at least.)

Sorry.  This is a left-overs from rebasing and measuring the cost of
transferring only the pointer to the page, and remote put_page().
And your xdp->data_meta, happen basically while my patches was in-flight.

I'll move this one-line back to patch 2, to spreading over too many
patches.

> [...]
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 9b6e7e84aafd..dbf2ae071108 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2521,10 +2521,36 @@ static int __bpf_tx_xdp(struct net_device *dev,
> >   	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
> >   	if (err)
> >   		return err;
> > -	if (map)
> > +	dev->netdev_ops->ndo_xdp_flush(dev);
> > +	return 0;
> > +}
> > +
> > +static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
> > +			    struct bpf_map *map,
> > +			    struct xdp_buff *xdp,
> > +			    u32 index)
> > +{
> > +	int err;
> > +
> > +	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> > +		struct net_device *dev = fwd;
> > +
> > +		if (!dev->netdev_ops->ndo_xdp_xmit)
> > +			return -EOPNOTSUPP;
> > +
> > +		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
> > +		if (err)
> > +			return err;
> >   		__dev_map_insert_ctx(map, index);
> > -	else
> > -		dev->netdev_ops->ndo_xdp_flush(dev);
> > +
> > +	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
> > +		struct bpf_cpu_map_entry *rcpu = fwd;
> > +
> > +		err = cpu_map_enqueue(rcpu, xdp, dev_rx);
> > +		if (err)
> > +			return err;
> > +		__cpu_map_insert_ctx(map, index);
> > +	}
> >   	return 0;
> >   }
> >
> > @@ -2534,11 +2560,33 @@ void xdp_do_flush_map(void)
> >   	struct bpf_map *map = ri->map_to_flush;
> >
> >   	ri->map_to_flush = NULL;
> > -	if (map)
> > -		__dev_map_flush(map);
> > +	if (map) {
> > +		switch (map->map_type) {
> > +		case BPF_MAP_TYPE_DEVMAP:
> > +			__dev_map_flush(map);
> > +			break;
> > +		case BPF_MAP_TYPE_CPUMAP:
> > +			__cpu_map_flush(map);
> > +			break;
> > +		default:
> > +			break;
> > +		}
> > +	}
> >   }
> >   EXPORT_SYMBOL_GPL(xdp_do_flush_map);
> >
> > +static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
> > +{
> > +	switch (map->map_type) {
> > +	case BPF_MAP_TYPE_DEVMAP:
> > +		return __dev_map_lookup_elem(map, index);
> > +	case BPF_MAP_TYPE_CPUMAP:
> > +		return __cpu_map_lookup_elem(map, index);
> > +	default:
> > +		return NULL;
> > +	}  
> 
> Should we just have a callback and instead of the above use
> map->ptr_lookup_elem() (or however we name it) ... lot of it
> is pretty much the same logic as with devmap.

We could extend struct bpf_map *map with such a callback, I was just
afraid that this would be too invasive.

Performance wise, I don't thinks will hurt too much.
http://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html


> > +}
> > +
> >   static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
> >   				   unsigned long aux)
> >   {
> > @@ -2551,8 +2599,8 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
> >   	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> >   	unsigned long map_owner = ri->map_owner;
> >   	struct bpf_map *map = ri->map;
> > -	struct net_device *fwd = NULL;
> >   	u32 index = ri->ifindex;
> > +	void *fwd = NULL;
> >   	int err;
> >
> >   	ri->ifindex = 0;  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net v2 3/9] net/mac89x0: Fix and modernize log messages
From: Finn Thain @ 2017-10-06 11:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel
In-Reply-To: <20171005.210842.1348457661309929796.davem@davemloft.net>

On Thu, 5 Oct 2017, David Miller wrote:

> From: Finn Thain <fthain@telegraphics.com.au>
> Date: Thu, 5 Oct 2017 21:11:05 -0400 (EDT)
> 
> > Fix misplaced newlines in conditional log messages.
> 
> Please don't do this, the way the author formatted the strings was 
> intentional, they intended to print out:
> 
> 	NAME: cs89%c0%s rev %c found at %#8lx IRQ %d ADDR %pM
> 
> But now you are splitting it into multiple lines.

Right.

> Also, you're printing the IRQ information after register_netdev() which 
> is bad.  As soon as register_netdev() is called, the driver's
> ->open() routine can be invoked, and during which time some
> log messages could be emitted during that operation.
> 
> And that would cut the probe messages up.
> 

Yes and no. The thing is, "IRQ %d" isn't really a "probe message" and 
doesn't need to be logged at all: the IRQ is entirely fixed. Actually the 
same is true for the macmace driver. There is value in this information 
but it can be found in /proc/interrupts so I'd happily drop the "IRQ %d" 
portion from these log messages.

> I know how you got to this state, you saw a reference to dev->name 
> before it had a real value.  You just removed the "eth%d" string 
> entirely.  And since you removed the dev->name reference, you had no 
> reason to move log messages after register_netdev() at all.
> 

Not quite. I used the "MAC %pM, IRQ %d" style for consistency with other 
NIC drivers. Though consistency in itself may be insufficient 
justification. More importantly, I wanted the MAC address logged together 
with the actual interface name. That's how I arrived at this code.

> Anyways, you can also see the intention of the author here becuase they 
> have _explicit_ leading newlines in the error path messages that come 
> after the inital probe printk.
> 

Of course. I do understand the existing code. And the code actually 
reflects the intentions of the author of the ISA driver. Having the IRQ 
logged could be really valuable to the typical ISA card user but this 
platform is not ISA.

> The real way to fix the early dev->name reference is to replace it with 
> a dev_info() call and have it use the struct device name rather than the 
> netdev device one.
> 

This driver only runs on machines with one expansion slot (called a "Comm 
Slot"). So I figured that either pr_info() or printk(KERN_INFO ...) would 
do just fine here (always has done). I did consider dev_info() but I don't 
see the benefit. I'm probably missing something, so would you elaborate 
please?

BTW I've also used pr_info() elsewhere in this series in platform drivers. 
It's not yet clear to me whether the mac89x0 driver should ultimately bind 
to a platform device or a nubus device: comm slot cards are a bit of 
each.

> Again, I think you really shouldn't be making these small weird changes 
> to these old drivers.
> 

These are weird changes befitting a weird platform.

I understand your reluctance to touch legacy drivers, but my intention is 
not change for the sake of change. There is bitrot here. Sometimes that's 
due to the rest of the kernel having changed and sometimes it's due to 
actual damage of the kind you seem to fear. I'm trying to address both.

-- 

^ permalink raw reply

* Re: [net-next,v2] ip_gre: check packet length and mtu correctly in erspan tx
From: Xin Long @ 2017-10-06 11:03 UTC (permalink / raw)
  To: William Tu; +Cc: network dev, David Laight
In-Reply-To: <1507230432-56495-1-git-send-email-u9012063@gmail.com>

On Fri, Oct 6, 2017 at 3:07 AM, William Tu <u9012063@gmail.com> wrote:
> Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
> is the length of the whole ether packet.  So skb->len should subtract
> the dev->hard_header_len.
>
> Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
> Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
> Signed-off-by: William Tu <u9012063@gmail.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Cc: David Laight <David.Laight@aculab.com>
> ---
> v1->v2:
> use addition to avoid overflow
> fix pskb_trim size
> ---
>  net/ipv4/ip_gre.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index b279c325c7f6..fb95f68d6e53 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -579,8 +579,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
>         if (gre_handle_offloads(skb, false))
>                 goto err_free_rt;
>
> -       if (skb->len > dev->mtu) {
> -               pskb_trim(skb, dev->mtu);
> +       if (skb->len > dev->mtu + dev->hard_header_len) {
> +               pskb_trim(skb, dev->mtu + dev->hard_header_len);
>                 truncate = true;
>         }
>
> @@ -731,8 +731,8 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
>         if (skb_cow_head(skb, dev->needed_headroom))
>                 goto free_skb;
>
> -       if (skb->len - dev->hard_header_len > dev->mtu) {
> -               pskb_trim(skb, dev->mtu);
> +       if (skb->len > dev->mtu + dev->hard_header_len) {
> +               pskb_trim(skb, dev->mtu + dev->hard_header_len);
>                 truncate = true;
>         }
>
> --
> 2.7.4
>
Reviewed-by: Xin Long <lucien.xin@gmail.com>

^ permalink raw reply

* [PATCH] bnx2x: Use pci_ari_enabled() instead of local copy
From: Bjorn Helgaas @ 2017-10-06 11:00 UTC (permalink / raw)
  To: Ariel Elior; +Cc: netdev, everest-linux-l2, linux-kernel

From: Bjorn Helgaas <bhelgaas@google.com>

Use pci_ari_enabled() from the PCI core instead of the identical local copy
bnx2x_ari_enabled().  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 9ca994d0bab6..3591077a5f6b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -1074,11 +1074,6 @@ static void bnx2x_vf_set_bars(struct bnx2x *bp, struct bnx2x_virtf *vf)
 	}
 }
 
-static int bnx2x_ari_enabled(struct pci_dev *dev)
-{
-	return dev->bus->self && dev->bus->self->ari_enabled;
-}
-
 static int
 bnx2x_get_vf_igu_cam_info(struct bnx2x *bp)
 {
@@ -1212,7 +1207,7 @@ int bnx2x_iov_init_one(struct bnx2x *bp, int int_mode_param,
 
 	err = -EIO;
 	/* verify ari is enabled */
-	if (!bnx2x_ari_enabled(bp->pdev)) {
+	if (!pci_ari_enabled(bp->pdev->bus)) {
 		BNX2X_ERR("ARI not supported (check pci bridge ARI forwarding), SRIOV can not be enabled\n");
 		return 0;
 	}

^ permalink raw reply related

* Re: [net-next V4 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Jesper Dangaard Brouer @ 2017-10-06 10:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, pavel.odintsov,
	Jason Wang, mchan, John Fastabend, peter.waskiewicz.jr,
	Daniel Borkmann, Alexei Starovoitov, Andy Gospodarek, brouer,
	Tobias Klauser
In-Reply-To: <59D5FDFF.5040002@iogearbox.net>

On Thu, 05 Oct 2017 11:40:15 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/04/2017 02:03 PM, Jesper Dangaard Brouer wrote:
> [...]
> > +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> > +struct xdp_bulk_queue {
> > +	void *q[CPU_MAP_BULK_SIZE];
> > +	unsigned int count;
> > +};
> > +
> > +/* Struct for every remote "destination" CPU in map */
> > +struct bpf_cpu_map_entry {
> > +	u32 cpu;    /* kthread CPU and map index */
> > +	int map_id; /* Back reference to map */  
> 
> map_id is not used here if I read it correctly? We should
> then remove it.

It is actually used in a later patch. Notice, there is no unused
members in the final patch.  I did consider adding back in the later
patch, but it was annoying to during the devel and split-up patch
phase, as it creates conflicts when I move between the different
patches, that need to modify this struct.  Thus, I choose to keep the
end-struct in this cpumap-base-patch.  If you insist, I can go though
the patch-stack and carefully introduce changes to the struct in steps?


> > +	u32 qsize;  /* Redundant queue size for map lookup */
> > +
> > +	/* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> > +	struct xdp_bulk_queue __percpu *bulkq;
> > +
> > +	/* Queue with potential multi-producers, and single-consumer kthread */
> > +	struct ptr_ring *queue;
> > +	struct task_struct *kthread;
> > +	struct work_struct kthread_stop_wq;
> > +
> > +	atomic_t refcnt; /* Control when this struct can be free'ed */
> > +	struct rcu_head rcu;
> > +};
> > +
> > +struct bpf_cpu_map {
> > +	struct bpf_map map;
> > +	/* Below members specific for map type */
> > +	struct bpf_cpu_map_entry **cpu_map;
> > +	unsigned long __percpu *flush_needed;
> > +};
> > +
> > +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> > +			     struct xdp_bulk_queue *bq);  
> 
> Could we avoid forward declaration?

This forward declaration helps me avoid mixing the enqueue code with
the bpf-map code.  I like that all the enqueue code is located after
the struct bpf_map_ops cpu_map_ops deceleration.  If you insist, I can
reorder the code?

 
> > +static u64 cpu_map_bitmap_size(const union bpf_attr *attr)
> > +{
> > +	return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
> > +}
> > +
> > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > +{
> > +	struct bpf_cpu_map *cmap;
> > +	u64 cost;
> > +	int err;
> > +
> > +	/* check sanity of attributes */
> > +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> > +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> > +	if (!cmap)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* mandatory map attributes */
> > +	cmap->map.map_type = attr->map_type;
> > +	cmap->map.key_size = attr->key_size;
> > +	cmap->map.value_size = attr->value_size;
> > +	cmap->map.max_entries = attr->max_entries;
> > +	cmap->map.map_flags = attr->map_flags;
> > +	cmap->map.numa_node = bpf_map_attr_numa_node(attr);
> > +
> > +	/* Pre-limit array size based on NR_CPUS, not final CPU check */
> > +	if (cmap->map.max_entries > NR_CPUS)
> > +		return ERR_PTR(-E2BIG);
> > +
> > +	/* make sure page count doesn't overflow */
> > +	cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *);
> > +	cost += cpu_map_bitmap_size(attr) * num_possible_cpus();
> > +	if (cost >= U32_MAX - PAGE_SIZE)
> > +		goto free_cmap;
> > +	cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	/* if map size is larger than memlock limit, reject it early */
> > +	err = bpf_map_precharge_memlock(cmap->map.pages);
> > +	if (err)
> > +		goto free_cmap;  
> 
> Given this is almost the same as devmap and touches user land reporting,
> we should probably go and do the same as in 582db7e0c4c2 ("bpf: devmap:
> pass on return value of bpf_map_precharge_memlock").

I guess we have to do this... even-though I absolutely HATE that this
will return -EPERM which user land will see as "Operation not permitted".

Even-though I know this, I got confused and spend several hours hunting
the wrong kind of error:
 https://github.com/netoptimizer/prototype-kernel/commit/cf4694792bf9807f48dc174a149ab0805133184a

Even the iovisor/BCC tool have a work-around for this ambiguous ABI
that we provide, and explicitly call their work-around a "hack":

https://github.com/iovisor/bcc/blob/2e20494f63a/src/cc/libbpf.c#L366-L373

 if (ret < 0 && errno == EPERM) {
    // When EPERM is returned, two reasons are possible:
    //  1. user has no permissions for bpf()
    //  2. user has insufficent rlimit for locked memory
    // Unfortunately, there is no api to inspect the current usage of locked
    // mem for the user, so an accurate calculation of how much memory to lock
    // for this new program is difficult to calculate. As a hack, bump the limit
    // to unlimited. If program load fails again, return the error.

 
> > +	/* A per cpu bitfield with a bit per possible CPU in map  */
> > +	cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr),
> > +					    __alignof__(unsigned long));
> > +	if (!cmap->flush_needed)
> > +		goto free_cmap;
> > +
> > +	/* Alloc array for possible remote "destination" CPUs */
> > +	cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries *
> > +					   sizeof(struct bpf_cpu_map_entry *),
> > +					   cmap->map.numa_node);
> > +	if (!cmap->cpu_map)
> > +		goto free_cmap;
> > +
> > +	return &cmap->map;
> > +free_cmap:
> > +	free_percpu(cmap->flush_needed);
> > +	kfree(cmap);
> > +	return ERR_PTR(-ENOMEM);
> > +}
> > +
> > +void __cpu_map_queue_destructor(void *ptr)
> > +{
> > +	/* For now, just catch this as an error */
> > +	if (!ptr)
> > +		return;
> > +	pr_err("ERROR: %s() cpu_map queue was not empty\n", __func__);  
> 
> Can you elaborate on this "for now" condition? Is this a race
> when kthread doesn't consume queue on thread exit, or should it
> be impossible to trigger (should it then be replaced with a
> 'if (WARN_ON_ONCE(ptr)) page_frag_free(ptr)' and a more elaborate
> comment)?

The "for now" is an old comment while developing and testing this.
In this final state of the patchset it _should_ not be possible to
trigger this situation.  I like your idea of replacing it with a
WARN_ON_ONCE.  (as it might be good to keep in some form, as it would
catch is someone changing the code which breaks the RCU+WQ+kthread
tear-down procedure).


> > +	page_frag_free(ptr);
> > +}
> > +
> > +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> > +		/* The queue should be empty at this point */
> > +		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> > +		kfree(rcpu->queue);
> > +		kfree(rcpu);
> > +	}
> > +}
> > +
> > +static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	atomic_inc(&rcpu->refcnt);
> > +}
> > +
> > +/* called from workqueue, to workaround syscall using preempt_disable */
> > +static void cpu_map_kthread_stop(struct work_struct *work)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu;
> > +
> > +	rcpu = container_of(work, struct bpf_cpu_map_entry, kthread_stop_wq);
> > +	synchronize_rcu(); /* wait for flush in __cpu_map_entry_free() */
> > +	kthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */
> > +}
> > +
> > +static int cpu_map_kthread_run(void *data)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu = data;
> > +
> > +	set_current_state(TASK_INTERRUPTIBLE);
> > +	while (!kthread_should_stop()) {
> > +		struct xdp_pkt *xdp_pkt;
> > +
> > +		schedule();
> > +		/* Do work */
> > +		while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
> > +			/* For now just "refcnt-free" */
> > +			page_frag_free(xdp_pkt);
> > +		}
> > +		__set_current_state(TASK_INTERRUPTIBLE);
> > +	}
> > +	put_cpu_map_entry(rcpu);
> > +
> > +	__set_current_state(TASK_RUNNING);
> > +	return 0;
> > +}
> > +
> > +struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id)
> > +{
> > +	gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
> > +	struct bpf_cpu_map_entry *rcpu;
> > +	int numa, err;
> > +
> > +	/* Have map->numa_node, but choose node of redirect target CPU */
> > +	numa = cpu_to_node(cpu);
> > +
> > +	rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
> > +	if (!rcpu)
> > +		return NULL;
> > +
> > +	/* Alloc percpu bulkq */
> > +	rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
> > +					 sizeof(void *), gfp);
> > +	if (!rcpu->bulkq)
> > +		goto fail;
> > +
> > +	/* Alloc queue */
> > +	rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
> > +	if (!rcpu->queue)
> > +		goto fail;
> > +
> > +	err = ptr_ring_init(rcpu->queue, qsize, gfp);
> > +	if (err)
> > +		goto fail;
> > +	rcpu->qsize = qsize;
> > +
> > +	/* Setup kthread */
> > +	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
> > +					       "cpumap/%d/map:%d", cpu, map_id);
> > +	if (IS_ERR(rcpu->kthread))
> > +		goto fail;  
> 
> What about ptr_ring_cleanup() when we fail here?

Thanks for catching this ... added.

> > +
> > +	/* Make sure kthread runs on a single CPU */
> > +	kthread_bind(rcpu->kthread, cpu);
> > +	wake_up_process(rcpu->kthread);
> > +
> > +	get_cpu_map_entry(rcpu); /* 1-refcnt for being in cmap->cpu_map[] */
> > +	get_cpu_map_entry(rcpu); /* 1-refcnt for kthread */
> > +
> > +	return rcpu;
> > +
> > +fail:   /* Hint: free API detect NULL values */
> > +	free_percpu(rcpu->bulkq);
> > +	kfree(rcpu->queue);
> > +	kfree(rcpu);
> > +	return NULL;
> > +}
> > +
> > +void __cpu_map_entry_free(struct rcu_head *rcu)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu;
> > +	int cpu;
> > +
> > +	/* This cpu_map_entry have been disconnected from map and one
> > +	 * RCU graze-period have elapsed.  Thus, XDP cannot queue any
> > +	 * new packets and cannot change/set flush_needed that can
> > +	 * find this entry.
> > +	 */
> > +	rcpu = container_of(rcu, struct bpf_cpu_map_entry, rcu);
> > +
> > +	/* Flush remaining packets in percpu bulkq */
> > +	for_each_online_cpu(cpu) {
> > +		struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu);
> > +
> > +		/* No concurrent bq_enqueue can run at this point */
> > +		bq_flush_to_queue(rcpu, bq);
> > +	}
> > +	free_percpu(rcpu->bulkq);
> > +	/* Cannot kthread_stop() here, last put free rcpu resources */
> > +	put_cpu_map_entry(rcpu);
> > +}
> > +
> > +/* After xchg pointer to bpf_cpu_map_entry, use the call_rcu() to
> > + * ensure any driver rcu critical sections have completed, but this
> > + * does not guarantee a flush has happened yet. Because driver side
> > + * rcu_read_lock/unlock only protects the running XDP program.  The
> > + * atomic xchg and NULL-ptr check in __cpu_map_flush() makes sure a
> > + * pending flush op doesn't fail.
> > + *
> > + * The bpf_cpu_map_entry is still used by the kthread, and there can
> > + * still be pending packets (in queue and percpu bulkq).  A refcnt
> > + * makes sure to last user (kthread_stop vs. call_rcu) free memory
> > + * resources.
> > + *
> > + * The rcu callback __cpu_map_entry_free flush remaining packets in
> > + * percpu bulkq to queue.  Due to caller map_delete_elem() disable
> > + * preemption, cannot call kthread_stop() to make sure queue is empty.
> > + * Instead a work_queue is started for stopping kthread,
> > + * cpu_map_kthread_stop, which waits for an RCU graze period before
> > + * stopping kthread, emptying the queue.
> > + */
> > +void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
> > +			     u32 key_cpu, struct bpf_cpu_map_entry *rcpu)
> > +{
> > +	struct bpf_cpu_map_entry *old_rcpu;
> > +
> > +	old_rcpu = xchg(&cmap->cpu_map[key_cpu], rcpu);
> > +	if (old_rcpu) {
> > +		call_rcu(&old_rcpu->rcu, __cpu_map_entry_free);
> > +		INIT_WORK(&old_rcpu->kthread_stop_wq, cpu_map_kthread_stop);
> > +		schedule_work(&old_rcpu->kthread_stop_wq);
> > +	}
> > +}
> > +
> > +int cpu_map_delete_elem(struct bpf_map *map, void *key)
> > +{
> > +	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
> > +	u32 key_cpu = *(u32 *)key;
> > +
> > +	if (key_cpu >= map->max_entries)
> > +		return -EINVAL;
> > +
> > +	/* notice caller map_delete_elem() use preempt_disable() */
> > +	__cpu_map_entry_replace(cmap, key_cpu, NULL);
> > +	return 0;
> > +}
> > +
> > +int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
> > +				u64 map_flags)
> > +{
> > +	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
> > +	struct bpf_cpu_map_entry *rcpu;
> > +
> > +	/* Array index key correspond to CPU number */
> > +	u32 key_cpu = *(u32 *)key;
> > +	/* Value is the queue size */
> > +	u32 qsize = *(u32 *)value;
> > +
> > +	/* Make sure CPU is a valid possible cpu */
> > +	if (!cpu_possible(key_cpu))
> > +		return -ENODEV;
> > +
> > +	if (unlikely(map_flags > BPF_EXIST))
> > +		return -EINVAL;
> > +	if (unlikely(key_cpu >= cmap->map.max_entries))
> > +		return -E2BIG;
> > +	if (unlikely(map_flags == BPF_NOEXIST))
> > +		return -EEXIST;
> > +	if (unlikely(qsize > 16384)) /* sanity limit on qsize */
> > +		return -EOVERFLOW;
> > +
> > +	if (qsize == 0) {
> > +		rcpu = NULL; /* Same as deleting */
> > +	} else {
> > +		/* Updating qsize cause re-allocation of bpf_cpu_map_entry */
> > +		rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id);
> > +		if (!rcpu)
> > +			return -ENOMEM;
> > +	}
> > +	rcu_read_lock();
> > +	__cpu_map_entry_replace(cmap, key_cpu, rcpu);
> > +	rcu_read_unlock();
> > +	return 0;  
> 
> You need to update verifier such that this function cannot be called
> out of an BPF program,

In the example BPF program, I do a lookup into the map, but only to
verify that an entry exist (I don't look at the value).  I would like
to support such usage.


> otherwise it would be possible under full RCU
> read context, which is explicitly avoided here and also it would otherwise
> be allowed for other maps of different type as well, which needs to
> be avoided.

Sorry, I don't follow this.

 
> > +}
> > +
> > +void cpu_map_free(struct bpf_map *map)
> > +{
> > +	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
> > +	int cpu;
> > +	u32 i;
> > +
> > +	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
> > +	 * so the bpf programs (can be more than one that used this map) were
> > +	 * disconnected from events. Wait for outstanding critical sections in
> > +	 * these programs to complete. The rcu critical section only guarantees
> > +	 * no further "XDP/bpf-side" reads against bpf_cpu_map->cpu_map.
> > +	 * It does __not__ ensure pending flush operations (if any) are
> > +	 * complete.
> > +	 */
> > +	synchronize_rcu();
> > +
> > +	/* To ensure all pending flush operations have completed wait for flush
> > +	 * bitmap to indicate all flush_needed bits to be zero on _all_ cpus.
> > +	 * Because the above synchronize_rcu() ensures the map is disconnected
> > +	 * from the program we can assume no new bits will be set.
> > +	 */
> > +	for_each_online_cpu(cpu) {
> > +		unsigned long *bitmap = per_cpu_ptr(cmap->flush_needed, cpu);
> > +
> > +		while (!bitmap_empty(bitmap, cmap->map.max_entries))
> > +			cond_resched();
> > +	}
> > +
> > +	/* For cpu_map the remote CPUs can still be using the entries
> > +	 * (struct bpf_cpu_map_entry).
> > +	 */
> > +	for (i = 0; i < cmap->map.max_entries; i++) {
> > +		struct bpf_cpu_map_entry *rcpu;
> > +
> > +		rcpu = READ_ONCE(cmap->cpu_map[i]);
> > +		if (!rcpu)
> > +			continue;
> > +
> > +		/* bq flush and cleanup happens after RCU graze-period */
> > +		__cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */
> > +	}
> > +	free_percpu(cmap->flush_needed);
> > +	bpf_map_area_free(cmap->cpu_map);
> > +	kfree(cmap);
> > +}
> > +
> > +struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
> > +{
> > +	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
> > +	struct bpf_cpu_map_entry *rcpu;
> > +
> > +	if (key >= map->max_entries)
> > +		return NULL;
> > +
> > +	rcpu = READ_ONCE(cmap->cpu_map[key]);
> > +	return rcpu;
> > +}
> > +
> > +static void *cpu_map_lookup_elem(struct bpf_map *map, void *key)
> > +{
> > +	struct bpf_cpu_map_entry *rcpu =
> > +		__cpu_map_lookup_elem(map, *(u32 *)key);
> > +
> > +	return rcpu ? &rcpu->qsize : NULL;  
> 
> The qsize doesn't seem used anywhere else besides here, but you
> probably should update verifier such that this cannot be called
> out of the BPF program, which could then mangle qsize value.

It is true that the BPF prog can modify this qsize value, but it's not
the authoritative value, so it doesn't really matter.

As I said above, I do want to do a lookup from a BPF program.  To allow
to BPF program to know, if an entry is valid, else it will blindly
send to a cpu destination.  Maybe bpf_prog's just have to use a
map-on-the-side to coordinate this(?), but then a sysadm modifying the
real cpumap will be invisible to the program.


Maybe we should just disable BPF-progs from reading this in the first
iteration?  It would allow for more advanced usage schemes later..

One crazy idea is to have the cpu_map_lookup_elem() return if any
packets are in-flight on this cpu-queue. (Making it easier to avoid OoO
packets, when switching target CPU, but it can also be implemented by
the BPF-programmer herself via maps, although via some extra atomic
cost).


> > +}
> > +
> > +static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> > +{
> > +	struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map);
> > +	u32 index = key ? *(u32 *)key : U32_MAX;
> > +	u32 *next = next_key;
> > +
> > +	if (index >= cmap->map.max_entries) {
> > +		*next = 0;
> > +		return 0;
> > +	}
> > +
> > +	if (index == cmap->map.max_entries - 1)
> > +		return -ENOENT;
> > +	*next = index + 1;
> > +	return 0;
> > +}


I would have liked to have implemented next_key so it only returned the
next valid cpu_entry, and used it as a simple round-robin scheduler.
But AFAIK the next_key function is not allowed from bpf_prog's, right?


> > +
> > +const struct bpf_map_ops cpu_map_ops = {
> > +	.map_alloc		= cpu_map_alloc,
> > +	.map_free		= cpu_map_free,
> > +	.map_delete_elem	= cpu_map_delete_elem,
> > +	.map_update_elem	= cpu_map_update_elem,
> > +	.map_lookup_elem	= cpu_map_lookup_elem,
> > +	.map_get_next_key	= cpu_map_get_next_key,
> > +};
> > +
> > +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> > +			     struct xdp_bulk_queue *bq)
> > +{
> > +	struct ptr_ring *q;
> > +	int i;
> > +
> > +	if (unlikely(!bq->count))
> > +		return 0;
> > +
> > +	q = rcpu->queue;
> > +	spin_lock(&q->producer_lock);
> > +
> > +	for (i = 0; i < bq->count; i++) {
> > +		void *xdp_pkt = bq->q[i];
> > +		int err;
> > +
> > +		err = __ptr_ring_produce(q, xdp_pkt);
> > +		if (err) {
> > +			/* Free xdp_pkt */
> > +			page_frag_free(xdp_pkt);
> > +		}
> > +	}
> > +	bq->count = 0;
> > +	spin_unlock(&q->producer_lock);
> > +
> > +	return 0;
> > +}
> > +
[...]


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 2/2] mac80211: only remove AP VLAN frames from TXQ
From: Toke Høiland-Jørgensen @ 2017-10-06 10:30 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Michał Kazior, Johannes Berg
In-Reply-To: <20171006095333.26335-2-johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> writes:

> From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> When removing an AP VLAN interface, mac80211 currently purges
> the entire TXQ for the AP interface. Fix this by using the FQ
> API introduced in the previous patch to filter frames.
>
> Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Acked-by: Toke Høiland-Jørgensen <toke-LJ9M9ZcSy1A@public.gmane.org>

^ permalink raw reply

* Re: [PATCH 1/2] fq: support filtering a given tin
From: Toke Høiland-Jørgensen @ 2017-10-06 10:30 UTC (permalink / raw)
  To: Johannes Berg, linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Michał Kazior, Johannes Berg
In-Reply-To: <20171006095333.26335-1-johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> writes:

> From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> Add to the FQ API a way to filter a given tin, in order to
> remove frames that fulfil certain criteria according to a
> filter function.
>
> This will be used by mac80211 to remove frames belonging to
> an AP VLAN interface that's being removed.
>
> Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Acked-by: Toke Høiland-Jørgensen <toke-LJ9M9ZcSy1A@public.gmane.org>

^ permalink raw reply

* [PATCH 2/2] mac80211: only remove AP VLAN frames from TXQ
From: Johannes Berg @ 2017-10-06  9:53 UTC (permalink / raw)
  To: linux-wireless
  Cc: Toke Høiland-Jørgensen, netdev, Michał Kazior,
	Johannes Berg
In-Reply-To: <20171006095333.26335-1-johannes@sipsolutions.net>

From: Johannes Berg <johannes.berg@intel.com>

When removing an AP VLAN interface, mac80211 currently purges
the entire TXQ for the AP interface. Fix this by using the FQ
API introduced in the previous patch to filter frames.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
---
 net/mac80211/ieee80211_i.h |  2 ++
 net/mac80211/iface.c       | 25 +++----------------------
 net/mac80211/tx.c          | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 9675814f64db..68f874e73561 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -2009,6 +2009,8 @@ void ieee80211_txq_init(struct ieee80211_sub_if_data *sdata,
 			struct txq_info *txq, int tid);
 void ieee80211_txq_purge(struct ieee80211_local *local,
 			 struct txq_info *txqi);
+void ieee80211_txq_remove_vlan(struct ieee80211_local *local,
+			       struct ieee80211_sub_if_data *sdata);
 void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
 			 u16 transaction, u16 auth_alg, u16 status,
 			 const u8 *extra, size_t extra_len, const u8 *bssid,
diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
index 2619daa29961..13b16f90e1cf 100644
--- a/net/mac80211/iface.c
+++ b/net/mac80211/iface.c
@@ -793,9 +793,7 @@ static int ieee80211_open(struct net_device *dev)
 static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
 			      bool going_down)
 {
-	struct ieee80211_sub_if_data *txq_sdata = sdata;
 	struct ieee80211_local *local = sdata->local;
-	struct fq *fq = &local->fq;
 	unsigned long flags;
 	struct sk_buff *skb, *tmp;
 	u32 hw_reconf_flags = 0;
@@ -939,9 +937,6 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
 
 	switch (sdata->vif.type) {
 	case NL80211_IFTYPE_AP_VLAN:
-		txq_sdata = container_of(sdata->bss,
-					 struct ieee80211_sub_if_data, u.ap);
-
 		mutex_lock(&local->mtx);
 		list_del(&sdata->u.vlan.list);
 		mutex_unlock(&local->mtx);
@@ -998,8 +993,6 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
 		skb_queue_purge(&sdata->skb_queue);
 	}
 
-	sdata->bss = NULL;
-
 	spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
 	for (i = 0; i < IEEE80211_MAX_QUEUES; i++) {
 		skb_queue_walk_safe(&local->pending[i], skb, tmp) {
@@ -1012,22 +1005,10 @@ static void ieee80211_do_stop(struct ieee80211_sub_if_data *sdata,
 	}
 	spin_unlock_irqrestore(&local->queue_stop_reason_lock, flags);
 
-	if (txq_sdata->vif.txq) {
-		struct txq_info *txqi = to_txq_info(txq_sdata->vif.txq);
+	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN)
+		ieee80211_txq_remove_vlan(local, sdata);
 
-		/*
-		 * FIXME FIXME
-		 *
-		 * We really shouldn't purge the *entire* txqi since that
-		 * contains frames for the other AP_VLANs (and possibly
-		 * the AP itself) as well, but there's no API in FQ now
-		 * to be able to filter.
-		 */
-
-		spin_lock_bh(&fq->lock);
-		ieee80211_txq_purge(local, txqi);
-		spin_unlock_bh(&fq->lock);
-	}
+	sdata->bss = NULL;
 
 	if (local->open_count == 0)
 		ieee80211_clear_tx_pending(local);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 94826680cf2b..7b8154474b9e 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1396,6 +1396,40 @@ static void ieee80211_txq_enqueue(struct ieee80211_local *local,
 		       fq_flow_get_default_func);
 }
 
+static bool fq_vlan_filter_func(struct fq *fq, struct fq_tin *tin,
+				struct fq_flow *flow, struct sk_buff *skb,
+				void *data)
+{
+	struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb);
+
+	return info->control.vif == data;
+}
+
+void ieee80211_txq_remove_vlan(struct ieee80211_local *local,
+			       struct ieee80211_sub_if_data *sdata)
+{
+	struct fq *fq = &local->fq;
+	struct txq_info *txqi;
+	struct fq_tin *tin;
+	struct ieee80211_sub_if_data *ap;
+
+	if (WARN_ON(sdata->vif.type != NL80211_IFTYPE_AP_VLAN))
+		return;
+
+	ap = container_of(sdata->bss, struct ieee80211_sub_if_data, u.ap);
+
+	if (!ap->vif.txq)
+		return;
+
+	txqi = to_txq_info(ap->vif.txq);
+	tin = &txqi->tin;
+
+	spin_lock_bh(&fq->lock);
+	fq_tin_filter(fq, tin, fq_vlan_filter_func, &sdata->vif,
+		      fq_skb_free_func);
+	spin_unlock_bh(&fq->lock);
+}
+
 void ieee80211_txq_init(struct ieee80211_sub_if_data *sdata,
 			struct sta_info *sta,
 			struct txq_info *txqi, int tid)
-- 
2.14.2

^ permalink raw reply related

* [PATCH 1/2] fq: support filtering a given tin
From: Johannes Berg @ 2017-10-06  9:53 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: Toke Høiland-Jørgensen, netdev-u79uwXL29TY76Z2rM5mHXA,
	Michał Kazior, Johannes Berg

From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add to the FQ API a way to filter a given tin, in order to
remove frames that fulfil certain criteria according to a
filter function.

This will be used by mac80211 to remove frames belonging to
an AP VLAN interface that's being removed.

Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 include/net/fq.h      |  7 +++++
 include/net/fq_impl.h | 72 ++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/include/net/fq.h b/include/net/fq.h
index 6d8521a30c5c..ac944a686840 100644
--- a/include/net/fq.h
+++ b/include/net/fq.h
@@ -90,6 +90,13 @@ typedef void fq_skb_free_t(struct fq *,
 			   struct fq_flow *,
 			   struct sk_buff *);
 
+/* Return %true to filter (drop) the frame. */
+typedef bool fq_skb_filter_t(struct fq *,
+			     struct fq_tin *,
+			     struct fq_flow *,
+			     struct sk_buff *,
+			     void *);
+
 typedef struct fq_flow *fq_flow_get_default_t(struct fq *,
 					      struct fq_tin *,
 					      int idx,
diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
index 4e6131cd3f43..8b237e4afee6 100644
--- a/include/net/fq_impl.h
+++ b/include/net/fq_impl.h
@@ -12,24 +12,22 @@
 
 /* functions that are embedded into includer */
 
-static struct sk_buff *fq_flow_dequeue(struct fq *fq,
-				       struct fq_flow *flow)
+static void fq_adjust_removal(struct fq *fq,
+			      struct fq_flow *flow,
+			      struct sk_buff *skb)
 {
 	struct fq_tin *tin = flow->tin;
-	struct fq_flow *i;
-	struct sk_buff *skb;
-
-	lockdep_assert_held(&fq->lock);
-
-	skb = __skb_dequeue(&flow->queue);
-	if (!skb)
-		return NULL;
 
 	tin->backlog_bytes -= skb->len;
 	tin->backlog_packets--;
 	flow->backlog -= skb->len;
 	fq->backlog--;
 	fq->memory_usage -= skb->truesize;
+}
+
+static void fq_rejigger_backlog(struct fq *fq, struct fq_flow *flow)
+{
+	struct fq_flow *i;
 
 	if (flow->backlog == 0) {
 		list_del_init(&flow->backlogchain);
@@ -43,6 +41,21 @@ static struct sk_buff *fq_flow_dequeue(struct fq *fq,
 		list_move_tail(&flow->backlogchain,
 			       &i->backlogchain);
 	}
+}
+
+static struct sk_buff *fq_flow_dequeue(struct fq *fq,
+				       struct fq_flow *flow)
+{
+	struct sk_buff *skb;
+
+	lockdep_assert_held(&fq->lock);
+
+	skb = __skb_dequeue(&flow->queue);
+	if (!skb)
+		return NULL;
+
+	fq_adjust_removal(fq, flow, skb);
+	fq_rejigger_backlog(fq, flow);
 
 	return skb;
 }
@@ -188,6 +201,45 @@ static void fq_tin_enqueue(struct fq *fq,
 	}
 }
 
+static void fq_flow_filter(struct fq *fq,
+			   struct fq_flow *flow,
+			   fq_skb_filter_t filter_func,
+			   void *filter_data,
+			   fq_skb_free_t free_func)
+{
+	struct fq_tin *tin = flow->tin;
+	struct sk_buff *skb, *tmp;
+
+	lockdep_assert_held(&fq->lock);
+
+	skb_queue_walk_safe(&flow->queue, skb, tmp) {
+		if (!filter_func(fq, tin, flow, skb, filter_data))
+			continue;
+
+		__skb_unlink(skb, &flow->queue);
+		fq_adjust_removal(fq, flow, skb);
+		free_func(fq, tin, flow, skb);
+	}
+
+	fq_rejigger_backlog(fq, flow);
+}
+
+static void fq_tin_filter(struct fq *fq,
+			  struct fq_tin *tin,
+			  fq_skb_filter_t filter_func,
+			  void *filter_data,
+			  fq_skb_free_t free_func)
+{
+	struct fq_flow *flow;
+
+	lockdep_assert_held(&fq->lock);
+
+	list_for_each_entry(flow, &tin->new_flows, flowchain)
+		fq_flow_filter(fq, flow, filter_func, filter_data, free_func);
+	list_for_each_entry(flow, &tin->old_flows, flowchain)
+		fq_flow_filter(fq, flow, filter_func, filter_data, free_func);
+}
+
 static void fq_flow_reset(struct fq *fq,
 			  struct fq_flow *flow,
 			  fq_skb_free_t free_func)
-- 
2.14.2

^ permalink raw reply related

* Re: Fw: [Bug 197099] New: Kernel panic in interrupt [l2tp_ppp]
From: James Chapman @ 2017-10-06  9:52 UTC (permalink / raw)
  To: SviMik; +Cc: netdev, Guillaume Nault
In-Reply-To: <CA++DawbbfQq1ZGB_x5t4a9mhkkptnDfcDxtCXgHYfGpY=8q83g@mail.gmail.com>

On 6 October 2017 at 05:45, SviMik <svimik@gmail.com> wrote:
> 2017-10-04 10:49 GMT+03:00 James Chapman <jchapman@katalix.com>:
>> On 3 October 2017 at 08:27, James Chapman <jchapman@katalix.com> wrote:
>>> For capturing complete oops messages, have you tried setting up
>>> netconsole? You might also find the full text in the syslog on reboot.
>
> Why, thank you! You've just told me that Santa Claus exists :)

You're welcome. Heh, my wife says I have a few more grey hairs and I
don't shave as often as I should. :)

> I've set up netconsole on 93 of my servers, and hope starting from
> tomorrow I'll have more pretty kernel panic reports, and get them even
> from servers where I had never had a chance to capture the console
> before.
>
>>> It's interesting that you are seeing l2tp issues since switching to
>>> 4.x kernels. Are you able to try earlier kernels to find the latest
>>> version that works? I'm curious whether things broke at v3.15.
>
> I'll try, but it will take some time to grab enough statistics. The
> bug is relatively rare, only few panics per day on the whole bunch of
> 93 servers.
>
>> It's possible that this may be fixed by a patch that is already
>> upstream and merged for v4.14. The fix is from Guillaume Nault:
>>
>> f3c66d4 l2tp: prevent creation of sessions on terminated tunnels
>>
>> If it's possible that the L2TP server may try to create a session in a
>> tunnel that is being closed, this bug would be exposed.
>>
>> Guillaume's fix isn't yet pushed to stable releases. Are you able to
>> try a v4.14-rc build?
>
> Sorry, I'm not skilled enough to build a kernel for CentOS on my own.
> Will wait till it appears in elrepo. The latest version there is
> currently 4.13.5. Meanwhile I'll try to switch to 3.10 and see how it
> works.

No problem. Please keep us updated. If Guillaume's fix in v4.14
prevents the l2tp crashes in your systems, I'd like to push it out to
stable releases. I have been trying to reproduce the problem here but
have had no luck so far. My guess is that your l2tp servers have a
large ppp population and are handling a lot of traffic. Until we have
evidence that Guillaume's patch resolves this problem, it's harder to
justify pushing it out to stable.

> I have also captured few more kernel panics in the last few days.
> Please see if they are related to this bug:
> http://svimik.com/hdmmsk1kp2.png
> http://svimik.com/hdmmsk1kp3.png
> http://svimik.com/hdmmsk1kp4.png
> http://svimik.com/hdmmsk2kp6.png

Thanks. None of these are related to this bug but it looks like p3, p4
and p6 are all in the networking code. It might be worth opening
separate threads for these. A full oops capture with netconsole would
likely get more attention though.

To check whether the oops is related to this bug yourself, please
check for text that contains "l2tp_xmit_skb" before posting it to this
thread.

^ permalink raw reply

* Re: [PATCH v2] net/mac80211/mesh_plink: Convert timers to use timer_setup()
From: Johannes Berg @ 2017-10-06  9:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: David S. Miller, linux-wireless, netdev, Thomas Gleixner,
	linux-kernel
In-Reply-To: <20171005173910.GA72335@beast>

On Thu, 2017-10-05 at 10:39 -0700, Kees Cook wrote:
> In preparation for unconditionally passing the struct timer_list
> pointer to
> all timer callbacks, switch to using the new timer_setup() and
> from_timer()
> to pass the timer pointer explicitly. This requires adding a pointer
> back
> to the sta_info since container_of() can't resolve the sta_info.
> 
Applied, thanks. (The change did land in net-next, and I merged that in
to merge this)

johannes

^ permalink raw reply

* RE: [PATCH net-next] ip_gre: check packet length and mtu correctly in erspan_fb_xmit
From: David Laight @ 2017-10-06  9:38 UTC (permalink / raw)
  To: 'William Tu'; +Cc: netdev@vger.kernel.org, Xin Long
In-Reply-To: <CALDO+SZPmF_UG9BsZKD2FzJUbx0PcUYKrOpp4Xta4Wufezhh+g@mail.gmail.com>

From: William Tu
> Sent: 05 October 2017 22:21
...
> >> -     if (skb->len > dev->mtu) {
> >> +     if (skb->len - dev->hard_header_len > dev->mtu) {
> >
> > Can you guarantee that skb->len > dev_hard_header_len?
> > It is probably safer to check skb->len > dev->hard_header_len + dev->mtu
> > since that addition isn't going to overflow.
> Sure, I will fix it.
> 
> >
> >>               pskb_trim(skb, dev->mtu);
> >>               truncate = true;
> >
> > Is that pskb_trim() now truncating to the correct size?
> 
> You're right, now I should truncate to (dev->mtu + dev_hard_header_len)

It might be worth caching that length in the dev structure
to avoid the arithmetic on every packet.

	David


^ permalink raw reply

* Re: [net-next V4 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP
From: Jesper Dangaard Brouer @ 2017-10-06  9:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, jakub.kicinski, Michael S. Tsirkin, pavel.odintsov,
	Jason Wang, mchan, John Fastabend, peter.waskiewicz.jr,
	Daniel Borkmann, Andy Gospodarek, brouer
In-Reply-To: <20171004190201.5no5mrmkko43cvv2@ast-mbp>

On Wed, 4 Oct 2017 12:02:02 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Oct 04, 2017 at 02:03:45PM +0200, Jesper Dangaard Brouer wrote:
> > The 'cpumap' is primary used as a backend map for XDP BPF helper
> > call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> > 
> > This patch implement the main part of the map.  It is not connected to
> > the XDP redirect system yet, and no SKB allocation are done yet.
> > 
> > The main concern in this patch is to ensure the datapath can run
> > without any locking.  This adds complexity to the setup and tear-down
> > procedure, which assumptions are extra carefully documented in the
> > code comments.
> > 
> > V2:
> >  - make sure array isn't larger than NR_CPUS
> >  - make sure CPUs added is a valid possible CPU
> > 
> > V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> ...
> > +static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > +{
> > +	struct bpf_cpu_map *cmap;
> > +	u64 cost;
> > +	int err;
> > +
> > +	/* check sanity of attributes */
> > +	if (attr->max_entries == 0 || attr->key_size != 4 ||
> > +	    attr->value_size != 4 || attr->map_flags & ~BPF_F_NUMA_NODE)
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	cmap = kzalloc(sizeof(*cmap), GFP_USER);
> > +	if (!cmap)
> > +		return ERR_PTR(-ENOMEM);  
> 
> just noticed that there is nothing here nor in DEVMAP/SOCKMAP
> that prevents unpriv user to create them.
> I'm not sure it was intentional for DEVMAP/SOCKMAP.
> For CPUMAP I'd suggest to restrict it to root, since it
> suppose to operate with XDP only which is root anyway.
> Note, lpm and lru maps are cap_sys_admin only already.

I agree.  Have restricted this in V5

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: BUG in free_netdev() on ppp link deletion
From: Guillaume Nault @ 2017-10-06  8:57 UTC (permalink / raw)
  To: Beniamino Galvani
  Cc: linux-ppp, netdev, Paul Mackerras, David Ahern, Gao Feng
In-Reply-To: <20171006080902.GA11223@tp>

On Fri, Oct 06, 2017 at 10:09:03AM +0200, Beniamino Galvani wrote:
> On Thu, Oct 05, 2017 at 04:55:03PM +0200, Guillaume Nault wrote:
> > Sorry for the delay, I've followed a few complicated dead ends before
> > getting to this simple and rather obvious fix.
> > 
> > Can you try this patch?
> > 
> > [..]
> 
> The patch solves the issue, thanks.
> 
Thanks, I'm going to do some more tests and submit it formally.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox