Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] netfilter: conntrack: fix calculation of next bucket number in early_drop
From: Vasily Khoruzhick @ 2018-10-25  3:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	David S. Miller, netfilter-devel, coreteam, netdev, linux-kernel,
	Dmitry Safonov
  Cc: Vasily Khoruzhick, stable

If there's no entry to drop in bucket that corresponds to the hash,
early_drop() should look for it in other buckets. But since it increments
hash instead of bucket number, it actually looks in the same bucket 8
times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
most cases.

Fix it by increasing bucket number instead of hash and rename _hash
to bucket to avoid future confusion.

Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
Cc: <stable@vger.kernel.org> # v4.7+
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
---
 net/netfilter/nf_conntrack_core.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index ca1168d67fac..a04af246b184 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1073,19 +1073,22 @@ static unsigned int early_drop_list(struct net *net,
 	return drops;
 }
 
-static noinline int early_drop(struct net *net, unsigned int _hash)
+static noinline int early_drop(struct net *net, unsigned int hash)
 {
 	unsigned int i;
 
 	for (i = 0; i < NF_CT_EVICTION_RANGE; i++) {
 		struct hlist_nulls_head *ct_hash;
-		unsigned int hash, hsize, drops;
+		unsigned int bucket, hsize, drops;
 
 		rcu_read_lock();
 		nf_conntrack_get_ht(&ct_hash, &hsize);
-		hash = reciprocal_scale(_hash++, hsize);
+		if (!i)
+			bucket = reciprocal_scale(hash, hsize);
+		else
+			bucket = (bucket + 1) % hsize;
 
-		drops = early_drop_list(net, &ct_hash[hash]);
+		drops = early_drop_list(net, &ct_hash[bucket]);
 		rcu_read_unlock();
 
 		if (drops) {
-- 
2.19.1

^ permalink raw reply related

* Regression in 4.19 net/phy/realtek: garbled sysfs output
From: Holger Hoffstätte @ 2018-10-24 19:36 UTC (permalink / raw)
  To: Netdev, Jassi Brar, David S. Miller

Hi,

Since 4.19 r8169 depends on phylib:

$lsmod | grep r8169
r8169                  81920  0
libphy                 57344  2 r8169,realtek

Unfortunately this now gives me the following sysfs error:

$cd /sys/module/realtek/drivers
$ls -l
ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or directory
total 0
lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8201CP Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8201CP Ethernet'
l????????? ? ?    ?    ?            ? 'mdio_bus:RTL8201F 10/100Mbps Ethernet'
lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8211 Gigabit Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8211 Gigabit Ethernet'
[..]

Apparently the forward slash in "10/100Mbps Ethernet" is interpreted as
directory separator that leads nowhere, and was introduced in commit
513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").

Would it be acceptable to change the name simply to "RTL8201F Ethernet"?

thanks,
Holger

^ permalink raw reply

* Re: Fw: [Bug 201423] New: eth0: hw csum failure
From: Andre Tomt @ 2018-10-24 19:41 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet
  Cc: Stephen Hemminger, netdev, rossi.f, Dimitris Michailidis
In-Reply-To: <e2c4e4a2-5e51-4df0-a34f-8a24b67ef55f@tomt.net>

On 21.10.2018 15:34, Andre Tomt wrote:
> On 20.10.2018 00:25, Eric Dumazet wrote:
>> On 10/19/2018 02:58 PM, Eric Dumazet wrote:
>>> On 10/16/2018 06:00 AM, Eric Dumazet wrote:
>>>> On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt <andre@tomt.net> wrote:
>>>>> I've seen similar on several systems with mlx4 cards when using 
>>>>> 4.18.x -
>>>>> that is hw csum failure followed by some backtrace.
>>>>>
>>>>> Only seems to happen on systems dealing with quite a bit of UDP.
>>>>>
>>>>
>>>> Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE,
>>>> but CHECKSUM_UNNECESSARY
>>>>
>>>> I would be nice to track this a bit further, maybe by providing the
>>>> full packet content.
>>>>
> <snip>
>>>
>>> As a matter of fact Dimitris found the issue in the patch and is 
>>> working on a fix involving csum_block_sub()
>>>
>>> Problems comes from trimming an odd number of bytes.
>>
>> More exactly, trimming bytes starting at an odd offset.
> 
> No hw csum failures here since I deployed Dimitris fix on top of 4.18.16 
> 32 hours ago.
> 
> Thanks

It eventually showed up again with mlx4, on 4.18.16 + fix and also on 
4.19. I still do not have a useful packet capture.

It is running a torrent client serving up various linux distributions.

> [116116.994519] p0xe0: hw csum failure
> [116116.994550] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.19.0-1 #1
> [116116.994551] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
> [116116.994555] Call Trace:
> [116116.994558]  <IRQ>
> [116116.994567]  dump_stack+0x5c/0x7b
> [116116.994574]  __skb_gro_checksum_complete+0x9a/0xa0
> [116116.994580]  udp6_gro_receive+0x211/0x290
> [116116.994585]  ipv6_gro_receive+0x1b1/0x3a0
> [116116.994588]  dev_gro_receive+0x3a0/0x620
> [116116.994590]  ? __build_skb+0x25/0xe0
> [116116.994592]  napi_gro_frags+0xa8/0x220
> [116116.994598]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
> [116116.994611]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
> [116116.994621]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
> [116116.994629]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
> [116116.994635]  net_rx_action+0xe0/0x2e0
> [116116.994641]  __do_softirq+0xd8/0x2ff
> [116116.994646]  irq_exit+0xbd/0xd0
> [116116.994650]  do_IRQ+0x85/0xd0
> [116116.994656]  common_interrupt+0xf/0xf
> [116116.994659]  </IRQ>
> [116116.994665] RIP: 0010:cpuidle_enter_state+0xb3/0x310
> [116116.994668] Code: 31 ff e8 e0 e0 bb ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 3f 02 00 00 31 ff e8 64 cc c0 ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
> [116116.994669] RSP: 0018:ffff924a0635bea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
> [116116.994671] RAX: ffff9016ffb60fc0 RBX: 0000699b9835d616 RCX: 000000000000001f
> [116116.994673] RDX: 0000699b9835d616 RSI: 00000000229837f7 RDI: 0000000000000000
> [116116.994674] RBP: 0000000000000001 R08: 0000000000000002 R09: 0000000000020840
> [116116.994675] R10: ffff924a0635be88 R11: 0000000000000367 R12: ffff9016ffb69aa8
> [116116.994676] R13: ffffffffa50ac638 R14: 0000000000000000 R15: 0000699b981c63b9
> [116116.994680]  ? cpuidle_enter_state+0x90/0x310
> [116116.994685]  do_idle+0x1d0/0x240
> [116116.994687]  cpu_startup_entry+0x5f/0x70
> [116116.994690]  start_secondary+0x185/0x1a0
> [116116.994693]  secondary_startup_64+0xa4/0xb0
> [116116.994709] p0xe0: hw csum failure
> [116116.994739] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.19.0-1 #1
> [116116.994740] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
> [116116.994741] Call Trace:
> [116116.994743]  <IRQ>
> [116116.994746]  dump_stack+0x5c/0x7b
> [116116.994751]  __skb_checksum_complete+0xb8/0xd0
> [116116.994755]  __udp6_lib_rcv+0xa0e/0xa20
> [116116.994764]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> [116116.994768]  ? nft_do_chain_inet+0x7a/0xd0 [nf_tables]
> [116116.994771]  ip6_input_finish+0xc0/0x460
> [116116.994774]  ip6_input+0x2b/0x90
> [116116.994776]  ? ip6_make_skb+0x1b0/0x1b0
> [116116.994778]  ipv6_rcv+0x54/0xb0
> [116116.994781]  __netif_receive_skb_one_core+0x42/0x50
> [116116.994784]  netif_receive_skb_internal+0x24/0xb0
> [116116.994786]  napi_gro_frags+0x171/0x220
> [116116.994790]  mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en]
> [116116.994798]  ? mlx4_cq_completion+0x23/0x70 [mlx4_core]
> [116116.994803]  ? mlx4_eq_int+0x373/0xc80 [mlx4_core]
> [116116.994806]  mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en]
> [116116.994808]  net_rx_action+0xe0/0x2e0
> [116116.994810]  __do_softirq+0xd8/0x2ff
> [116116.994812]  irq_exit+0xbd/0xd0
> [116116.994814]  do_IRQ+0x85/0xd0
> [116116.994816]  common_interrupt+0xf/0xf
> [116116.994818]  </IRQ>
> [116116.994821] RIP: 0010:cpuidle_enter_state+0xb3/0x310
> [116116.994823] Code: 31 ff e8 e0 e0 bb ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 3f 02 00 00 31 ff e8 64 cc c0 ff fb 66 0f 1f 44 00 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
> [116116.994824] RSP: 0018:ffff924a0635bea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
> [116116.994825] RAX: ffff9016ffb60fc0 RBX: 0000699b9835d616 RCX: 000000000000001f
> [116116.994826] RDX: 0000699b9835d616 RSI: 00000000229837f7 RDI: 0000000000000000
> [116116.994827] RBP: 0000000000000001 R08: 0000000000000002 R09: 0000000000020840
> [116116.994828] R10: ffff924a0635be88 R11: 0000000000000367 R12: ffff9016ffb69aa8
> [116116.994829] R13: ffffffffa50ac638 R14: 0000000000000000 R15: 0000699b981c63b9
> [116116.994832]  ? cpuidle_enter_state+0x90/0x310
> [116116.994835]  do_idle+0x1d0/0x240
> [116116.994837]  cpu_startup_entry+0x5f/0x70
> [116116.994838]  start_secondary+0x185/0x1a0
> [116116.994840]  secondary_startup_64+0xa4/0xb0

^ permalink raw reply

* [PATCH net 0/4] net: Fixups for recent dump filtering changes
From: David Ahern @ 2018-10-24 19:58 UTC (permalink / raw)
  To: netdev, davem; +Cc: lirongqing, David Ahern

From: David Ahern <dsahern@gmail.com>

Li RongQing noted that tgt_net is leaked in ipv4 due to the recent change
to handle address dumps for a specific device. The report also applies to
ipv6 and other error paths. Patches 1 and 2 fix those leaks.

Patch 3 stops route dumps from erroring out when dumping across address
families and a table id is given. This is needed in preparation for
patch 4.

Patch 4 updates the rtnl_dump_all to handle a failure in one of the dumpit
functions. At the moment, if an address dump returns an error the dump all
loop breaks but the error is dropped. The result can be no data is returned
and no error either leaving the user wondering about the addresses.

Patches were tested with a modified iproute2 to add invalid data to the
dump request causing each specific failure path to be hit in addition
to positive testing that it works as it should when given valid data.

David Ahern (4):
  net/ipv4: Put target net when address dump fails due to bad attributes
  net/ipv6: Put target net when address dump fails due to bad attributes
  net: Don't return invalid table id error when dumping all families
  net: rtnl_dump_all needs to propagate error from dumpit function

 include/net/ip_fib.h    |  1 +
 net/core/rtnetlink.c    |  6 ++++--
 net/ipv4/devinet.c      | 13 ++++++++-----
 net/ipv4/fib_frontend.c |  4 ++++
 net/ipv4/ipmr.c         |  3 +++
 net/ipv6/addrconf.c     | 14 ++++++++------
 net/ipv6/ip6_fib.c      |  3 +++
 net/ipv6/ip6mr.c        |  3 +++
 8 files changed, 34 insertions(+), 13 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net 2/4] net/ipv6: Put target net when address dump fails due to bad attributes
From: David Ahern @ 2018-10-24 19:59 UTC (permalink / raw)
  To: netdev, davem; +Cc: lirongqing, David Ahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
request, make sure all error paths call put_net.

Fixes: 6371a71f3a3b ("net/ipv6: Add support for dumping addresses for a specific device")
Fixes: ed6eff11790a ("net/ipv6: Update inet6_dump_addr for strict data checking")
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/addrconf.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 45b84dd5c4eb..7eb09c86fa13 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5089,23 +5089,25 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	struct net_device *dev;
 	struct inet6_dev *idev;
 	struct hlist_head *head;
+	int err = 0;
 
 	s_h = cb->args[0];
 	s_idx = idx = cb->args[1];
 	s_ip_idx = cb->args[2];
 
 	if (cb->strict_check) {
-		int err;
-
 		err = inet6_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
 						  skb->sk, cb);
 		if (err < 0)
-			return err;
+			goto put_tgt_net;
 
+		err = 0;
 		if (fillargs.ifindex) {
 			dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
-			if (!dev)
-				return -ENODEV;
+			if (!dev) {
+				err = -ENODEV;
+				goto put_tgt_net;
+			}
 			idev = __in6_dev_get(dev);
 			if (idev) {
 				err = in6_dump_addrs(idev, skb, cb, s_ip_idx,
@@ -5144,7 +5146,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-	return skb->len;
+	return err < 0 ? err : skb->len;
 }
 
 static int inet6_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 1/4] net/ipv4: Put target net when address dump fails due to bad attributes
From: David Ahern @ 2018-10-24 19:58 UTC (permalink / raw)
  To: netdev, davem; +Cc: lirongqing, David Ahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
request, make sure all error paths call put_net.

Fixes: 5fcd266a9f64 ("net/ipv4: Add support for dumping addresses for a specific device")
Fixes: c33078e3dfb1 ("net/ipv4: Update inet_dump_ifaddr for strict data checking")
Reported-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv4/devinet.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 63d5b58fbfdb..9250b309c742 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1761,7 +1761,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net_device *dev;
 	struct in_device *in_dev;
 	struct hlist_head *head;
-	int err;
+	int err = 0;
 
 	s_h = cb->args[0];
 	s_idx = idx = cb->args[1];
@@ -1771,12 +1771,15 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 		err = inet_valid_dump_ifaddr_req(nlh, &fillargs, &tgt_net,
 						 skb->sk, cb);
 		if (err < 0)
-			return err;
+			goto put_tgt_net;
 
+		err = 0;
 		if (fillargs.ifindex) {
 			dev = __dev_get_by_index(tgt_net, fillargs.ifindex);
-			if (!dev)
-				return -ENODEV;
+			if (!dev) {
+				err = -ENODEV;
+				goto put_tgt_net;
+			}
 
 			in_dev = __in_dev_get_rtnl(dev);
 			if (in_dev) {
@@ -1821,7 +1824,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	if (fillargs.netnsid >= 0)
 		put_net(tgt_net);
 
-	return skb->len;
+	return err < 0 ? err : skb->len;
 }
 
 static void rtmsg_ifa(int event, struct in_ifaddr *ifa, struct nlmsghdr *nlh,
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 3/4] net: Don't return invalid table id error when dumping all families
From: David Ahern @ 2018-10-24 19:59 UTC (permalink / raw)
  To: netdev, davem; +Cc: lirongqing, David Ahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

When doing a route dump across all address families, do not error out
if the table does not exist. This allows a route dump for AF_UNSPEC
with a table id that may only exist for some of the families.

Do return the table does not exist error if dumping routes for a
specific family and the table does not exist.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip_fib.h    | 1 +
 net/ipv4/fib_frontend.c | 4 ++++
 net/ipv4/ipmr.c         | 3 +++
 net/ipv6/ip6_fib.c      | 3 +++
 net/ipv6/ip6mr.c        | 3 +++
 5 files changed, 14 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e8d9456bf36e..c5969762a8f4 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -226,6 +226,7 @@ struct fib_dump_filter {
 	u32			table_id;
 	/* filter_set is an optimization that an entry is set */
 	bool			filter_set;
+	bool			dump_all_families;
 	unsigned char		protocol;
 	unsigned char		rt_type;
 	unsigned int		flags;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 5bf653f36911..6df95be96311 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -829,6 +829,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
 		return -EINVAL;
 	}
 
+	filter->dump_all_families = (rtm->rtm_family == AF_UNSPEC);
 	filter->flags    = rtm->rtm_flags;
 	filter->protocol = rtm->rtm_protocol;
 	filter->rt_type  = rtm->rtm_type;
@@ -899,6 +900,9 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	if (filter.table_id) {
 		tb = fib_get_table(net, filter.table_id);
 		if (!tb) {
+			if (filter.dump_all_families)
+				return skb->len;
+
 			NL_SET_ERR_MSG(cb->extack, "ipv4: FIB table does not exist");
 			return -ENOENT;
 		}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 7a3e2acda94c..a6defbec4f1b 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2542,6 +2542,9 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 
 		mrt = ipmr_get_table(sock_net(skb->sk), filter.table_id);
 		if (!mrt) {
+			if (filter.dump_all_families)
+				return skb->len;
+
 			NL_SET_ERR_MSG(cb->extack, "ipv4: MR table does not exist");
 			return -ENOENT;
 		}
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 2a058b408a6a..1b8bc008b53b 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -620,6 +620,9 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 	if (arg.filter.table_id) {
 		tb = fib6_get_table(net, arg.filter.table_id);
 		if (!tb) {
+			if (arg.filter.dump_all_families)
+				return skb->len;
+
 			NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not exist");
 			return -ENOENT;
 		}
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index c3317ffb09eb..e2ea691e42c6 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -2473,6 +2473,9 @@ static int ip6mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 
 		mrt = ip6mr_get_table(sock_net(skb->sk), filter.table_id);
 		if (!mrt) {
+			if (filter.dump_all_families)
+				return skb->len;
+
 			NL_SET_ERR_MSG_MOD(cb->extack, "MR table does not exist");
 			return -ENOENT;
 		}
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 4/4] net: rtnl_dump_all needs to propagate error from dumpit function
From: David Ahern @ 2018-10-24 19:59 UTC (permalink / raw)
  To: netdev, davem; +Cc: lirongqing, David Ahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

If an address, route or netconf dump request is sent for AF_UNSPEC, then
rtnl_dump_all is used to do the dump across all address families. If one
of the dumpit functions fails (e.g., invalid attributes in the dump
request) then rtnl_dump_all needs to propagate that error so the user
gets an appropriate response instead of just getting no data.

Fixes: effe67926624 ("net: Enable kernel side filtering of route dumps")
Fixes: 5fcd266a9f64 ("net/ipv4: Add support for dumping addresses for a specific device")
Fixes: 6371a71f3a3b ("net/ipv6: Add support for dumping addresses for a specific device")
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/rtnetlink.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 0958c7be2c22..f679c7a7d761 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3333,6 +3333,7 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 	int idx;
 	int s_idx = cb->family;
 	int type = cb->nlh->nlmsg_type - RTM_BASE;
+	int ret = 0;
 
 	if (s_idx == 0)
 		s_idx = 1;
@@ -3365,12 +3366,13 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 			cb->prev_seq = 0;
 			cb->seq = 0;
 		}
-		if (dumpit(skb, cb))
+		ret = dumpit(skb, cb);
+		if (ret < 0)
 			break;
 	}
 	cb->family = idx;
 
-	return skb->len;
+	return skb->len ? : ret;
 }
 
 struct sk_buff *rtmsg_ifinfo_build_skb(int type, struct net_device *dev,
-- 
2.11.0

^ permalink raw reply related

* [PATCH bpf 0/7] Batch of direct packet access fixes for BPF
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann

Several fixes to get direct packet access in order from verifier
side. Also test suite fix to run cg_skb as unpriv and an improvement
to make direct packet write less error prone in future.

Thanks!

Daniel Borkmann (7):
  bpf: fix test suite to enable all unpriv program types
  bpf: disallow direct packet access for unpriv in cg_skb
  bpf: fix direct packet access for flow dissector progs
  bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
  bpf: fix direct packet write into pop/peek helpers
  bpf: fix leaking uninitialized memory on pop/peek helpers
  bpf: make direct packet write unclone more robust

 kernel/bpf/helpers.c                        |  2 --
 kernel/bpf/queue_stack_maps.c               |  2 ++
 kernel/bpf/verifier.c                       | 13 ++++++++++---
 net/core/filter.c                           | 17 +++++++++++++++++
 tools/testing/selftests/bpf/test_verifier.c | 15 +++++++++++++--
 5 files changed, 42 insertions(+), 7 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf 1/7] bpf: fix test suite to enable all unpriv program types
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Given BPF_PROG_TYPE_CGROUP_SKB program types are also valid in an
unprivileged setting, lets not omit these tests and potentially
have issues fall through the cracks. Make this more obvious by
adding a small test_as_unpriv() helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 769d68a..8e1a79d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4891,6 +4891,8 @@ static struct bpf_test tests[] = {
 			BPF_EXIT_INSN(),
 		},
 		.result = ACCEPT,
+		.result_unpriv = REJECT,
+		.errstr_unpriv = "R3 pointer comparison prohibited",
 		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
 	},
 	{
@@ -5146,6 +5148,7 @@ static struct bpf_test tests[] = {
 		.fixup_cgroup_storage = { 1 },
 		.result = REJECT,
 		.errstr = "get_local_storage() doesn't support non-zero flags",
+		.errstr_unpriv = "R2 leaks addr into helper function",
 		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
 	},
 	{
@@ -5261,6 +5264,7 @@ static struct bpf_test tests[] = {
 		.fixup_percpu_cgroup_storage = { 1 },
 		.result = REJECT,
 		.errstr = "get_local_storage() doesn't support non-zero flags",
+		.errstr_unpriv = "R2 leaks addr into helper function",
 		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
 	},
 	{
@@ -14050,6 +14054,13 @@ static void get_unpriv_disabled()
 	fclose(fd);
 }
 
+static bool test_as_unpriv(struct bpf_test *test)
+{
+	return !test->prog_type ||
+	       test->prog_type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	       test->prog_type == BPF_PROG_TYPE_CGROUP_SKB;
+}
+
 static int do_test(bool unpriv, unsigned int from, unsigned int to)
 {
 	int i, passes = 0, errors = 0, skips = 0;
@@ -14060,10 +14071,10 @@ static int do_test(bool unpriv, unsigned int from, unsigned int to)
 		/* Program types that are not supported by non-root we
 		 * skip right away.
 		 */
-		if (!test->prog_type && unpriv_disabled) {
+		if (test_as_unpriv(test) && unpriv_disabled) {
 			printf("#%d/u %s SKIP\n", i, test->descr);
 			skips++;
-		} else if (!test->prog_type) {
+		} else if (test_as_unpriv(test)) {
 			if (!unpriv)
 				set_admin(false);
 			printf("#%d/u %s ", i, test->descr);
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 3/7] bpf: fix direct packet access for flow dissector progs
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Petar Penkov
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit d58e468b1112 ("flow_dissector: implements flow dissector BPF
hook") added direct packet access for skbs in may_access_direct_pkt_data()
function where this enables read and write access to the skb->data. This
is buggy because without a prologue generator such as bpf_unclone_prologue()
we would allow for writing into cloned skbs. Original intention might have
been to only allow read access where this is not needed (similar as the
flow_dissector_func_proto() indicates which enables only bpf_skb_load_bytes()
as well), therefore this patch fixes it to restrict to read-only.

Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Petar Penkov <ppenkov@google.com>
---
 kernel/bpf/verifier.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 98fa0be..b0cc8f2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1387,21 +1387,23 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 				       enum bpf_access_type t)
 {
 	switch (env->prog->type) {
+	/* Program types only with direct read access go here! */
 	case BPF_PROG_TYPE_LWT_IN:
 	case BPF_PROG_TYPE_LWT_OUT:
 	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
-		/* dst_input() and dst_output() can't write for now */
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (t == BPF_WRITE)
 			return false;
 		/* fallthrough */
+
+	/* Program types with direct read + write access go here! */
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
 	case BPF_PROG_TYPE_XDP:
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
-	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (meta)
 			return meta->pkt_access;
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 4/7] bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Song Liu
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit b39b5f411dcf ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added direct packet access for skbs in
cg_skb program types, however allowed access type was not added to
the may_access_direct_pkt_data() helper. Therefore the latter always
returns false. This is not directly an issue, it just means writes
are unconditionally disabled (which is correct) but also reads.
Latter is relevant in this function when BPF helpers may read direct
packet data which is unconditionally disabled then. Fix it by properly
adding BPF_PROG_TYPE_CGROUP_SKB to may_access_direct_pkt_data().

Fixes: b39b5f411dcf ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b0cc8f2..5fc9a65 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1393,6 +1393,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+	case BPF_PROG_TYPE_CGROUP_SKB:
 		if (t == BPF_WRITE)
 			return false;
 		/* fallthrough */
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 5/7] bpf: fix direct packet write into pop/peek helpers
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Mauricio Vasquez B
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit f1a2e44a3aec ("bpf: add queue and stack maps") probably just
copy-pasted .pkt_access for bpf_map_{pop,peek}_elem() helpers, but
this is buggy in this context since it would allow writes into cloned
skbs which is invalid. Therefore, disable .pkt_access for the two.

Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/helpers.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index ab0d5e3..a74972b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -99,7 +99,6 @@ BPF_CALL_2(bpf_map_pop_elem, struct bpf_map *, map, void *, value)
 const struct bpf_func_proto bpf_map_pop_elem_proto = {
 	.func		= bpf_map_pop_elem,
 	.gpl_only	= false,
-	.pkt_access	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_CONST_MAP_PTR,
 	.arg2_type	= ARG_PTR_TO_UNINIT_MAP_VALUE,
@@ -113,7 +112,6 @@ BPF_CALL_2(bpf_map_peek_elem, struct bpf_map *, map, void *, value)
 const struct bpf_func_proto bpf_map_peek_elem_proto = {
 	.func		= bpf_map_pop_elem,
 	.gpl_only	= false,
-	.pkt_access	= true,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_CONST_MAP_PTR,
 	.arg2_type	= ARG_PTR_TO_UNINIT_MAP_VALUE,
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 2/7] bpf: disallow direct packet access for unpriv in cg_skb
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Song Liu
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit b39b5f411dcf ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added support for returning pkt pointers
for direct packet access. Given this program type is allowed for both
unprivileged and privileged users, we shouldn't allow unprivileged
ones to use it, e.g. besides others one reason would be to avoid any
potential speculation on the packet test itself, thus guard this for
root only.

Fixes: b39b5f411dcf ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
---
 net/core/filter.c                           | 6 ++++++
 tools/testing/selftests/bpf/test_verifier.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 35c6933..3fdddfa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5496,7 +5496,13 @@ static bool cg_skb_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
+	case bpf_ctx_range(struct __sk_buff, data):
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		if (!capable(CAP_SYS_ADMIN))
+			return false;
+		break;
 	}
+
 	if (type == BPF_WRITE) {
 		switch (off) {
 		case bpf_ctx_range(struct __sk_buff, mark):
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 8e1a79d..36f3d30 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -4892,7 +4892,7 @@ static struct bpf_test tests[] = {
 		},
 		.result = ACCEPT,
 		.result_unpriv = REJECT,
-		.errstr_unpriv = "R3 pointer comparison prohibited",
+		.errstr_unpriv = "invalid bpf_context access off=76 size=4",
 		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
 	},
 	{
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 6/7] bpf: fix leaking uninitialized memory on pop/peek helpers
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Mauricio Vasquez B
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Commit f1a2e44a3aec ("bpf: add queue and stack maps") added helpers
with ARG_PTR_TO_UNINIT_MAP_VALUE. Meaning, the helper is supposed to
fill the map value buffer with data instead of reading from it like
in other helpers such as map update. However, given the buffer is
allowed to be uninitialized (since we fill it in the helper anyway),
it also means that the helper is obliged to wipe the memory in case
of an error in order to not allow for leaking uninitialized memory.
Given pop/peek is both handled inside __{stack,queue}_map_get(),
lets wipe it there on error case, that is, empty stack/queue.

Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
---
 kernel/bpf/queue_stack_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c
index 12a93fb..8bbd72d 100644
--- a/kernel/bpf/queue_stack_maps.c
+++ b/kernel/bpf/queue_stack_maps.c
@@ -122,6 +122,7 @@ static int __queue_map_get(struct bpf_map *map, void *value, bool delete)
 	raw_spin_lock_irqsave(&qs->lock, flags);

 	if (queue_stack_map_is_empty(qs)) {
+		memset(value, 0, qs->map.value_size);
 		err = -ENOENT;
 		goto out;
 	}
@@ -151,6 +152,7 @@ static int __stack_map_get(struct bpf_map *map, void *value, bool delete)
 	raw_spin_lock_irqsave(&qs->lock, flags);

 	if (queue_stack_map_is_empty(qs)) {
+		memset(value, 0, qs->map.value_size);
 		err = -ENOENT;
 		goto out;
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf 7/7] bpf: make direct packet write unclone more robust
From: Daniel Borkmann @ 2018-10-24 20:05 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20181024200549.8516-1-daniel@iogearbox.net>

Given this seems to be quite fragile and can easily slip through the
cracks, lets make direct packet write more robust by requiring that
future program types which allow for such write must provide a prologue
callback. In case of XDP and sk_msg it's noop, thus add a generic noop
handler there. The latter starts out with NULL data/data_end unconditionally
when sg pages are shared.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c |  6 +++++-
 net/core/filter.c     | 11 +++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fc9a65..171a2c8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5709,7 +5709,11 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 	bool is_narrower_load;
 	u32 target_size;
 
-	if (ops->gen_prologue) {
+	if (ops->gen_prologue || env->seen_direct_write) {
+		if (!ops->gen_prologue) {
+			verbose(env, "bpf verifier is misconfigured\n");
+			return -EINVAL;
+		}
 		cnt = ops->gen_prologue(insn_buf, env->seen_direct_write,
 					env->prog);
 		if (cnt >= ARRAY_SIZE(insn_buf)) {
diff --git a/net/core/filter.c b/net/core/filter.c
index 3fdddfa..cd648d0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5644,6 +5644,15 @@ static bool sock_filter_is_valid_access(int off, int size,
 					       prog->expected_attach_type);
 }
 
+static int bpf_noop_prologue(struct bpf_insn *insn_buf, bool direct_write,
+			     const struct bpf_prog *prog)
+{
+	/* Neither direct read nor direct write requires any preliminary
+	 * action.
+	 */
+	return 0;
+}
+
 static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write,
 				const struct bpf_prog *prog, int drop_verdict)
 {
@@ -7210,6 +7219,7 @@ const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
 	.convert_ctx_access	= xdp_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
 };
 
 const struct bpf_prog_ops xdp_prog_ops = {
@@ -7308,6 +7318,7 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 	.get_func_proto		= sk_msg_func_proto,
 	.is_valid_access	= sk_msg_is_valid_access,
 	.convert_ctx_access	= sk_msg_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
 };
 
 const struct bpf_prog_ops sk_msg_prog_ops = {
-- 
2.9.5

^ permalink raw reply related

* Re: Regression in 4.19 net/phy/realtek: garbled sysfs output
From: Andrew Lunn @ 2018-10-24 20:12 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: Netdev, Jassi Brar, David S. Miller
In-Reply-To: <5e989478-0d33-cd05-efa7-fe3ec54856ab@applied-asynchrony.com>

On Wed, Oct 24, 2018 at 09:36:02PM +0200, Holger Hoffstätte wrote:
> Hi,
> 
> Since 4.19 r8169 depends on phylib:
> 
> $lsmod | grep r8169
> r8169                  81920  0
> libphy                 57344  2 r8169,realtek
> 
> Unfortunately this now gives me the following sysfs error:
> 
> $cd /sys/module/realtek/drivers
> $ls -l
> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or directory
> total 0
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8201CP Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8201CP Ethernet'
> l????????? ? ?    ?    ?            ? 'mdio_bus:RTL8201F 10/100Mbps Ethernet'
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8211 Gigabit Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8211 Gigabit Ethernet'
> [..]
> 
> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted as
> directory separator that leads nowhere, and was introduced in commit
> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
> 
> Would it be acceptable to change the name simply to "RTL8201F Ethernet"?

Hi Holger

Or use "RTL8201F Fast Ethernet"

I wonder if other drivers have similar problems?

davicom.c:      .name           = "Davicom DM9161B/C",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.3",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.4",
intel-xway.c:           .name           = "Intel XWAY PHY11G (PEF 7071/PEF 7072) v1.5 / v1.6",
intel-xway.c:           .name           = "Intel XWAY PHY22F (PEF 7061) v1.5 / v1.6",
smsc.c:	 .name	       = "SMSC LAN8710/LAN8720",

	 Andrew

^ permalink raw reply

* Re: [PATCH] netfilter: conntrack: fix calculation of next bucket number in early_drop
From: Dmitry Safonov @ 2018-10-25  4:52 UTC (permalink / raw)
  To: Vasily Khoruzhick, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, David S. Miller, netfilter-devel, coreteam,
	netdev, linux-kernel
  Cc: stable
In-Reply-To: <20181025034853.23573-1-vasilykh@arista.com>

On 10/25/18 4:48 AM, Vasily Khoruzhick wrote:
> If there's no entry to drop in bucket that corresponds to the hash,
> early_drop() should look for it in other buckets. But since it increments
> hash instead of bucket number, it actually looks in the same bucket 8
> times: hsize is 16k by default (14 bits) and hash is 32-bit value, so
> reciprocal_scale(hash, hsize) returns the same value for hash..hash+7 in
> most cases.
> 
> Fix it by increasing bucket number instead of hash and rename _hash
> to bucket to avoid future confusion.
> 
> Fixes: 3e86638e9a0b ("netfilter: conntrack: consider ct netns in early_drop logic")
> Cc: <stable@vger.kernel.org> # v4.7+
> Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>

Nice work!

Reviewed-by: Dmitry Safonov <dima@arista.com>

> ---
>   net/netfilter/nf_conntrack_core.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index ca1168d67fac..a04af246b184 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1073,19 +1073,22 @@ static unsigned int early_drop_list(struct net *net,
>   	return drops;
>   }
>   
> -static noinline int early_drop(struct net *net, unsigned int _hash)
> +static noinline int early_drop(struct net *net, unsigned int hash)
>   {
>   	unsigned int i;
>   
>   	for (i = 0; i < NF_CT_EVICTION_RANGE; i++) {
>   		struct hlist_nulls_head *ct_hash;
> -		unsigned int hash, hsize, drops;
> +		unsigned int bucket, hsize, drops;
>   
>   		rcu_read_lock();
>   		nf_conntrack_get_ht(&ct_hash, &hsize);
> -		hash = reciprocal_scale(_hash++, hsize);
> +		if (!i)
> +			bucket = reciprocal_scale(hash, hsize);
> +		else
> +			bucket = (bucket + 1) % hsize;
>   
> -		drops = early_drop_list(net, &ct_hash[hash]);
> +		drops = early_drop_list(net, &ct_hash[bucket]);
>   		rcu_read_unlock();
>   
>   		if (drops) {
> 

-- 
           Dima

^ permalink raw reply

* Re: [PATCH ghak90 (was ghak32) V4 03/10] audit: log container info of syscalls
From: Paul Moore @ 2018-10-24 20:55 UTC (permalink / raw)
  To: rgb
  Cc: containers, linux-api, linux-audit, linux-fsdevel, linux-kernel,
	netdev, netfilter-devel, ebiederm, luto, carlos, dhowells, viro,
	simo, Eric Paris, Serge Hallyn
In-Reply-To: <20181024151439.lavhanabsyxdrdvo@madcap2.tricolour.ca>

On Wed, Oct 24, 2018 at 11:15 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-10-19 19:16, Paul Moore wrote:
> > On Sun, Aug 5, 2018 at 4:32 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Create a new audit record AUDIT_CONTAINER to document the audit
> > > container identifier of a process if it is present.
> > >
> > > Called from audit_log_exit(), syscalls are covered.
> > >
> > > A sample raw event:
> > > type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> > > type=CWD msg=audit(1519924845.499:257): cwd="/root"
> > > type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
> > > type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
> > > type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> > > type=CONTAINER msg=audit(1519924845.499:257): op=task contid=123458
> > >
> > > See: https://github.com/linux-audit/audit-kernel/issues/90
> > > See: https://github.com/linux-audit/audit-userspace/issues/51
> > > See: https://github.com/linux-audit/audit-testsuite/issues/64
> > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > Acked-by: Serge Hallyn <serge@hallyn.com>
> > > Acked-by: Steve Grubb <sgrubb@redhat.com>
> > > ---
> > >  include/linux/audit.h      |  7 +++++++
> > >  include/uapi/linux/audit.h |  1 +
> > >  kernel/audit.c             | 24 ++++++++++++++++++++++++
> > >  kernel/auditsc.c           |  3 +++
> > >  4 files changed, 35 insertions(+)
> >
> > ...
> >
> > > @@ -2045,6 +2045,30 @@ void audit_log_session_info(struct audit_buffer *ab)
> > >         audit_log_format(ab, " auid=%u ses=%u", auid, sessionid);
> > >  }
> > >
> > > +/*
> > > + * audit_log_contid - report container info
> > > + * @tsk: task to be recorded
> > > + * @context: task or local context for record
> > > + * @op: contid string description
> > > + */
> > > +int audit_log_contid(struct task_struct *tsk,
> > > +                            struct audit_context *context, char *op)
> > > +{
> > > +       struct audit_buffer *ab;
> > > +
> > > +       if (!audit_contid_set(tsk))
> > > +               return 0;
> > > +       /* Generate AUDIT_CONTAINER record with container ID */
> > > +       ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER);
> > > +       if (!ab)
> > > +               return -ENOMEM;
> > > +       audit_log_format(ab, "op=%s contid=%llu",
> > > +                        op, audit_get_contid(tsk));
> > > +       audit_log_end(ab);
> > > +       return 0;
> > > +}
> > > +EXPORT_SYMBOL(audit_log_contid);
> >
> > As discussed in the previous iteration of the patch, I prefer
> > AUDIT_CONTAINER_ID here over AUDIT_CONTAINER.  If you feel strongly
> > about keeping it as-is with AUDIT_CONTAINER I suppose I could live
> > with that, but it is isn't my first choice.
>
> I don't have a strong opinion on this one, mildly preferring the shorter
> one only because it is shorter.

We already have multiple AUDIT_CONTAINER* record types, so it seems as
though we should use "AUDIT_CONTAINER" as a prefix of sorts, rather
than a type itself.

> > However, I do care about the "op" field in this record.  It just
> > doesn't make any sense; the way you are using it it is more of a
> > context field than an operations field, and even then why is the
> > context important from a logging and/or security perspective?  Drop it
> > please.
>
> I'll rename it to whatever you like.  I'd suggest "ref=".  The reason I
> think it is important is there are multiple sources that aren't always
> obvious from the other records to which it is associated.  In the case
> of ptrace and signals, there can be many target tasks listed (OBJ_PID)
> with no other way to distinguish the matching audit container identifier
> records all for one event.  This is in addition to the default syscall
> container identifier record.  I'm not currently happy with the text
> content to link the two, but that should be solvable (most obvious is
> taret PID).  Throwing away this information seems shortsighted.

It would be helpful if you could generate real audit events
demonstrating the problems you are describing, as well as a more
standard syscall event, so we can discuss some possible solutions.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH net] net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
From: David Ahern @ 2018-10-24 20:58 UTC (permalink / raw)
  To: netdev, davem; +Cc: David Ahern

From: David Ahern <dsahern@gmail.com>

The intent of ip6_route_check_nh_onlink is to make sure the gateway
given for an onlink route is not actually on a connected route for
a different interface (e.g., 2001:db8:1::/64 is on dev eth1 and then
an onlink route has a via 2001:db8:1::1 dev eth2). If the gateway
lookup hits the default route then it most likely will be a different
interface than the onlink route which is ok.

Update ip6_route_check_nh_onlink to disregard the device mismatch
if the gateway lookup hits the default route. Turns out the existing
onlink tests are passing because there is no default route or it is
an unreachable default, so update the onlink tests to have a default
route other than unreachable.

Fixes: fc1e64e1092f6 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/ipv6/route.c                                |  2 ++
 tools/testing/selftests/net/fib-onlink-tests.sh | 14 +++++++-------
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e3226284e480..2a7423c39456 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2745,6 +2745,8 @@ static int ip6_route_check_nh_onlink(struct net *net,
 	grt = ip6_nh_lookup_table(net, cfg, gw_addr, tbid, 0);
 	if (grt) {
 		if (!grt->dst.error &&
+		    /* ignore match if it is the default route */
+		    grt->from && !ipv6_addr_any(&grt->from->fib6_dst.addr) &&
 		    (grt->rt6i_flags & flags || dev != grt->dst.dev)) {
 			NL_SET_ERR_MSG(extack,
 				       "Nexthop has invalid gateway or device mismatch");
diff --git a/tools/testing/selftests/net/fib-onlink-tests.sh b/tools/testing/selftests/net/fib-onlink-tests.sh
index 3991ad1a368d..864f865eee55 100755
--- a/tools/testing/selftests/net/fib-onlink-tests.sh
+++ b/tools/testing/selftests/net/fib-onlink-tests.sh
@@ -167,8 +167,8 @@ setup()
 	# add vrf table
 	ip li add ${VRF} type vrf table ${VRF_TABLE}
 	ip li set ${VRF} up
-	ip ro add table ${VRF_TABLE} unreachable default
-	ip -6 ro add table ${VRF_TABLE} unreachable default
+	ip ro add table ${VRF_TABLE} unreachable default metric 8192
+	ip -6 ro add table ${VRF_TABLE} unreachable default metric 8192
 
 	# create test interfaces
 	ip li add ${NETIFS[p1]} type veth peer name ${NETIFS[p2]}
@@ -185,20 +185,20 @@ setup()
 	for n in 1 3 5 7; do
 		ip li set ${NETIFS[p${n}]} up
 		ip addr add ${V4ADDRS[p${n}]}/24 dev ${NETIFS[p${n}]}
-		ip addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]}
+		ip addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]} nodad
 	done
 
 	# move peer interfaces to namespace and add addresses
 	for n in 2 4 6 8; do
 		ip li set ${NETIFS[p${n}]} netns ${PEER_NS} up
 		ip -netns ${PEER_NS} addr add ${V4ADDRS[p${n}]}/24 dev ${NETIFS[p${n}]}
-		ip -netns ${PEER_NS} addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]}
+		ip -netns ${PEER_NS} addr add ${V6ADDRS[p${n}]}/64 dev ${NETIFS[p${n}]} nodad
 	done
 
-	set +e
+	ip -6 ro add default via ${V6ADDRS[p3]/::[0-9]/::64}
+	ip -6 ro add table ${VRF_TABLE} default via ${V6ADDRS[p7]/::[0-9]/::64}
 
-	# let DAD complete - assume default of 1 probe
-	sleep 1
+	set +e
 }
 
 cleanup()
-- 
2.11.0

^ permalink raw reply related

* Re: Regression in 4.19 net/phy/realtek: garbled sysfs output
From: David Miller @ 2018-10-24 20:59 UTC (permalink / raw)
  To: holger; +Cc: netdev, jaswinder.singh, hkallweit1
In-Reply-To: <5e989478-0d33-cd05-efa7-fe3ec54856ab@applied-asynchrony.com>

From: Holger Hoffstätte <holger@applied-asynchrony.com>
Date: Wed, 24 Oct 2018 21:36:02 +0200

Adding Heiner to CC:

> Since 4.19 r8169 depends on phylib:
> 
> $lsmod | grep r8169
> r8169                  81920  0
> libphy                 57344  2 r8169,realtek
> 
> Unfortunately this now gives me the following sysfs error:
> 
> $cd /sys/module/realtek/drivers
> $ls -l
> ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such
> file or directory
> total 0
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8201CP Ethernet' ->
> '../../../bus/mdio_bus/drivers/RTL8201CP Ethernet'
> l????????? ? ?  ?  ?  ? 'mdio_bus:RTL8201F 10/100Mbps Ethernet'
> lrwxrwxrwx 1 root root 0 Oct 24 21:09 'mdio_bus:RTL8211 Gigabit
> Ethernet' -> '../../../bus/mdio_bus/drivers/RTL8211 Gigabit Ethernet'
> [..]
> 
> Apparently the forward slash in "10/100Mbps Ethernet" is interpreted
> as
> directory separator that leads nowhere, and was introduced in commit
> 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
> 
> Would it be acceptable to change the name simply to "RTL8201F
> Ethernet"?

^ permalink raw reply

* Re: [PATCH net 0/4] net: Fixups for recent dump filtering changes
From: David Miller @ 2018-10-24 21:07 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, lirongqing, dsahern
In-Reply-To: <20181024195902.17479-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Wed, 24 Oct 2018 12:58:58 -0700

> Li RongQing noted that tgt_net is leaked in ipv4 due to the recent change
> to handle address dumps for a specific device. The report also applies to
> ipv6 and other error paths. Patches 1 and 2 fix those leaks.
> 
> Patch 3 stops route dumps from erroring out when dumping across address
> families and a table id is given. This is needed in preparation for
> patch 4.
> 
> Patch 4 updates the rtnl_dump_all to handle a failure in one of the dumpit
> functions. At the moment, if an address dump returns an error the dump all
> loop breaks but the error is dropped. The result can be no data is returned
> and no error either leaving the user wondering about the addresses.
> 
> Patches were tested with a modified iproute2 to add invalid data to the
> dump request causing each specific failure path to be hit in addition
> to positive testing that it works as it should when given valid data.

Series applied, thanks David.

^ permalink raw reply

* Re: [PATCH net v2] net: udp: fix handling of CHECKSUM_COMPLETE packets
From: David Miller @ 2018-10-24 21:21 UTC (permalink / raw)
  To: stranche; +Cc: eric.dumazet, netdev, samanthakumar, edumazet
In-Reply-To: <1540332271-15564-1-git-send-email-stranche@codeaurora.org>

From: Sean Tranchetti <stranche@codeaurora.org>
Date: Tue, 23 Oct 2018 16:04:31 -0600

> Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
> incorrect for any packet that has an incorrect checksum value.
> 
> udp4/6_csum_init() will both make a call to
> __skb_checksum_validate_complete() to initialize/validate the csum
> field when receiving a CHECKSUM_COMPLETE packet. When this packet
> fails validation, skb->csum will be overwritten with the pseudoheader
> checksum so the packet can be fully validated by software, but the
> skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
> the stack can later warn the user about their hardware spewing bad
> checksums. Unfortunately, leaving the SKB in this state can cause
> problems later on in the checksum calculation.
> 
> Since the the packet is still marked as CHECKSUM_COMPLETE,
> udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
> from skb->csum instead of adding it, leaving us with a garbage value
> in that field. Once we try to copy the packet to userspace in the
> udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
> to checksum the packet data and add it in the garbage skb->csum value
> to perform our final validation check.
> 
> Since the value we're validating is not the proper checksum, it's possible
> that the folded value could come out to 0, causing us not to drop the
> packet. Instead, we believe that the packet was checksummed incorrectly
> by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
> to warn the user with netdev_rx_csum_fault(skb->dev);
> 
> Unfortunately, since this is the UDP path, skb->dev has been overwritten
> by skb->dev_scratch and is no longer a valid pointer, so we end up
> reading invalid memory.

Just want to say that it has always been complicated in this area due to
the fact that we do this deferral of final checksum validation to when we
copy the packet into userspace.  For example, poll() needs to do special
things, etc.

Because we have to make it seem as if we dropped the packet with a bad
checksum from the point of view of what the user sees during recvmsg()
and poll() calls.  But until we do that checksum validation, we don't
know exactly what the situation is.

> This patch addresses this problem in two ways:
> 	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
> 	   from skb_copy_and_csum_datagram_msg(). Since this gets called
> 	   from the UDP path where skb->dev has been overwritten, we have
> 	   no way of knowing if the pointer is still valid. Also for the
> 	   sake of consistency with the other uses of
> 	   netdev_rx_csum_fault(), don't attempt to call it if the
> 	   packet was checksummed by software.
> 
> 	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
> 	   If we receive a packet that's CHECKSUM_COMPLETE that fails
> 	   verification (i.e. skb->csum_valid == 0), check who performed
> 	   the calculation. It's possible that the checksum was done in
> 	   software by the network stack earlier (such as Netfilter's
> 	   CONNTRACK module), and if that says the checksum is bad,
> 	   we can drop the packet immediately instead of waiting until
> 	   we try and copy it to userspace. Otherwise, we need to
> 	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
> 	   no longer contains the full packet checksum after the
> 	   call to __skb_checksum_validate_complete().
> 
> Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")

Can't count on my hands how many regressions are a result of that change and
it's subtle side effects. :-/

> Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line")
> Cc: Sam Kumar <samanthakumar@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH] r8169: Add new device ID support
From: David Miller @ 2018-10-24 21:22 UTC (permalink / raw)
  To: shawn.lin; +Cc: nic_swsd, netdev, hkallweit1
In-Reply-To: <1540345607-110155-1-git-send-email-shawn.lin@rock-chips.com>

From: Shawn Lin <shawn.lin@rock-chips.com>
Date: Wed, 24 Oct 2018 09:46:47 +0800

> It's found my r8169 ethernet card at hand has a device ID
> of 0x0000 which wasn't on the list of rtl8169_pci_tbl. Add
> a new entry to make it work:
> 
> [2.165785] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [2.165863] r8169 0000:01:00.0: enabling device (0000 -> 0003)
> [2.167110] r8169 0000:01:00.0 eth0: RTL8168c/8111c at 0xffffff80089be000,
> 00:e0:4c:21:00:17, XID 1c4000c0 IRQ 208
> [2.167128] r8169 0000:01:00.0 eth0: jumbo features [frames: 6128
> bytes, tx checksumming: ko]
> 
> [root@rk1808:/]# lspci
> 00:00.0 Class 0604: 1d87:1808
> 01:00.0 Class 0200: 10ec:0000
> 
> Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>

I'm stil not terribly confident in this change, a device ID of zero is
really unusual.

Heiner, what do you think?

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code
From: Alexei Starovoitov @ 2018-10-24 21:23 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, David S. Miller, Eric Dumazet, kafai, daniel
In-Reply-To: <4c94ee8fe77a51d61927bfff46441abc15172193.camel@redhat.com>

On Tue, Oct 23, 2018 at 09:28:03AM +0200, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote:
> > On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote:
> > > And update __sk_queue_drop_skb() to work on the specified queue.
> > > This will help the udp protocol to use an additional private
> > > rx queue in a later patch.
> > > 
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  include/linux/skbuff.h |  7 ++++
> > >  include/net/sock.h     |  4 +--
> > >  net/core/datagram.c    | 90 ++++++++++++++++++++++++++++----------------------
> > >  3 files changed, 60 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index a098d95..bfc7892 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
> > >  
> > >  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
> > >  				const struct sk_buff *skb);
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last);
> > >  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> > >  					void (*destructor)(struct sock *sk,
> > >  							   struct sk_buff *skb),
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 66349e4..49d226f 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
> > >  
> > >  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
> > >  
> > > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> > > -			unsigned int flags,
> > > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> > > +			struct sk_buff *skb, unsigned int flags,
> > >  			void (*destructor)(struct sock *sk,
> > >  					   struct sk_buff *skb));
> > >  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > index db1866f2..a4592b4 100644
> > > --- a/net/core/datagram.c
> > > +++ b/net/core/datagram.c
> > > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
> > >  	return skb;
> > >  }
> > >  
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last)
> > > +{
> > > +	struct sk_buff *skb;
> > > +
> > > +	*last = queue->prev;
> > 
> > this refactoring changed the behavior.
> > Now queue->prev is returned as last.
> > Whereas it was *last = queue before.
> > 
> > > +	skb_queue_walk(queue, skb) {
> > 
> > and *last = skb assignment is gone too.
> > 
> > Was this intentional ? 
> 
> Yes.
> 
> > Is this the right behavior?
> 
> I think so. queue->prev is the last skb in the queue. With the old
> code,   __skb_try_recv_datagram(), when returning NULL, used the
> instructions you quoted to overall set 'last' to the last skb in the
> queue. We did not use 'last' elsewhere. So overall this just reduce the
> number of instructions inside the loop. (unless I'm missing something).

Right. On the second glance it does appear to be correct.

> Are you experiencing any specific issues due to the mentioned commit?

yes.

Just like what Baoyou Xie reported https://lore.kernel.org/patchwork/patch/962802/
we're hitting infinite loop in __skb_recv_datagram() on 4.11 kernel.
and different, but also buggy, behavior on the net-next.

In particular __skb_try_recv_datagram() returns immediately,
because skb_queue_empty() is true (sk->sk_receive_queue.next == &sk->sk_receive_queue)

But __skb_wait_for_more_packets() also returns immediately
because if (sk->sk_receive_queue.prev != skb) is also true.

There is a link list corruption in sk_receive_queue.

list->next == list, but list->prev still points to valid skb.
Before your commit we had
*last = queue;
and we had this infinite loop I described above.
After your commit
*last = queue->next;
which assigns buggy pointer into *last, but that accidentally
makes if (sk->sk_receive_queue.prev != skb) to be false
and __skb_wait_for_more_packets() goes into schedule_timeout().
Eventually bad things happen too, but in the different spot.

The corruption is somehow related to netlink_recvmsg() just like in that
Baoyou Xie report.

The typical stack trace is
__skb_wait_for_more_packets+0x64/0x140
? skb_gro_receive+0x310/0x310
__skb_recv_datagram+0x5c/0xa0
skb_recv_datagram+0x31/0x40
netlink_recvmsg+0x51/0x3c0
? sock_write_iter+0xf8/0x110
SYSC_recvfrom+0x116/0x190

We didn't figure out a way to reproduce it yet.
kasan didn't help.
The way netlink socket pushes skbs into sk_receive_queue and drains it
all looks correct. We thought it could be MSG_PEAK related, but
skb->users refcnting also looks correct.

If anyone have any ideas what things to try, I'm all ears.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox