Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/9] Netfilter fixes for net
From: David Miller @ 2018-06-13 21:05 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20180613105700.12894-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed, 13 Jun 2018 12:56:51 +0200

> The following patchset contains Netfilter patches for your net tree:
> 
> 1) Fix NULL pointer dereference from nf_nat_decode_session() if NAT is
>    not loaded, from Prashant Bhole.
> 
> 2) Fix socket extension module autoload.
> 
> 3) Don't bogusly reject sets with the NFT_SET_EVAL flag set on from
>    the dynset extension.
> 
> 4) Fix races with nf_tables module removal and netns exit path,
>    patches from Florian Westphal.
> 
> 5) Don't hit BUG_ON if jumpstack goes too deep, instead hit
>    WARN_ON_ONCE, from Taehee Yoo.
> 
> 6) Another NULL pointer dereference from ctnetlink, again if NAT is
>    not loaded, from Florian Westphal.
> 
> 7) Fix x_tables match list corruption in xt_connmark module removal
>    path, also from Florian.
> 
> 8) nf_conncount doesn't properly deal with conntrack zones, hence
>    garbage collector may get rid of entries in a different zone.
>    From Yi-Hung Wei.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thank you.

^ permalink raw reply

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
From: Florian Westphal @ 2018-06-13 20:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Florian Westphal, netfilter-devel, ast, daniel, netdev,
	David S. Miller, ecree
In-Reply-To: <20180613004324.x2xdyj2qlbkkpccy@ast-mbp.dhcp.thefacebook.com>

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Tue, Jun 12, 2018 at 11:28:12AM +0200, Florian Westphal wrote:
> > I think its important user(space) can see which rules are jitted, and
> > which ebpf prog corresponds to which rule(s), using an expression as
> > container allows to re-use existing nft config plane code to serialze
> > this via netlink attributes.
> 
> In my mind it would be all or nothing. I don't think it helps
> to convert some rules and not all.

Ok.  Still, even in that case I think it would be good if we'd be able to tell
userspace the ebpf program id that corresponds to the ruleset.

> > Step 1: 1:1 mapping, an nft rule has at most one ebpf prog.
> > Step 2: figure out how to handle maps, sets, and how to cope with
> >         not-yet-translateable expressions
> > Step 3: m:n mapping: kernel provides adjacent rules to the UMH for
> >         jitting.  Example: user appends rules a, b, c.  UMH creates
> > 	single ebpf prog from a/b/c.
> >       	nft-pseudo-expression replaces a/b/c in the
> > 	packet path, original rules a/b/c are linked from the pseudo
> > 	expression for tracking.  If user deletes rule b, we provide
> > 	a/c to UMH to create new epbf prog that replaces new
> > 	sequence a/c.
> > Step 4: always provide entire future base chain and all reachable chains
> >         to the umh.  Ideally all of it is replaced by single program.

[..]

> > Does that make sense to you?
> > 
> > If you see this as flawed, please let me know, but as I have no idea
> > how to resolve these issues going from 0 to 4 makes no sense to me.
>
> I think the challenge is how to implement 4 without doing step 1, right?

Yes.

> imo doing such 1:1 (single rule to single bpf prog) translation does not
> help to break hard problem into smaller pieces. Such 1:1 is great
> for prototype, but not to land upstream.
> For the same reasons in bpfilter we did single iptable rule to single
> bpf prog translation, but such code doesn't belong in upstream tree,
> since it's not a scalable approach.
[..]

> > Okay, but without any idea how to consider existing expressions,
> > sets, maps etc. I'm not sure it makes sense to work on that at this
> > point.
> 
> I think sets and ipset (in case of iptables) fit well into trie model.

Yes, but thats going to be a lot of effort to handle properly
without breaking (or replacing) userland plumbing.

For nft we could aim for full-translation for the ingress hook
initially as that takes stateful filering out of the picture (ingress
occurs before conntrack).

We could also ignore sets for now and only deal with anonymous sets (they
are immutable and data stored in such sets can be made available to
UMH).

I can rework the RFC to emit "future table" to UMH instead of
individual rules, but I don't know yet when i will have time to work on
it again.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v2] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Eric Dumazet @ 2018-06-13 20:57 UTC (permalink / raw)
  To: Yidong Ren, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	David S. Miller, devel, netdev, linux-kernel
In-Reply-To: <20180613193608.444-1-yidren@linuxonhyperv.com>



On 06/13/2018 12:36 PM, Yidong Ren wrote:
> From: Yidong Ren <yidren@microsoft.com>
> 
> This patch implements following ethtool stats fields for netvsc:
> cpu<n>_tx/rx_packets/bytes
> cpu<n>_vf_tx/rx_packets/bytes

...
>  
> +	pcpu_sum = alloc_percpu(struct netvsc_ethtool_pcpu_stats);
> +	netvsc_get_pcpu_stats(dev, pcpu_sum);
> +	for_each_present_cpu(cpu) {
> +		struct netvsc_ethtool_pcpu_stats *this_sum =
> +			per_cpu_ptr(pcpu_sum, cpu);
> +		for (j = 0; j < ARRAY_SIZE(pcpu_stats); j++)
> +			data[i++] = *(u64 *)((void *)this_sum
> +					     + pcpu_stats[j].offset);
> +	}
> +	free_percpu(pcpu_sum);
>


Using alloc_percpu() / free_percpu() for a short section of code makes no sense.

You actually want to allocate memory local to this cpu, possibly in one chunk,
not spread all over the places.

kvmalloc(nr_cpu_ids * sizeof(struct netvsc_ethtool_pcpu_stats))  should be really better,
since it would most of the time be satisfied by a single kmalloc()

^ permalink raw reply

* [BUG] net: stmmac: socfpga ethernet no longer working on linux-next
From: Dinh Nguyen @ 2018-06-13 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, clabbe, joabreu, Dinh Nguyen, Marek Vasut

Hi,

The stmmac ethernet has stopped working in linux-next and linus/master
branch(v4.17-11782-gbe779f03d563)

It appears that the stmmac ethernet has stopped working after these 2 commits:

4dbbe8dde848 net: stmmac: Add support for U32 TC filter using Flexible RX Parser
5f0456b43140 net: stmmac: Implement logic to automatically select HW Interface

If I move to this commit "565020aaeebf net: stmmac: Disable ACS
Feature for GMAC >= 4", then the stmmac works again on SoCFPGA.

I was following this thread:
https://www.spinics.net/lists/netdev/msg502858.html

Was wondering if there was a patch to fix dwmac-sun8i that the socfpga
platform needs as well?

Thanks,
Dinh

^ permalink raw reply

* Re: [PATCH net] net: qcom/emac: Add missing of_node_put()
From: Timur Tabi @ 2018-06-13 20:45 UTC (permalink / raw)
  To: YueHaibing, davem; +Cc: linux-kernel, netdev, Hemanth Puranik
In-Reply-To: <20180611130345.15172-1-yuehaibing@huawei.com>

On 06/11/2018 08:03 AM, YueHaibing wrote:
> Add missing of_node_put() call for device node returned by
> of_parse_phandle().
> 
> Signed-off-by: YueHaibing<yuehaibing@huawei.com>

Acked-by: Timur Tabi <timur@codeaurora.org>

This seems legit.  The comment for of_find_device_by_node() that says 
the np needs to be released was added after the code was written, so 
it's possible that I didn't know at the time that this was a requirement.

However, I no longer have the ability to test EMAC on device tree 
platforms, so I can't verify this code.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* [PATCH v2] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Yidong Ren @ 2018-06-13 19:36 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
	David S. Miller, devel, netdev, linux-kernel

From: Yidong Ren <yidren@microsoft.com>

This patch implements following ethtool stats fields for netvsc:
cpu<n>_tx/rx_packets/bytes
cpu<n>_vf_tx/rx_packets/bytes

Corresponding per-cpu counters exist in current code. Exposing these
counters will help troubleshooting performance issues.

Signed-off-by: Yidong Ren <yidren@microsoft.com>
---
Changes in v2:
  - Remove cpp style comment
  - Resubmit after freeze

 drivers/net/hyperv/hyperv_net.h |  11 +++++
 drivers/net/hyperv/netvsc_drv.c | 104 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 23304ac..c825353 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -873,6 +873,17 @@ struct netvsc_ethtool_stats {
 	unsigned long wake_queue;
 };
 
+struct netvsc_ethtool_pcpu_stats {
+	u64     rx_packets;
+	u64     rx_bytes;
+	u64     tx_packets;
+	u64     tx_bytes;
+	u64     vf_rx_packets;
+	u64     vf_rx_bytes;
+	u64     vf_tx_packets;
+	u64     vf_tx_bytes;
+};
+
 struct netvsc_vf_pcpu_stats {
 	u64     rx_packets;
 	u64     rx_bytes;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 7b18a8c..6803aae 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1105,6 +1105,66 @@ static void netvsc_get_vf_stats(struct net_device *net,
 	}
 }
 
+static void netvsc_get_pcpu_stats(struct net_device *net,
+				  struct netvsc_ethtool_pcpu_stats
+					__percpu *pcpu_tot)
+{
+	struct net_device_context *ndev_ctx = netdev_priv(net);
+	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
+	int i;
+
+	/* fetch percpu stats of vf */
+	for_each_possible_cpu(i) {
+		const struct netvsc_vf_pcpu_stats *stats =
+			per_cpu_ptr(ndev_ctx->vf_stats, i);
+		struct netvsc_ethtool_pcpu_stats *this_tot =
+			per_cpu_ptr(pcpu_tot, i);
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			this_tot->vf_rx_packets = stats->rx_packets;
+			this_tot->vf_tx_packets = stats->tx_packets;
+			this_tot->vf_rx_bytes = stats->rx_bytes;
+			this_tot->vf_tx_bytes = stats->tx_bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+		this_tot->rx_packets = this_tot->vf_rx_packets;
+		this_tot->tx_packets = this_tot->vf_tx_packets;
+		this_tot->rx_bytes   = this_tot->vf_rx_bytes;
+		this_tot->tx_bytes   = this_tot->vf_tx_bytes;
+	}
+
+	/* fetch percpu stats of netvsc */
+	for (i = 0; i < nvdev->num_chn; i++) {
+		const struct netvsc_channel *nvchan = &nvdev->chan_table[i];
+		const struct netvsc_stats *stats;
+		struct netvsc_ethtool_pcpu_stats *this_tot =
+			per_cpu_ptr(pcpu_tot, nvchan->channel->target_cpu);
+		u64 packets, bytes;
+		unsigned int start;
+
+		stats = &nvchan->tx_stats;
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			packets = stats->packets;
+			bytes = stats->bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+
+		this_tot->tx_bytes	+= bytes;
+		this_tot->tx_packets	+= packets;
+
+		stats = &nvchan->rx_stats;
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			packets = stats->packets;
+			bytes = stats->bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+
+		this_tot->rx_bytes	+= bytes;
+		this_tot->rx_packets	+= packets;
+	}
+}
+
 static void netvsc_get_stats64(struct net_device *net,
 			       struct rtnl_link_stats64 *t)
 {
@@ -1202,6 +1262,23 @@ static const struct {
 	{ "rx_no_memory", offsetof(struct netvsc_ethtool_stats, rx_no_memory) },
 	{ "stop_queue", offsetof(struct netvsc_ethtool_stats, stop_queue) },
 	{ "wake_queue", offsetof(struct netvsc_ethtool_stats, wake_queue) },
+}, pcpu_stats[] = {
+	{ "cpu%u_rx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, rx_packets) },
+	{ "cpu%u_rx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, rx_bytes) },
+	{ "cpu%u_tx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, tx_packets) },
+	{ "cpu%u_tx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, tx_bytes) },
+	{ "cpu%u_vf_rx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_packets) },
+	{ "cpu%u_vf_rx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_bytes) },
+	{ "cpu%u_vf_tx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_packets) },
+	{ "cpu%u_vf_tx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_bytes) },
 }, vf_stats[] = {
 	{ "vf_rx_packets", offsetof(struct netvsc_vf_pcpu_stats, rx_packets) },
 	{ "vf_rx_bytes",   offsetof(struct netvsc_vf_pcpu_stats, rx_bytes) },
@@ -1213,6 +1290,9 @@ static const struct {
 #define NETVSC_GLOBAL_STATS_LEN	ARRAY_SIZE(netvsc_stats)
 #define NETVSC_VF_STATS_LEN	ARRAY_SIZE(vf_stats)
 
+/* statistics per queue (rx/tx packets/bytes) */
+#define NETVSC_PCPU_STATS_LEN (num_present_cpus() * ARRAY_SIZE(pcpu_stats))
+
 /* 4 statistics per queue (rx/tx packets/bytes) */
 #define NETVSC_QUEUE_STATS_LEN(dev) ((dev)->num_chn * 4)
 
@@ -1228,6 +1308,7 @@ static int netvsc_get_sset_count(struct net_device *dev, int string_set)
 	case ETH_SS_STATS:
 		return NETVSC_GLOBAL_STATS_LEN
 			+ NETVSC_VF_STATS_LEN
+			+ NETVSC_PCPU_STATS_LEN
 			+ NETVSC_QUEUE_STATS_LEN(nvdev);
 	default:
 		return -EINVAL;
@@ -1242,9 +1323,10 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
 	const void *nds = &ndc->eth_stats;
 	const struct netvsc_stats *qstats;
 	struct netvsc_vf_pcpu_stats sum;
+	struct netvsc_ethtool_pcpu_stats __percpu *pcpu_sum;
 	unsigned int start;
 	u64 packets, bytes;
-	int i, j;
+	int i, j, cpu;
 
 	if (!nvdev)
 		return;
@@ -1256,6 +1338,17 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
 	for (j = 0; j < NETVSC_VF_STATS_LEN; j++)
 		data[i++] = *(u64 *)((void *)&sum + vf_stats[j].offset);
 
+	pcpu_sum = alloc_percpu(struct netvsc_ethtool_pcpu_stats);
+	netvsc_get_pcpu_stats(dev, pcpu_sum);
+	for_each_present_cpu(cpu) {
+		struct netvsc_ethtool_pcpu_stats *this_sum =
+			per_cpu_ptr(pcpu_sum, cpu);
+		for (j = 0; j < ARRAY_SIZE(pcpu_stats); j++)
+			data[i++] = *(u64 *)((void *)this_sum
+					     + pcpu_stats[j].offset);
+	}
+	free_percpu(pcpu_sum);
+
 	for (j = 0; j < nvdev->num_chn; j++) {
 		qstats = &nvdev->chan_table[j].tx_stats;
 
@@ -1283,7 +1376,7 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 	struct net_device_context *ndc = netdev_priv(dev);
 	struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev);
 	u8 *p = data;
-	int i;
+	int i, cpu;
 
 	if (!nvdev)
 		return;
@@ -1300,6 +1393,13 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 			p += ETH_GSTRING_LEN;
 		}
 
+		for_each_present_cpu(cpu) {
+			for (i = 0; i < ARRAY_SIZE(pcpu_stats); i++) {
+				sprintf(p, pcpu_stats[i].name, cpu);
+				p += ETH_GSTRING_LEN;
+			}
+		}
+
 		for (i = 0; i < nvdev->num_chn; i++) {
 			sprintf(p, "tx_queue_%u_packets", i);
 			p += ETH_GSTRING_LEN;
-- 
2.7.4

^ permalink raw reply related

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: syzbot @ 2018-06-13 18:49 UTC (permalink / raw)
  To: ast, daniel, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <26c434ee-0a0a-fbba-282c-dabddfac652e@iogearbox.net>

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger  
crash:

Reported-and-tested-by:  
syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com

Tested on:

commit:         be779f03d563 Merge tag 'kbuild-v4.18-2' of git://git.kerne..
git tree:       upstream
kernel config:  https://syzkaller.appspot.com/x/.config?x=68d8eba98e3f8e88
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply

* Re: [PATCH ethtool 1/6] ethtool: fix uninitialized return value
From: John W. Linville @ 2018-06-13 18:31 UTC (permalink / raw)
  To: Ivan Vecera; +Cc: netdev
In-Reply-To: <20180608092010.13041-1-cera@cera.cz>

On Fri, Jun 08, 2018 at 11:20:05AM +0200, Ivan Vecera wrote:
> Fixes: b0fe96d ("Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift")
> Signed-off-by: Ivan Vecera <cera@cera.cz>

LGTM -- I have queued the series for the next release, including
extending the commit IDs...

Thanks!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: Daniel Borkmann @ 2018-06-13 18:42 UTC (permalink / raw)
  To: syzbot, ast, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <0000000000000c58c2056e8a3a27@google.com>

On 06/13/2018 08:34 PM, syzbot wrote:
> Hello,
> 
> syzbot has tested the proposed patch and the reproducer did not trigger crash:
> 
> Reported-and-tested-by: syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com
> 
> Tested on:
> 
> commit:         be779f03d563 Merge tag 'kbuild-v4.18-2' of git://git.kerne..
> git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/--
> kernel config:  https://syzkaller.appspot.com/x/.config?x=68d8eba98e3f8e88
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Note: testing is done by a robot and is best-effort only.

#syz fix: bpf: reject passing modified ctx to helper functions

^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: syzbot @ 2018-06-13 18:34 UTC (permalink / raw)
  To: ast, daniel, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <b3af0f35-b0f8-859d-4f8f-b919d35ebaaa@iogearbox.net>

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger  
crash:

Reported-and-tested-by:  
syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com

Tested on:

commit:         be779f03d563 Merge tag 'kbuild-v4.18-2' of git://git.kerne..
git tree:        
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/--
kernel config:  https://syzkaller.appspot.com/x/.config?x=68d8eba98e3f8e88
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: Daniel Borkmann @ 2018-06-13 18:15 UTC (permalink / raw)
  To: syzbot; +Cc: ast, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000402477056e89f067@google.com>

On 06/13/2018 08:13 PM, syzbot wrote:
>> On 06/13/2018 06:17 PM, syzbot wrote:
>>> Hello,
> 
>>> syzbot found the following crash on:
> 
>>> HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
>>> git tree:       bpf-next
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1754783f800000
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=a601a80fec461d44
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=76de61614cb1abdd73fc
>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=12c1e1bf800000
> 
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com
> 
>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
>>> 8021q: adding VLAN 0 to HW filter on device team0
>>> 8021q: adding VLAN 0 to HW filter on device team0
>>> ==================================================================
>>> BUG: KASAN: slab-out-of-bounds in skb_at_tc_ingress include/net/sch_generic.h:535 [inline]
>>> BUG: KASAN: slab-out-of-bounds in bpf_push_mac_rcsum net/core/filter.c:1625 [inline]
>>> BUG: KASAN: slab-out-of-bounds in ____bpf_skb_vlan_push net/core/filter.c:2446 [inline]
>>> BUG: KASAN: slab-out-of-bounds in bpf_skb_vlan_push+0x6b7/0x720 net/core/filter.c:2437
>>> Read of size 5 at addr ffff8801b77347d0 by task syz-executor6/6529
> 
>> Should be fixed already by:
> 
>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=58990d1ff3f7896ee341030e9a7c2e4002570683
> 
> 
>> #syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> want 2 args (repo, branch), got 1

Fair enough ... assumed default would have been master. ;-)

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

^ permalink raw reply

* Re: 4b66af2d("af_key: Always verify length of provided sadb_key")
From: Greg KH @ 2018-06-13 18:13 UTC (permalink / raw)
  To: Zubin Mithra; +Cc: linux-netdev, groeck, stable
In-Reply-To: <20180613180252.GA34929@zsmcros.c.googlers.com>

On Wed, Jun 13, 2018 at 02:02:54PM -0400, Zubin Mithra wrote:
> Hello,
> 
> Syzkaller has reported a crash here[1] for a slab OOB read in pfkey_add.
> 
> Could the following patch be applied to stable kernels for 4.14, 4.4, 3.18, 3.14, 3.10 and 3.8?
> 
> 4b66af2d("af_key: Always verify length of provided sadb_key")
> 
> [1] https://syzkaller.appspot.com/bug?id=26cb120b31cd24d984fc16da67f50fb375c432a7

Now queued up, thanks.

greg k-h

^ permalink raw reply

* Re: Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: syzbot @ 2018-06-13 18:13 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, daniel, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <b3af0f35-b0f8-859d-4f8f-b919d35ebaaa@iogearbox.net>

> On 06/13/2018 06:17 PM, syzbot wrote:
>> Hello,

>> syzbot found the following crash on:

>> HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for  
>> stab..
>> git tree:       bpf-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1754783f800000
>> kernel config:   
>> https://syzkaller.appspot.com/x/.config?x=a601a80fec461d44
>> dashboard link:  
>> https://syzkaller.appspot.com/bug?extid=76de61614cb1abdd73fc
>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>> syzkaller  
>> repro:https://syzkaller.appspot.com/x/repro.syz?x=12c1e1bf800000

>> IMPORTANT: if you fix the bug, please add the following tag to the  
>> commit:
>> Reported-by: syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com

>> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
>> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
>> 8021q: adding VLAN 0 to HW filter on device team0
>> 8021q: adding VLAN 0 to HW filter on device team0
>> ==================================================================
>> BUG: KASAN: slab-out-of-bounds in skb_at_tc_ingress  
>> include/net/sch_generic.h:535 [inline]
>> BUG: KASAN: slab-out-of-bounds in bpf_push_mac_rcsum  
>> net/core/filter.c:1625 [inline]
>> BUG: KASAN: slab-out-of-bounds in ____bpf_skb_vlan_push  
>> net/core/filter.c:2446 [inline]
>> BUG: KASAN: slab-out-of-bounds in bpf_skb_vlan_push+0x6b7/0x720  
>> net/core/filter.c:2437
>> Read of size 5 at addr ffff8801b77347d0 by task syz-executor6/6529

> Should be fixed already by:

>     
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=58990d1ff3f7896ee341030e9a7c2e4002570683


> #syz test:  
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

want 2 args (repo, branch), got 1


^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in bpf_skb_vlan_push
From: Daniel Borkmann @ 2018-06-13 18:13 UTC (permalink / raw)
  To: syzbot, ast, davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000219be6056e8850fa@google.com>

On 06/13/2018 06:17 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    75d4e704fa8d netdev-FAQ: clarify DaveM's position for stab..
> git tree:       bpf-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1754783f800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a601a80fec461d44
> dashboard link: https://syzkaller.appspot.com/bug?extid=76de61614cb1abdd73fc
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=12c1e1bf800000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+76de61614cb1abdd73fc@syzkaller.appspotmail.com
> 
> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 8021q: adding VLAN 0 to HW filter on device team0
> 8021q: adding VLAN 0 to HW filter on device team0
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in skb_at_tc_ingress include/net/sch_generic.h:535 [inline]
> BUG: KASAN: slab-out-of-bounds in bpf_push_mac_rcsum net/core/filter.c:1625 [inline]
> BUG: KASAN: slab-out-of-bounds in ____bpf_skb_vlan_push net/core/filter.c:2446 [inline]
> BUG: KASAN: slab-out-of-bounds in bpf_skb_vlan_push+0x6b7/0x720 net/core/filter.c:2437
> Read of size 5 at addr ffff8801b77347d0 by task syz-executor6/6529

Should be fixed already by:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=58990d1ff3f7896ee341030e9a7c2e4002570683


#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

^ permalink raw reply

* 4b66af2d("af_key: Always verify length of provided sadb_key")
From: Zubin Mithra @ 2018-06-13 18:02 UTC (permalink / raw)
  To: linux-netdev; +Cc: groeck, stable

Hello,

Syzkaller has reported a crash here[1] for a slab OOB read in pfkey_add.

Could the following patch be applied to stable kernels for 4.14, 4.4, 3.18, 3.14, 3.10 and 3.8?

4b66af2d("af_key: Always verify length of provided sadb_key")

[1] https://syzkaller.appspot.com/bug?id=26cb120b31cd24d984fc16da67f50fb375c432a7

Thanks,
- Zubin

^ permalink raw reply

* Re: [bpf PATCH] bpf: selftest fix for sockmap
From: John Fastabend @ 2018-06-13 17:52 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: netdev
In-Reply-To: <0e4b873d-7e2d-7683-c7cb-0a0112d123a5@iogearbox.net>

On 06/12/2018 05:31 PM, Daniel Borkmann wrote:
> On 06/11/2018 08:47 PM, John Fastabend wrote:
>> In selftest test_maps the sockmap test case attempts to add a socket
>> in listening state to the sockmap. This is no longer a valid operation
>> so it fails as expected. However, the test wrongly reports this as an
>> error now. Fix the test to avoid adding sockets in listening state.
>>
>> Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> 
> (fyi, discussed with John that this will be enrolled into the set of
>  fixes he has pending for bpf since the test is related to the one
>  restricting to ESTABLISHED state.)
> 

Patch part of this series now,

https://patchwork.ozlabs.org/project/netdev/list/?series=49988

^ permalink raw reply

* [bpf PATCH 6/6] bpf: selftest remove attempts to add LISTEN sockets to sockmap
From: John Fastabend @ 2018-06-13 17:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

In selftest test_maps the sockmap test case attempts to add a socket
in listening state to the sockmap. This is no longer a valid operation
so it fails as expected. However, the test wrongly reports this as an
error now. Fix the test to avoid adding sockets in listening state.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_maps.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 6c25334..9fed5f0 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
 	}
 
 	/* Test update without programs */
-	for (i = 0; i < 6; i++) {
+	for (i = 2; i < 6; i++) {
 		err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
 		if (err) {
 			printf("Failed noprog update sockmap '%i:%i'\n",
@@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
 	}
 
 	/* Test map update elem afterwards fd lives in fd and map_fd */
-	for (i = 0; i < 6; i++) {
+	for (i = 2; i < 6; i++) {
 		err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
 		if (err) {
 			printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",

^ permalink raw reply related

* [bpf PATCH 5/6] bpf: sockhash, add release routine
From: John Fastabend @ 2018-06-13 17:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

Add map_release_uref pointer to hashmap ops. This was dropped when
original sockhash code was ported into bpf-next before initial
commit.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 91c7b47..57fa2fa 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2516,6 +2516,7 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
 	.map_get_next_key = sock_hash_get_next_key,
 	.map_update_elem = sock_hash_update_elem,
 	.map_delete_elem = sock_hash_delete_elem,
+	.map_release_uref = sock_map_release,
 };
 
 static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)

^ permalink raw reply related

* [bpf PATCH 4/6] bpf: sockmap, tcp_disconnect to listen transition
From: John Fastabend @ 2018-06-13 17:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

After adding checks to ensure TCP is in ESTABLISHED state when a
sock is added we need to also ensure that user does not transition
through tcp_disconnect() and back into ESTABLISHED state without
sockmap removing the sock.

To do this add unhash hook and remove sock from map there.

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   67 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 50 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 2e848cd..91c7b47 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -130,6 +130,7 @@ struct smap_psock {
 
 	struct proto *sk_proto;
 	void (*save_close)(struct sock *sk, long timeout);
+	void (*save_unhash)(struct sock *sk);
 	void (*save_data_ready)(struct sock *sk);
 	void (*save_write_space)(struct sock *sk);
 };
@@ -141,6 +142,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
 			    int offset, size_t size, int flags);
 static void bpf_tcp_close(struct sock *sk, long timeout);
+static void bpf_tcp_unhash(struct sock *sk);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -182,6 +184,7 @@ static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
 {
 	prot[SOCKMAP_BASE]			= *base;
 	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
+	prot[SOCKMAP_BASE].unhash		= bpf_tcp_unhash;
 	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
 	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
 
@@ -215,6 +218,7 @@ static int bpf_tcp_init(struct sock *sk)
 	}
 
 	psock->save_close = sk->sk_prot->close;
+	psock->save_unhash = sk->sk_prot->unhash;
 	psock->sk_proto = sk->sk_prot;
 
 	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
@@ -302,28 +306,12 @@ struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
 	return e;
 }
 
-static void bpf_tcp_close(struct sock *sk, long timeout)
+static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
 {
-	void (*close_fun)(struct sock *sk, long timeout);
 	struct smap_psock_map_entry *e;
 	struct sk_msg_buff *md, *mtmp;
-	struct smap_psock *psock;
 	struct sock *osk;
 
-	rcu_read_lock();
-	psock = smap_psock_sk(sk);
-	if (unlikely(!psock)) {
-		rcu_read_unlock();
-		return sk->sk_prot->close(sk, timeout);
-	}
-
-	/* The psock may be destroyed anytime after exiting the RCU critial
-	 * section so by the time we use close_fun the psock may no longer
-	 * be valid. However, bpf_tcp_close is called with the sock lock
-	 * held so the close hook and sk are still valid.
-	 */
-	close_fun = psock->save_close;
-
 	if (psock->cork) {
 		free_start_sg(psock->sock, psock->cork);
 		kfree(psock->cork);
@@ -378,6 +366,51 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 		}
 		e = psock_map_pop(sk, psock);
 	}
+}
+
+static void bpf_tcp_unhash(struct sock *sk)
+{
+	void (*unhash_fun)(struct sock *sk);
+	struct smap_psock *psock;
+
+	rcu_read_lock();
+	psock = smap_psock_sk(sk);
+	if (unlikely(!psock)) {
+		rcu_read_unlock();
+		return sk->sk_prot->unhash(sk);
+	}
+
+	/* The psock may be destroyed anytime after exiting the RCU critial
+	 * section so by the time we use close_fun the psock may no longer
+	 * be valid. However, bpf_tcp_close is called with the sock lock
+	 * held so the close hook and sk are still valid.
+	 */
+	unhash_fun = psock->save_unhash;
+	bpf_tcp_remove(sk, psock);
+	rcu_read_unlock();
+	unhash_fun(sk);
+
+}
+
+static void bpf_tcp_close(struct sock *sk, long timeout)
+{
+	void (*close_fun)(struct sock *sk, long timeout);
+	struct smap_psock *psock;
+
+	rcu_read_lock();
+	psock = smap_psock_sk(sk);
+	if (unlikely(!psock)) {
+		rcu_read_unlock();
+		return sk->sk_prot->close(sk, timeout);
+	}
+
+	/* The psock may be destroyed anytime after exiting the RCU critial
+	 * section so by the time we use close_fun the psock may no longer
+	 * be valid. However, bpf_tcp_close is called with the sock lock
+	 * held so the close hook and sk are still valid.
+	 */
+	close_fun = psock->save_close;
+	bpf_tcp_remove(sk, psock);
 	rcu_read_unlock();
 	close_fun(sk, timeout);
 }

^ permalink raw reply related

* [bpf PATCH 3/6] bpf: sockhash fix omitted bucket lock in sock_close
From: John Fastabend @ 2018-06-13 17:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

First in tcp_close, reduce scope of sk_callback_lock() the lock is
only needed for protecting smap_release_sock() the ingress and cork
lists are protected by sock lock. Having the lock in wider scope is
harmless but may confuse the reader who may infer it is in fact
needed.

Next, in sock_hash_delete_elem() the pattern is as follows,

  sock_hash_delete_elem()
     [...]
     spin_lock(bucket_lock)
     l = lookup_elem_raw()
     if (l)
        hlist_del_rcu()
        write_lock(sk_callback_lock)
         .... destroy psock ...
        write_unlock(sk_callback_lock)
     spin_unlock(bucket_lock)

The ordering is necessary because we only know the {p}sock after
dereferencing the hash table which we can't do unless we have the
bucket lock held. Once we have the bucket lock and the psock element
it is deleted from the hashmap to ensure any other path doing a lookup
will fail. Finally, the refcnt is decremented and if zero the psock
is destroyed.

In parallel with the above (or free'ing the map) a tcp close event
may trigger tcp_close(). Which at the moment omits the bucket lock
altogether (oops!) where the flow looks like this,

  bpf_tcp_close()
     [...]
     write_lock(sk_callback_lock)
     for each psock->maps // list of maps this sock is part of
         hlist_del_rcu(ref_hash_node);
         .... destroy psock ...
     write_unlock(sk_callback_lock)

Obviously, and demonstrated by syzbot, this is broken because
we can have multiple threads deleting entries via hlist_del_rcu().

To fix this we might be tempted to wrap the hlist operation in a
bucket lock but that would create a lock inversion problem. In
summary to follow locking rules maps needs the sk_callback_lock but we
need the bucket lock to do the hlist_del_rcu. To resolve the lock
inversion problem note that when bpf_tcp_close is called no updates
can happen in parallel, due to ESTABLISH state check in update logic,
so pop the head of the list repeatedly and remove the reference until
no more are left. If a delete happens in parallel from the BPF API
that is OK as well because it will do a similar action, lookup the
sock in the map/hash, delete it from the map/hash, and dec the refcnt.
We check for this case before doing a destroy on the psock to ensure
we don't have two threads tearing down a psock. The new logic is
as follows,

  bpf_tcp_close()
  e = psock_map_pop(psock->maps) // done with sk_callback_lock
  bucket_lock() // lock hash list bucket
  l = lookup_elem_raw(head, hash, key, key_size);
  if (l) {
     //only get here if elmnt was not already removed
     hlist_del_rcu()
     ... destroy psock...
  }
  bucket_unlock()

And finally for all the above to work add missing sk_callback_lock
around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
delete and update may corrupt maps list.

(As an aside the sk_callback_lock serves two purposes. The
 first, is to update the sock callbacks sk_data_ready, sk_write_space,
 etc. The second is to protect the psock 'maps' list. The 'maps' list
 is used to (as shown above) to delete all map/hash references to a
 sock when the sock is closed)

(If we did not have the ESTABLISHED state guarantee from tcp_close
 then we could not ensure completion because updates could happen
 forever and pin thread in delete loop.)

Reported-by: syzbot+0ce137753c78f7b6acc1@syzkaller.appspotmail.com
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |  112 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 80 insertions(+), 32 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index e5bab52..2e848cd 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -258,16 +258,54 @@ static void bpf_tcp_release(struct sock *sk)
 	rcu_read_unlock();
 }
 
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
+					 u32 hash, void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node) {
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+	}
+
+	return NULL;
+}
+
+static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &__select_bucket(htab, hash)->head;
+}
+
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	atomic_dec(&htab->count);
 	kfree_rcu(l, rcu);
 }
 
+struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
+					   struct smap_psock *psock)
+{
+	struct smap_psock_map_entry *e;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	e = list_first_entry_or_null(&psock->maps,
+				     struct smap_psock_map_entry,
+				     list);
+	if (e)
+		list_del(&e->list);
+	write_unlock_bh(&sk->sk_callback_lock);
+	return e;
+}
+
 static void bpf_tcp_close(struct sock *sk, long timeout)
 {
 	void (*close_fun)(struct sock *sk, long timeout);
-	struct smap_psock_map_entry *e, *tmp;
+	struct smap_psock_map_entry *e;
 	struct sk_msg_buff *md, *mtmp;
 	struct smap_psock *psock;
 	struct sock *osk;
@@ -286,7 +324,6 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 	 */
 	close_fun = psock->save_close;
 
-	write_lock_bh(&sk->sk_callback_lock);
 	if (psock->cork) {
 		free_start_sg(psock->sock, psock->cork);
 		kfree(psock->cork);
@@ -299,20 +336,48 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
 		kfree(md);
 	}
 
-	list_for_each_entry_safe(e, tmp, &psock->maps, list) {
+	/* Sock is in TCP_CLOSE state so any concurrent adds or updates will be
+	 * blocked by ESTABLISHED check. However, tcp_close() + delete + free
+	 * can all run at the same time. If a tcp_close + delete happens each
+	 * code path will remove the entry for the map/hash before deleting it.
+	 * In the map case a xchg and then check to verify we have a sk protects
+	 * two paths from tearing down the same object. For hash map we lock the
+	 * bucket and remove the object from the hash map before destroying to
+	 * ensure that only one reference exists. By pulling object off the head
+	 * of the list with (with sk_callback_lock) if multiple deleters are
+	 * running we avoid duplicate references.
+	 */
+	e = psock_map_pop(sk, psock);
+	while (e) {
 		if (e->entry) {
 			osk = cmpxchg(e->entry, sk, NULL);
 			if (osk == sk) {
-				list_del(&e->list);
 				smap_release_sock(psock, sk);
 			}
 		} else {
-			hlist_del_rcu(&e->hash_link->hash_node);
-			smap_release_sock(psock, e->hash_link->sk);
-			free_htab_elem(e->htab, e->hash_link);
+			struct htab_elem *link = e->hash_link;
+			struct hlist_head *head;
+			struct htab_elem *l;
+			struct bucket *b;
+
+			b = __select_bucket(e->htab, link->hash);
+			head = &b->head;
+			raw_spin_lock_bh(&b->lock);
+			l = lookup_elem_raw(head,
+					    link->hash, link->key,
+					    e->htab->elem_size);
+			/* If another thread deleted this object skip deletion.
+			 * The refcnt on psock may or may not be zero.
+			 */
+			if (l) {
+				hlist_del_rcu(&e->hash_link->hash_node);
+				smap_release_sock(psock, e->hash_link->sk);
+				free_htab_elem(e->htab, e->hash_link);
+			}
+			raw_spin_unlock_bh(&b->lock);
 		}
+		e = psock_map_pop(sk, psock);
 	}
-	write_unlock_bh(&sk->sk_callback_lock);
 	rcu_read_unlock();
 	close_fun(sk, timeout);
 }
@@ -2085,16 +2150,6 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
-{
-	return &htab->buckets[hash & (htab->n_buckets - 1)];
-}
-
-static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
-{
-	return &__select_bucket(htab, hash)->head;
-}
-
 static void sock_hash_free(struct bpf_map *map)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
@@ -2111,10 +2166,13 @@ static void sock_hash_free(struct bpf_map *map)
 	 */
 	rcu_read_lock();
 	for (i = 0; i < htab->n_buckets; i++) {
-		struct hlist_head *head = select_bucket(htab, i);
+		struct bucket *b = __select_bucket(htab, i);
+		struct hlist_head *head;
 		struct hlist_node *n;
 		struct htab_elem *l;
 
+		raw_spin_lock_bh(&b->lock);
+		head = &b->head;
 		hlist_for_each_entry_safe(l, n, head, hash_node) {
 			struct sock *sock = l->sk;
 			struct smap_psock *psock;
@@ -2134,6 +2192,7 @@ static void sock_hash_free(struct bpf_map *map)
 			write_unlock_bh(&sock->sk_callback_lock);
 			kfree(l);
 		}
+		raw_spin_unlock_bh(&b->lock);
 	}
 	rcu_read_unlock();
 	bpf_map_area_free(htab->buckets);
@@ -2164,19 +2223,6 @@ static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
 	return l_new;
 }
 
-static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
-					 u32 hash, void *key, u32 key_size)
-{
-	struct htab_elem *l;
-
-	hlist_for_each_entry_rcu(l, head, hash_node) {
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
-			return l;
-	}
-
-	return NULL;
-}
-
 static inline u32 htab_map_hash(const void *key, u32 key_len)
 {
 	return jhash(key, key_len, 0);
@@ -2308,8 +2354,10 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		psock = smap_psock_sk(l_old->sk);
 
 		hlist_del_rcu(&l_old->hash_node);
+		write_lock_bh(&l_old->sk->sk_callback_lock);
 		smap_list_remove(psock, NULL, l_old);
 		smap_release_sock(psock, l_old->sk);
+		write_unlock_bh(&l_old->sk->sk_callback_lock);
 		free_htab_elem(htab, l_old);
 	}
 	raw_spin_unlock_bh(&b->lock);

^ permalink raw reply related

* [bpf PATCH 2/6] bpf: sockmap only allow ESTABLISHED sock state
From: John Fastabend @ 2018-06-13 17:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

Per the note in the TLS ULP (which is actually a generic statement
regarding ULPs)

 /* The TLS ulp is currently supported only for TCP sockets
  * in ESTABLISHED state.
  * Supporting sockets in LISTEN state will require us
  * to modify the accept implementation to clone rather then
  * share the ulp context.
  */

After this patch we only allow socks that are in ESTABLISHED state or
are being added via a sock_ops event that is transitioning into an
ESTABLISHED state. By allowing sock_ops events we allow users to
manage sockmaps directly from sock ops programs. The two supported
sock_ops ops are BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB and
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB.

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index f6dd4cd..e5bab52 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1976,8 +1976,12 @@ static int sock_map_update_elem(struct bpf_map *map,
 		return -EINVAL;
 	}
 
+	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
+	 * state.
+	 */
 	if (skops.sk->sk_type != SOCK_STREAM ||
-	    skops.sk->sk_protocol != IPPROTO_TCP) {
+	    skops.sk->sk_protocol != IPPROTO_TCP ||
+	    skops.sk->sk_state != TCP_ESTABLISHED) {
 		fput(socket->file);
 		return -EOPNOTSUPP;
 	}
@@ -2338,6 +2342,16 @@ static int sock_hash_update_elem(struct bpf_map *map,
 		return -EINVAL;
 	}
 
+	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
+	 * state.
+	 */
+	if (skops.sk->sk_type != SOCK_STREAM ||
+	    skops.sk->sk_protocol != IPPROTO_TCP ||
+	    skops.sk->sk_state != TCP_ESTABLISHED) {
+		fput(socket->file);
+		return -EOPNOTSUPP;
+	}
+
 	err = sock_hash_ctx_update_elem(&skops, map, key, flags);
 	fput(socket->file);
 	return err;
@@ -2423,10 +2437,19 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
 	.map_delete_elem = sock_hash_delete_elem,
 };
 
+static bool bpf_is_valid_sock(struct bpf_sock_ops_kern *ops)
+{
+	return ops->op == BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB ||
+	       ops->op == BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB;
+}
+
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (!bpf_is_valid_sock(bpf_sock))
+		return -EOPNOTSUPP;
 	return sock_map_ctx_update_elem(bpf_sock, map, key, flags);
 }
 
@@ -2445,6 +2468,9 @@ struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
 	   struct bpf_map *, map, void *, key, u64, flags)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (!bpf_is_valid_sock(bpf_sock))
+		return -EOPNOTSUPP;
 	return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
 }
 

^ permalink raw reply related

* [bpf PATCH 1/6] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-13 17:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev
In-Reply-To: <20180613174202.4264.59970.stgit@john-Precision-Tower-5810>

This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used. Further, only allow ESTABLISHED connections to join the
map per note in TLS ULP,

   /* The TLS ulp is currently supported only for TCP sockets
    * in ESTABLISHED state.
    * Supporting sockets in LISTEN state will require us
    * to modify the accept implementation to clone rather then
    * share the ulp context.
    */

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 kernel/bpf/sockmap.c |   58 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..f6dd4cd 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
 static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
 			    int offset, size_t size, int flags);
+static void bpf_tcp_close(struct sock *sk, long timeout);
 
 static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
 {
@@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 	return !empty;
 }
 
-static struct proto tcp_bpf_proto;
+enum {
+	SOCKMAP_IPV4,
+	SOCKMAP_IPV6,
+	SOCKMAP_NUM_PROTS,
+};
+
+enum {
+	SOCKMAP_BASE,
+	SOCKMAP_TX,
+	SOCKMAP_NUM_CONFIGS,
+};
+
+static struct proto *saved_tcpv6_prot;
+static DEFINE_MUTEX(tcpv6_prot_mutex);
+static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS];
+static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
+			 struct proto *base)
+{
+	prot[SOCKMAP_BASE]			= *base;
+	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
+	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
+	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
+
+	prot[SOCKMAP_TX]			= prot[SOCKMAP_BASE];
+	prot[SOCKMAP_TX].sendmsg		= bpf_tcp_sendmsg;
+	prot[SOCKMAP_TX].sendpage		= bpf_tcp_sendpage;
+}
+
+static void update_sk_prot(struct sock *sk, struct smap_psock *psock)
+{
+	int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4;
+	int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE;
+
+	sk->sk_prot = &bpf_tcp_prots[family][conf];
+}
+
 static int bpf_tcp_init(struct sock *sk)
 {
 	struct smap_psock *psock;
@@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk)
 	psock->save_close = sk->sk_prot->close;
 	psock->sk_proto = sk->sk_prot;
 
-	if (psock->bpf_tx_msg) {
-		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
-		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
-		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
-		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
+	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
+	if (sk->sk_family == AF_INET6 &&
+	    unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) {
+		mutex_lock(&tcpv6_prot_mutex);
+		if (likely(sk->sk_prot != saved_tcpv6_prot)) {
+			build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot);
+			smp_store_release(&saved_tcpv6_prot, sk->sk_prot);
+		}
+		mutex_unlock(&tcpv6_prot_mutex);
 	}
-
-	sk->sk_prot = &tcp_bpf_proto;
+	update_sk_prot(sk, psock);
 	rcu_read_unlock();
 	return 0;
 }
@@ -1111,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
 
 static int bpf_tcp_ulp_register(void)
 {
-	tcp_bpf_proto = tcp_prot;
-	tcp_bpf_proto.close = bpf_tcp_close;
+	build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot);
 	/* Once BPF TX ULP is registered it is never unregistered. It
 	 * will be in the ULP list for the lifetime of the system. Doing
 	 * duplicate registers is not a problem.

^ permalink raw reply related

* [bpf PATCH 0/6] BPF fixes for sockhash
From: John Fastabend @ 2018-06-13 17:49 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

This addresses two syzbot issues that lead to identifing (by Eric and
Wei) a class of bugs where we don't correctly check for IPv4/v6
sockets and their associated state. The second issue was a locking
error in sockhash.

The first 2 patches address handling IPv4 correctly and then ensuring
that only sockets in ESTABLISHED state can be added. There is then a
follow up fix (patch4) to fix the other issue Eric noted, namely that
we depend on sockets to call tcp_close to remove them from the map.
However, we missed that a socket can transition through
tcp_disconnect() and never call tcp_close() missing our hook. To
resolve this implement the unhash hook which is also called from the
tcp_disconnect() flow.

The other issue syzbot found that the tcp_close() handler missed
locking the hash bucket lock which could result in corrupting the
sockhash bucket list if delete and close ran at the same time. To
fix this we had to restructure the tcp_close() lock handling. This is
done in patch 3.

Finally, during review I noticed the release handler was ommitted
from the upstream code (patch 5) due to an incorrect merge conflict
fix when I ported the code to latest bpf-next before submitting. And
then patch 6 fixes up selftests for the above.

The tcp_disconnect() catch also appears to be missing in kTLS so
a follow up patch will need to address that as well.

---

John Fastabend (6):
      bpf: sockmap, fix crash when ipv6 sock is added
      bpf: sockmap only allow ESTABLISHED sock state
      bpf: sockhash fix omitted bucket lock in sock_close
      bpf: sockmap, tcp_disconnect to listen transition
      bpf: sockhash, add release routine
      bpf: selftest remove attempts to add LISTEN sockets to sockmap

 kernel/bpf/sockmap.c                    |  266 ++++++++++++++++++++++++-------
 tools/testing/selftests/bpf/test_maps.c |    4 
 2 files changed, 208 insertions(+), 62 deletions(-)

^ permalink raw reply

* Re: [RFC PATCH RESEND] tcp: avoid F-RTO if SACK and timestamps are disabled
From: Eric Dumazet @ 2018-06-13 17:48 UTC (permalink / raw)
  To: Yuchung Cheng, Michal Kubecek
  Cc: netdev, Eric Dumazet, Ilpo Jarvinen, linux-kernel
In-Reply-To: <CAK6E8=eCOLU9AX0+bSrOg_UYBm1mFxrGT=ybksba9B0OUfp7jg@mail.gmail.com>



On 06/13/2018 10:32 AM, Yuchung Cheng wrote:
> On Wed, Jun 13, 2018 at 9:55 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>> When F-RTO algorithm (RFC 5682) is used on connection without both SACK and
>> timestamps (either because of (mis)configuration or because the other
>> endpoint does not advertise them), specific pattern loss can make RTO grow
>> exponentially until the sender is only able to send one packet per two
>> minutes (TCP_RTO_MAX).

>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> 
> Thanks for the patch (and packedrill test)! 

Yes, thanks a lot Michal.

Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [RFC PATCH 02/12] xen/manage: introduce helper function to know the on-going suspend mode
From: Balbir Singh @ 2018-06-13 17:41 UTC (permalink / raw)
  To: Anchal Agarwal
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT), boris.ostrovsky,
	konrad.wilk, roger.pau, netdev, jgross, xen-devel,
	linux-kernel@vger.kernel.org, kamatam, Frank van der Linden,
	vallish, guruanb, eduval, Rafael J. Wysocki, Pavel Machek,
	Len Brown, linux-pm, cyberax
In-Reply-To: <20180612205619.28156-3-anchalag@amazon.com>

On Wed, Jun 13, 2018 at 6:56 AM, Anchal Agarwal <anchalag@amazon.com> wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
>
> Introduce simple functions which help to know the on-going suspend mode
> so that other Xen-related code can behave differently according to the
> current suspend mode.

I'd squash this patch with the previous, the previous one just left
too many open questions about how suspend mode is used. Looks like
suspend mode is a state, but I am not sure if valid transitions are
defined/checked?

Balbir Singh.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox