Netdev List
 help / color / mirror / Atom feed
* Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
From: Michael Ellerman @ 2017-12-22  4:22 UTC (permalink / raw)
  To: Rasmus Villemoes, Michael Ellerman
  Cc: Jouni Malinen, netdev@vger.kernel.org, Johannes Berg,
	linuxppc-dev@lists.ozlabs.org, Johannes Berg
In-Reply-To: <87efnpnf8b.fsf@rasmusvillemoes.dk>

Rasmus Villemoes <linux@rasmusvillemoes.dk> writes:
> On Tue, Dec 19 2017, Michael Ellerman <michael@concordia.ellerman.id.au> wrote:
>>> From: Johannes Berg <johannes.berg@intel.com>
>>> 
>>> This reverts commit d6f295e9def0; some userspace (in the case
>>
>> This revert seems to have broken networking on one of my powerpc
>> machines, according to git bisect.
>>
>> The symptom is DHCP fails and I don't get a link, I didn't dig any
>> further than that. I can if it's helpful.
>>
>> I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name
>> is now the same as dev_alloc_name_ns") only makes sense while
>> d6f295e9def0 remains in the tree.
>
> I'm sorry about all of this, I really didn't think there would be such
> consequences of changing an errno return. Indeed, d6f29 was preparation
> for unifying the two functions that do the exact same thing (and how we
> ever got into that situation is somewhat unclear), except for
> their behaviour in the case the requested name already exists. So one of
> the two interfaces had to change its return value, and as I wrote, I
> thought EEXIST was the saner choice when an explicit name (no %d) had
> been requested.

No worries.

>> ie. before the entire series, dev_get_valid_name() would return EEXIST,
>> and that was retained when 87c320e51519 was merged, but now that
>> d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE.
>>
>> I can get the network up again if I also revert 87c320e51519 ("net:
>> core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with
>> the gross patch below.
>
> I don't think changing -ENFILE to -EEXIST would be right either, since
> dev_get_valid_name() used to be able to return both (-EEXIST in the case
> where there's no %d, -ENFILE in the case where we end up calling
> dev_alloc_name_ns()). If anything, we could do the check for the old
> -EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also
> fine with reverting.

Yeah I think a revert would be best, given it's nearly rc5.

My userspace is not exotic AFAIK, just debian something, so presumably
this will affect other people too.

cheers

^ permalink raw reply

* Re: BUG: unable to handle kernel NULL pointer dereference in sctp_stream_free
From: Xin Long @ 2017-12-22  5:31 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: syzbot, davem, LKML, linux-sctp, network dev, Neil Horman,
	syzkaller-bugs, Vlad Yasevich
In-Reply-To: <20171221131333.GO6122@localhost.localdomain>

On Thu, Dec 21, 2017 at 9:13 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 12:51:01PM -0800, syzbot wrote:
>
> from the log:
> [   89.451366] FAULT_INJECTION: forcing a failure.^M
> [   89.451366] name failslab, interval 1, probability 0, space 0,
> times 0^M
> [   89.451374] CPU: 0 PID: 17287 Comm: syz-executor2 Not tainted
> +4.15.0-rc3-next-20171214+ #67^M
> [   89.451377] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS
> +Google 01/01/2011^M
> [   89.451380] Call Trace:^M
> [   89.451395]  dump_stack+0xe9/0x14b^M
> [   89.451408]  should_fail+0x1e5/0x220^M
> [   89.451419]  should_failslab+0x73/0x90^M
> [   89.451428]  __kmalloc+0x63/0x730^M
> [   89.451439]  ? rcu_read_lock_sched_held+0x74/0x80^M
> [   89.451446]  ? __kmalloc+0x4ac/0x730^M
> [   89.451452]  ? sctp_stream_alloc_in+0x2f/0x100^M
> [   89.451464]  sctp_stream_alloc_in+0x2f/0x100^M
> [   89.451473]  sctp_stream_init+0xfa/0x140^M
> [   89.451485]  sctp_process_init+0x676/0xc50^M
>
> this is what caused the panic later, because in the error path we free
> out but don't zero outcnt. This patch should fix it. Can you please
> try it? Thanks
>
> ----8<---
>
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index 06b644dd858c..50ab09029f00 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -184,6 +184,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
>         sched->free(stream);
>         kfree(stream->out);
>         stream->out = NULL;
> +       stream->outcnt = 0;
>  out:
>         return ret;
>  }

In case it can't be verified due to no reproducer yet, I modified some
code in sctp_stream_init() to confirm Marcelo's deduction:
-       i = sctp_stream_alloc_in(stream, incnt, gfp);
+       i = 1;
        if (i) {
                ret = -ENOMEM;
                goto free;

And got the same call trace as the mail:

[  301.488065] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000008
[  301.488618] IP: sctp_stream_free+0x2c/0x60 [sctp]
[  301.488928] PGD 59a3b067 P4D 59a3b067 PUD 5994e067 PMD 0
[  301.489372] Oops: 0000 [#1] SMP
[...]
[  301.497647] Call Trace:
[  301.497812]  <IRQ>
[  301.497955]  sctp_association_free+0xb8/0x210 [sctp]
[  301.498306]  sctp_sf_do_5_1B_init+0x1c4/0x360 [sctp]
[  301.498654]  sctp_do_sm+0x9a/0x2d0 [sctp]
[  301.498921]  ? sctp_has_association+0x130/0x130 [sctp]
[  301.499301]  ? kernel_text_address+0xba/0xe0
[  301.499615]  ? check_usage_backwards+0x88/0x150
[  301.499911]  ? __lock_acquire+0x280/0x1080
[  301.500200]  ? sctp_endpoint_lookup_assoc+0x95/0x140 [sctp]
[  301.500593]  sctp_endpoint_bh_rcv+0x11e/0x220 [sctp]
[  301.500923]  sctp_rcv+0x9f5/0xbe0 [sctp]

And Marcelo's patch could fix it.

Since the "free:" part only works for if (i), maybe the patch can also do:
        if (i) {
                sched->free(stream);
                kfree(stream->out);
                stream->out = NULL;
                stream->outcnt = 0;

                ret = -ENOMEM;
                goto out;
        }

and remove the "free:" path.

^ permalink raw reply

* Re: [PATCH net] RDS: Check cmsg_len before dereferencing CMSG_DATA
From: santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA @ 2017-12-22  6:58 UTC (permalink / raw)
  To: Avinash Repaka, David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1513916224-9445-1-git-send-email-avinash.repaka-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On 12/21/17 8:17 PM, Avinash Repaka wrote:
> RDS currently doesn't check if the length of the control message is
> large enough to hold the required data, before dereferencing the control
> message data. This results in following crash:
> 
> BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
> [inline]
> BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
> net/rds/send.c:1066
> Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157
> 
> CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x257 lib/dump_stack.c:53
>   print_address_description+0x73/0x250 mm/kasan/report.c:252
>   kasan_report_error mm/kasan/report.c:351 [inline]
>   kasan_report+0x25b/0x340 mm/kasan/report.c:409
>   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>   rds_rdma_bytes net/rds/send.c:1013 [inline]
>   rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
>   sock_sendmsg_nosec net/socket.c:628 [inline]
>   sock_sendmsg+0xca/0x110 net/socket.c:638
>   ___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
>   __sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
>   SYSC_sendmmsg net/socket.c:2139 [inline]
>   SyS_sendmmsg+0x35/0x60 net/socket.c:2134
>   entry_SYSCALL_64_fastpath+0x1f/0x96
> RIP: 0033:0x43fe49
> RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
> RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
> RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
> R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000
> 
> To fix this, we verify that the cmsg_len is large enough to hold the
> data to be read, before proceeding further.
> 
> Reported-by: syzbot <syzkaller-bugs-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
> Signed-off-by: Avinash Repaka <avinash.repaka-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
Thanks !!

Acked-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* suspicious RCU usage in net/wireless/util.c:778
From: Dominik Brodowski @ 2017-12-22  7:20 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2137 bytes --]

Dear all,

once the (wifi) link becomes ready, the following warning is emitted on
mainline (v4.15-rc4-202-gead68f216110) on my notebook:

[   22.770422] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

[   22.772364] =============================
[   22.772369] WARNING: suspicious RCU usage
[   22.772375] 4.15.0-rc4+ #5 Not tainted
[   22.772380] -----------------------------
[   22.772386] /home/brodo/local/kernel/git/linux/net/wireless/util.c:778 suspicious rcu_dereference_check() usage!
[   22.772391] 
[   22.772397] 
[   22.772402] 4 locks held by wpa_supplicant/774:
[   22.772407]  #0:  (cb_lock){++++}, at: [<00000000276dc3a0>] genl_rcv+0x15/0x40
[   22.772437]  #1:  (genl_mutex){+.+.}, at: [<0000000024d83eb3>] genl_rcv_msg+0x7a/0x90
[   22.772463]  #2:  (rtnl_mutex){+.+.}, at: [<000000009de25a59>] nl80211_pre_doit+0xe9/0x190
[   22.772489]  #3:  (&wdev->mtx){+.+.}, at: [<0000000089bf2cfd>] nl80211_send_iface+0x317/0x8d0
[   22.772516] 
[   22.772522] CPU: 3 PID: 774 Comm: wpa_supplicant Not tainted 4.15.0-rc4+ #5
[   22.772528] Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016
[   22.772532] Call Trace:
[   22.772544]  dump_stack+0x67/0x95
[   22.772553]  ieee80211_bss_get_ie+0x66/0x70
[   22.772562]  nl80211_send_iface+0x344/0x8d0
[   22.772585]  nl80211_get_interface+0x4b/0xa0
[   22.772598]  genl_family_rcv_msg+0x32e/0x3f0
[   22.772607]  ? preempt_count_sub+0x92/0xd0
[   22.772645]  genl_rcv_msg+0x47/0x90
[   22.772652]  ? genl_family_rcv_msg+0x3f0/0x3f0
[   22.772661]  netlink_rcv_skb+0x8a/0x120
[   22.772677]  genl_rcv+0x24/0x40
[   22.772684]  netlink_unicast+0x174/0x1f0
[   22.772698]  netlink_sendmsg+0x386/0x3d0
[   22.772719]  sock_sendmsg+0x2d/0x40
[   22.772728]  ___sys_sendmsg+0x2a7/0x300
[   22.772748]  ? netlink_sendmsg+0x13d/0x3d0
[   22.772791]  ? __sys_sendmsg+0x67/0xb0
[   22.772797]  __sys_sendmsg+0x67/0xb0
[   22.772822]  entry_SYSCALL_64_fastpath+0x18/0x85

This warning wasn't present in 4.15. Despite of it, networking seems to
work fine. Nonetheless, the code seems to need a bugfix.

Thanks,
	Dominik

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net 2/2] tipc: fix tipc_mon_delete() oops in tipc_enable_bearer() error path
From: Tommi Rantala @ 2017-12-22  7:35 UTC (permalink / raw)
  To: netdev; +Cc: tipc-discussion
In-Reply-To: <20171222073517.8773-1-tommi.t.rantala@nokia.com>

Calling tipc_mon_delete() before the monitor has been created will oops.
This can happen in tipc_enable_bearer() error path if tipc_disc_create()
fails.

[   48.589074] BUG: unable to handle kernel paging request at 0000000000001008
[   48.590266] IP: tipc_mon_delete+0xea/0x270 [tipc]
[   48.591223] PGD 1e60c5067 P4D 1e60c5067 PUD 1eb0cf067 PMD 0
[   48.592230] Oops: 0000 [#1] SMP KASAN
[   48.595610] CPU: 5 PID: 1199 Comm: tipc Tainted: G    B            4.15.0-rc4-pc64-dirty #5
[   48.597176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   48.598489] RIP: 0010:tipc_mon_delete+0xea/0x270 [tipc]
[   48.599347] RSP: 0018:ffff8801d827f668 EFLAGS: 00010282
[   48.600705] RAX: ffff8801ee813f00 RBX: 0000000000000204 RCX: 0000000000000000
[   48.602183] RDX: 1ffffffff1de6a75 RSI: 0000000000000297 RDI: 0000000000000297
[   48.604373] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff1dd1533
[   48.605607] R10: ffffffff8eafbb05 R11: fffffbfff1dd1534 R12: 0000000000000050
[   48.607082] R13: dead000000000200 R14: ffffffff8e73f310 R15: 0000000000001020
[   48.608228] FS:  00007fc686484800(0000) GS:ffff8801f5540000(0000) knlGS:0000000000000000
[   48.610189] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.611459] CR2: 0000000000001008 CR3: 00000001dda70002 CR4: 00000000003606e0
[   48.612759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.613831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.615038] Call Trace:
[   48.615635]  tipc_enable_bearer+0x415/0x5e0 [tipc]
[   48.620623]  tipc_nl_bearer_enable+0x1ab/0x200 [tipc]
[   48.625118]  genl_family_rcv_msg+0x36b/0x570
[   48.631233]  genl_rcv_msg+0x5a/0xa0
[   48.631867]  netlink_rcv_skb+0x1cc/0x220
[   48.636373]  genl_rcv+0x24/0x40
[   48.637306]  netlink_unicast+0x29c/0x350
[   48.639664]  netlink_sendmsg+0x439/0x590
[   48.642014]  SYSC_sendto+0x199/0x250
[   48.649912]  do_syscall_64+0xfd/0x2c0
[   48.650651]  entry_SYSCALL64_slow_path+0x25/0x25
[   48.651843] RIP: 0033:0x7fc6859848e3
[   48.652539] RSP: 002b:00007ffd25dff938 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   48.654003] RAX: ffffffffffffffda RBX: 00007ffd25dff990 RCX: 00007fc6859848e3
[   48.655303] RDX: 0000000000000054 RSI: 00007ffd25dff990 RDI: 0000000000000003
[   48.656512] RBP: 00007ffd25dff980 R08: 00007fc685c35fc0 R09: 000000000000000c
[   48.657697] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000d13010
[   48.658840] R13: 00007ffd25e009c0 R14: 0000000000000000 R15: 0000000000000000
[   48.662972] RIP: tipc_mon_delete+0xea/0x270 [tipc] RSP: ffff8801d827f668
[   48.664073] CR2: 0000000000001008
[   48.664576] ---[ end trace e811818d54d5ce88 ]---

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
---
 net/tipc/monitor.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 8e884ed06d4b..32dc33a94bc7 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -642,9 +642,13 @@ void tipc_mon_delete(struct net *net, int bearer_id)
 {
 	struct tipc_net *tn = tipc_net(net);
 	struct tipc_monitor *mon = tipc_monitor(net, bearer_id);
-	struct tipc_peer *self = get_self(net, bearer_id);
+	struct tipc_peer *self;
 	struct tipc_peer *peer, *tmp;
 
+	if (!mon)
+		return;
+
+	self = get_self(net, bearer_id);
 	write_lock_bh(&mon->lock);
 	tn->monitors[bearer_id] = NULL;
 	list_for_each_entry_safe(peer, tmp, &self->list, list) {
-- 
2.14.3


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

^ permalink raw reply related

* [PATCH net 1/2] tipc: error path leak fixes in tipc_enable_bearer()
From: Tommi Rantala @ 2017-12-22  7:35 UTC (permalink / raw)
  To: netdev; +Cc: Jon Maloy, Ying Xue, tipc-discussion, Tommi Rantala

Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
cleanup with bearer_disable() if tipc_mon_create() fails.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
---
 net/tipc/bearer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 47ec121574ce..c8001471da6c 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -324,6 +324,7 @@ static int tipc_enable_bearer(struct net *net, const char *name,
 	if (res) {
 		pr_warn("Bearer <%s> rejected, enable failure (%d)\n",
 			name, -res);
+		kfree(b);
 		return -EINVAL;
 	}
 
@@ -347,8 +348,10 @@ static int tipc_enable_bearer(struct net *net, const char *name,
 	if (skb)
 		tipc_bearer_xmit_skb(net, bearer_id, skb, &b->bcast_addr);
 
-	if (tipc_mon_create(net, bearer_id))
+	if (tipc_mon_create(net, bearer_id)) {
+		bearer_disable(net, b);
 		return -ENOMEM;
+	}
 
 	pr_info("Enabled bearer <%s>, discovery domain %s, priority %u\n",
 		name,
-- 
2.14.3

^ permalink raw reply related

* Re: linux-next: build failure after merge of the net-next tree
From: Ido Schimmel @ 2017-12-22  7:51 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List, Ido Schimmel, Eric Dumazet
In-Reply-To: <20171222114519.612087d4@canb.auug.org.au>

Hi Stephen,

On Fri, Dec 22, 2017 at 11:45:19AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:

[...]

> I have added the following merge fix patch for today (I am guessing
> a bit here):
> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 22 Dec 2017 11:25:13 +1100
> Subject: [PATCH] ipv6: fix up for "ipv6: Move dst->from into struct rt6_info"
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
> ---
>  net/ipv6/route.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4efaac956f0c..2490280b3394 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4321,9 +4321,8 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
>  		goto errout;
>  	}
>  
> -	if (fibmatch && rt->dst.from) {
> -		struct rt6_info *ort = container_of(rt->dst.from,
> -						    struct rt6_info, dst);
> +	if (fibmatch && rt->from) {
> +		struct rt6_info *ort = rt->from;

Your merge fix is correct. Thanks!

>  
>  		dst_hold(&ort->dst);
>  		ip6_rt_put(rt);

^ permalink raw reply

* Re: [PATCH net] rtnetlink: fix struct net reference leak
From: Nicolas Dichtel @ 2017-12-22  8:11 UTC (permalink / raw)
  To: Craig Gallek, David Miller, Jiri Benc; +Cc: netdev, Jason A . Donenfeld
In-Reply-To: <20171221221832.89616-1-kraigatgoog@gmail.com>

Le 21/12/2017 à 23:18, Craig Gallek a écrit :
> From: Craig Gallek <kraig@google.com>
> 
> The below referenced commit extended the RTM_GETLINK interface to
> allow querying by netns id.  The netnsid property was previously
> defined as a signed integer, but this patch assumes that the user
> always passes a positive integer.  syzkaller discovered this problem
> by setting a negative netnsid and then calling the get-link path
> in a tight loop.  This surprisingly quickly overflows the reference
> count on the associated struct net, potentially destroying it.  When the
> default namespace is used, the machine crashes in strange and interesting
> ways.
> 
> Unfortunately, this is not easy to reproduce with just the ip tool
> as it enforces unsigned integer parsing despite the interface interpeting
> the NETNSID attribute as signed.
> 
> I'm not sure why this attribute is signed in the first place, but
> the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
> so I assume it's too late to change.
A valid (assigned) nsid is always >= 0.

> 
> This patch removes the positive netns id assumption, but adds another
> assumption that the netns id 0 is always the 'self' identifying id (for
> which an additional struct net reference is not necessary).
We cannot make this assumption, this is wrong. nsids may be automatically
allocated by the kernel, and it starts by 0.
The current netns can be identify by NETNSA_NSID_NOT_ASSIGNED, ie -1.


Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH v3 next-queue 08/10] ixgbe: process the Tx ipsec offload
From: Yanjun Zhu @ 2017-12-22  8:24 UTC (permalink / raw)
  To: Shannon Nelson, intel-wired-lan, jeffrey.t.kirsher
  Cc: steffen.klassert, sowmini.varadhan, netdev
In-Reply-To: <1513728002-7643-9-git-send-email-shannon.nelson@oracle.com>



On 2017/12/20 8:00, Shannon Nelson wrote:
> If the skb has a security association referenced in the skb, then
> set up the Tx descriptor with the ipsec offload bits.  While we're
> here, we fix an oddly named field in the context descriptor struct.
>
> v3: added ifdef CONFIG_XFRM_OFFLOAD check around call to ixgbe_ipsec_tx()
>
> v2: use ihl != 5
>      move the ixgbe_ipsec_tx() call to near the call to ixgbe_tso()
>      drop the ipsec packet if the tx offload setup fails
>      simplify the ixgbe_ipsec_tx() parameters by using 'first'
>      leave out the ixgbe_tso() changes since we don't support TSO
>         with ipsec yet.
>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> ---
>   drivers/net/ethernet/intel/ixgbe/ixgbe.h       | 10 +++-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 79 ++++++++++++++++++++++++++
>   drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c   |  4 +-
>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 26 +++++++--
>   drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  2 +-
>   5 files changed, 112 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index a094b23..3d2b7bf 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -171,10 +171,11 @@ enum ixgbe_tx_flags {
>   	IXGBE_TX_FLAGS_CC	= 0x08,
>   	IXGBE_TX_FLAGS_IPV4	= 0x10,
>   	IXGBE_TX_FLAGS_CSUM	= 0x20,
> +	IXGBE_TX_FLAGS_IPSEC	= 0x40,
>   
>   	/* software defined flags */
> -	IXGBE_TX_FLAGS_SW_VLAN	= 0x40,
> -	IXGBE_TX_FLAGS_FCOE	= 0x80,
> +	IXGBE_TX_FLAGS_SW_VLAN	= 0x80,
> +	IXGBE_TX_FLAGS_FCOE	= 0x100,
>   };
>   
>   /* VLAN info */
> @@ -1014,6 +1015,8 @@ void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
>   void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
>   		    union ixgbe_adv_rx_desc *rx_desc,
>   		    struct sk_buff *skb);
> +int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring, struct ixgbe_tx_buffer *first,
> +		   struct ixgbe_ipsec_tx_data *itd);
>   #else
>   static inline void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter) { };
>   static inline void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter) { };
> @@ -1021,5 +1024,8 @@ static inline void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter) { };
>   static inline void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
>   				  union ixgbe_adv_rx_desc *rx_desc,
>   				  struct sk_buff *skb) { };
> +static inline int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
> +				 struct ixgbe_tx_buffer *first,
> +				 struct ixgbe_ipsec_tx_data *itd) { return 0; };
>   #endif /* CONFIG_XFRM_OFFLOAD */
>   #endif /* _IXGBE_H_ */
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> index a9b8f5c..c2fd2ac 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> @@ -695,12 +695,91 @@ static void ixgbe_ipsec_del_sa(struct xfrm_state *xs)
>   	}
>   }
>   
> +/**
> + * ixgbe_ipsec_offload_ok - can this packet use the xfrm hw offload
> + * @skb: current data packet
> + * @xs: pointer to transformer state struct
> + **/
> +static bool ixgbe_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
> +{
> +	if (xs->props.family == AF_INET) {
> +		/* Offload with IPv4 options is not supported yet */
> +		if (ip_hdr(skb)->ihl != 5)
> +			return false;
> +	} else {
> +		/* Offload with IPv6 extension headers is not support yet */
> +		if (ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>   static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
>   	.xdo_dev_state_add = ixgbe_ipsec_add_sa,
>   	.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
> +	.xdo_dev_offload_ok = ixgbe_ipsec_offload_ok,
>   };
>   
>   /**
> + * ixgbe_ipsec_tx - setup Tx flags for ipsec offload
> + * @tx_ring: outgoing context
> + * @first: current data packet
> + * @itd: ipsec Tx data for later use in building context descriptor
> + **/
> +int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
> +		   struct ixgbe_tx_buffer *first,
> +		   struct ixgbe_ipsec_tx_data *itd)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
> +	struct ixgbe_ipsec *ipsec = adapter->ipsec;
> +	struct xfrm_state *xs;
> +	struct tx_sa *tsa;
> +
> +	if (!first->skb->sp->len) {
Hi, Nelson

The function ixgbe_ipsec_tx is called in tx fastpath. Can we add 
unlikely as below:
if (unlikely(!first->skb->sp->len)) ?

If I am wrong, please correct me.

Thanks a lot.
Zhu Yanjun
> +		netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
> +			   __func__, first->skb->sp->len);
> +		return 0;
> +	}
> +
> +	xs = xfrm_input_state(first->skb);
> +	if (!xs) {
if (unlikely(!xs)) {
> +		netdev_err(tx_ring->netdev, "%s: no xfrm_input_state() xs = %p\n",
> +			   __func__, xs);
> +		return 0;
> +	}
> +
> +	itd->sa_idx = xs->xso.offload_handle - IXGBE_IPSEC_BASE_TX_INDEX;
> +	if (itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT) {
if (unlikely(itd->sa_idx > IXGBE_IPSEC_MAX_SA_COUNT))
> +		netdev_err(tx_ring->netdev, "%s: bad sa_idx=%d handle=%lu\n",
> +			   __func__, itd->sa_idx, xs->xso.offload_handle);
> +		return 0;
> +	}
> +
> +	tsa = &ipsec->tx_tbl[itd->sa_idx];
> +	if (!tsa->used) {
if (unlikely(!tsa->used)) {
> +		netdev_err(tx_ring->netdev, "%s: unused sa_idx=%d\n",
> +			   __func__, itd->sa_idx);
> +		return 0;
> +	}
> +
> +	first->tx_flags |= IXGBE_TX_FLAGS_IPSEC | IXGBE_TX_FLAGS_CC;
> +
> +	itd->flags = 0;
> +	if (xs->id.proto == IPPROTO_ESP) {
> +		itd->flags |= IXGBE_ADVTXD_TUCMD_IPSEC_TYPE_ESP |
> +			      IXGBE_ADVTXD_TUCMD_L4T_TCP;
> +		if (first->protocol == htons(ETH_P_IP))
> +			itd->flags |= IXGBE_ADVTXD_TUCMD_IPV4;
> +		itd->trailer_len = xs->props.trailer_len;
> +	}
> +	if (tsa->encrypt)
> +		itd->flags |= IXGBE_ADVTXD_TUCMD_IPSEC_ENCRYPT_EN;
> +
> +	return 1;
> +}
> +
> +/**
>    * ixgbe_ipsec_rx - decode ipsec bits from Rx descriptor
>    * @rx_ring: receiving ring
>    * @rx_desc: receive data descriptor
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> index f1bfae0..d7875b3 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
> @@ -1261,7 +1261,7 @@ void ixgbe_clear_interrupt_scheme(struct ixgbe_adapter *adapter)
>   }
>   
>   void ixgbe_tx_ctxtdesc(struct ixgbe_ring *tx_ring, u32 vlan_macip_lens,
> -		       u32 fcoe_sof_eof, u32 type_tucmd, u32 mss_l4len_idx)
> +		       u32 fceof_saidx, u32 type_tucmd, u32 mss_l4len_idx)
>   {
>   	struct ixgbe_adv_tx_context_desc *context_desc;
>   	u16 i = tx_ring->next_to_use;
> @@ -1275,7 +1275,7 @@ void ixgbe_tx_ctxtdesc(struct ixgbe_ring *tx_ring, u32 vlan_macip_lens,
>   	type_tucmd |= IXGBE_TXD_CMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
>   
>   	context_desc->vlan_macip_lens	= cpu_to_le32(vlan_macip_lens);
> -	context_desc->seqnum_seed	= cpu_to_le32(fcoe_sof_eof);
> +	context_desc->fceof_saidx	= cpu_to_le32(fceof_saidx);
>   	context_desc->type_tucmd_mlhl	= cpu_to_le32(type_tucmd);
>   	context_desc->mss_l4len_idx	= cpu_to_le32(mss_l4len_idx);
>   }
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 0ee1e5e..6b9a9b2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7756,10 +7756,12 @@ static inline bool ixgbe_ipv6_csum_is_sctp(struct sk_buff *skb)
>   }
>   
>   static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
> -			  struct ixgbe_tx_buffer *first)
> +			  struct ixgbe_tx_buffer *first,
> +			  struct ixgbe_ipsec_tx_data *itd)
>   {
>   	struct sk_buff *skb = first->skb;
>   	u32 vlan_macip_lens = 0;
> +	u32 fceof_saidx = 0;
>   	u32 type_tucmd = 0;
>   
>   	if (skb->ip_summed != CHECKSUM_PARTIAL) {
> @@ -7800,7 +7802,12 @@ static void ixgbe_tx_csum(struct ixgbe_ring *tx_ring,
>   	vlan_macip_lens |= skb_network_offset(skb) << IXGBE_ADVTXD_MACLEN_SHIFT;
>   	vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
>   
> -	ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd, 0);
> +	if (first->tx_flags & IXGBE_TX_FLAGS_IPSEC) {
> +		fceof_saidx |= itd->sa_idx;
> +		type_tucmd |= itd->flags | itd->trailer_len;
> +	}
> +
> +	ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, fceof_saidx, type_tucmd, 0);
>   }
>   
>   #define IXGBE_SET_FLAG(_input, _flag, _result) \
> @@ -7843,11 +7850,16 @@ static void ixgbe_tx_olinfo_status(union ixgbe_adv_tx_desc *tx_desc,
>   					IXGBE_TX_FLAGS_CSUM,
>   					IXGBE_ADVTXD_POPTS_TXSM);
>   
> -	/* enble IPv4 checksum for TSO */
> +	/* enable IPv4 checksum for TSO */
>   	olinfo_status |= IXGBE_SET_FLAG(tx_flags,
>   					IXGBE_TX_FLAGS_IPV4,
>   					IXGBE_ADVTXD_POPTS_IXSM);
>   
> +	/* enable IPsec */
> +	olinfo_status |= IXGBE_SET_FLAG(tx_flags,
> +					IXGBE_TX_FLAGS_IPSEC,
> +					IXGBE_ADVTXD_POPTS_IPSEC);
> +
>   	/*
>   	 * Check Context must be set if Tx switch is enabled, which it
>   	 * always is for case where virtual functions are running
> @@ -8306,6 +8318,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
>   	u32 tx_flags = 0;
>   	unsigned short f;
>   	u16 count = TXD_USE_COUNT(skb_headlen(skb));
> +	struct ixgbe_ipsec_tx_data ipsec_tx = { 0 };
>   	__be16 protocol = skb->protocol;
>   	u8 hdr_len = 0;
>   
> @@ -8410,11 +8423,16 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
>   	}
>   
>   #endif /* IXGBE_FCOE */
> +
> +#ifdef CONFIG_XFRM_OFFLOAD
> +	if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
> +		goto out_drop;
> +#endif
>   	tso = ixgbe_tso(tx_ring, first, &hdr_len);
>   	if (tso < 0)
>   		goto out_drop;
>   	else if (!tso)
> -		ixgbe_tx_csum(tx_ring, first);
> +		ixgbe_tx_csum(tx_ring, first, &ipsec_tx);
>   
>   	/* add the ATR filter if ATR is on */
>   	if (test_bit(__IXGBE_TX_FDIR_INIT_DONE, &tx_ring->state))
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> index 3df0763..0ac725fa 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> @@ -2856,7 +2856,7 @@ union ixgbe_adv_rx_desc {
>   /* Context descriptors */
>   struct ixgbe_adv_tx_context_desc {
>   	__le32 vlan_macip_lens;
> -	__le32 seqnum_seed;
> +	__le32 fceof_saidx;
>   	__le32 type_tucmd_mlhl;
>   	__le32 mss_l4len_idx;
>   };

^ permalink raw reply

* Re: [PATCH net] phylink: ensure the PHY interface mode is appropriately set
From: Florian Fainelli @ 2017-12-22  8:31 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eRng0-0006g9-MZ@rmk-PC.armlinux.org.uk>

Le 12/20/17 à 15:21, Russell King a écrit :
> When setting the ethtool settings, ensure that the validated PHY
> interface mode is propagated to the current link settings, so that
> 2500BaseX can be selected.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] phylink: ensure AN is enabled
From: Florian Fainelli @ 2017-12-22  8:32 UTC (permalink / raw)
  To: Russell King, Andrew Lunn; +Cc: netdev
In-Reply-To: <E1eRng6-0006gG-0t@rmk-PC.armlinux.org.uk>

Le 12/20/17 à 15:21, Russell King a écrit :
> Ensure that we mark AN as enabled at boot time, rather than leaving
> it disabled.  This is noticable if your SFP module is fiber, and you
> it supports faster speeds than 1G with 2.5G support in place.

(not worth a respin IMHO), s/you //?

> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
-- 
Florian

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2017-12-22  8:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stephen Hemminger, David Miller, sridhar.samudrala, mst, netdev,
	virtualization, Alexander Duyck, jesse.brandeburg
In-Reply-To: <20171220205204.4853f1d1@cakuba.netronome.com>

On Wed, Dec 20, 2017 at 8:52 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
> On Wed, 20 Dec 2017 18:16:30 -0800, Siwei Liu wrote:
>> > The plan is to remove the delay and do the naming in the kernel.
>> > This was suggested by Lennart since udev is only doing naming policy
>> > because kernel names were not repeatable.
>> >
>> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>> >
>> > Patch is pending.
>>
>> While it's good to show VF with specific naming to indicate
>> enslavement, I wonder wouldn't it be better to hide this netdev at all
>> from the user space? IMHO this extra device is useless when being
>> enslaved and we may delegate controls (e.g. ethtool) over to the
>> para-virtual device instead? That way it's possible to eliminate the
>> possibility of additional udev setup or modification?
>>
>> I'm not sure if this  is consistent with Windows guest or not, but I
>> don't find it _Linux_ user friendly that ethtool doesn't work on the
>> composite interface any more, and I have to end up with finding out
>> the correct enslaved VF I must operate on.
>
> Hiding "low level" netdevs comes up from time to time, and is more
> widely applicable than just to VF bonds.  We should find a generic
> solution to that problem.

Wholeheartedly agreed.

Be it a generic virtual bond or virtio-net specific one, there should
be some common code between netvsc and virtio for this type of work.
For avoiding duplicated bugs, consistent (Linux) user experience,
future code refactoring/management, and whatever...

One thing worth to note is that, unlike the Hyper-V netvsc backend
there's currently no equivalent (fine-grained) Linux ndo_* driver
interface that is able to move around MAC address/VLAN filters at
run-time specifically. The OID_RECEIVE_FILTER_MOVE_FILTER request I
mean. That translates to one substantial difference in VF
plumbing/unplumbing sequence: you cannot move the MAC address around
to paravirt device until VF is fully unplugged out of the guest OS. I
don't know what backend changes to be proposed for virtio-net as
helper, but the common code needs to work with both flavors of data
path switching backends and do its job correctly.

Regards,
-Siwei

^ permalink raw reply

* [PATCH 4.9 058/104] net: ipconfig: fix ic_close_devs() use-after-free
From: Greg Kroah-Hartman @ 2017-12-22  8:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Mark Rutland, Alexey Kuznetsov,
	David S. Miller, Hideaki YOSHIFUJI, James Morris, Patrick McHardy,
	netdev, Sasha Levin
In-Reply-To: <20171222084609.262099650@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Mark Rutland <mark.rutland@arm.com>


[ Upstream commit ffefb6f4d6ad699a2b5484241bc46745a53235d0 ]

Our chosen ic_dev may be anywhere in our list of ic_devs, and we may
free it before attempting to close others. When we compare d->dev and
ic_dev->dev, we're potentially dereferencing memory returned to the
allocator. This causes KASAN to scream for each subsequent ic_dev we
check.

As there's a 1-1 mapping between ic_devs and netdevs, we can instead
compare d and ic_dev directly, which implicitly handles the !ic_dev
case, and avoids the use-after-free. The ic_dev pointer may be stale,
but we will not dereference it.

Original splat:

[    6.487446] ==================================================================
[    6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr ffff800367efa708
[    6.503013] Read of size 8 by task swapper/0/1
[    6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc3-00002-gda42158 #8
[    6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 3.05.05-beta_rc Jan 27 2016
[    6.523138] Call trace:
[    6.525590] [<ffff200008094778>] dump_backtrace+0x0/0x570
[    6.530976] [<ffff200008094d08>] show_stack+0x20/0x30
[    6.536017] [<ffff200008bee928>] dump_stack+0x120/0x188
[    6.541231] [<ffff20000856d5e4>] kasan_object_err+0x24/0xa0
[    6.546790] [<ffff20000856d924>] kasan_report_error+0x244/0x738
[    6.552695] [<ffff20000856dfec>] __asan_report_load8_noabort+0x54/0x80
[    6.559204] [<ffff20000aae86ac>] ic_close_devs+0xc4/0x154
[    6.564590] [<ffff20000aaedbac>] ip_auto_config+0x2ed4/0x2f1c
[    6.570321] [<ffff200008084b04>] do_one_initcall+0xcc/0x370
[    6.575882] [<ffff20000aa31de8>] kernel_init_freeable+0x5f8/0x6c4
[    6.581959] [<ffff20000a16df00>] kernel_init+0x18/0x190
[    6.587171] [<ffff200008084710>] ret_from_fork+0x10/0x40
[    6.592468] Object at ffff800367efa700, in cache kmalloc-128 size: 128
[    6.598969] Allocated:
[    6.601324] PID = 1
[    6.603427]  save_stack_trace_tsk+0x0/0x418
[    6.607603]  save_stack_trace+0x20/0x30
[    6.611430]  kasan_kmalloc+0xd8/0x188
[    6.615087]  ip_auto_config+0x8c4/0x2f1c
[    6.619002]  do_one_initcall+0xcc/0x370
[    6.622832]  kernel_init_freeable+0x5f8/0x6c4
[    6.627178]  kernel_init+0x18/0x190
[    6.630660]  ret_from_fork+0x10/0x40
[    6.634223] Freed:
[    6.636233] PID = 1
[    6.638334]  save_stack_trace_tsk+0x0/0x418
[    6.642510]  save_stack_trace+0x20/0x30
[    6.646337]  kasan_slab_free+0x88/0x178
[    6.650167]  kfree+0xb8/0x478
[    6.653131]  ic_close_devs+0x130/0x154
[    6.656875]  ip_auto_config+0x2ed4/0x2f1c
[    6.660875]  do_one_initcall+0xcc/0x370
[    6.664705]  kernel_init_freeable+0x5f8/0x6c4
[    6.669051]  kernel_init+0x18/0x190
[    6.672534]  ret_from_fork+0x10/0x40
[    6.676098] Memory state around the buggy address:
[    6.680880]  ffff800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    6.688078]  ffff800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    6.695276] >ffff800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    6.702469]                       ^
[    6.705952]  ffff800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    6.713149]  ffff800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    6.720343] ==================================================================
[    6.727536] Disabling lock debugging due to kernel taint

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 net/ipv4/ipconfig.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -306,7 +306,7 @@ static void __init ic_close_devs(void)
 	while ((d = next)) {
 		next = d->next;
 		dev = d->dev;
-		if ((!ic_dev || dev != ic_dev->dev) && !netdev_uses_dsa(dev)) {
+		if (d != ic_dev && !netdev_uses_dsa(dev)) {
 			pr_debug("IP-Config: Downing %s\n", dev->name);
 			dev_change_flags(dev, d->flags);
 		}

^ permalink raw reply

* Re: [QUESTION] Doubt about NAPI_GRO_CB(skb)->is_atomic in tcpv4 gro process
From: Yunsheng Lin @ 2017-12-22  8:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev@vger.kernel.org, davem@davemloft.net, linuxarm@huawei.com,
	yuxiaowu, wzhen.wang, Xuehuahu
In-Reply-To: <CAKgT0UeMOJ9N5veeXk2BY14w41kVkDNsbAF50cqyb708DMPtRA@mail.gmail.com>

Hi, Alexander

On 2017/12/22 0:29, Alexander Duyck wrote:
> On Thu, Dec 21, 2017 at 1:16 AM, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>> Hi, Alexander
>>
>> On 2017/12/21 0:24, Alexander Duyck wrote:
>>> On Wed, Dec 20, 2017 at 1:09 AM, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>> Hi, all
>>>>         I have some doubt about NAPI_GRO_CB(skb)->is_atomic when
>>>> analyzing the tcpv4 gro process:
>>>>
>>>> Firstly we set NAPI_GRO_CB(skb)->is_atomic to 1 in dev_gro_receive:
>>>> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/core/dev.c#L4838
>>>>
>>>> And then in inet_gro_receive, we check the NAPI_GRO_CB(skb)->is_atomic
>>>> before setting NAPI_GRO_CB(skb)->is_atomic according to IP_DF bit in the ip header:
>>>> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/ipv4/af_inet.c#L1319
>>>>
>>>> struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff *skb)
>>>> {
>>>> .....................
>>>>         for (p = *head; p; p = p->next) {
>>>> ........................
>>>>
>>>>                 /* If the previous IP ID value was based on an atomic
>>>>                  * datagram we can overwrite the value and ignore it.
>>>>                  */
>>>>                 if (NAPI_GRO_CB(skb)->is_atomic)                      //we check it here
>>>>                         NAPI_GRO_CB(p)->flush_id = flush_id;
>>>>                 else
>>>>                         NAPI_GRO_CB(p)->flush_id |= flush_id;
>>>>         }
>>>>
>>>>         NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));  //we set it here
>>>>         NAPI_GRO_CB(skb)->flush |= flush;
>>>>         skb_set_network_header(skb, off);
>>>> ................................
>>>> }
>>>>
>>>> My question is whether we should check the NAPI_GRO_CB(skb)->is_atomic or NAPI_GRO_CB(p)->is_atomic?
>>>> If we should check NAPI_GRO_CB(skb)->is_atomic, then maybe it is unnecessary because it is alway true.
>>>> If we should check NAPI_GRO_CB(p)->is_atomic, maybe there is a bug here.
>>>>
>>>> So what is the logic here? I am just start analyzing the gro, maybe I miss something obvious here.
>>>
>>> The logic there is to address the multiple IP header case where there
>>> are 2 or more IP headers due to things like VXLAN or GRE tunnels. So
>>> what will happen is that an outer IP header will end up being sent
>>> with DF not set and will clear the is_atomic value then we want to OR
>>> in the next header that is applied. It defaults to assignment on
>>> is_atomic because the first IP header will encounter flush_id with no
>>> previous configuration occupying it.
>>
>> I see your point now.
>>
>> But for the same flow of tunnels packet, the outer and inner ip header must
>> have the same fixed id or increment id?
>>
>> For example, if we have a flow of tunnels packet which has fixed id in outer
>> header and increment id in inner header(the inner header does have DF flag set):

Sorry, a typo error here. I meant the inner header does *not* have DF flag set here.

>>
>> 1. For the first packet, NAPI_GRO_CB(skb)->is_atomic will be set to zero when
>> inet_gro_receive is processing the inner ip header.
>>
>> 2. For the second packet, when inet_gro_receive is processing the outer ip header
>> which has a fixed id, NAPI_GRO_CB(p)->is_atomic is zero according to [1], so
>> NAPI_GRO_CB(p)->flush_id will be set to 0xFFFF, then the second packet will not
>> be merged to first packet in tcp_gro_receive.
> 
> I'm not sure how valid your case here is. The is_atomic is only really
> meant to apply to the inner-most header.

For the new skb, NAPI_GRO_CB(skb)->is_atomic is indeed applied to the
inner-most header.

What about the NAPI_GRO_CB(p)->is_atomic, p is the same skb flow already
merged by gro.

Let me try if I understand it correctly:
when there is only one skb merged in p, then NAPI_GRO_CB(p)->is_atomic
is set according to the first skb' inner-most ip DF flags.

When the second skb comes, and inet_gro_receive is processing the
outer-most ip, for the below code, NAPI_GRO_CB(p)->is_atomic
is for first skb's inner-most ip DF flags, and "iph->frag_off & htons(IP_DF)"
is for second skb' outer-most ip DF flags.
Why don't we use the first skb's outer-most ip DF flags here? I think it is
more logical to use first skb's outer-most ip DF flags here. But the first
skb's outer-most ip DF flags is lost when we get here, right?

		if (!NAPI_GRO_CB(p)->is_atomic ||
		    !(iph->frag_off & htons(IP_DF))) {
			flush_id ^= NAPI_GRO_CB(p)->count;
			flush_id = flush_id ? 0xFFFF : 0;
		}


 In the case of TCP the
> inner-most header should almost always have the DF bit set which means
> the inner-most is almost always atomic.
> 
>> I thought outer ip header could have a fixed id while inner ip header could
>> have a increment id. Do I miss something here?
> 
> You have it backwards. The innermost will have DF bit set so it can be
> fixed, the outer-most will in many cases not since it is usually UDP
> and as such it will likely need to increment.

Actually I got it backwards because that is what intel do when TSOing.

Below is quoted from 710 datasheet in page 1052:
The Identification field (in the IP header in non tunneled packets or the inner IP header in
tunneled packets) is taken from the TSO header in the first segment and then it is increased by
one for each transmitted segment.
The Identification field (in the external IP header) is taken from the TSO header in the first
segment and then it is either increased by one for each transmitted segment or remains
constant depending on the EIP_NOINC setting in the transmit context descriptor.

If I understand it correctly, intel is doing oppositely as pointed out by you.
Am I miss something here?

Refer:
https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf
> 
>>>
>>> The part I am not sure about is if we should be using assignment for
>>> is_atomic or using an "&=" to clear the bit and leave it cleared.
>>
>> I am not sure I understood you here. is_atomic is a bit field, why do you
>> want to use "&="?
> 
> Actually that was my mind kind of wandering. It has been a while since
> I looked at this code and the use of &= wouldn't be appropriate since
> is_atomic should only apply to the innermost header.
> 
> Basically the only acceptable combinations for is_atomic and flush_id
> are false with 0, or true with 1. We can't have a fixed outer header
> value if DF is not set.
> 
> Hope that helps to clarify things.

Yes, Your reqly has helped a lot. But I still have some doubt above,
please help me understand it.
I hope this does not take much of your time.
Thanks very much.

Yunsheng Lin

> 
> - Alex
> 
> .
> 

^ permalink raw reply

* [PATCH net,stable 1/1] net: fec: unmap the xmit buffer that are not transferred by DMA
From: Fugang Duan @ 2017-12-22  9:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, fugang.duan

The enet IP only support 32 bit, it will use swiotlb buffer to do dma
mapping when xmit buffer DMA memory address is bigger than 4G in i.MX
platform. After stress suspend/resume test, it will print out:

log:
[12826.352864] fec 5b040000.ethernet: swiotlb buffer is full (sz: 191 bytes)
[12826.359676] DMA: Out of SW-IOMMU space for 191 bytes at device 5b040000.ethernet
[12826.367110] fec 5b040000.ethernet eth0: Tx DMA memory map failed

The issue is that the ready xmit buffers that are dma mapped but DMA still
don't copy them into fifo, once MAC restart, these DMA buffers are not unmapped.
So it should check the dma mapping buffer and unmap them.

Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 2d1b065..e17d10b 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -818,6 +818,12 @@ static void fec_enet_bd_init(struct net_device *dev)
 		for (i = 0; i < txq->bd.ring_size; i++) {
 			/* Initialize the BD for every fragment in the page. */
 			bdp->cbd_sc = cpu_to_fec16(0);
+			if (bdp->cbd_bufaddr &&
+			    !IS_TSO_HEADER(txq, fec32_to_cpu(bdp->cbd_bufaddr)))
+				dma_unmap_single(&fep->pdev->dev,
+						 fec32_to_cpu(bdp->cbd_bufaddr),
+						 fec16_to_cpu(bdp->cbd_datlen),
+						 DMA_TO_DEVICE);
 			if (txq->tx_skbuff[i]) {
 				dev_kfree_skb_any(txq->tx_skbuff[i]);
 				txq->tx_skbuff[i] = NULL;
-- 
1.9.1

^ permalink raw reply related

* RE: [patch iproute2 v2] tc: add -bs option for batch mode
From: Chris Mi @ 2017-12-22  9:27 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org; +Cc: gerlitz.or@gmail.com
In-Reply-To: <dee42804-e4ca-7159-fce3-9a999c29b874@gmail.com>

> -----Original Message-----
> From: David Ahern [mailto:dsahern@gmail.com]
> Sent: Friday, December 22, 2017 6:04 AM
> To: Chris Mi <chrism@mellanox.com>; netdev@vger.kernel.org
> Cc: gerlitz.or@gmail.com
> Subject: Re: [patch iproute2 v2] tc: add -bs option for batch mode
> 
> On 12/20/17 2:26 AM, Chris Mi wrote:
> > Currently in tc batch mode, only one command is read from the batch
> > file and sent to kernel to process. With this patch, we can accumulate
> > several commands before sending to kernel. The batch size is specified
> > using option -bs or -batchsize.
> >
> > To accumulate the commands in tc, we allocate an array of struct iovec.
> > If batchsize is bigger than 1 and we haven't accumulated enough
> > commands, rtnl_talk() will return without actually sending the message.
> > One exception is that there is no more command in the batch file.
> >
> > But please note that kernel still processes the requests one by one.
> > To process the requests in parallel in kernel is another effort.
> > The time we're saving in this patch is the user mode and kernel mode
> > context switch. So this patch works on top of the current kernel.
> >
> > Using the following script in kernel, we can generate 1,000,000 rules.
> >         tools/testing/selftests/tc-testing/tdc_batch.py
> >
> > Without this patch, 'tc -b $file' exection time is:
> >
> > real    0m14.916s
> > user    0m6.808s
> > sys     0m8.046s
> >
> > With this patch, 'tc -b $file -bs 10' exection time is:
> >
> > real    0m12.286s
> > user    0m5.903s
> > sys     0m6.312s
> >
> > The insertion rate is improved more than 10%.
> >
> > Signed-off-by: Chris Mi <chrism@mellanox.com>
> > ---
> >  include/libnetlink.h |  6 ++++
> >  include/utils.h      |  4 +++
> >  lib/libnetlink.c     | 25 ++++++++++-----
> >  lib/utils.c          | 20 ++++++++++++
> >  man/man8/tc.8        |  5 +++
> >  tc/m_action.c        | 62 +++++++++++++++++++++++++++---------
> >  tc/tc.c              | 24 ++++++++++++--
> >  tc/tc_common.h       |  3 ++
> >  tc/tc_filter.c       | 88 ++++++++++++++++++++++++++++++++++++---------
> -------
> >  9 files changed, 186 insertions(+), 51 deletions(-)
> >
> > diff --git a/include/libnetlink.h b/include/libnetlink.h index
> > a4d83b9e..775f830b 100644
> > --- a/include/libnetlink.h
> > +++ b/include/libnetlink.h
> > @@ -13,6 +13,8 @@
> >  #include <linux/netconf.h>
> >  #include <arpa/inet.h>
> >
> > +#define MSG_IOV_MAX 256
> > +
> >  struct rtnl_handle {
> >  	int			fd;
> >  	struct sockaddr_nl	local;
> > @@ -93,6 +95,10 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
> >  			void *arg, __u16 nc_flags);
> >  #define rtnl_dump_filter(rth, filter, arg) \
> >  	rtnl_dump_filter_nc(rth, filter, arg, 0)
> > +
> > +extern int msg_iov_index;
> > +extern int batch_size;
> > +
> >  int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
> >  	      struct nlmsghdr **answer)
> >  	__attribute__((warn_unused_result));
> > diff --git a/include/utils.h b/include/utils.h index
> > d3895d56..113a8c31 100644
> > --- a/include/utils.h
> > +++ b/include/utils.h
> > @@ -235,6 +235,10 @@ void print_nlmsg_timestamp(FILE *fp, const struct
> > nlmsghdr *n);
> >
> >  extern int cmdlineno;
> >  ssize_t getcmdline(char **line, size_t *len, FILE *in);
> > +
> > +extern int cmdlinetotal;
> > +void setcmdlinetotal(const char *name);
> > +
> >  int makeargs(char *line, char *argv[], int maxargs);  int
> > inet_get_addr(const char *src, __u32 *dst, struct in6_addr *dst6);
> >
> > diff --git a/lib/libnetlink.c b/lib/libnetlink.c index
> > 00e6ce0c..7ff32d64 100644
> > --- a/lib/libnetlink.c
> > +++ b/lib/libnetlink.c
> > @@ -24,6 +24,7 @@
> >  #include <sys/uio.h>
> >
> >  #include "libnetlink.h"
> > +#include "utils.h"
> >
> >  #ifndef SOL_NETLINK
> >  #define SOL_NETLINK 270
> > @@ -581,6 +582,10 @@ static void rtnl_talk_error(struct nlmsghdr *h,
> struct nlmsgerr *err,
> >  		strerror(-err->error));
> >  }
> >
> > +static struct iovec msg_iov[MSG_IOV_MAX]; int msg_iov_index; int
> > +batch_size = 1;
> > +
> >  static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
> >  		       struct nlmsghdr **answer,
> >  		       bool show_rtnl_err, nl_ext_ack_fn_t errfn) @@ -589,23
> > +594,29 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr
> *n,
> >  	unsigned int seq;
> >  	struct nlmsghdr *h;
> >  	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
> > -	struct iovec iov = {
> > -		.iov_base = n,
> > -		.iov_len = n->nlmsg_len
> > -	};
> > +	struct iovec *iov = &msg_iov[msg_iov_index];
> > +	char *buf;
> > +
> > +	iov->iov_base = n;
> > +	iov->iov_len = n->nlmsg_len;
> > +
> >  	struct msghdr msg = {
> >  		.msg_name = &nladdr,
> >  		.msg_namelen = sizeof(nladdr),
> > -		.msg_iov = &iov,
> > -		.msg_iovlen = 1,
> > +		.msg_iov = msg_iov,
> > +		.msg_iovlen = ++msg_iov_index,
> >  	};
> > -	char *buf;
> >
> >  	n->nlmsg_seq = seq = ++rtnl->seq;
> >
> >  	if (answer == NULL)
> >  		n->nlmsg_flags |= NLM_F_ACK;
> >
> > +	msg_iov_index %= batch_size;
> > +	if (msg_iov_index != 0 && batch_size > 1 && cmdlineno !=
> cmdlinetotal &&
> > +	    (n->nlmsg_type == RTM_NEWTFILTER || n->nlmsg_type ==
> RTM_NEWACTION))
> > +		return 3;
> > +
> >  	status = sendmsg(rtnl->fd, &msg, 0);
> >  	if (status < 0) {
> >  		perror("Cannot talk to rtnetlink");
> 
> I have a fair idea of why you did it this way, but relying on global variables and
> magic return codes is not a great solution.
> 
> Why not refactor rtnl_talk -- move the sendmsg piece to a helper that takes
> an iov or a msg as input arg. Then have the tc code piece together the iov and
> call rtnl_talk. If batch_size == 1 it calls rtnl_talk; > 1 it puts together the iov
> and hands it off.
Thanks for your suggestion. I think it is better to add a new API, like rtnl_talk_msg.
It will take a struct msghdr * as input arg instead of a struct nlmsghdr *.
The caller should prepare for the msghdr, like tc_filter_modify.
After this changes, libnetlink needn't to know anything about the batching.

rtnl_talk will send a single iov like before. Existing users has no impact at all.
rtnl_talk_msg will send multiple iovs.

^ permalink raw reply

* Re: [PATCH net-next 2/2] net/sched: act_csum: don't use spinlock in the fast path
From: Davide Caratti @ 2017-12-22  9:28 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, jiri, netdev
In-Reply-To: <20171213.162328.1374458006305010879.davem@davemloft.net>

On Wed, 2017-12-13 at 16:23 -0500, David Miller wrote:
> From: Davide Caratti <dcaratti@redhat.com>
> Date: Wed, 13 Dec 2017 10:48:38 +0100
> 
> > Then, in the data path, use READ_ONCE() to
> > read those values, to avoid lock contention among multiple readers.
>  ...
> > @@ -544,14 +543,12 @@ static int tcf_csum(struct sk_buff *skb, const struct tc_action *a,
> >  
> >       tcf_lastuse_update(&p->tcf_tm);
> >       bstats_cpu_update(this_cpu_ptr(p->common.cpu_bstats), skb);
> > -     spin_lock(&p->tcf_lock);
> > -     action = p->tcf_action;
> > -     update_flags = p->update_flags;
> > -     spin_unlock(&p->tcf_lock);
> >  
> > +     action = READ_ONCE(p->tcf_action);
> >       if (unlikely(action == TC_ACT_SHOT))
> >               goto drop;
> >  
> > +     update_flags = READ_ONCE(p->update_flags);
> >       switch (tc_skb_protocol(skb)) {
> >       case cpu_to_be16(ETH_P_IP):
> >               if (!tcf_csum_ipv4(skb, update_flags))

hi David, thank you for replying!

> That's not why the lock is here.
> 
> We must read both action and flags atomically so that they are consistent
> with eachother.
> 
> We must never use action from one configuration change and flags from
> yet another.

I was (erroneously) assuming that such behavior was acceptable, since it's
present almost in all other TC actions, even those where tcf_lock is used.
But agree, it's better not to introduce a race in a place where it's not
present.

> Find a way to load both of these values with a single cpu load, then you
> can legally remove the lock.

act_tunnel_key seems a good example for this, I will send a v2 soon.

-- 
davide

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: Jiri Pirko @ 2017-12-22  9:35 UTC (permalink / raw)
  To: Cong Wang
  Cc: John Fastabend, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAM_iQpVkV0YbOaSfmXb_=OpzpxepmxkXHKibKUcmhELZBM-Abg@mail.gmail.com>

Thu, Dec 21, 2017 at 09:59:56PM CET, xiyou.wangcong@gmail.com wrote:
>On Thu, Dec 21, 2017 at 12:39 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Why just moving qdisc_free to rcu is not enough? It would resolve this
>> issue and also avoid using synchronize net. Something like:
>
>If you mean Jakub's issue, apparently not:
>https://www.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUCBordering.html
>
>Jiri, you have to use a rcu barrier to wait for a rcu callback, not
>queuing another rcu callback, the ordering is simply NOT guaranteed.

Sure. But the ordering does not matter in this case. You just need to
make sure the reader does not see freed qdisc struct.

>
>What's more importantly, you already have one rcu barrier in the
>same function. Why keep believing you don't need it?

The rcu barrier is there for a different reason. It is there to avoid
reuse of old miniq that readers still might see in scenario
miniq1
miniq2
miniq1 again

^ permalink raw reply

* Re: [Patch net] net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()
From: Jiri Pirko @ 2017-12-22  9:38 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jiri Pirko, John Fastabend
In-Reply-To: <CAM_iQpVyycftiydUi3VJsZ1QMj45dWDvkmsOquT10JqF-zrwkw@mail.gmail.com>

Thu, Dec 21, 2017 at 09:54:06PM CET, xiyou.wangcong@gmail.com wrote:
>On Thu, Dec 21, 2017 at 11:01 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> On Thu, Dec 21, 2017 at 1:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>
>>>
>>> But again, we don't we just free qdisc in call_rcu and avoid the
>>> barrier?
>>
>>
>> Non-sense again. Why qdisc code should be adjusted for your
>> miniq code? It is your own responsibility to take care of this shit.
>> Don't spread it out of minq.
>
>Also, in case you believe call_rcu to free qdisc is queued after
>the call_rcu in miniq, you are wrong again:

I certainly don't :) Not sure why you think I do.

>
>https://www.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUCBordering.html
>
>The rcu callbacks don't guarantee FIFO ordering.

^ permalink raw reply

* Re: [PATCH RFC 08/18] r8168: add basic phylib support
From: Andrew Lunn @ 2017-12-22  9:44 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Realtek linux nic maintainers, Chun-Hao Lin, David Miller,
	netdev@vger.kernel.org
In-Reply-To: <8d117cab-2221-2cc4-ff0a-891f4a60dc92@gmail.com>

On Thu, Dec 21, 2017 at 09:50:25PM +0100, Heiner Kallweit wrote:

Hi Heiner

This code looks good. Just a few minor comments.

> +static int r8168_phy_connect(struct rtl8168_private *tp)
> +{
> +	struct phy_device *phydev;
> +	phy_interface_t phy_mode;
> +	int ret;
> +
> +	phy_mode = tp->mii.supports_gmii ? PHY_INTERFACE_MODE_GMII :
> +		   PHY_INTERFACE_MODE_MII;
> +
> +	phydev = phy_find_first(tp->mii_bus);
> +	if (!phydev)
> +		return -ENODEV;
> +
> +	ret = phy_connect_direct(tp->dev, phydev, r8168_phylink_handler,
> +				 phy_mode);
> +	if (ret)
> +		return ret;
> +
> +	phy_attached_info(phydev);
> +
> +	if (!tp->mii.supports_gmii && (phydev->supported &
> +	    (SUPPORTED_1000baseT_Half | SUPPORTED_1000baseT_Full))) {
> +		netif_info(tp, probe, tp->dev, "Restrict PHY to 100Mbit because MAC doesn't support 1GBit");
> +		phy_set_max_speed(phydev, SPEED_100);
> +		ret = genphy_restart_aneg(phydev);
> +	}

You might be able to put this higher up, after phy_first_first, and
skip the genphy_restart_aneg(). When phy_start() is called i think it
will perform an aneg, so there is no need to do it here.

> +
> +	return ret;
> +}
> +
>  static void rtl_hw_init_8168g(struct rtl8168_private *tp)
>  {
>  	void __iomem *ioaddr = tp->mmio_addr;
> @@ -8212,10 +8284,18 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	if (!tp->counters)
>  		return -ENOMEM;
>  
> -	rc = register_netdev(dev);
> +	rc = r8168_mdio_register(tp);
>  	if (rc < 0)
>  		return rc;
>  
> +	rc = register_netdev(dev);
> +	if (rc < 0)
> +		goto err_mdio_unregister;
> +
> +	rc = r8168_phy_connect(tp);
> +	if (rc < 0)
> +		goto err_netdev_unregister;
> +
>  	pci_set_drvdata(pdev, dev);

This is the wrong order. Everything should be setup before you call
register_netdev(). As soon as that is called, things can start
happening with the device, so it is dangerous not to have the phy
connected, nor drvdata set. This is an old bug, but now would be a
good time to fix it.

     Andrew

^ permalink raw reply

* [PATCH 4/8] xfrm: Fix stack-out-of-bounds read on socket policy lookup.
From: Steffen Klassert @ 2017-12-22  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20171222094501.23345-1-steffen.klassert@secunet.com>

When we do tunnel or beet mode, we pass saddr and daddr from the
template to xfrm_state_find(), this is ok. On transport mode,
we pass the addresses from the flowi, assuming that the IP
addresses (and address family) don't change during transformation.
This assumption is wrong in the IPv4 mapped IPv6 case, packet
is IPv4 and template is IPv6.

Fix this by catching address family missmatches of the policy
and the flow already before we do the lookup.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_policy.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 9542975eb2f9..038ec68f6901 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1168,9 +1168,15 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir,
  again:
 	pol = rcu_dereference(sk->sk_policy[dir]);
 	if (pol != NULL) {
-		bool match = xfrm_selector_match(&pol->selector, fl, family);
+		bool match;
 		int err = 0;
 
+		if (pol->family != family) {
+			pol = NULL;
+			goto out;
+		}
+
+		match = xfrm_selector_match(&pol->selector, fl, family);
 		if (match) {
 			if ((sk->sk_mark & pol->mark.m) != pol->mark.v) {
 				pol = NULL;
-- 
2.14.1

^ permalink raw reply related

* [PATCH 1/8] xfrm: check id proto in validate_tmpl()
From: Steffen Klassert @ 2017-12-22  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20171222094501.23345-1-steffen.klassert@secunet.com>

From: Cong Wang <xiyou.wangcong@gmail.com>

syzbot reported a kernel warning in xfrm_state_fini(), which
indicates that we have entries left in the list
net->xfrm.state_all whose proto is zero. And
xfrm_id_proto_match() doesn't consider them as a match with
IPSEC_PROTO_ANY in this case.

Proto with value 0 is probably not a valid value, at least
verify_newsa_info() doesn't consider it valid either.

This patch fixes it by checking the proto value in
validate_tmpl() and rejecting invalid ones, like what iproute2
does in xfrm_xfrmproto_getbyname().

Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_user.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 983b0233767b..c2cfcc6fdb34 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1445,6 +1445,21 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family)
 		default:
 			return -EINVAL;
 		}
+
+		switch (ut[i].id.proto) {
+		case IPPROTO_AH:
+		case IPPROTO_ESP:
+		case IPPROTO_COMP:
+#if IS_ENABLED(CONFIG_IPV6)
+		case IPPROTO_ROUTING:
+		case IPPROTO_DSTOPTS:
+#endif
+		case IPSEC_PROTO_ANY:
+			break;
+		default:
+			return -EINVAL;
+		}
+
 	}
 
 	return 0;
-- 
2.14.1

^ permalink raw reply related

* [PATCH 2/8] xfrm: fix XFRMA_OUTPUT_MARK policy entry
From: Steffen Klassert @ 2017-12-22  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20171222094501.23345-1-steffen.klassert@secunet.com>

From: Michal Kubecek <mkubecek@suse.cz>

This seems to be an obvious typo, NLA_U32 is type of the attribute, not its
(minimal) length.

Fixes: 077fbac405bf ("net: xfrm: support setting an output mark.")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index c2cfcc6fdb34..ff58c37469d6 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2485,7 +2485,7 @@ static const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
 	[XFRMA_PROTO]		= { .type = NLA_U8 },
 	[XFRMA_ADDRESS_FILTER]	= { .len = sizeof(struct xfrm_address_filter) },
 	[XFRMA_OFFLOAD_DEV]	= { .len = sizeof(struct xfrm_user_offload) },
-	[XFRMA_OUTPUT_MARK]	= { .len = NLA_U32 },
+	[XFRMA_OUTPUT_MARK]	= { .type = NLA_U32 },
 };
 
 static const struct nla_policy xfrma_spd_policy[XFRMA_SPD_MAX+1] = {
-- 
2.14.1

^ permalink raw reply related

* [PATCH 5/8] xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
From: Steffen Klassert @ 2017-12-22  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20171222094501.23345-1-steffen.klassert@secunet.com>

From: Antony Antony <antony@phenome.org>

copy geniv when cloning the xfrm state.

x->geniv was not copied to the new state and migration would fail.

xfrm_do_migrate
  ..
  xfrm_state_clone()
   ..
   ..
   esp_init_aead()
   crypto_alloc_aead()
    crypto_alloc_tfm()
     crypto_find_alg() return EAGAIN and failed

Signed-off-by: Antony Antony <antony@phenome.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_state.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 1f5cee2269af..88d0a563e141 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1344,6 +1344,7 @@ static struct xfrm_state *xfrm_state_clone(struct xfrm_state *orig,
 
 	if (orig->aead) {
 		x->aead = xfrm_algo_aead_clone(orig->aead);
+		x->geniv = orig->geniv;
 		if (!x->aead)
 			goto error;
 	}
-- 
2.14.1

^ permalink raw reply related

* [PATCH 6/8] xfrm: Fix stack-out-of-bounds with misconfigured transport mode policies.
From: Steffen Klassert @ 2017-12-22  9:44 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, Steffen Klassert, netdev
In-Reply-To: <20171222094501.23345-1-steffen.klassert@secunet.com>

On policies with a transport mode template, we pass the addresses
from the flowi to xfrm_state_find(), assuming that the IP addresses
(and address family) don't change during transformation.

Unfortunately our policy template validation is not strict enough.
It is possible to configure policies with transport mode template
where the address family of the template does not match the selectors
address family. This lead to stack-out-of-bound reads because
we compare arddesses of the wrong family. Fix this by refusing
such a configuration, address family can not change on transport
mode.

We use the assumption that, on transport mode, the first templates
address family must match the address family of the policy selector.
Subsequent transport mode templates must mach the address family of
the previous template.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_user.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index ff58c37469d6..bdb48e5dba04 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -1419,11 +1419,14 @@ static void copy_templates(struct xfrm_policy *xp, struct xfrm_user_tmpl *ut,
 
 static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family)
 {
+	u16 prev_family;
 	int i;
 
 	if (nr > XFRM_MAX_DEPTH)
 		return -EINVAL;
 
+	prev_family = family;
+
 	for (i = 0; i < nr; i++) {
 		/* We never validated the ut->family value, so many
 		 * applications simply leave it at zero.  The check was
@@ -1435,6 +1438,12 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family)
 		if (!ut[i].family)
 			ut[i].family = family;
 
+		if ((ut[i].mode == XFRM_MODE_TRANSPORT) &&
+		    (ut[i].family != prev_family))
+			return -EINVAL;
+
+		prev_family = ut[i].family;
+
 		switch (ut[i].family) {
 		case AF_INET:
 			break;
-- 
2.14.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox