Netdev List
 help / color / mirror / Atom feed
* Re: net: GPF in eth_header
From: Eric Dumazet @ 2016-11-29 16:15 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: syzkaller, Dmitry Vyukov, David Miller, Tom Herbert,
	Alexander Duyck, Hannes Frederic Sowa, Jiri Benc, Sabrina Dubroca,
	netdev, LKML
In-Reply-To: <CAAeHK+zLO=Gv33TqS6tGTTC=QUYO-HOPgSB2oS1SBOAyJzPDTQ@mail.gmail.com>

On Tue, 2016-11-29 at 16:31 +0100, Andrey Konovalov wrote:
=
> The issue is not with skb_network_offset(), but with __skb_pull()
> using skb_network_offset() as an argument.
> 

No. The issue can happen with _any_ __skb_pull() with a 'negative'
argument, on 64bit arches.

skb_network_offset() is only one of the many cases this could happen if
a bug is added at some random place, including memory corruption from
a different kernel layer, or buggy hardware.

> I'm not sure what would be the beast way to fix this, to add a check
> before every __skb_pull(skb_network_offset()), to fix __skb_pull() to
> work with signed ints, to add BUG_ON()'s in __skb_pull, or something
> else.
> 
> What I meant is that you fixed this very instance of the bug, and I'm
> pointing out that a similar one might hit us again.

As I said, adding a check in skb_network_offset() would not be generic
enough.

Sure, we can be proactive and add tests everywhere in the kernel, but we
also want to keep it reasonably fast.

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:19 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129063106.GB84990-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Mon, Nov 28, 2016 at 10:31:06PM -0800, Vishwanathapura, Niranjana wrote:
> On Fri, Nov 25, 2016 at 12:05:09PM -0700, Jason Gunthorpe wrote:
> >On Thu, Nov 24, 2016 at 06:13:50PM -0800, Vishwanathapura, Niranjana wrote:
> >
> >>In order to be truely device independent the hfi_vnic ULP should not depend
> >>on a device exported symbol. Instead device should register its functions
> >>with the ULP. Hence the approaches a) and b).
> >
> >It is not device independent, it is hard linked to hfi1, just like our
> >other multi-component drivers.. So don't worry about that.
> >
> 
> We would like to keep the design clean and avoid any tight coupling here
> (our original design in this series tackled these).
> Any strong reason not to go with a) or b) ?

You are not making a subsystem. Don't overcomplicate things. A
multi-part device device can just directly link.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] tun: Use netif_receive_skb instead of netif_rx
From: Michael S. Tsirkin @ 2016-11-29 16:20 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Herbert Xu, David S . Miller, Jason Wang, Eric Dumazet,
	Paolo Abeni, Soheil Hassas Yeganeh, Markus Elfring, Mike Rapoport,
	netdev, linux-kernel, Dmitry Vyukov, Kostya Serebryany, syzkaller
In-Reply-To: <1480433136-7922-1-git-send-email-andreyknvl@google.com>

On Tue, Nov 29, 2016 at 04:25:36PM +0100, Andrey Konovalov wrote:
> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received. The difference between the two is that netif_rx
> queues the packet into the backlog, and netif_receive_skb proccesses the
> packet in the current context.
> 
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
> 
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks, but CONFIG_4KSTACKS was removed and stacks are
> 8K now.
> 
> [1] https://github.com/google/syzkaller
> 
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
> 
> [3] https://www.spinics.net/lists/netdev/msg130570.html
> 
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>

This was on my list of things to investigate ever since
8k stack default went in. Thanks for looking into this!

I note that there are still seem to exit 3 architectures that do have
CONFIG_4KSTACKS.

How about a wrapper that does netif_rx_ni with CONFIG_4KSTACKS.


> ---
>  drivers/net/tun.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8093e39..4b56e91 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1304,7 +1304,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>  	skb_probe_transport_header(skb, 0);
>  
>  	rxhash = skb_get_hash(skb);
> -	netif_rx_ni(skb);
> +	local_bh_disable();
> +	netif_receive_skb(skb);
> +	local_bh_enable();
>  
>  	stats = get_cpu_ptr(tun->pcpu_stats);
>  	u64_stats_update_begin(&stats->syncp);
> -- 
> 2.8.0.rc3.226.g39d4020

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:21 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161129062938.GA84990-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Mon, Nov 28, 2016 at 10:29:38PM -0800, Vishwanathapura, Niranjana wrote:
> On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
> >>And will move the hfi_vnic module under
> >>???drivers/infiniband/ulp/hfi_vnic???.
> >
> >I would prefer drivers/net/ethernet
> >
> >This is clearly not a ULP since it doesn't use verbs.
> >
> 
> I understand it is not using verbs, but the control path (ib_device client)
> is using verbs (IB MAD).
> Our prefernce is to keep it somewhere under drivers/infiniband. Summarizing
> reasons again here,
> 
> - VNIC control driver (ib_device client) is an IB MAD agent.
> - It is purly a software construct, encapsualtes ethernet packets in
> Omni-path packet and depends on hfi1 driver here for HW access.

Is the majority of the code MAD focused or net stack focused?

I'm not sure it matters, it isn't like we can review Intel's
proprietary mad stuff anyhow. :\

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 12/13] net: ethernet: ti: cpts: calc mult and shift from refclk freq
From: Grygorii Strashko @ 2016-11-29 16:22 UTC (permalink / raw)
  To: Richard Cochran
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap, Rob Herring, devicetree, Murali Karicheri,
	Wingman Kwok, John Stultz, Thomas Gleixner
In-Reply-To: <20161129103453.GJ3110@localhost.localdomain>



On 11/29/2016 04:34 AM, Richard Cochran wrote:
> On Mon, Nov 28, 2016 at 05:03:36PM -0600, Grygorii Strashko wrote:
>> +static void cpts_calc_mult_shift(struct cpts *cpts)
>> +{
>> +	u64 frac, maxsec, ns;
>> +	u32 freq, mult, shift;
>> +
>> +	freq = clk_get_rate(cpts->refclk);
>> +
>> +	/* Calc the maximum number of seconds which we can run before
>> +	 * wrapping around.
>> +	 */
>> +	maxsec = cpts->cc.mask;
>> +	do_div(maxsec, freq);
>> +	if (maxsec > 600 && cpts->cc.mask > UINT_MAX)
>> +		maxsec = 600;
> 
> The reason for this test is not obvious.  Why check cc.mask against
> UINT_MAX?  Please use the comment to explain it.
> 

Yeah. This is copy paste from __clocksource_update_freq_scale(), but
I'm going to limit it to 10 sec for now, because otherwise it will result in too small
mult in case of 64bit counter.

if (maxsec > 10)
	maxsec = 10;

-- 
regards,
-grygorii

^ permalink raw reply

* Re: [net-next] neigh: remove duplicate check for same neigh
From: David Ahern @ 2016-11-29 16:22 UTC (permalink / raw)
  To: Zhang Shengju, netdev
In-Reply-To: <1480407771-12884-1-git-send-email-zhangshengju@cmss.chinamobile.com>

On 11/29/16 1:22 AM, Zhang Shengju wrote:
> @@ -2285,20 +2295,15 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
>  	rcu_read_lock_bh();
>  	nht = rcu_dereference_bh(tbl->nht);
>  
> -	for (h = s_h; h < (1 << nht->hash_shift); h++) {
> -		if (h > s_h)
> -			s_idx = 0;
> +	for (h = s_h; h < (1 << nht->hash_shift); h++, s_idx = 0) {
>  		for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
>  		     n != NULL;
> -		     n = rcu_dereference_bh(n->next)) {
> -			if (!net_eq(dev_net(n->dev), net))
> -				continue;
> -			if (neigh_ifindex_filtered(n->dev, filter_idx))
> +		     n = rcu_dereference_bh(n->next), idx++) {

incrementing idx here ...

> +			if (idx < s_idx || !net_eq(dev_net(n->dev), net))
>  				continue;
> -			if (neigh_master_filtered(n->dev, filter_master_idx))
> +			if (neigh_dump_filtered(n->dev, filter_idx,
> +						filter_master_idx))
>  				continue;
> -			if (idx < s_idx)
> -				goto next;
>  			if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
>  					    cb->nlh->nlmsg_seq,
>  					    RTM_NEWNEIGH,

... causes a missed entry when an error happens here

idx is only incremented when a neigh has been successfully added to the skb.


Your intention is to save a few cycles by making idx an absolute index within the hash bucket and jumping to the next entry to be dumped without evaluating any filters per entry.

So why not keep it simple and do this:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 2ae929f9bd06..2e49a85e696a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2291,14 +2291,12 @@ static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
                for (n = rcu_dereference_bh(nht->hash_buckets[h]), idx = 0;
                     n != NULL;
                     n = rcu_dereference_bh(n->next)) {
-                       if (!net_eq(dev_net(n->dev), net))
-                               continue;
-                       if (neigh_ifindex_filtered(n->dev, filter_idx))
-                               continue;
-                       if (neigh_master_filtered(n->dev, filter_master_idx))
-                               continue;
                        if (idx < s_idx)
                                goto next;
+                       if (!net_eq(dev_net(n->dev), net) ||
+                           neigh_ifindex_filtered(n->dev, filter_idx) ||
+                           neigh_master_filtered(n->dev, filter_master_idx))
+                               goto next;
                        if (neigh_fill_info(skb, n, NETLINK_CB(cb->skb).portid,
                                            cb->nlh->nlmsg_seq,
                                            RTM_NEWNEIGH,

^ permalink raw reply related

* [PATCH net-next] net: can: usb: kvaser_usb: fix spelling mistake of "outstanding"
From: Colin King @ 2016-11-29 16:27 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, Jimmy Assarsson,
	Wolfram Sang, David S . Miller, linux-can, netdev
  Cc: linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake "oustanding" to "outstanding" in
comment and dev_dbg message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/can/usb/kvaser_usb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/can/usb/kvaser_usb.c b/drivers/net/can/usb/kvaser_usb.c
index d51e0c4..18cc529 100644
--- a/drivers/net/can/usb/kvaser_usb.c
+++ b/drivers/net/can/usb/kvaser_usb.c
@@ -459,7 +459,7 @@ struct kvaser_usb {
 	struct usb_endpoint_descriptor *bulk_in, *bulk_out;
 	struct usb_anchor rx_submitted;
 
-	/* @max_tx_urbs: Firmware-reported maximum number of oustanding,
+	/* @max_tx_urbs: Firmware-reported maximum number of outstanding,
 	 * not yet ACKed, transmissions on this device. This value is
 	 * also used as a sentinel for marking free tx contexts.
 	 */
@@ -2027,7 +2027,7 @@ static int kvaser_usb_probe(struct usb_interface *intf,
 		((dev->fw_version >> 16) & 0xff),
 		(dev->fw_version & 0xffff));
 
-	dev_dbg(&intf->dev, "Max oustanding tx = %d URBs\n", dev->max_tx_urbs);
+	dev_dbg(&intf->dev, "Max outstanding tx = %d URBs\n", dev->max_tx_urbs);
 
 	err = kvaser_usb_get_card_info(dev);
 	if (err) {
-- 
2.10.2


^ permalink raw reply related

* RE: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Hefty, Sean @ 2016-11-29 16:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Vishwanathapura, Niranjana
  Cc: Weiny, Ira, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Dalessandro, Dennis
In-Reply-To: <20161129161950.GB742-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

> You are not making a subsystem. Don't overcomplicate things. A
> multi-part device device can just directly link.

The VNIC may be usable over multiple generations of HFIs, but I don't know if that is the intent.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: netlink: GPF in sock_sndtimeo
From: Richard Guy Briggs @ 2016-11-29 16:48 UTC (permalink / raw)
  To: Cong Wang
  Cc: Dmitry Vyukov, David Miller, Johannes Berg, Florian Westphal,
	Eric Dumazet, Herbert Xu, netdev, LKML, syzkaller
In-Reply-To: <CAM_iQpVeLvfYV+1jX1ZKOntZim4roof4=>

On 2016-11-26 17:11, Cong Wang wrote:
> On Sat, Nov 26, 2016 at 7:44 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> > Hello,

Eric, thanks for the heads up on this.  (more below...)

> > The following program triggers GPF in sock_sndtimeo:
> > https://gist.githubusercontent.com/dvyukov/c19cadd309791cf5cb9b2bf936d3f48d/raw/1743ba0211079a5465d039512b427bc6b59b1a76/gistfile1.txt
> >
> > On commit 16ae16c6e5616c084168740990fc508bda6655d4 (Nov 24).
> >
> > general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> > Dumping ftrace buffer:
> >    (ftrace buffer empty)
> > Modules linked in:
> > CPU: 1 PID: 19950 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > task: ffff88002a0d0840 task.stack: ffff880036920000
> > RIP: 0010:[<ffffffff86cb35e1>]  [<     inline     >] sock_sndtimeo
> > include/net/sock.h:2075
> > RIP: 0010:[<ffffffff86cb35e1>]  [<ffffffff86cb35e1>]
> > netlink_unicast+0xe1/0x730 net/netlink/af_netlink.c:1232
> > RSP: 0018:ffff880036926f68  EFLAGS: 00010202
> > RAX: 0000000000000068 RBX: ffff880036927000 RCX: ffffc900021d0000
> > RDX: 0000000000000d63 RSI: 00000000024000c0 RDI: 0000000000000340
> > RBP: ffff880036927028 R08: ffffed0006ea7aab R09: ffffed0006ea7aab
> > R10: 0000000000000001 R11: ffffed0006ea7aaa R12: dffffc0000000000
> > R13: 0000000000000000 R14: ffff880035de3400 R15: ffff880035de3400
> > FS:  00007f90a2fc7700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000006de0c0 CR3: 0000000035de6000 CR4: 00000000000006e0
> > Stack:
> >  ffff880035de3400 ffffffff819f02a1 1ffff10006d24df4 0000000000000004
> >  00004db400000014 ffff880036926fd8 ffffffff00000000 0000000041b58ab3
> >  ffffffff89653c11 ffffffff86cb3500 ffffffff819f0345 ffff880035de3400
> > Call Trace:
> >  [<     inline     >] audit_replace kernel/audit.c:817
> >  [<ffffffff816c34b9>] audit_receive_msg+0x22c9/0x2ce0 kernel/audit.c:894
> >  [<     inline     >] audit_receive_skb kernel/audit.c:1120
> >  [<ffffffff816c40ac>] audit_receive+0x1dc/0x360 kernel/audit.c:1133
> >  [<     inline     >] netlink_unicast_kernel net/netlink/af_netlink.c:1214
> >  [<ffffffff86cb3a14>] netlink_unicast+0x514/0x730 net/netlink/af_netlink.c:1240
> >  [<ffffffff86cb46d4>] netlink_sendmsg+0xaa4/0xe50 net/netlink/af_netlink.c:1786
> >  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
> >  [<ffffffff86a6d54f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
> >  [<ffffffff86a6d8bb>] sock_write_iter+0x32b/0x620 net/socket.c:829
> >  [<     inline     >] new_sync_write fs/read_write.c:499
> >  [<ffffffff81a6f24e>] __vfs_write+0x4fe/0x830 fs/read_write.c:512
> >  [<ffffffff81a70cf5>] vfs_write+0x175/0x4e0 fs/read_write.c:560
> >  [<     inline     >] SYSC_write fs/read_write.c:607
> >  [<ffffffff81a75180>] SyS_write+0x100/0x240 fs/read_write.c:599
> >  [<ffffffff81009a24>] do_syscall_64+0x2f4/0x940 arch/x86/entry/common.c:280
> >  [<ffffffff88149e8d>] entry_SYSCALL64_slow_path+0x25/0x25
> > Code: fe 4c 89 f7 e8 31 16 ff ff 8b 8d 70 ff ff ff 49 89 c7 31 c0 85
> > c9 75 25 e8 7d 4a a3 fa 49 8d bd 40 03 00 00 48 89 f8 48 c1 e8 03 <42>
> > 80 3c 20 00 0f 85 3a 06 00 00 49 8b 85 40 03 00 00 4c 8d 73
> > RIP  [<     inline     >] sock_sndtimeo include/net/sock.h:2075
> > RIP  [<ffffffff86cb35e1>] netlink_unicast+0xe1/0x730
> > net/netlink/af_netlink.c:1232
> >  RSP <ffff880036926f68>
> > ---[ end trace 8383a15fba6fdc59 ]---
> 
> It is racy on audit_sock, especially on the netns exit path.

I think that is the only place it is racy.  The other places audit_sock
is set is when the socket failure has just triggered a reset.

Is there a notifier callback for failed or reaped sockets?

> Could the following patch help a little bit? Also, I don't see how the
> synchronize_net() here could sync with netlink rcv path, since unlike
> packets from wire, netlink messages are not handled in BH context
> nor I see any RCU taken on rcv path.

Thanks Cong, this looks like it could help.  I'll have a closer look.

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f1ca116..20bc79e 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1167,10 +1167,13 @@ static void __net_exit audit_net_exit(struct net *net)
>  {
>         struct audit_net *aunet = net_generic(net, audit_net_id);
>         struct sock *sock = aunet->nlsk;
> +
> +       mutex_lock(&audit_cmd_mutex);
>         if (sock == audit_sock) {
>                 audit_pid = 0;
>                 audit_sock = NULL;
>         }
> +       mutex_unlock(&audit_cmd_mutex);
> 
>         RCU_INIT_POINTER(aunet->nlsk, NULL);
>         synchronize_net();

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH net] bpf: fix states equal logic for varlen access
From: Alexei Starovoitov @ 2016-11-29 16:49 UTC (permalink / raw)
  To: Josef Bacik; +Cc: davem, netdev, daniel, ast, jannh
In-Reply-To: <db0d1d81-b497-4f95-48c1-87f5f52a906e@fb.com>

On Tue, Nov 29, 2016 at 09:45:33AM -0500, Josef Bacik wrote:
> On 11/28/2016 10:04 PM, Alexei Starovoitov wrote:
> >On Mon, Nov 28, 2016 at 02:44:10PM -0500, Josef Bacik wrote:
> >>If we have a branch that looks something like this
> >>
> >>int foo = map->value;
> >>if (condition) {
> >>  foo += blah;
> >>} else {
> >>  foo = bar;
> >>}
> >>map->array[foo] = baz;
> >>
> >>We will incorrectly assume that the !condition branch is equal to the condition
> >>branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
> >>adjust this logic to only do this if we didn't do a varlen access after we
> >>processed the !condition branch, otherwise we have different ranges and need to
> >>check the other branch as well.
> >>
> >>Signed-off-by: Josef Bacik <jbacik@fb.com>
> >>---
> >> kernel/bpf/verifier.c | 10 ++++++++--
> >> 1 file changed, 8 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >>index 89f787c..2c8a688 100644
> >>--- a/kernel/bpf/verifier.c
> >>+++ b/kernel/bpf/verifier.c
> >>@@ -2478,6 +2478,7 @@ static bool states_equal(struct bpf_verifier_env *env,
> >> {
> >> 	struct bpf_reg_state *rold, *rcur;
> >> 	int i;
> >>+	bool map_access = env->varlen_map_value_access;
> >
> >that's a bit misleading name for the variable.
> >Pls call it varlen_map_access.
> >
> >> 	for (i = 0; i < MAX_BPF_REG; i++) {
> >> 		rold = &old->regs[i];
> >>@@ -2489,12 +2490,17 @@ static bool states_equal(struct bpf_verifier_env *env,
> >> 		/* If the ranges were not the same, but everything else was and
> >> 		 * we didn't do a variable access into a map then we are a-ok.
> >> 		 */
> >>-		if (!env->varlen_map_value_access &&
> >>+		if (!map_access &&
> >> 		    rold->type == rcur->type && rold->imm == rcur->imm)
> >
> >just noticed that this one is missing comparing rold->id == rcur->id
> >
> 
> Do you want me to fix that here?  I'll fix up the rest of the stuff, and
> Daniels things as well.  Thanks,

Nevermind. Comparing 'id' is not needed in net, only in net-next.

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-29 16:50 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Vishwanathapura, Niranjana, Weiny, Ira, Doug Ledford,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Dalessandro, Dennis
In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373AB0B9B46@ORSMSX109.amr.corp.intel.com>

On Tue, Nov 29, 2016 at 04:44:37PM +0000, Hefty, Sean wrote:
> > You are not making a subsystem. Don't overcomplicate things. A
> > multi-part device device can just directly link.
> 
> The VNIC may be usable over multiple generations of HFIs, but I
> don't know if that is the intent.

If Intel wants to build a HFI subystem within RDMA with multiple
drivers then sure, but they are not there yet, and we don't even know
what that could look like. So it is better to leave it simple for now.

Jason

^ permalink raw reply

* Re: bpf debug info
From: Alexei Starovoitov @ 2016-11-29 17:01 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Brenden Blanco, Thomas Graf, Wangnan, He Kuang,
	kernel-team
In-Reply-To: <583D9AA4.8050601@iogearbox.net>

On Tue, Nov 29, 2016 at 04:11:32PM +0100, Daniel Borkmann wrote:
> On 11/29/2016 07:42 AM, Alexei Starovoitov wrote:
> [...]
> >The support for debug information in BPF was recently added to llvm.
> >
> >In order to use it recompile bpf programs with the following patch
> >in samples/bpf/Makefile
> >@@ -155,4 +155,4 @@ $(obj)/%.o: $(src)/%.c
> >         $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
> >                 -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
> >                 -Wno-compare-distinct-pointer-types \
> >-               -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
> >+               -O2 -emit-llvm -g -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
> >
> >and compiled .o files can be consumed by standard llvm-objdump utility.
> >
> >$ llvm-objdump -S -no-show-raw-insn samples/bpf/xdp1_kern.o
> >xdp1_kern.o:    file format ELF64-BPF
> >
> >Disassembly of section xdp1:
> >xdp_prog1:
> >; {
> >        0:       r2 = *(u32 *)(r1 + 4)
> >; void *data = (void *)(long)ctx->data;
> >        8:       r1 = *(u32 *)(r1 + 0)
> >; if (data + nh_off > data_end)
> >       10:       r3 = r1
> >       18:       r3 += 14
> >       20:       if r3 > r2 goto 55
> >; h_proto = eth->h_proto;
> >       28:       r3 = *(u8 *)(r1 + 12)
> >       30:       r4 = *(u8 *)(r1 + 13)
> >       38:       r4 <<= 8
> >       40:       r4 |= r3
> >; if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> >       48:       if r4 == 43144 goto 2
> >       50:       r3 = 14
> >       58:       if r4 != 129 goto 5
> >
> >LBB0_3:
> >; if (data + nh_off > data_end)
> >       60:       r3 = r1
> >       68:       r3 += 18
> >       70:       if r3 > r2 goto 45
> >       78:       r3 = 18
> >; h_proto = vhdr->h_vlan_encapsulated_proto;
> >       80:       r4 = *(u16 *)(r1 + 16)
> >
> >LBB0_5:
> >       88:       r5 = r4
> >       90:       r5 &= 65535
> >; if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> >       98:       if r5 == 43144 goto 1
> >       a0:       if r5 != 129 goto 9
> >
> >Notice that 'clang -S -o a.s' output and llvm-objdump disassembler
> >were changed to use kernel verifier style, so now it should be easier
> >to see what's going on.
> 
> Sounds really useful, is that scheduled for llvm 3.10 release?

llvm 4.0 :)

> That debugging info is stored in dwarf format into the obj, right?

right.

> Would be nice if also pahole could work on bpf object files.

yeah. pahole need to be taught to recognize bpf e_machine type and relocations.

> >The main advantage of debug info is that verifier error messages
> >are now easier to correlate to original C code.
> 
> Does that mean that the old -S output format is not available anymore?
> Personally, I liked that one better tbh, was hoping someone would have
> added asm parsing for it, so compilation from .S to .o also works.

we can still do a proper assembler form .s into .o
llvm infra is flexible enough. I can point somebody on what places
inside llvm bpf backend to extend to add full parsing of .s

> >If I try to run samples/bpf/test_cls_bpf.sh the verifier will complain:
> >R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=42) R2=pkt_end
> >112: (0f) r4 += r3
> >113: (0f) r1 += r4
> >114: (b7) r0 = 2
> >115: (69) r2 = *(u16 *)(r1 +2)
> >invalid access to packet, off=2 size=2, R1(id=3,off=0,r=0)
> >
> >Now multiply 115 * 8 and convert to hex. This is address 0x398 in llvm-objdump:
> >; struct udphdr *udp = data + tp_off;
> >      388:       r1 += r4
> >      390:       r0 = 2
> >; if (udp->dest == htons(DEFAULT_PKTGEN_UDP_PORT) ||
> >      398:       r2 = *(u16 *)(r1 + 2)
> >      3a0:       if r2 == 2304 goto 16
> >
> >Now it's clear which line of C code is causing the verifier to reject.
> [...]
> 
> Could llvm-objdump switch line numbering for bpf same way as verifier
> output, so mapping step is not really needed?

you mean that llvm-objdump to print 113,114,115 ?
I guess it's doable. Will give it a try.

> So the BPF_COMMENT pseudo insn will get stripped away from the insn array
> after verification step, so we don't need to hold/account for this mem?

that was the idea, but if we keep src in there, we might want to keep
it for some future 'dump program' debugging step?

> I assume in it's ->imm member it will just hold offset into text blob?

yes. 

> Given that the generated verifier log can already become huge nowadays up
> to a point where less than 4k insns prog fails to load due to reaching max
> kernel allowed verifier buffer size, is the plan to only dump C code for
> last few lines or for everything?

right. that's a good point.

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Jakub Kicinski @ 2016-11-29 17:10 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Yuval Mintz, davem, netdev, alexei.starovoitov
In-Reply-To: <583DA362.7010702@iogearbox.net>

On Tue, 29 Nov 2016 16:48:50 +0100, Daniel Borkmann wrote:
> On 11/29/2016 03:47 PM, Yuval Mintz wrote:
> > Add support for the ndo_xdp callback. This patch would support XDP_PASS,
> > XDP_DROP and XDP_ABORTED commands.
> >
> > This also adds a per Rx queue statistic which counts number of packets
> > which didn't reach the stack [due to XDP].
> >
> > Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com>  
> [...]
> > @@ -1560,6 +1593,7 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
> >   			       struct qede_fastpath *fp,
> >   			       struct qede_rx_queue *rxq)
> >   {
> > +	struct bpf_prog *xdp_prog = READ_ONCE(rxq->xdp_prog);
> >   	struct eth_fast_path_rx_reg_cqe *fp_cqe;
> >   	u16 len, pad, bd_cons_idx, parse_flag;
> >   	enum eth_rx_cqe_type cqe_type;
> > @@ -1596,6 +1630,11 @@ static int qede_rx_process_cqe(struct qede_dev *edev,
> >   	len = le16_to_cpu(fp_cqe->len_on_first_bd);
> >   	pad = fp_cqe->placement_offset;
> >
> > +	/* Run eBPF program if one is attached */
> > +	if (xdp_prog)
> > +		if (!qede_rx_xdp(edev, fp, rxq, xdp_prog, bd, fp_cqe))
> > +			return 1;
> > +  
> 
> You also need to wrap this under rcu_read_lock() (at least I haven't seen
> it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
> protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
> disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
> doesn't.

My understanding was that Yuval is always doing full stop()/start() so
there should be no RX packets in flight while the XDP prog is being
changed.  But thinking about it again, perhaps is worth adding the
optimization to forego the full qede_reload() in qede_xdp_set() if there
is a program already loaded and just do the xchg()+put() (and add RCU
protection on the fast path)?

^ permalink raw reply

* Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver
From: Florian Fainelli @ 2016-11-29 17:14 UTC (permalink / raw)
  To: Lino Sanfilippo, davem, charrer, liodot, gregkh, andrew
  Cc: devel, netdev, linux-kernel
In-Reply-To: <6b7de79d-b149-9888-2d0f-2acee5c2905e@gmx.de>

On 11/28/2016 01:41 PM, Lino Sanfilippo wrote:
> The problem is that the HW does not provide a tx completion index. Instead we have to 
> iterate the status descriptors until we get an invalid idx which indicates that there 
> are no further tx descriptors done for now. I am afraid that if we do not limit the 
> number of descriptors processed in the tx completion handler, a continuous transmission 
> of frames could keep the loop in xmit_complete() run endlessly. I dont know if this 
> can actually happen but I wanted to make sure that this is avoided.

OK, it might be a good idea to put that comment somewhere around the tx
completion handler to understand why it is bounded with a specific value.

> 
>> [snip]
>>
>>> +	while (slic_get_free_rx_descs(rxq) > SLIC_MAX_REQ_RX_DESCS) {
>>> +		skb = alloc_skb(maplen + ALIGN_MASK, gfp);
>>> +		if (!skb)
>>> +			break;
>>> +
>>> +		paddr = dma_map_single(&sdev->pdev->dev, skb->data, maplen,
>>> +				       DMA_FROM_DEVICE);
>>> +		if (dma_mapping_error(&sdev->pdev->dev, paddr)) {
>>> +			netdev_err(dev, "mapping rx packet failed\n");
>>> +			/* drop skb */
>>> +			dev_kfree_skb_any(skb);
>>> +			break;
>>> +		}
>>> +		/* ensure head buffer descriptors are 256 byte aligned */
>>> +		offset = 0;
>>> +		misalign = paddr & ALIGN_MASK;
>>> +		if (misalign) {
>>> +			offset = SLIC_RX_BUFF_ALIGN - misalign;
>>> +			skb_reserve(skb, offset);
>>> +		}
>>> +		/* the HW expects dma chunks for descriptor + frame data */
>>> +		desc = (struct slic_rx_desc *)skb->data;
>>> +		memset(desc, 0, sizeof(*desc));
>>
>> Do you really need to zero-out the prepending RX descriptor? Are not you
>> missing a write barrier here?
> 
> Indeed, it should be sufficient to make sure that the bit SLIC_IRHDDR_SVALID is not set.
> I will adjust it. 
> Concerning the write barrier: You mean a wmb() before slic_write() to ensure that the zeroing
>  of the status desc is done before the descriptor is passed to the HW, right?

Correct, that's what I meant here.

> 
> 
>> [snip]
>>
>>> +
>>> +		dma_sync_single_for_cpu(&sdev->pdev->dev,
>>> +					dma_unmap_addr(buff, map_addr),
>>> +					buff->addr_offset + sizeof(*desc),
>>> +					DMA_FROM_DEVICE);
>>> +
>>> +		status = le32_to_cpu(desc->status);
>>> +		if (!(status & SLIC_IRHDDR_SVALID))
>>> +			break;
>>> +
>>> +		buff->skb = NULL;
>>> +
>>> +		dma_unmap_single(&sdev->pdev->dev,
>>> +				 dma_unmap_addr(buff, map_addr),
>>> +				 dma_unmap_len(buff, map_len),
>>> +				 DMA_FROM_DEVICE);
>>
>> This is potentially inefficient, you already did a cache invalidation
>> for the RX descriptor here, you could be more efficient with just
>> invalidating the packet length, minus the descriptor length.
>>
> 
> I am not sure I understand: We have to unmap the complete dma area, no matter if we synced
> part of it before, dont we? AFAIK a dma sync is different from unmapping dma, or do I miss
> something?

Sorry, I was not very clear, what I meant is that you can allocate and
do the initial dma_map_single() of your RX skbs during ndo_open(), and
then, in your RX path, you can only do dma_sync_single_for_cpu() twice
(once for the RX descriptor status, second time for the actual packet
contents), and when you return the SKB to the HW, do a
dma_sync_single_for_device(). The advantage of doing that, is that if
your cache operations are slow, you only do them on exactly packet
length, and not the actual RX buffer size (e.g: 2KB).
-- 
Florian

^ permalink raw reply

* [Patch net-next] audit: remove useless synchronize_net()
From: Cong Wang @ 2016-11-29 17:14 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Richard Guy Briggs

netlink kernel socket is protected by refcount, not RCU.
Its rcv path is neither protected by RCU. So the synchronize_net()
is just pointless.

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 kernel/audit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 92c463d..67b9fbd8 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1172,9 +1172,8 @@ static void __net_exit audit_net_exit(struct net *net)
 		audit_sock = NULL;
 	}
 
-	RCU_INIT_POINTER(aunet->nlsk, NULL);
-	synchronize_net();
 	netlink_kernel_release(sock);
+	aunet->nlsk = NULL;
 }
 
 static struct pernet_operations audit_net_ops __net_initdata = {
-- 
2.1.0

^ permalink raw reply related

* [RFC PATCH net-next v2] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Lebrun @ 2016-11-29 17:15 UTC (permalink / raw)
  To: netdev; +Cc: David Lebrun

When multiple nexthops are available for a given route, the routing engine
chooses a nexthop by computing the flow hash through get_hash_from_flowi6
and by taking that value modulo the number of nexthops. The resulting value
indexes the nexthop to select. This method causes issues when a new nexthop
is added or one is removed (e.g. link failure). In that case, the number
of nexthops changes and potentially all the flows get re-routed to another
nexthop.

This patch implements a consistent hash method to select the nexthop in
case of ECMP. The idea is to generate K slices (or intervals) for each
route with multiple nexthops. The nexthops are randomly assigned to those
slices, in a uniform manner. The number K is configurable through a sysctl
net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the
nexthop, the algorithm takes the flow hash and computes an index which is
the flow hash modulo K. As K = 2^x, the modulo can be computed using a
simple binary AND operation (idx = hash & (K - 1)). The resulting index
references the selected nexthop. The lookup time complexity is thus O(1).

When a nexthop is added, it steals K/N slices from the other nexthops,
where N is the new number of nexthops. The slices are stolen randomly and
uniformly from the other nexthops. When a nexthop is removed, the orphan
slices are randomly reassigned to the other nexthops.

The number of slices for a route also fixes the maximum number of nexthops
possible for that route.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
---
 include/net/ip6_fib.h    |  25 +++++++
 include/net/netns/ipv6.h |   1 +
 net/ipv6/ip6_fib.c       | 174 ++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv6/route.c         |  58 ++++++++--------
 4 files changed, 229 insertions(+), 29 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index a74e2aa..29594cc 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -93,6 +93,30 @@ struct rt6key {
 
 struct fib6_table;
 
+/* Multipath nexthop.
+ * @nh is the actual nexthop
+ * @nslices is the number of slices assigned to this nexthop
+ */
+struct rt6_multi_nh {
+	struct rt6_info *nh;
+	unsigned int	nslices;
+};
+
+/* Multipath routes map.
+ * @nhs is an array of available nexthops
+ * @size is the number of elements in @nhs
+ * @nslices is the number of slices and the number of elements in @index
+ *	    and is always in the form 2^x to prevent using a division for
+ *	    the computation of the modulo.
+ * @index is an array mapping a slice index to a nexthop index in @nhs
+ */
+struct rt6_multi_map {
+	struct rt6_multi_nh	*nhs;
+	unsigned int		size;
+	unsigned int		nslices;
+	__u8			*index;
+};
+
 struct rt6_info {
 	struct dst_entry		dst;
 
@@ -113,6 +137,7 @@ struct rt6_info {
 	 */
 	struct list_head		rt6i_siblings;
 	unsigned int			rt6i_nsiblings;
+	struct rt6_multi_map		*rt6i_nh_map;
 
 	atomic_t			rt6i_ref;
 
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index de7745e..4967846 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -36,6 +36,7 @@ struct netns_sysctl_ipv6 {
 	int idgen_retries;
 	int idgen_delay;
 	int flowlabel_state_ranges;
+	int ip6_rt_ecmp_slices;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index ef54852..b6b1895 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -727,6 +727,166 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn,
 	}
 }
 
+static void fib6_mp_free(struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = rt->rt6i_nh_map;
+	struct rt6_info *sibling;
+
+	if (nh_map) {
+		list_for_each_entry(sibling, &rt->rt6i_siblings,
+				    rt6i_siblings) {
+			sibling->rt6i_nh_map = NULL;
+		}
+
+		rt->rt6i_nh_map = NULL;
+
+		kfree(nh_map->nhs);
+		kfree(nh_map->index);
+		kfree(nh_map);
+	}
+}
+
+static int fib6_mp_extend(struct net *net, struct rt6_info *sref,
+			  struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
+	struct rt6_multi_nh *tmp_nhs, *old_nhs;
+	unsigned int kslices;
+	unsigned int i, j;
+
+	if (!nh_map) {
+		__u8 *index;
+
+		nh_map = kmalloc(sizeof(*nh_map), GFP_ATOMIC);
+		if (!nh_map)
+			return -ENOMEM;
+
+		nh_map->size = 1;
+		nh_map->nslices = (1 << net->ipv6.sysctl.ip6_rt_ecmp_slices);
+
+		tmp_nhs = kmalloc(sizeof(*tmp_nhs), GFP_ATOMIC);
+		if (!tmp_nhs) {
+			kfree(nh_map);
+			return -ENOMEM;
+		}
+
+		tmp_nhs[0].nh = sref;
+		tmp_nhs[0].nslices = nh_map->nslices;
+		nh_map->nhs = tmp_nhs;
+
+		index = kmalloc_array(nh_map->nslices, sizeof(*index),
+				      GFP_ATOMIC);
+		if (!index) {
+			kfree(nh_map->nhs);
+			kfree(nh_map);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < nh_map->nslices; i++)
+			index[i] = 0;
+
+		nh_map->index = index;
+		sref->rt6i_nh_map = nh_map;
+	}
+
+	if (nh_map->size == nh_map->nslices)
+		return -ENOBUFS;
+
+	nh_map->size++;
+
+	tmp_nhs = kmalloc_array(nh_map->size, sizeof(*tmp_nhs), GFP_ATOMIC);
+	if (!tmp_nhs)
+		return -ENOMEM;
+
+	for (i = 0; i < nh_map->size - 1; i++)
+		tmp_nhs[i] = nh_map->nhs[i];
+
+	tmp_nhs[nh_map->size - 1].nh = rt;
+	tmp_nhs[nh_map->size - 1].nslices = 0;
+
+	kslices = nh_map->nslices / nh_map->size;
+
+	/* Loop over nexthops and steal a random slice until we have kslices for
+	 * the new nexthop. This algorithm ensures that flows are taken as
+	 * equally as possible from current nexthops.
+	 *
+	 * At each iteration, @j is the index in tmp_nhs of the nexthop
+	 * to steal from and @idx is the index of the slice to steal
+	 * among @j's slices.
+	 */
+	for (i = 0, j = 0; i < kslices; i++) {
+		unsigned int idx, k = 0;
+
+		if (tmp_nhs[j].nslices == 1)
+			continue;
+
+		idx = (prandom_u32() % tmp_nhs[j].nslices) + 1;
+		do {
+			if (nh_map->index[k++] == j)
+				idx--;
+		} while (idx);
+
+		nh_map->index[k - 1] = nh_map->size - 1;
+		tmp_nhs[nh_map->size - 1].nslices++;
+		tmp_nhs[j].nslices--;
+
+		j = (j + 1) % (nh_map->size - 1);
+	}
+
+	WARN_ON(tmp_nhs[nh_map->size - 1].nslices != kslices);
+
+	old_nhs = nh_map->nhs;
+	nh_map->nhs = tmp_nhs;
+	kfree(old_nhs);
+
+	rt->rt6i_nh_map = nh_map;
+
+	return 0;
+}
+
+static int fib6_mp_shrink(struct rt6_info *sref, struct rt6_info *rt)
+{
+	struct rt6_multi_map *nh_map = sref->rt6i_nh_map;
+	struct rt6_multi_nh *tmp_nhs, *old_nhs;
+	unsigned int i, j, idx = 0;
+
+	tmp_nhs = kmalloc_array(nh_map->size - 1, sizeof(*tmp_nhs), GFP_ATOMIC);
+	if (!tmp_nhs)
+		return -ENOMEM;
+
+	for (i = 0, j = 0; i < nh_map->size; i++) {
+		if (nh_map->nhs[i].nh != rt)
+			tmp_nhs[j++] = nh_map->nhs[i];
+		else
+			idx = i;
+	}
+
+	nh_map->size--;
+	old_nhs = nh_map->nhs;
+	nh_map->nhs = tmp_nhs;
+	kfree(old_nhs);
+
+	rt->rt6i_nh_map = NULL;
+
+	/* Update the slices index. For each slice mapping to the removed
+	 * nexthop, assign another random nexthop. For each slice mapping
+	 * to a nexthop that was after the removed nexthop, decrement the
+	 * nexthop index to reflect the shift. Nexthops that were before
+	 * the removed nexthop are unaffected.
+	 */
+	for (i = 0; i < nh_map->nslices; i++) {
+		if (nh_map->index[i] == idx) {
+			j = prandom_u32() % nh_map->size;
+			nh_map->index[i] = j;
+			nh_map->nhs[j].nslices++;
+		} else if (nh_map->index[i] > idx) {
+			nh_map->index[i]--;
+		}
+	}
+
+	return 0;
+}
+
 /*
  *	Insert routing information in a node.
  */
@@ -837,6 +997,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			}
 			sibling = sibling->dst.rt6_next;
 		}
+		err = fib6_mp_extend(info->nl_net, sibling, rt);
+		if (err < 0)
+			return err;
 		/* For each sibling in the list, increment the counter of
 		 * siblings. BUG() if counters does not match, list of siblings
 		 * is broken!
@@ -900,6 +1063,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			fn->fn_flags |= RTN_RTINFO;
 		}
 		nsiblings = iter->rt6i_nsiblings;
+		if (nsiblings)
+			fib6_mp_free(iter);
 		fib6_purge_rt(iter, fn, info->nl_net);
 		rt6_release(iter);
 
@@ -1407,13 +1572,20 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
 
 	/* Remove this entry from other siblings */
 	if (rt->rt6i_nsiblings) {
-		struct rt6_info *sibling, *next_sibling;
+		struct rt6_info *sibling, *next_sibling, *sref;
+
+		sref = list_first_entry(&rt->rt6i_siblings, struct rt6_info,
+					rt6i_siblings);
 
 		list_for_each_entry_safe(sibling, next_sibling,
 					 &rt->rt6i_siblings, rt6i_siblings)
 			sibling->rt6i_nsiblings--;
 		rt->rt6i_nsiblings = 0;
 		list_del_init(&rt->rt6i_siblings);
+		if (!sref->rt6i_nsiblings)
+			fib6_mp_free(sref);
+		else
+			fib6_mp_shrink(sref, rt);
 	}
 
 	/* Adjust walkers */
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b317bb1..e9f13e0 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -427,39 +427,27 @@ static bool rt6_check_expired(const struct rt6_info *rt)
 	return false;
 }
 
-/* Multipath route selection:
- *   Hash based function using packet header and flowlabel.
- * Adapted from fib_info_hashfn()
- */
-static int rt6_info_hash_nhsfn(unsigned int candidate_count,
-			       const struct flowi6 *fl6)
-{
-	return get_hash_from_flowi6(fl6) % candidate_count;
-}
-
 static struct rt6_info *rt6_multipath_select(struct rt6_info *match,
 					     struct flowi6 *fl6, int oif,
 					     int strict)
 {
-	struct rt6_info *sibling, *next_sibling;
-	int route_choosen;
+	struct rt6_multi_map *nh_map = match->rt6i_nh_map;
+	__u32 hash = get_hash_from_flowi6(fl6);
+	unsigned int slice, idx;
+	struct rt6_info *res;
 
-	route_choosen = rt6_info_hash_nhsfn(match->rt6i_nsiblings + 1, fl6);
-	/* Don't change the route, if route_choosen == 0
-	 * (siblings does not include ourself)
-	 */
-	if (route_choosen)
-		list_for_each_entry_safe(sibling, next_sibling,
-				&match->rt6i_siblings, rt6i_siblings) {
-			route_choosen--;
-			if (route_choosen == 0) {
-				if (rt6_score_route(sibling, oif, strict) < 0)
-					break;
-				match = sibling;
-				break;
-			}
-		}
-	return match;
+	WARN_ON(!nh_map);
+	if (!nh_map)
+		return match;
+
+	slice = hash & (nh_map->nslices - 1);
+	idx = nh_map->index[slice];
+	res = nh_map->nhs[idx].nh;
+
+	if (rt6_score_route(res, oif, strict) < 0)
+		res = match;
+
+	return res;
 }
 
 /*
@@ -3550,6 +3538,9 @@ int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
 	return 0;
 }
 
+static int one = 1;
+static int thirtyone = 31;
+
 struct ctl_table ipv6_route_table_template[] = {
 	{
 		.procname	=	"flush",
@@ -3621,6 +3612,15 @@ struct ctl_table ipv6_route_table_template[] = {
 		.mode		=	0644,
 		.proc_handler	=	proc_dointvec_ms_jiffies,
 	},
+	{
+		.procname	=	"ecmp_slices",
+		.data		=	&init_net.ipv6.sysctl.ip6_rt_ecmp_slices,
+		.maxlen		=	sizeof(int),
+		.mode		=	0644,
+		.proc_handler	=	proc_dointvec_minmax,
+		.extra1		=	&one,
+		.extra2		=	&thirtyone,
+	},
 	{ }
 };
 
@@ -3644,6 +3644,7 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
 		table[7].data = &net->ipv6.sysctl.ip6_rt_mtu_expires;
 		table[8].data = &net->ipv6.sysctl.ip6_rt_min_advmss;
 		table[9].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
+		table[10].data = &net->ipv6.sysctl.ip6_rt_ecmp_slices;
 
 		/* Don't export sysctls to unprivileged users */
 		if (net->user_ns != &init_user_ns)
@@ -3707,6 +3708,7 @@ static int __net_init ip6_route_net_init(struct net *net)
 	net->ipv6.sysctl.ip6_rt_gc_elasticity = 9;
 	net->ipv6.sysctl.ip6_rt_mtu_expires = 10*60*HZ;
 	net->ipv6.sysctl.ip6_rt_min_advmss = IPV6_MIN_MTU - 20 - 40;
+	net->ipv6.sysctl.ip6_rt_ecmp_slices = 5;
 
 	net->ipv6.ip6_rt_gc_expire = 30*HZ;
 
-- 
2.7.3

^ permalink raw reply related

* Re: [PATCH net] openvswitch: Fix skb leak in IPv6 reassembly.
From: Pravin Shelar @ 2016-11-29 17:21 UTC (permalink / raw)
  To: Daniele Di Proietto
  Cc: Linux Kernel Network Developers, Florian Westphal, Joe Stringer
In-Reply-To: <CAExiqTKTbgOaXLjHMuvne__9hyNLT5kN72fOrEg2CF3VgFg3Cw@mail.gmail.com>

On Tue, Nov 29, 2016 at 12:54 AM, Daniele Di Proietto
<diproiettod@ovn.org> wrote:
>
>
> 2016-11-28 23:39 GMT-08:00 Pravin Shelar <pshelar@ovn.org>:
>>
>> On Mon, Nov 28, 2016 at 3:43 PM, Daniele Di Proietto
>> <diproiettod@ovn.org> wrote:
>> > If nf_ct_frag6_gather() returns an error other than -EINPROGRESS, it
>> > means that we still have a reference to the skb.  We should free it
>> > before returning from handle_fragments, as stated in the comment above.
>> >
>> > Fixes: daaa7d647f81 ("netfilter: ipv6: avoid nf_iterate recursion")
>> > CC: Florian Westphal <fw@strlen.de>
>> > CC: Pravin B Shelar <pshelar@ovn.org>
>> > CC: Joe Stringer <joe@ovn.org>
>> > Signed-off-by: Daniele Di Proietto <diproiettod@ovn.org>
>> > ---
>> >  net/openvswitch/conntrack.c | 5 ++++-
>> >  1 file changed, 4 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> > index 31045ef..fecefa2 100644
>> > --- a/net/openvswitch/conntrack.c
>> > +++ b/net/openvswitch/conntrack.c
>> > @@ -370,8 +370,11 @@ static int handle_fragments(struct net *net, struct
>> > sw_flow_key *key,
>> >                 skb_orphan(skb);
>> >                 memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
>> >                 err = nf_ct_frag6_gather(net, skb, user);
>> > -               if (err)
>> > +               if (err) {
>> > +                       if (err != -EINPROGRESS)
>> > +                               kfree_skb(skb);
>> >                         return err;
>> > +               }
>> >
>>
>> This fixes the code. But the patch is adding yet another skb-kfree in
>> conntrack code. we could simplify it by reusing error handling in
>> do_execute_actions().
>> If you think that is too complicated for stable branch, I am fine with
>> this patch going in as it is.
>
>
> I thought it was simpler to handle this here rather than changing the
> interface of
> handle_fragments() and ovs_ct_execute(). Also this is more similar to the v4
> case,
> where ip_defrag() frees the skb in case of error.
>
> What do you think?

I agree this is simple enough for stable branch. I will check if ct
action error handling could be simplified bit.

Acked-by: Pravin B Shelar <pshelar@ovn.org>

^ permalink raw reply

* Re: [PATCH v2 0/6] Add QLogic FastLinQ iSCSI (qedi) driver.
From: Martin K. Petersen @ 2016-11-29 17:21 UTC (permalink / raw)
  To: Manish Rangankar
  Cc: martin.petersen, lduncan, cleech, linux-scsi, netdev,
	QLogic-Storage-Upstream, Yuval.Mintz
In-Reply-To: <1478588223-16183-1-git-send-email-manish.rangankar@cavium.com>

>>>>> "Manish" == Manish Rangankar <manish.rangankar@cavium.com> writes:

Manish> This series introduces hardware offload iSCSI initiator driver
Manish> for the 41000 Series Converged Network Adapters (579xx chip) by
Manish> Qlogic. The overall driver design includes a common module
Manish> ('qed') and protocol specific dependent modules ('qedi' for
Manish> iSCSI).

The SCSI portions look reasonably sane to me.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* [PATCH net][v2] bpf: fix states equal logic for varlen access
From: Josef Bacik @ 2016-11-29 17:27 UTC (permalink / raw)
  To: davem, netdev, ast, jannh, daniel, kernel-team

If we have a branch that looks something like this

int foo = map->value;
if (condition) {
  foo += blah;
} else {
  foo = bar;
}
map->array[foo] = baz;

We will incorrectly assume that the !condition branch is equal to the condition
branch as the register for foo will be UNKNOWN_VALUE in both cases.  We need to
adjust this logic to only do this if we didn't do a varlen access after we
processed the !condition branch, otherwise we have different ranges and need to
check the other branch as well.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
v1->v2:
- renamed and moved varlen_map_access variable.
- dropped the extra () in the second if statement.
- added the Fixes and Reported-by tag.

 kernel/bpf/verifier.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6a93615..8199821 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2454,6 +2454,7 @@ static bool states_equal(struct bpf_verifier_env *env,
 			 struct bpf_verifier_state *old,
 			 struct bpf_verifier_state *cur)
 {
+	bool varlen_map_access = env->varlen_map_value_access;
 	struct bpf_reg_state *rold, *rcur;
 	int i;
 
@@ -2467,12 +2468,17 @@ static bool states_equal(struct bpf_verifier_env *env,
 		/* If the ranges were not the same, but everything else was and
 		 * we didn't do a variable access into a map then we are a-ok.
 		 */
-		if (!env->varlen_map_value_access &&
+		if (!varlen_map_access &&
 		    rold->type == rcur->type && rold->imm == rcur->imm)
 			continue;
 
+		/* If we didn't map access then again we don't care about the
+		 * mismatched range values and it's ok if our old type was
+		 * UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+		 */
 		if (rold->type == NOT_INIT ||
-		    (rold->type == UNKNOWN_VALUE && rcur->type != NOT_INIT))
+		    (!varlen_map_access && rold->type == UNKNOWN_VALUE &&
+		     rcur->type != NOT_INIT))
 			continue;
 
 		if (rold->type == PTR_TO_PACKET && rcur->type == PTR_TO_PACKET &&
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next] bpf: add test for the verifier equal logic bug
From: Josef Bacik @ 2016-11-29 17:35 UTC (permalink / raw)
  To: davem, netdev, ast, jannh, daniel, kernel-team

This is a test to verify that

bpf: fix states equal logic for varlen access

actually fixed the problem.  The problem was if the register we added to our map
register was UNKNOWN in both the false and true branches and the only thing that
changed was the range then we'd incorrectly assume that the true branch was
valid, which it really wasnt.  This tests this case and properly fails without
my fix in place and passes with it in place.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 tools/testing/selftests/bpf/test_verifier.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 3c4a1fb..5da2e9d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2660,6 +2660,29 @@ static struct bpf_test tests[] = {
 		.result = ACCEPT,
 		.prog_type = BPF_PROG_TYPE_SCHED_CLS
 	},
+	{
+		"invalid map access from else condition",
+		.insns = {
+			BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+			BPF_LD_MAP_FD(BPF_REG_1, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 6),
+			BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+			BPF_JMP_IMM(BPF_JGE, BPF_REG_1, MAX_ENTRIES-1, 1),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 1),
+			BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, offsetof(struct test_val, foo)),
+			BPF_EXIT_INSN(),
+		},
+		.fixup_map2 = { 3 },
+		.errstr = "R0 unbounded memory access, make sure to bounds check any array access into a map",
+		.result = REJECT,
+		.errstr_unpriv = "R0 pointer arithmetic prohibited",
+		.result_unpriv = REJECT,
+	},
 };
 
 static int probe_filter_length(const struct bpf_insn *fp)
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v2 1/1] driver: ipvlan: Use NF_IP_PRI_LAST as hook priority instead of INT_MAX
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-11-29 17:36 UTC (permalink / raw)
  To: Gao Feng; +Cc: David Miller, Eric Dumazet, linux-netdev
In-Reply-To: <CA+6hz4qXYzC-mEfTqa5HOf9GtL1S1Q=rcz_CtFW5mbvUtj7niA@mail.gmail.com>

On Mon, Nov 28, 2016 at 5:14 PM, Gao Feng <fgao@ikuai8.com> wrote:
>
> Hi Mahesh,
>
> On Tue, Nov 29, 2016 at 3:26 AM, Mahesh Bandewar (महेश बंडेवार)
> <maheshb@google.com> wrote:
> > On Sun, Nov 27, 2016 at 3:18 AM,  <fgao@ikuai8.com> wrote:
> >> From: Gao Feng <fgao@ikuai8.com>
> >>
> >> It is better to use NF_IP_PRI_LAST instead of INT_MAX as hook priority.
> >> The former is good at readability and easier to maintain.
> >>
> > This IPvlan hook has to be "absolute" last hook and at this moment
> > NF_IP_PRI_LAST is set as INT_MAX so it's not altering anything.
>
> Yes. It is same now.
> So I prefer to use NF_IP_PRI_LAST than INT_MAX.
> Because the nf_hook_ops belongs to the netfilter module. i think the
> ipvlan codes should follow its rule.
> Since netfilter has defined some specific priority enum value, why
> don't we follow it?
>
Making changes only to IPvlan is problem-prone as I have explained earlier.

> >
> > If for whatever reasons the value of NF_IP_PRI_LAST changes, there
> > could be random IPvlan failure. Since that possibility cannot be
> > denied and there are several places INT_MAX is still used as hook
> > priority, I don't see any gain in having this patch in fact there
> > could be future (possible) downside.
>
> If the netfilter module changed the value of NF_IP_PRI_LAST, it may
> decrease it and add one check for the hook priority.
> As a result, the ipvlan may fail to register because of invalid priority.
>
> When use INT_MAX not NF_IP_PRI_LAST, there is one assumption that the
> hook priority is never changed.
> I think it is not good as two different modules.
>
> Regards
> Feng
>
> >
> >> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> >> ---
> >>  v2: Add the lost header file. It is added in local but not in v1 patch
> >>  v1: Inital patch
> >>
> >>  drivers/net/ipvlan/ipvlan_main.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> >> index ab90b22..01c7446 100644
> >> --- a/drivers/net/ipvlan/ipvlan_main.c
> >> +++ b/drivers/net/ipvlan/ipvlan_main.c
> >> @@ -7,6 +7,7 @@
> >>   *
> >>   */
> >>
> >> +#include "linux/netfilter_ipv4.h"
> >>  #include "ipvlan.h"
> >>
> >>  static u32 ipvl_nf_hook_refcnt = 0;
> >> @@ -16,13 +17,13 @@
> >>                 .hook     = ipvlan_nf_input,
> >>                 .pf       = NFPROTO_IPV4,
> >>                 .hooknum  = NF_INET_LOCAL_IN,
> >> -               .priority = INT_MAX,
> >> +               .priority = NF_IP_PRI_LAST,
> >>         },
> >>         {
> >>                 .hook     = ipvlan_nf_input,
> >>                 .pf       = NFPROTO_IPV6,
> >>                 .hooknum  = NF_INET_LOCAL_IN,
> >> -               .priority = INT_MAX,
> >> +               .priority = NF_IP_PRI_LAST,
> >>         },
> >>  };
> >>
> >> --
> >> 1.9.1
> >>
> >>
>
>

^ permalink raw reply

* Re: [PATCH v2 net-next 10/11] qede: Add basic XDP support
From: Mintz, Yuval @ 2016-11-29 17:51 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	alexei.starovoitov@gmail.com
In-Reply-To: <20161129171020.6b8b552d@jkicinski-Precision-T1700>

> > You also need to wrap this under rcu_read_lock() (at least I haven't seen
> > it in your patches) for same reasons as stated in 326fe02d1ed6 ("net/mlx4_en:
> > protect ring->xdp_prog with rcu_read_lock"), as otherwise xdp_prog could
> > disappear underneath you. mlx4 and nfp does it correctly, looks like mlx5
> > doesn't.

> My understanding was that Yuval is always doing full stop()/start() so
> there should be no RX packets in flight while the XDP prog is being
> changed.  But thinking about it again, perhaps is worth adding the
> optimization to forego the full qede_reload() in qede_xdp_set() if there
> is a program already loaded and just do the xchg()+put() (and add RCU
> protection on the fast path)?

Yeps. That the current state of the code.
I'll surely pursue this later, but I don't think all this added complexity
is required for the initial submission.

BTW, one of the problems [or perhaps 'lack of motivation' is a better term]
I had with the program switching scenario was that there was no sample
application that did it.
If it's really an interesting [basic] scenario, perhaps it's worthy to add
a sample user app. that will repeatedly switch the attached eBPF?

^ permalink raw reply

* Re: [PATCH net-netx] net: lan78xx: add LAN7801 MAC-only support
From: Florian Fainelli @ 2016-11-29 17:54 UTC (permalink / raw)
  To: Woojung.Huh, davem, andrew; +Cc: netdev, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D4096E268@CHN-SV-EXMX02.mchp-main.com>

On 11/28/2016 12:03 PM, Woojung.Huh@microchip.com wrote:
> From: Woojung Huh <woojung.huh@microchip.com>
> 
> Add LAN7801 MAC-only configuration support.
> 
> Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
> ---
>  drivers/net/usb/Kconfig   |   5 +++
>  drivers/net/usb/lan78xx.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/net/usb/lan78xx.h |  14 ++++++
>  3 files changed, 128 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
> index cdde590..3dd490f5 100644
> --- a/drivers/net/usb/Kconfig
> +++ b/drivers/net/usb/Kconfig
> @@ -114,6 +114,11 @@ config USB_LAN78XX
>  	help
>  	  This option adds support for Microchip LAN78XX based USB 2
>  	  & USB 3 10/100/1000 Ethernet adapters.
> +	  LAN7800 : USB 3 to 10/100/1000 Ethernet adapter
> +	  LAN7850 : USB 2 to 10/100/1000 Ethernet adapter
> +	  LAN7801 : USB 3 to 10/100/1000 Ethernet adapter (MAC only)
> +
> +	  Proper PHY driver is required for LAN7801.
>  
>  	  To compile this driver as a module, choose M here: the
>  	  module will be called lan78xx.
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index 0c459e9..08f2895 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -40,7 +40,7 @@
>  #define DRIVER_AUTHOR	"WOOJUNG HUH <woojung.huh@microchip.com>"
>  #define DRIVER_DESC	"LAN78XX USB 3.0 Gigabit Ethernet Devices"
>  #define DRIVER_NAME	"lan78xx"
> -#define DRIVER_VERSION	"1.0.5"
> +#define DRIVER_VERSION	"1.0.6"
>  
>  #define TX_TIMEOUT_JIFFIES		(5 * HZ)
>  #define THROTTLE_JIFFIES		(HZ / 8)
> @@ -67,6 +67,7 @@
>  #define LAN78XX_USB_VENDOR_ID		(0x0424)
>  #define LAN7800_USB_PRODUCT_ID		(0x7800)
>  #define LAN7850_USB_PRODUCT_ID		(0x7850)
> +#define LAN7801_USB_PRODUCT_ID		(0x7801)
>  #define LAN78XX_EEPROM_MAGIC		(0x78A5)
>  #define LAN78XX_OTP_MAGIC		(0x78F3)
>  
> @@ -400,6 +401,21 @@ struct lan78xx_net {
>  	struct irq_domain_data	domain_data;
>  };
>  
> +/* define external phy id */
> +#define	PHY_LAN8835			(0x0007C130)
> +#define	PHY_KSZ9031RNX			(0x00221620)
> +
> +/* phyid : masked external phy id
> + * pre_config : if needed, configure MAC and/or external PHY
> + *		such as irq pin mux and RGMII timing.
> + */
> +struct ext_phy_config_table {
> +	int phyid;
> +	void (*pre_config)(struct lan78xx_net *dev,
> +			   struct phy_device *phydev,
> +			   phy_interface_t *interface);
> +};
> +
>  /* use ethtool to change the level for any given device */
>  static int msg_level = -1;
>  module_param(msg_level, int, 0);
> @@ -1697,6 +1713,7 @@ static int lan78xx_mdiobus_read(struct mii_bus *bus, int phy_id, int idx)
>  done:
>  	mutex_unlock(&dev->phy_mutex);
>  	usb_autopm_put_interface(dev->intf);
> +
>  	return ret;
>  }
>  
> @@ -1759,6 +1776,10 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
>  		/* set to internal PHY id */
>  		dev->mdiobus->phy_mask = ~(1 << 1);
>  		break;
> +	case ID_REV_CHIP_ID_7801_:
> +		/* scan thru PHYAD[2..0] */
> +		dev->mdiobus->phy_mask = ~(0xFF);
> +		break;
>  	}
>  
>  	ret = mdiobus_register(dev->mdiobus);
> @@ -1933,11 +1954,58 @@ static void lan78xx_remove_irq_domain(struct lan78xx_net *dev)
>  	dev->domain_data.irqdomain = NULL;
>  }
>  
> +static void lan8835_pre_config(struct lan78xx_net *dev,
> +			       struct phy_device *phydev,
> +			       phy_interface_t *interface)
> +{
> +	int buf;
> +	int ret;
> +
> +	/* LED2/PME_N/IRQ_N/RGMII_ID pin to IRQ_N mode */
> +	buf = phy_read_mmd_indirect(phydev, 32784, 3);
> +	buf &= ~0x1800;
> +	buf |= 0x0800;
> +	phy_write_mmd_indirect(phydev, 32784, 3, buf);

Using decimal numbers for register addresses is a bit unusual.

> +
> +	/* RGMII MAC TXC Delay Enable */
> +	ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
> +				MAC_RGMII_ID_TXC_DELAY_EN_);
> +
> +	/* RGMII TX DLL Tune Adjust */
> +	ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
> +
> +	*interface = PHY_INTERFACE_MODE_RGMII_TXID;
> +}
> +
> +static void ksz9031rnx_pre_config(struct lan78xx_net *dev,
> +				  struct phy_device *phydev,
> +				  phy_interface_t *interface)
> +{
> +	/* Micrel9301RNX PHY configuration */
> +	/* RGMII Control Signal Pad Skew */
> +	phy_write_mmd_indirect(phydev, 4, 2, 0x0077);
> +	/* RGMII RX Data Pad Skew */
> +	phy_write_mmd_indirect(phydev, 5, 2, 0x7777);
> +	/* RGMII RX Clock Pad Skew */
> +	phy_write_mmd_indirect(phydev, 8, 2, 0x1FF);
> +
> +	*interface = PHY_INTERFACE_MODE_RGMII_RXID;
> +}

This should really belong in the respective PHY drivers for these PHYs,
is there a particular reason you decided to do this here?

-- 
Florian

^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Add 88E6176 device tree support
From: Vivien Didelot @ 2016-11-29 17:54 UTC (permalink / raw)
  To: Uwe Kleine-König, Andrew Lunn
  Cc: Rob Herring, Frank Rowand, Andreas Färber, netdev,
	linux-arm-kernel, Michal Hrusecki, Tomas Hlavacek,
	Bed??icha Ko??atu, Florian Fainelli, linux-kernel, devicetree
In-Reply-To: <2c59cc79-b6dc-9920-1725-a7785ff3b6bf@kleine-koenig.org>

Hi,

Uwe Kleine-König <uwe@kleine-koenig.org> writes:

> Also it seems wrong to write "marvell,mv88e6085" (only) if I know the
> hardware is really a "marvell,mv88e6176".

I agree. It might be complex for a user to dig into the driver in order
to figure out how the switch ID is read and which compatible to choose.

I've sent a patch to change this https://lkml.org/lkml/2016/6/8/1198 but
Andrew had a stronger opinion on compatible strings, which makes sense.

>> Linus has said he does not like ARM devices because of all the busses
>> which are not enumerable. Here we have a device which with a little
>> bit of help we can enumerate. So we should. 
>
> If you write
>
> 	compatible = "marvell,mv88e6176", "marvell,mv88e6085";
>
> you can still enumerate in the same way as before.

So we don't break the existing DTS files, I like this.

The driver already prints info about the detected switch. Instead of
failing at probe, which seems against the notion of compatible and
breaks the existing behavior, it could report the eventual mismatch?

We have examples for both usage, still I don't know what the best
practices are. My _preference_ would go with enumerating them all.

Thanks,

        Vivien

^ permalink raw reply

* [PATCH] net: ipv4: Don't crash if passing a null sk to ip_rt_update_pmtu.
From: Lorenzo Colitti @ 2016-11-29 17:56 UTC (permalink / raw)
  To: netdev; +Cc: davem, erezsh, Lorenzo Colitti

Commit e2d118a1cb5e ("net: inet: Support UID-based routing in IP
protocols.") made __build_flow_key call sock_net(sk) to determine
the network namespace of the passed-in socket. This crashes if sk
is NULL.

Fix this by getting the network namespace from the skb instead.

Reported-by: Erez Shitrit <erezsh@dev.mellanox.co.il>
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..6402d74 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -531,13 +531,14 @@ static void __build_flow_key(const struct net *net, struct flowi4 *fl4,
 static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
 			       const struct sock *sk)
 {
+	const struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 	int oif = skb->dev->ifindex;
 	u8 tos = RT_TOS(iph->tos);
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;
 
-	__build_flow_key(sock_net(sk), fl4, sk, iph, oif, tos, prot, mark, 0);
+	__build_flow_key(net, fl4, sk, iph, oif, tos, prot, mark, 0);
 }
 
 static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox