Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 1/5] net: aquantia: Ethtool based ring size configuration
From: Jakub Kicinski @ 2018-06-04 23:00 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: David S . Miller, netdev, David Arcari, Pavel Belous,
	Anton Mikaev
In-Reply-To: <5f24d3f28fe0014c33fde9a20af547cd3b9f8d9d.1528150073.git.igor.russkikh@aquantia.com>

On Tue,  5 Jun 2018 01:30:15 +0300, Igor Russkikh wrote:
> @@ -158,6 +158,8 @@ static void aq_nic_service_timer_cb(struct timer_list *t)
>  	int ctimer = AQ_CFG_SERVICE_TIMER_INTERVAL;
>  	int err = 0;
>  
> +	mutex_lock(&self->aq_mutex);
> +
>  	if (aq_utils_obj_test(&self->flags, AQ_NIC_FLAGS_IS_NOT_READY))
>  		goto err_exit;
>  
> @@ -175,6 +177,7 @@ static void aq_nic_service_timer_cb(struct timer_list *t)
>  		ctimer = max(ctimer / 2, 1);
>  
>  err_exit:
> +	mutex_unlock(&self->aq_mutex);
>  	mod_timer(&self->service_timer, jiffies + ctimer);
>  }
>  

This looks like a timer callback from the prototype, I don't think you
can take mutexes in timer callbacks.

^ permalink raw reply

* Re: [PATCH net-next v11 04/10] netdev: cavium: octeon: Add Octeon III BGX Ports
From: Andrew Lunn @ 2018-06-04 23:07 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: netdev, Carlos Munoz
In-Reply-To: <1528149617-8964-5-git-send-email-steven.hill@cavium.com>

> +static int bgx_port_get_qlm_speed(struct bgx_port_priv	*priv, int qlm)
> +{
> +	enum lane_mode lmode;
> +	u64 data;
> +
> +	data = oct_csr_read(GSER_LANE_MODE(priv->node, qlm));
> +	lmode = data & 0xf;
> +
> +	switch (lmode) {
> +	case R_25G_REFCLK100:
> +		return 2500;
> +	case R_5G_REFCLK100:
> +		return 5000;
> +	case R_8G_REFCLK100:
> +		return 8000;
> +	case R_125G_REFCLK15625_KX:
> +		return 1250;
> +	case R_3125G_REFCLK15625_XAUI:
> +		return 3125;
> +	case R_103125G_REFCLK15625_KR:
> +		return 10312;
> +	case R_125G_REFCLK15625_SGMII:
> +		return 1250;
> +	case R_5G_REFCLK15625_QSGMII:
> +		return 5000;
> +	case R_625G_REFCLK15625_RXAUI:
> +		return 6250;
> +	case R_25G_REFCLK125:
> +		return 2500;
> +	case R_5G_REFCLK125:
> +		return 5000;
> +	case R_8G_REFCLK125:
> +		return 8000;
> +	default:
> +		return 0;
> +	}


> +	struct port_status status;
> +	int speed;
> +
> +	/* The simulator always uses a 1Gbps full duplex port */
> +	if (octeon_is_simulation()) {
> +		status.link = 1;
> +		status.duplex = DUPLEX_FULL;
> +		status.speed = 1000;
> +	} else {
> +		/* Use the qlm speed */
> +		speed = bgx_port_get_qlm_speed(priv, priv->qlm);
> +		status.link = 1;
> +		status.duplex = DUPLEX_FULL;
> +		status.speed = speed * 8 / 10;
> +	}

Hi Steve

That looks like it gives some odd speeds.

2500 * 8 / 10 = 2Gbps
5000 * 8 / 10 = 4Gbps
8000 * 8 / 10 = 6.4Gbps
10312 * 8 /10 = 8.249Gbps

Is this correct. We are more used to 10/100/1000/2.5G/5G/10G/40G/100G.

   Andrew

^ permalink raw reply

* Re: [bpf-next PATCH] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-04 23:08 UTC (permalink / raw)
  To: Daniel Borkmann, edumazet, ast; +Cc: netdev
In-Reply-To: <ce73fcc1-15c4-7384-10dd-aeb5fbc159ae@iogearbox.net>

On 06/04/2018 12:59 PM, Daniel Borkmann wrote:
> On 06/04/2018 05:21 PM, John Fastabend wrote:
>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>> of tcpv6_prot.
>>
>> Previously we overwrote the sk->prot field with tcp_prot even in the
>> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
>> are used. Further, only allow ESTABLISHED connections to join the
>> map per note in TLS ULP,
>>
>>    /* The TLS ulp is currently supported only for TCP sockets
>>     * in ESTABLISHED state.
>>     * Supporting sockets in LISTEN state will require us
>>     * to modify the accept implementation to clone rather then
>>     * share the ulp context.
>>     */
>>
>> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
>> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
>> crashing case here.
>>
>> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
>> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Wei Wang <weiwan@google.com>
> 
> Applied to bpf-next, thanks everyone!
> 

Thanks Daniel, this has the unfortunate side-effect though of
making it hard to add sockets transitioning from LISTEN into
ESTABLISHED states to a sockmap. Before this patch we could add
sockets from the sock_ops event BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
to a sock{map|hash}. However, this is before a socket is in established
state so risked crashing and wasn't valid at all per this thread. So
I believe its correct to block this action, seeing it will crash a
system in many (most!) cases.

That said we still would like to support pushing sockets into a
sock{map|hash} in this case. I thought about adding a new hook but
we already have a few sock op hooks in the TCP stack so its too bad we
don't have one that fires after the ESTABLISHED state has transitioned.
Right now I'm looking into seeing if the following would have any
issues,

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2206,9 +2206,6 @@ void tcp_set_state(struct sock *sk, int state)
        BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
        BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);

-       if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
-               tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
-
        switch (state) {
        case TCP_ESTABLISHED:
                if (oldstate != TCP_ESTABLISHED)
@@ -2234,6 +2231,9 @@ void tcp_set_state(struct sock *sk, int state)
         */
        inet_sk_state_store(sk, state);

+       if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+               tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
 #ifdef STATE_TRACE
        SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n", sk, statename[oldstate], statename[state]);
 #endif

This would change the call hook slightly, moving it to after the state
change. However unless the unhash is some how visible from the bpf program
I don't think it should impact existing BPF programs.

Thanks,
John

^ permalink raw reply

* Re: [PATCH net] ipmr: fix error path when mr_table_alloc fails
From: kbuild test robot @ 2018-06-05  7:29 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: kbuild-all, netdev, Sabrina Dubroca, Eric Dumazet,
	Nikolay Aleksandrov, Yuval Mintz, Ivan Vecera
In-Reply-To: <ffc3366697c8d8789cbe314b6944b910eceb38e7.1528112758.git.sd@queasysnail.net>

[-- Attachment #1: Type: text/plain, Size: 3404 bytes --]

Hi Sabrina,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Sabrina-Dubroca/ipmr-fix-error-path-when-mr_table_alloc-fails/20180605-060837
config: x86_64-randconfig-x006-201822 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/current.h:5:0,
                    from include/linux/sched.h:12,
                    from include/linux/uaccess.h:5,
                    from net/ipv6/ip6mr.c:19:
   net/ipv6/ip6mr.c: In function 'ip6_mroute_setsockopt':
>> include/linux/compiler.h:177:26: warning: 'mrt' may be used uninitialized in this function [-Wmaybe-uninitialized]
     case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
                             ^
   net/ipv6/ip6mr.c:1751:20: note: 'mrt' was declared here
      struct mr_table *mrt;
                       ^~~
--
   In file included from arch/x86/include/asm/current.h:5:0,
                    from include/linux/sched.h:12,
                    from include/linux/uaccess.h:5,
                    from net//ipv6/ip6mr.c:19:
   net//ipv6/ip6mr.c: In function 'ip6_mroute_setsockopt':
>> include/linux/compiler.h:177:26: warning: 'mrt' may be used uninitialized in this function [-Wmaybe-uninitialized]
     case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
                             ^
   net//ipv6/ip6mr.c:1751:20: note: 'mrt' was declared here
      struct mr_table *mrt;
                       ^~~

vim +/mrt +177 include/linux/compiler.h

230fa253 Christian Borntraeger 2014-11-25  170  
d976441f Andrey Ryabinin       2015-10-19  171  #define __READ_ONCE_SIZE						\
d976441f Andrey Ryabinin       2015-10-19  172  ({									\
d976441f Andrey Ryabinin       2015-10-19  173  	switch (size) {							\
d976441f Andrey Ryabinin       2015-10-19  174  	case 1: *(__u8 *)res = *(volatile __u8 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  175  	case 2: *(__u16 *)res = *(volatile __u16 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  176  	case 4: *(__u32 *)res = *(volatile __u32 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19 @177  	case 8: *(__u64 *)res = *(volatile __u64 *)p; break;		\
d976441f Andrey Ryabinin       2015-10-19  178  	default:							\
d976441f Andrey Ryabinin       2015-10-19  179  		barrier();						\
d976441f Andrey Ryabinin       2015-10-19  180  		__builtin_memcpy((void *)res, (const void *)p, size);	\
d976441f Andrey Ryabinin       2015-10-19  181  		barrier();						\
d976441f Andrey Ryabinin       2015-10-19  182  	}								\
d976441f Andrey Ryabinin       2015-10-19  183  })
d976441f Andrey Ryabinin       2015-10-19  184  

:::::: The code at line 177 was first introduced by commit
:::::: d976441f44bc5d48635d081d277aa76556ffbf8b compiler, atomics, kasan: Provide READ_ONCE_NOCHECK()

:::::: TO: Andrey Ryabinin <aryabinin@virtuozzo.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29370 bytes --]

^ permalink raw reply

* Re: [bpf-next PATCH] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-04 23:20 UTC (permalink / raw)
  To: John Fastabend, edumazet, ast; +Cc: netdev
In-Reply-To: <60fbb84e-e334-9f53-ced4-170bc2dd21df@gmail.com>

On 06/05/2018 01:08 AM, John Fastabend wrote:
> On 06/04/2018 12:59 PM, Daniel Borkmann wrote:
>> On 06/04/2018 05:21 PM, John Fastabend wrote:
>>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>>> of tcpv6_prot.
>>>
>>> Previously we overwrote the sk->prot field with tcp_prot even in the
>>> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
>>> are used. Further, only allow ESTABLISHED connections to join the
>>> map per note in TLS ULP,
>>>
>>>    /* The TLS ulp is currently supported only for TCP sockets
>>>     * in ESTABLISHED state.
>>>     * Supporting sockets in LISTEN state will require us
>>>     * to modify the accept implementation to clone rather then
>>>     * share the ulp context.
>>>     */
>>>
>>> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
>>> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
>>> crashing case here.
>>>
>>> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
>>> Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>> Signed-off-by: Wei Wang <weiwan@google.com>
>>
>> Applied to bpf-next, thanks everyone!
> 
> Thanks Daniel, this has the unfortunate side-effect though of
> making it hard to add sockets transitioning from LISTEN into
> ESTABLISHED states to a sockmap. Before this patch we could add
> sockets from the sock_ops event BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
> to a sock{map|hash}. However, this is before a socket is in established
> state so risked crashing and wasn't valid at all per this thread. So
> I believe its correct to block this action, seeing it will crash a
> system in many (most!) cases.
> 
> That said we still would like to support pushing sockets into a
> sock{map|hash} in this case. I thought about adding a new hook but
> we already have a few sock op hooks in the TCP stack so its too bad we
> don't have one that fires after the ESTABLISHED state has transitioned.
> Right now I'm looking into seeing if the following would have any
> issues,
> 
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2206,9 +2206,6 @@ void tcp_set_state(struct sock *sk, int state)
>         BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
>         BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
>  
> -       if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
> -               tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
> -
>         switch (state) {
>         case TCP_ESTABLISHED:
>                 if (oldstate != TCP_ESTABLISHED)
> @@ -2234,6 +2231,9 @@ void tcp_set_state(struct sock *sk, int state)
>          */
>         inet_sk_state_store(sk, state);
>  
> +       if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
> +               tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
> +
>  #ifdef STATE_TRACE
>         SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n", sk, statename[oldstate], statename[state]);
>  #endif
> 
> This would change the call hook slightly, moving it to after the state
> change. However unless the unhash is some how visible from the bpf program
> I don't think it should impact existing BPF programs.

Hmm, the current fix also breaks compilation when IPv6 is compiled out, so I had
to take it out for now. :-( I think this needs similar workaround as in kTLS case
in tls_init(). Given this and your above seen side-effect, lets respin all with a
clean fix.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
head:   4e1f687e302835e45e2f296392f21cfeb5671303
commit: 4e1f687e302835e45e2f296392f21cfeb5671303 [3/3] bpf: sockmap, fix crash when ipv6 sock is added
config: i386-randconfig-a0-06041847 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        git checkout 4e1f687e302835e45e2f296392f21cfeb5671303
        # save the attached .config to linux build tree
        make ARCH=i386

All errors (new ones prefixed by >>):

   kernel/bpf/sockmap.o: In function `bpf_tcp_ulp_register':
>> kernel/bpf/sockmap.c:1127: undefined reference to `tcpv6_prot'
>> kernel/bpf/sockmap.c:1127: undefined reference to `tcpv6_prot'

vim +1127 kernel/bpf/sockmap.c

  1122	
  1123	static int bpf_tcp_ulp_register(void)
  1124	{
  1125		tcp_bpf_proto = tcp_prot;
  1126		tcp_bpf_proto.close = bpf_tcp_close;
> 1127		tcpv6_bpf_proto = tcpv6_prot;
  1128		tcpv6_bpf_proto.close = bpf_tcp_close;
  1129		/* Once BPF TX ULP is registered it is never unregistered. It
  1130		 * will be in the ULP list for the lifetime of the system. Doing
  1131		 * duplicate registers is not a problem.
  1132		 */
  1133		return tcp_register_ulp(&bpf_tcp_ulp_ops);
  1134	}
  1135	

Thanks,
Daniel

^ permalink raw reply

* AF_XDP. Was: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: Alexei Starovoitov @ 2018-06-04 23:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Björn Töpel, Karlsson, Magnus,
	Alexei Starovoitov, Daniel Borkmann, Or Gerlitz, Jeff Kirsher,
	Netdev
In-Reply-To: <CAKgT0UdekKLETPDg0Qy58bckFZTST1vXUkOSCz40CoZ1sC-=KA@mail.gmail.com>

On Mon, Jun 04, 2018 at 03:02:31PM -0700, Alexander Duyck wrote:
> On Mon, Jun 4, 2018 at 2:27 PM, David Miller <davem@davemloft.net> wrote:
> > From: Or Gerlitz <gerlitz.or@gmail.com>
> > Date: Tue, 5 Jun 2018 00:11:35 +0300
> >
> >> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
> >> merged for this window -- AFAIU from [1], it's still under
> >> examination/development/research for non Intel HWs, am I correct or
> >> this is going to get in now?
> >
> > All of the pending AF_XDP changes will be merged this merge window.
> >
> > I think Intel folks need to review things as fast as possible because
> > I pretty much refuse to revert the series or disable it in Kconfig at
> > this point.
> >
> > Thank you.
> 
> My understanding of things is that the current AF_XDP patches were
> going to be updated to have more of a model agnostic API such that
> they would work for either the "typewriter" mode or the descriptor
> ring based approach. The current plan was to have the zero copy
> patches be a follow-on after the vendor agnostic API bits in the
> descriptors and such had been sorted out. I believe you guys have the
> descriptor fixes already right?
> 
> In my opinion the i40e code isn't mature enough yet to really go into
> anything other than maybe net-next in a couple weeks. We are going to
> need a while to get adequate testing in order to flush out all the
> bugs and performance regressions we are likely to see coming out of
> this change.

I think the work everyone did in this release cycle increased my confidence
that the way descriptors are defined and the rest of uapi are stable enough
and i40e zero copy bits can land in the next release without uapi changes.
In that sense even if we merge i40e parts now, the other nic vendors
will be in the same situation and may find things that they would like
to improve in uapi.
So I propose we merge the first 7 patches of the last series now and
let 3 remaining i40e patches go via intel trees for the next release.
In the mean time other NIC vendors should start actively working
on AF_XDP support as well.
If somehow uapi would need tweaks, we can still do minor adjustments
since 4.18 won't be released for ~10 weeks.

^ permalink raw reply

* Re: [PATCH net-next v2 1/5] net: aquantia: Ethtool based ring size configuration
From: Igor Russkikh @ 2018-06-04 23:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, netdev, David Arcari, Pavel Belous,
	Anton Mikaev
In-Reply-To: <20180604160024.7a8c7296@cakuba.netronome.com>


>> +	mutex_lock(&self->aq_mutex);
>> +
>>  	if (aq_utils_obj_test(&self->flags, AQ_NIC_FLAGS_IS_NOT_READY))
>>  		goto err_exit;
>>  
>> @@ -175,6 +177,7 @@ static void aq_nic_service_timer_cb(struct timer_list *t)
>>  		ctimer = max(ctimer / 2, 1);
>>  
>>  err_exit:
>> +	mutex_unlock(&self->aq_mutex);
>>  	mod_timer(&self->service_timer, jiffies + ctimer);
>>  }
>>  
> 
> This looks like a timer callback from the prototype, I don't think you
> can take mutexes in timer callbacks.

True as well. Eventually, think we may just get rid of mutex inside of this callback.

Mutex then will only serve to prevent possible parallel `ethtool -G` collisions from happening.

BR, Igor

^ permalink raw reply

* Re: [PATCH bpf-next v2 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events
From: Daniel Borkmann @ 2018-06-04 23:42 UTC (permalink / raw)
  To: Jakub Kicinski, Toke Høiland-Jørgensen
  Cc: netdev, Alexei Starovoitov
In-Reply-To: <20180604152324.6e1115a2@cakuba.netronome.com>

On 06/05/2018 12:26 AM, Jakub Kicinski wrote:
> On Mon, 04 Jun 2018 18:33:56 +0200, Toke Høiland-Jørgensen wrote:
>> This adds two new helper functions to trace_helpers that supports polling
>> multiple perf file descriptors for events. These are used to the XDP
>> perf_event_output example, which needs to work with one perf fd per CPU.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> 
> Did you take a look at tools/bpf/bpftool/map_perf_ring.c ?
> 
> I think the ability to poll multiple FDs could be generally useful and
> therefore better add it to libbpf.c than
> tools/testing/selftests/bpf/trace_helpers.c?  I'm not 100% sure myself...

I think for it to land in libbpf this code needs to be more generalized
as it is right now and allowing for more flexibility like pinning RB
processing threads to CPUs, poll handling, etc.

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/2] samples/bpf: Add xdp_sample_pkts example
From: Daniel Borkmann @ 2018-06-04 23:43 UTC (permalink / raw)
  To: Jakub Kicinski, Toke Høiland-Jørgensen; +Cc: netdev
In-Reply-To: <20180604153205.72678576@cakuba.netronome.com>

On 06/05/2018 12:32 AM, Jakub Kicinski wrote:
> On Mon, 04 Jun 2018 18:33:56 +0200, Toke Høiland-Jørgensen wrote:
>> +	if (load_bpf_file(filename)) {
> 
> Would you mind using libbpf instead of bpf_load.o?  I converted some
> samples in be5bca44aa6b ("samples: bpf: convert some XDP samples from
> bpf_load to libbpf"), it's pretty straight forward.  Maybe we can kill
> bpf_load.o one day :)

Agreed, we should only be using libbpf going forward.

^ permalink raw reply

* [PATCH net-next] net: metrics: add proper netlink validation
From: Eric Dumazet @ 2018-06-04 23:46 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet, David Ahern

Before using nla_get_u32(), better make sure the attribute
is of the proper size.

Code recently was changed, but bug has been there from beginning
of git.

BUG: KMSAN: uninit-value in rtnetlink_put_metrics+0x553/0x960 net/core/rtnetlink.c:746
CPU: 1 PID: 14139 Comm: syz-executor6 Not tainted 4.17.0-rc5+ #103
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x149/0x260 mm/kmsan/kmsan.c:1084
 __msan_warning_32+0x6e/0xc0 mm/kmsan/kmsan_instr.c:686
 rtnetlink_put_metrics+0x553/0x960 net/core/rtnetlink.c:746
 fib_dump_info+0xc42/0x2190 net/ipv4/fib_semantics.c:1361
 rtmsg_fib+0x65f/0x8c0 net/ipv4/fib_semantics.c:419
 fib_table_insert+0x2314/0x2b50 net/ipv4/fib_trie.c:1287
 inet_rtm_newroute+0x210/0x340 net/ipv4/fib_frontend.c:779
 rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x455a09
RSP: 002b:00007faae5fd8c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007faae5fd96d4 RCX: 0000000000455a09
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000013
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000005d0 R14: 00000000006fdc20 R15: 0000000000000000

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:529
 fib_convert_metrics net/ipv4/fib_semantics.c:1056 [inline]
 fib_create_info+0x2d46/0x9dc0 net/ipv4/fib_semantics.c:1150
 fib_table_insert+0x3e4/0x2b50 net/ipv4/fib_trie.c:1146
 inet_rtm_newroute+0x210/0x340 net/ipv4/fib_frontend.c:779
 rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
 kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
 slab_post_alloc_hook mm/slab.h:446 [inline]
 slab_alloc_node mm/slub.c:2753 [inline]
 __kmalloc_node_track_caller+0xb32/0x11b0 mm/slub.c:4395
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:988 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
 netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a919525ad832 ("net: Move fib_convert_metrics to metrics file")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
---
 net/ipv4/fib_semantics.c | 2 ++
 net/ipv4/metrics.c       | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 6608db23f54b6afdac0455650b47d64b1b22b255..9a890be8a0265edb78da225a82e2cac120f2150f 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -717,6 +717,8 @@ bool fib_metrics_match(struct fib_config *cfg, struct fib_info *fi)
 			nla_strlcpy(tmp, nla, sizeof(tmp));
 			val = tcp_ca_get_key_by_name(fi->fib_net, tmp, &ecn_ca);
 		} else {
+			if (nla_len(nla) != sizeof(u32)
+				return false;
 			val = nla_get_u32(nla);
 		}
 
diff --git a/net/ipv4/metrics.c b/net/ipv4/metrics.c
index 5121c6475e6b0e9a9a158d4cee473f52cd4d8efe..04311f7067e2e9e3dafb89aa4f8e30dab0fde854 100644
--- a/net/ipv4/metrics.c
+++ b/net/ipv4/metrics.c
@@ -32,6 +32,8 @@ int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
 			if (val == TCP_CA_UNSPEC)
 				return -EINVAL;
 		} else {
+			if (nla_len(nla) != sizeof(u32))
+				return -EINVAL;
 			val = nla_get_u32(nla);
 		}
 		if (type == RTAX_ADVMSS && val > 65535 - 40)
-- 
2.17.1.1185.g55be947832-goog

^ permalink raw reply related

* Re: [PATCH] [net-next, wrong] make BPFILTER_UMH depend on X86
From: Alexei Starovoitov @ 2018-06-04 23:51 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Masahiro Yamada, David S. Miller, Alexei Starovoitov,
	Linux Kbuild mailing list, Networking, Linux Kernel Mailing List
In-Reply-To: <CAK8P3a1MSsL1vhV7Y98wCnP6NzK+OhqYpkoNuRm2J5PkJEK8=g@mail.gmail.com>

On Fri, Jun 01, 2018 at 05:20:12PM +0200, Arnd Bergmann wrote:
> On Thu, May 31, 2018 at 3:42 AM, Masahiro Yamada
> <yamada.masahiro@socionext.com> wrote:
> > 2018-05-31 0:17 GMT+09:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
> >> On Mon, May 28, 2018 at 05:31:01PM +0200, Arnd Bergmann wrote:
> 
> > Hmm.
> > For cross-compiling, we set 'ARCH' via the environment variable or the
> > command line.
> >
> > ARCH is not explicitly set, the top-level Makefile sets it to $(SUBARCH)
> >
> >
> > ARCH ?= $(SUBARCH)
> >
> >
> > Maybe, we can assume the native build if $(ARCH) and $(SUBARCH) are the same?
> >
> 
> SUBARCH is also used with a special meaning for arch/um where we build
> with ARCH=um SUBARCH=x86, either on native (x86) or cross builds.
> 
> 
> So doing that would still work in most but not all cases.
> 
> What is the reason for using HOSTCC rather than CC anyway? I think
> the correct way to do this would be to check if CC is able to link binaries
> and disallow the option if it's not.

that's a great idea. Let's do that.

> Don't we already do something like that for tools/testing/selftest which
> also needs to generate binaries with CC?

I couldn't find such makefile magic. Can you please help me with this?

^ permalink raw reply

* Re: [PATCH net-next] net: metrics: add proper netlink validation
From: David Ahern @ 2018-06-04 23:54 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180604234601.261823-1-edumazet@google.com>

On 6/4/18 4:46 PM, Eric Dumazet wrote:
> Before using nla_get_u32(), better make sure the attribute
> is of the proper size.
> 
> Code recently was changed, but bug has been there from beginning
> of git.
> 
...
> 
> Fixes: a919525ad832 ("net: Move fib_convert_metrics to metrics file")

That commit just moved the code from 1 file to another. The previous
commit id is 6cf9dfd3bd62e, but it just moved code to a helper. The
originating commit id for the ip_metrics_convert bug is:

ea697639992d9 ("net: tcp: add RTAX_CC_ALGO fib handling")


> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Cc: David Ahern <dsahern@gmail.com>
> ---
>  net/ipv4/fib_semantics.c | 2 ++
>  net/ipv4/metrics.c       | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index 6608db23f54b6afdac0455650b47d64b1b22b255..9a890be8a0265edb78da225a82e2cac120f2150f 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -717,6 +717,8 @@ bool fib_metrics_match(struct fib_config *cfg, struct fib_info *fi)
>  			nla_strlcpy(tmp, nla, sizeof(tmp));
>  			val = tcp_ca_get_key_by_name(fi->fib_net, tmp, &ecn_ca);
>  		} else {
> +			if (nla_len(nla) != sizeof(u32)
> +				return false;
>  			val = nla_get_u32(nla);
>  		}
>  
> diff --git a/net/ipv4/metrics.c b/net/ipv4/metrics.c
> index 5121c6475e6b0e9a9a158d4cee473f52cd4d8efe..04311f7067e2e9e3dafb89aa4f8e30dab0fde854 100644
> --- a/net/ipv4/metrics.c
> +++ b/net/ipv4/metrics.c
> @@ -32,6 +32,8 @@ int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
>  			if (val == TCP_CA_UNSPEC)
>  				return -EINVAL;
>  		} else {
> +			if (nla_len(nla) != sizeof(u32))
> +				return -EINVAL;
>  			val = nla_get_u32(nla);
>  		}
>  		if (type == RTAX_ADVMSS && val > 65535 - 40)
> 

^ permalink raw reply

* Re: [PATCH net-next] net: metrics: add proper netlink validation
From: Eric Dumazet @ 2018-06-04 23:58 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet, David S . Miller; +Cc: netdev
In-Reply-To: <abaa5381-157f-b66d-5f57-58436908bea4@gmail.com>

On 06/04/2018 04:54 PM, David Ahern wrote:
> On 6/4/18 4:46 PM, Eric Dumazet wrote:
>> Before using nla_get_u32(), better make sure the attribute
>> is of the proper size.
>>
>> Code recently was changed, but bug has been there from beginning
>> of git.
>>
> ...
>>
>> Fixes: a919525ad832 ("net: Move fib_convert_metrics to metrics file")
> 
> That commit just moved the code from 1 file to another. The previous
> commit id is 6cf9dfd3bd62e, but it just moved code to a helper. The
> originating commit id for the ip_metrics_convert bug is:
> 

Please read what I wrote.

I simply wanted to warn stable teams that your this patch is based on recent tree,
but bug has been there forever.

The Fixes: tag might help them to cook proper backports, thats is all.

A Fixes: tag does not blame the code, it simply gives some hints.

> ea697639992d9 ("net: tcp: add RTAX_CC_ALGO fib handling")
> 

This patch has not added any bug, it was there already.

I can put a (long) list of tags, but ultimately the bug has been there forever.

^ permalink raw reply

* Re: [PATCH net-next v11 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus
From: Andrew Lunn @ 2018-06-05  0:10 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: netdev, Carlos Munoz
In-Reply-To: <1528149617-8964-4-git-send-email-steven.hill@cavium.com>

> +		/* Connect to PKI/PKO */
> +		data = oct_csr_read(BGX_CMR_CONFIG(numa_node, interface, port));
> +		if (is_mix)
> +			data |= BIT(11);
> +		else
> +			data &= ~BIT(11);
> +		oct_csr_write(data, BGX_CMR_CONFIG(numa_node, interface, port));
> +

Hi Steven

This driver has quite a lot of magic BIT macros. Can you add some
#defines with useful names?

	 Thanks
		Andrew

^ permalink raw reply

* Re: [PATCH net-next v11 04/10] netdev: cavium: octeon: Add Octeon III BGX Ports
From: Andrew Lunn @ 2018-06-05  0:55 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: netdev, Carlos Munoz
In-Reply-To: <1528149617-8964-5-git-send-email-steven.hill@cavium.com>

> +	if (status.link) {
> +		/* Always full duplex */
> +		status.duplex = DUPLEX_FULL;
> +
> +		/* Speed */
> +		speed = bgx_port_get_qlm_speed(priv, priv->qlm);
> +		data = oct_csr_read(BGX_CMR_CONFIG(priv->node, priv->bgx,
> +						   priv->index));
> +		switch ((data >> 8) & 7) {
> +		default:
> +		case 1:
> +			speed = (speed * 8 + 5) / 10;
> +			lanes = 4;
> +			break;

Hi Steven

Here you add 5, which you did not in the other function dealing with
speed...

> +		priv->phydev = of_phy_connect(netdev, priv->phy_np,
> +					      bgx_port_adjust_link, 0,
> +					      PHY_INTERFACE_MODE_SGMII);
> +		if (!priv->phydev)
> +			return -ENODEV;
> +
> +		netif_carrier_off(netdev);
> +
> +		if (priv->phydev)
> +			phy_start_aneg(priv->phydev);
> +	}

If you are using phylib, you should not need to make calls to
netif_carrier_*(). The phylib will do it for you.

Why hard code passing PHY_INTERFACE_MODE_SGMII? You also support
RGMII? It would be better to use of_get_phy_mode().

> +int bgx_port_change_mtu(struct net_device *netdev, int new_mtu)
> +{
> +	struct bgx_port_priv *priv = bgx_port_netdev2priv(netdev);
> +	int max_frame;
> +
> +	if (new_mtu < 60 || new_mtu > 65392) {
> +		netdev_warn(netdev, "Maximum MTU supported is 65392\n");
> +		return -EINVAL;
> +	}

The core can check this for you, if you tell it the MAX and Min.

    Andrew

^ permalink raw reply

* Re: [PATCH 10/18] rhashtable: remove rhashtable_walk_peek()
From: NeilBrown @ 2018-06-05  1:00 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Herbert Xu, Thomas Graf, Linux Kernel Network Developers, LKML,
	Tom Herbert
In-Reply-To: <CALx6S35sf4oj=x73SfhUxi1e9-u4Stxgr9BAj5E7KKmbufntwg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

On Mon, Jun 04 2018, Tom Herbert wrote:

>>
>> Maybe a useful way forward would be for you to write documentation for
>> the rhashtable_walk_peek() interface which correctly describes what it
>> does and how it is used.  Given that, I can implement that interface
>> with the stability improvements that I'm working on.
>>
>
> Here's how it's documented currently:
>
> "rhashtable_walk_peek - Return the next object but don't advance the iterator"
>
> I don't see what is incorrect about that.

rhashtable_walk_next is documented:

 * rhashtable_walk_next - Return the next object and advance the iterator

So it seems reasonable to assume that you get the same object, no matter
which one you call.  Yet this is not (necessarily) the case.

>                                           Peek returns the next object
> in the walk, however does not move the iterator past that object, so
> sucessive calls to peek return the same object. In other words it's a
> way to inspect the next object but not "consume" it. This is what is
> needed when netlink returns in the middle of a walk. The last object
> retrieved from the table may not have been processed completely, so it
> needs to be the first one processed on the next invocation to netlink.

I completely agree with this last sentence.
We typically need to process the last object retrieved.  This could also
be described as the previously retrieved object.
So rhashtable_walk_last() and rhashtable_walk_prev() might both be
suitable names, though each is open to misinterpretation.
I fail to see how the "last object retrieved" could be the same as
"the next object" which rhashtable_walk_peek claims to return.

>
> This is also easily distinguishable from
>
> "rhashtable_walk_next - Return the next object and advance the iterator"
>
> Where the only difference is that peek and walk is that, walk advances
> the iterator and peek does not. Hence why "peek" is a descriptive name
> for what is happening.

Maybe if we step back and go for a totally different API.

We could change rhashtable_walk_start() to return the object that was
current (most recently returned) when rhashtable_walk_stop() was called,
if it is still in the table, otherwise it returns NULL (as it would the
first time it was called).
This loses the option for rhashtable_walk_start() to return -EAGAIN, but
I would rather than rhashtable_walk_next() were the only thing to return
that.

The only time it really makes sense to call rhashtable_walk_peek() is
immediately after rhashtable_walk_start(), and this change would make
that fact clear in the API.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: David Miller @ 2018-06-05  1:09 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180604175644.24293-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon,  4 Jun 2018 10:56:32 -0700

> This series contains a smorgasbord of updates to documentation, e1000e,
> igb, ixgbe, ixgbevf and i40e.
 ...
> The following are changes since commit 8284fd4cb85577eecca024fe1e7a35b39ed0f3f5:
>   Merge branch 'selftests-net-various'
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH 10/18] rhashtable: remove rhashtable_walk_peek()
From: NeilBrown @ 2018-06-05  1:24 UTC (permalink / raw)
  To: Tom Herbert, Tom Herbert
  Cc: Herbert Xu, Thomas Graf, Linux Kernel Network Developers, LKML
In-Reply-To: <CAPDqMeo0hV+-ijYPKhpzpHWVovYrS7s22tBfUZTfvXvNv_qpXw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7501 bytes --]

On Mon, Jun 04 2018, Tom Herbert wrote:

> On Mon, Jun 4, 2018 at 2:31 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Sun, Jun 3, 2018 at 7:09 PM, NeilBrown <neilb@suse.com> wrote:
>>> On Sun, Jun 03 2018, Tom Herbert wrote:
>>>
>>>> On Sun, Jun 3, 2018 at 5:30 PM, NeilBrown <neilb@suse.com> wrote:
>>>>> On Sat, Jun 02 2018, Herbert Xu wrote:
>>>>>
>>>>>> On Fri, Jun 01, 2018 at 02:44:09PM +1000, NeilBrown wrote:
>>>>>>> This function has a somewhat confused behavior that is not properly
>>>>>>> described by the documentation.
>>>>>>> Sometimes is returns the previous object, sometimes it returns the
>>>>>>> next one.
>>>>>>> Sometimes it changes the iterator, sometimes it doesn't.
>>>>>>>
>>>>>>> This function is not currently used and is not worth keeping, so
>>>>>>> remove it.
>>>>>>>
>>>>>>> A future patch will introduce a new function with a
>>>>>>> simpler interface which can meet the same need that
>>>>>>> this was added for.
>>>>>>>
>>>>>>> Signed-off-by: NeilBrown <neilb@suse.com>
>>>>>>
>>>>>> Please keep Tom Herbert in the loop.  IIRC he had an issue with
>>>>>> this patch.
>>>>>
>>>>> Yes you are right - sorry for forgetting to add Tom.
>>>>>
>>>>> My understanding of where this issue stands is that Tom raised issue and
>>>>> asked for clarification, I replied, nothing further happened.
>>>>>
>>>>> It summary, my position is that:
>>>>> - most users of my new rhashtable_walk_prev() will use it like
>>>>>    rhasthable_talk_prev() ?: rhashtable_walk_next()
>>>>>   which is close to what rhashtable_walk_peek() does
>>>>> - I know of no use-case that could not be solved if we only had
>>>>>   the combined operation
>>>>> - BUT it is hard to document the combined operation, as it really
>>>>>   does two things.  If it is hard to document, then it might be
>>>>>   hard to understand.
>>>>>
>>>>> So provide the most understandable/maintainable solution, I think
>>>>> we should provide rhashtable_walk_prev() as a separate interface.
>>>>>
>>>> I'm still missing why requiring two API operations instead of one is
>>>> simpler or easier to document. Also, I disagree that
>>>> rhashtable_walk_peek does two things-- it just does one which is to
>>>> return the current element in the walk without advancing to the next
>>>> one. The fact that the iterator may or may not move is immaterial in
>>>> the API, that is an implementation detail. In fact, it's conceivable
>>>> that we might completely reimplement this someday such that the
>>>> iterator works completely differently implementation semantics but the
>>>> API doesn't change. Also the naming in your proposal is confusing,
>>>> we'd have operations to get the previous, and the next next object--
>>>> so the user may ask where's the API to get the current object in the
>>>> walk? The idea that we get it by first trying to get the previous
>>>> object, and then if that fails getting the next object seems
>>>> counterintuitive.
>>>
>>> To respond to your points out of order:
>>>
>>> - I accept that "rhashtable_walk_prev" is not a perfect name.  It
>>>   suggests a stronger symmetry with rhasthable_walk_next than actually
>>>   exist.  I cannot think of a better name, but I think the
>>>   description "Return the previously returned object if it is
>>>   still in the table" is clear and simple and explains the name.
>>>   I'm certainly open to suggestions for a better name.
>>>
>>> - I don't think it is meaningful to talk about a "current" element in a
>>>   table where asynchronous insert/remove is to be expected.
>>>   The best we can hope for is a "current location" is the sequence of
>>>   objects in the table - a location which is after some objects and
>>>   before all others.  rhashtable_walk_next() returns the next object
>>>   after the current location, and advances the location pointer past
>>>   that object.
>>>   rhashtable_walk_prev() *doesn't* return the previous object in the
>>>   table.  It returns the previously returned object. ("previous" in
>>>   time, but not in space, if you like).
>>>
>>> - rhashtable_walk_peek() currently does one of two different things.
>>>   It either returns the previously returned object (iter->p) if that
>>>   is still in the table, or it find the next object, steps over it, and
>>>   returns it.
>>>
>>> - I would like to suggest that when an API acts on a iterator object,
>>>   the question of whether or not the iterator is advanced *must* be a
>>>   fundamental question, not one that might change from time to time.
>>>
>>> Maybe a useful way forward would be for you to write documentation for
>>> the rhashtable_walk_peek() interface which correctly describes what it
>>> does and how it is used.  Given that, I can implement that interface
>>> with the stability improvements that I'm working on.
>>>
>>
>> Here's how it's documented currently:
>>
>> "rhashtable_walk_peek - Return the next object but don't advance the iterator"
>>
>> I don't see what is incorrect about that. Peek returns the next object
>> in the walk, however does not move the iterator past that object, so
>> sucessive calls to peek return the same object. In other words it's a
>> way to inspect the next object but not "consume" it. This is what is
>> needed when netlink returns in the middle of a walk. The last object
>> retrieved from the table may not have been processed completely, so it
>> needs to be the first one processed on the next invocation to netlink.
>>
>> This is also easily distinguishable from
>>
>> "rhashtable_walk_next - Return the next object and advance the iterator"
>>
>> Where the only difference is that peek and walk is that, walk advances
>> the iterator and peek does not. Hence why "peek" is a descriptive name
>> for what is happening.
>>
>
> btw, we are using rhashtable_walk_peek with ILA code that hasn't been
> upstreamed yet. I'll (re)post the patches shortly, this demonstates
> why we need the peek functionality. If you think that
> rhashtable_walk_peek is nothing more than an inline that does "return
> rhashtable_walk_prev(iter) ? : rhashtable_walk_next(iter);" then maybe
> we could redefine rhashtable_walk_peek to be that. But, then I'll ask
> what the use case is for rhashtable_walk_prev as a standalone
> function? We created rhashtable_walk_peek for the netlink walk problem
> and I don't think any of the related use cases would ever call
> rhashtable_walk_prev without the rhashtable_walk_next fallback.

I think I did describe my particular use-case for rhashtable_walk_prev()
before, but it probably got buried in other things.

Lustre has a debugfs file which lists all the cached pages in all the
objects which are stored in a hash table.
It might get part-way through listing the cached pages for one object,
and then have to take a break (rhashtable_walk_stop()) until more output
space is available.
When it restarts, it needs to pick up where it left off.
If the same object is still in the cache, it needs to start at the page
address that it was up to.  If the object is gone and it has to move on
to the next object, then it needs to start at page zero.
So I would do something like:

 rhashtable_walk_start(hashtab, &state->riter);
 obj = rhashtable_walk_prev(&state->riter);
 addr = state->addr;
 if (!obj) {
    obj = rhashtable_walk_next(state->riter);
    addr = 0;
 }

 then repeatedly format(obj,addr) and incrment addr or
 (if past end-of-object) call rhashtable_walk_next and addr=0.

 then
  addr->addr = addr - 1;
  rhashtable_walk_stop();


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH net-next v2] net: ipv6: Generate random IID for addresses on RAWIP devices
From: Subash Abhinov Kasiviswanathan @ 2018-06-05  1:26 UTC (permalink / raw)
  To: davem, netdev, yoshfuji; +Cc: Subash Abhinov Kasiviswanathan, Sean Tranchetti

RAWIP devices such as rmnet do not have a hardware address and
instead require the kernel to generate a random IID for the
IPv6 addresses.

Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

---
v1->v2: Yoshfuji suggested to update the I/G and G/L bit.
Similar functionality is already implemented by addrconf_ifid_ip6tnl()
so use it.
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 4 ++++
 net/ipv6/addrconf.c                             | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index cb02e1a..b9a7548 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -221,6 +221,10 @@ void rmnet_vnd_setup(struct net_device *rmnet_dev)
 
 	rmnet_dev->needs_free_netdev = true;
 	rmnet_dev->ethtool_ops = &rmnet_ethtool_ops;
+
+	/* This perm addr will be used as interface identifier by IPv6 */
+	rmnet_dev->addr_assign_type = NET_ADDR_RANDOM;
+	eth_random_addr(rmnet_dev->perm_addr);
 }
 
 /* Exposed API */
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f09afc2..5596d87 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2251,6 +2251,7 @@ static int ipv6_generate_eui64(u8 *eui, struct net_device *dev)
 		return addrconf_ifid_ieee1394(eui, dev);
 	case ARPHRD_TUNNEL6:
 	case ARPHRD_IP6GRE:
+	case ARPHRD_RAWIP:
 		return addrconf_ifid_ip6tnl(eui, dev);
 	}
 	return -1;
@@ -3286,7 +3287,8 @@ static void addrconf_dev_config(struct net_device *dev)
 	    (dev->type != ARPHRD_IP6GRE) &&
 	    (dev->type != ARPHRD_IPGRE) &&
 	    (dev->type != ARPHRD_TUNNEL) &&
-	    (dev->type != ARPHRD_NONE)) {
+	    (dev->type != ARPHRD_NONE) &&
+	    (dev->type != ARPHRD_RAWIP)) {
 		/* Alas, we support only Ethernet autoconfiguration. */
 		return;
 	}
-- 
1.9.1

^ permalink raw reply related

* [PATCH net-next v2] net: qualcomm: rmnet: Fix use after free while sending command ack
From: Subash Abhinov Kasiviswanathan @ 2018-06-05  1:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: Subash Abhinov Kasiviswanathan

When sending an ack to a command packet, the skb is still referenced
after it is sent to the real device. Since the real device could
free the skb, the device pointer would be invalid.
Also, remove an unnecessary variable.

Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>

---
v1->v2: Rebase change on net-next instead as mentioned by David.
Also remove an unnecessary variable.
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
index 56a93df..3ee8ae9 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
@@ -67,7 +67,7 @@ static void rmnet_map_send_ack(struct sk_buff *skb,
 			       struct rmnet_port *port)
 {
 	struct rmnet_map_control_command *cmd;
-	int xmit_status;
+	struct net_device *dev = skb->dev;
 
 	if (port->data_format & RMNET_FLAGS_INGRESS_MAP_CKSUMV4)
 		skb_trim(skb,
@@ -78,9 +78,9 @@ static void rmnet_map_send_ack(struct sk_buff *skb,
 	cmd = RMNET_MAP_GET_CMD_START(skb);
 	cmd->cmd_type = type & 0x03;
 
-	netif_tx_lock(skb->dev);
-	xmit_status = skb->dev->netdev_ops->ndo_start_xmit(skb, skb->dev);
-	netif_tx_unlock(skb->dev);
+	netif_tx_lock(dev);
+	dev->netdev_ops->ndo_start_xmit(skb, dev);
+	netif_tx_unlock(dev);
 }
 
 /* Process MAP command frame and send N/ACK message as appropriate. Message cmd
-- 
1.9.1

^ permalink raw reply related

* Re: AF_XDP. Was: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04
From: Alexander Duyck @ 2018-06-05  1:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Miller, Björn Töpel, Karlsson, Magnus,
	Alexei Starovoitov, Daniel Borkmann, Or Gerlitz, Jeff Kirsher,
	Netdev
In-Reply-To: <20180604233224.hjuh2hcbsnsn2lwr@ast-mbp>

On Mon, Jun 4, 2018 at 4:32 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Jun 04, 2018 at 03:02:31PM -0700, Alexander Duyck wrote:
>> On Mon, Jun 4, 2018 at 2:27 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Or Gerlitz <gerlitz.or@gmail.com>
>> > Date: Tue, 5 Jun 2018 00:11:35 +0300
>> >
>> >> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
>> >> merged for this window -- AFAIU from [1], it's still under
>> >> examination/development/research for non Intel HWs, am I correct or
>> >> this is going to get in now?
>> >
>> > All of the pending AF_XDP changes will be merged this merge window.
>> >
>> > I think Intel folks need to review things as fast as possible because
>> > I pretty much refuse to revert the series or disable it in Kconfig at
>> > this point.
>> >
>> > Thank you.
>>
>> My understanding of things is that the current AF_XDP patches were
>> going to be updated to have more of a model agnostic API such that
>> they would work for either the "typewriter" mode or the descriptor
>> ring based approach. The current plan was to have the zero copy
>> patches be a follow-on after the vendor agnostic API bits in the
>> descriptors and such had been sorted out. I believe you guys have the
>> descriptor fixes already right?
>>
>> In my opinion the i40e code isn't mature enough yet to really go into
>> anything other than maybe net-next in a couple weeks. We are going to
>> need a while to get adequate testing in order to flush out all the
>> bugs and performance regressions we are likely to see coming out of
>> this change.
>
> I think the work everyone did in this release cycle increased my confidence
> that the way descriptors are defined and the rest of uapi are stable enough
> and i40e zero copy bits can land in the next release without uapi changes.
> In that sense even if we merge i40e parts now, the other nic vendors
> will be in the same situation and may find things that they would like
> to improve in uapi.
> So I propose we merge the first 7 patches of the last series now and
> let 3 remaining i40e patches go via intel trees for the next release.
> In the mean time other NIC vendors should start actively working
> on AF_XDP support as well.
> If somehow uapi would need tweaks, we can still do minor adjustments
> since 4.18 won't be released for ~10 weeks.
>

That works for me. Actually I think patch 11 can probably be included
as well since that is just sample code and could probably be used by
whatever drivers end up implementing this.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH net-next] net: phy: broadcom: Enable 125 MHz clock on LED4 pin for BCM54612E by default.
From: Florian Fainelli @ 2018-06-05  1:45 UTC (permalink / raw)
  To: Kun Yi, davem
  Cc: netdev, Avi.Fishman, tali.perry, tomer.maimon, benjaminfair,
	rlippert
In-Reply-To: <20180604201704.238472-1-kunyi@google.com>

Le 06/04/18 à 13:17, Kun Yi a écrit :
> BCM54612E have 4 multi-functional LED pins that can be configured
> through register setting; the LED4 pin can be configured to a 125MHz
> reference clock output by setting the spare register. Since the dedicated
> CLK125 reference clock pin is not brought out on the 48-Pin MLP, the LED4
> pin is the only pin to provide such function in this package, and therefore
> it is beneficial to just enable the reference clock by default.

Checked the data sheet and this appears to be absolutely correct:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* [RFC PATCH] net: aquantia: hw_atl_utils_mpi_set_state() can be static
From: kbuild test robot @ 2018-06-05 10:22 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: kbuild-all, David S . Miller, netdev, David Arcari, Pavel Belous,
	Igor Russkikh
In-Reply-To: <cea8a6dadd0ddd48fd08c3c3b244fd4db149bf50.1527596210.git.igor.russkikh@aquantia.com>


Fixes: 45c5c36aa288 ("net: aquantia: Improve adapter init/deinit logic")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
---
 hw_atl_utils.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
index 9d0a96d..3d60a48 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
@@ -533,8 +533,8 @@ int hw_atl_utils_mpi_set_speed(struct aq_hw_s *self, u32 speed)
 	return 0;
 }
 
-int hw_atl_utils_mpi_set_state(struct aq_hw_s *self,
-			       enum hal_atl_utils_fw_state_e state)
+static int hw_atl_utils_mpi_set_state(struct aq_hw_s *self,
+				      enum hal_atl_utils_fw_state_e state)
 {
 	int err = 0;
 	u32 transaction_id = 0;

^ permalink raw reply related

* Re: [PATCH net-next 2/5] net: aquantia: Improve adapter init/deinit logic
From: kbuild test robot @ 2018-06-05 10:22 UTC (permalink / raw)
  To: Igor Russkikh
  Cc: kbuild-all, David S . Miller, netdev, David Arcari, Pavel Belous,
	Igor Russkikh
In-Reply-To: <cea8a6dadd0ddd48fd08c3c3b244fd4db149bf50.1527596210.git.igor.russkikh@aquantia.com>

Hi Igor,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Igor-Russkikh/net-aquantia-Ethtool-based-ring-size-configuration/20180601-044445
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:525:5: sparse: symbol 'hw_atl_utils_mpi_set_speed' was not declared. Should it be static?
>> drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c:536:5: sparse: symbol 'hw_atl_utils_mpi_set_state' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH net-next] net/mlx5e: Make function mlx5e_change_rep_mtu() static
From: Wei Yongjun @ 2018-06-05  2:42 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Adi Nissim
  Cc: Wei Yongjun, netdev, linux-rdma, kernel-janitors

Fixes the following sparse warning:

drivers/net/ethernet/mellanox/mlx5/core/en_rep.c:903:5: warning:
 symbol 'mlx5e_change_rep_mtu' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 3857f22..57987f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -900,7 +900,7 @@ int mlx5e_get_offload_stats(int attr_id, const struct net_device *dev,
 	.switchdev_port_attr_get	= mlx5e_attr_get,
 };
 
-int mlx5e_change_rep_mtu(struct net_device *netdev, int new_mtu)
+static int mlx5e_change_rep_mtu(struct net_device *netdev, int new_mtu)
 {
 	return mlx5e_change_mtu(netdev, new_mtu, NULL);
 }

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox