Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: Ingo Molnar @ 2018-03-22  9:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, David Laight, Rahul Lakkireddy, x86@kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	mingo@redhat.com, hpa@zytor.com, davem@davemloft.net,
	akpm@linux-foundation.org, ganeshgr@chelsio.com,
	nirranjan@chelsio.com, indranil@chelsio.com, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, Eric Biggers
In-Reply-To: <CA+55aFyK6uJopBjZGyeDz0O+qzCac6AwPkJyOpuaT69inPgVgw@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.

Yeah, I think the only valid save/restore pattern is to 100% correctly enumerate 
the width of the vector registers, and use full width instructions.

Using partial registers, even though it's possible in some cases is probably a bad 
idea not just due to most instructions auto-zeroing the upper portion to reduce 
false dependencies, but also because 'mixed' use of partial and full register 
access is known to result in penalties on a wide range of Intel CPUs, at least 
according to the Agner PDFs. On AMD CPUs there's no penalty.

So what I think could be done at best is to define a full register save/restore 
API, which falls back to XSAVE*/XRSTOR* if we don't have the routines for the 
native vector register width. (I.e. if old kernel is used on very new CPU.)

Note that the actual AVX code could still use partial width, it's the save/restore 
primitives that has to handle full width registers.

> And yes, we can have some statically patched code that takes that into account, 
> and saves the whole zmm register when AVX512 is on, but the whole *point* of the 
> dynamic XSAVES thing is actually that Intel wants to be able enable new 
> user-space features without having to wait for OS support. Literally. That's why 
> and how it was designed.

This aspect wouldn't be hurt AFAICS: to me it appears that due to glibc using 
vector instructions in its memset() the AVX bits get used early on and to the 
maximum, so the XINUSE for them is set for every task.

The optionality of other XSAVE based features like MPX wouldn't be hurt if the 
kernel only uses vector registers.

> And saving a couple of zmm registers is actually pretty hard. They're big. Do 
> you want to allocate 128 bytes of stack space, preferably 64-byte aligned, for a 
> save area? No. So now it needs to be some kind of per-thread (or maybe per-CPU, 
> if we're willing to continue to not preempt) special save area too.

Hm, that's indeed a nasty complication:

 - While a single 128 bytes slot might work - in practice at least two vector
   registers are needed to have enough parallelism and hide latencies.

 - &current->thread.fpu.state.xsave is available almost all the time: with our 
   current 'direct' FPU context switching code the only time there's live data in
   &current->thread.fpu is when the task is not running. But it's not IRQ-safe.

We could probably allow irq save/restore sections to use it, as 
local_irq_save()/restore() is still *much* faster than a 1-1.5K FPU context 
save/restore pattern.

But I was hoping for a less restrictive model ... :-/

To have a better model and avoid the local_irq_save()/restore we could perhaps 
change the IRQ model to have a per IRQ 'current' value (we have separate IRQ 
stacks already), but that's quite a bit of work to transform all code that 
operates on the interrupted task (scheduler and timer code).

But it's work that would be useful for other reasons as well.

With such a separation in place &current->thread.fpu.state.xsave would become a 
generic, natural vector register save area.

> And even then, it doesn't solve the real worry of "maybe there will be odd 
> interactions with future extensions that we don't even know of".

Yes, that's true, but I think we could avoid these dangers by using CPU model 
based enumeration. The cost would be that vector ops would only be available on 
new CPU models after an explicit opt-in. In many cases it will be a single new 
constant to an existing switch() statement, easily backported as well.

> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?

Ok, so that's not what I'd use it for, I'd use it:

 - Speed up existing AVX (crypto, RAID) routines for smaller buffer sizes.
   Right now the XSAVE*+XRSTOR* cost is significant:

     x86/fpu: Cost of: XSAVE   insn:   104 cycles
     x86/fpu: Cost of: XRSTOR  insn:    80 cycles

   ... and that's with just 128-bit AVX and a ~0.8K XSAVE area. The Agner PDF 
   lists Skylake XSAVE+XRSTOR costs at 107+122 cycles, plus there's probably a
   significant amount of L1 cache churn caused by XSAVE/XRSTOR.

   Most of the relevant vector instructions have a single cycle cost
   on the other hand.

 - To use vector ops in bulk, well-aligned memcpy(), which in many workloads
   is a fair chunk of all memset() activity. A usage profile on a typical system:

            galatea:~> cat /proc/sched_debug  | grep hist | grep -E '[[:digit:]]{4,}$' | grep '0\]'
            hist[0x0000]:    1514272
            hist[0x0010]:    1905248
            hist[0x0020]:      99471
            hist[0x0030]:     343309
            hist[0x0040]:     177874
            hist[0x0080]:     190052
            hist[0x00a0]:       5258
            hist[0x00b0]:       2387
            hist[0x00c0]:       6975
            hist[0x00d0]:       5872
            hist[0x0100]:       3229
            hist[0x0140]:       4813
            hist[0x0160]:       9323
            hist[0x0200]:      12540
            hist[0x0230]:      37488
            hist[0x1000]:      17136
            hist[0x1d80]:     225199

   First column is length of the area copied, the column is usage count.

   To do this I wouldn't complicate the main memset() interface in any way to 
   branch it off to vector ops, I'd isolate specific memcpy()'s and memset()s
   (such as page table copying and page clearing) and use the simpler
   vector register based primitives there.

   For example we have clear_page() which is used by GFP_ZERO and by other places
   is implemented on modern x86 CPUs as:

     ENTRY(clear_page_erms)
        movl $4096,%ecx
        xorl %eax,%eax
        rep stosb
        ret

   While for such buffer sizes the enhanced-REP string instructions are supposed 
   to be slightly faster than 128-bit AVX ops, for such exact page granular ops 
   I'm pretty sure 256-bit (and 512-bit) vector ops are faster.

 - For page granular memset/memcpy it would also be interesting to investigate 
   whether non-temporal, cache-preserving vector ops for such known-large bulk 
   ops, such as VMOVNTQA, are beneficial in certain circumstances.

   On x86 there's only a single non-temporal instruction to GP registers: 
   MOVNTI, and for stores only.

   The vector instructions space is a lot richer in that regard, allowing 
   non-temporal loads as well which utilize fill buffers to move chunks of memory 
   into vector registers.

   Random example: in do_cow_fault() we use copy_user_highpage() to copy the page, 
   which uses copy_user_page() -> copy_page(), which uses:

     ENTRY(copy_page)
        ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
        movl    $4096/8, %ecx
        rep     movsq
        ret

   But in this COW copy case it's pretty obvious that we shouldn't keep the 
   _source_ page in cache. So we could use non-temporal load, which appear to make 
   a difference on more recent uarchs even on write-back memory ranges:

      https://stackoverflow.com/questions/40096894/do-current-x86-architectures-support-non-temporal-loads-from-normal-memory

   See the final graph in that entry and now the 'NT load' variant results in the 
   best execution time in the 4K case - and this is a limited benchmark that
   doesn't measure the lower cache eviction pressure by NT loads.

   ( The store part is probably better done into the cache, not just due to the
     SFENCE cost (which is relatively low at 40 cycles), but because it's probably
     beneficial to prime the cache with a freshly COW-ed page, it's going to get
     used in the near future once we return from the fault. )

   etc.

 - But more broadly, if we open up vector ops for smaller buffer sizes as well
   then I think other kernel code would start using them as well:

 - I think the BPF JIT, whose byte code machine languge is used by an
   increasing number of kernel subsystems, could benefit from having vector ops.
   It would possibly allow the handling of floating point types.

 - We could consider implementing vector ops based copy-to-user and copy-from-user 
   primitives as well, for cases where we know that the dominant usage pattern is 
   for larger, well-aligned chunks of memory.

 - Maybe we could introduce a floating point library (which falls back to a C 
   implementation) and simplify scheduler math. We go to ridiculous lengths to 
   maintain precision across a wide range of parameters, essentially implementing 
   128-bit fixed point math. Even 32-bit floating point math would possibly be 
   better than that, even if it was done via APIs.

etc.: I think the large vector processor available in modern x86 CPUs could be 
utilized by the kernel as well for various purposes.

But I think that's only worth doing if vector registers and their save areas are 
easily accessibly and the accesses are fundamentally IRQ safe.

> Yes, yes, I can find an Intel white-paper that talks about setting WC and then 
> using xmm and ymm instructions to write a single 64-byte burst over PCIe, and I 
> assume that is where the code and idea came from. But I don't actually see any 
> reason why a burst of 8 regular quad-word bytes wouldn't cause a 64-byte burst 
> write too.

Yeah, I'm not too convinced about the wide readq/writeq usecase either, I just 
used the opportunity to outline these (very vague) plans about utilizing vector 
instructions more broadly within the kernel.

> So as far as I can tell, there are basically *zero* upsides, and a lot of 
> potential downsides.

I agree about the potential downsides and I think most of them can be addressed 
adequately - and I think my list of upsides above is potentially significant, 
especially once we have lightweight APIs to utilize individual vector registers 
without having to do a full save/restore of ~1K large vector register context.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 1/2] bpf: Remove struct bpf_verifier_env argument from print_bpf_insn
From: Daniel Borkmann @ 2018-03-22  9:34 UTC (permalink / raw)
  To: Jiri Olsa, Quentin Monnet; +Cc: Jiri Olsa, Alexei Starovoitov, lkml, netdev
In-Reply-To: <20180321183749.GF2707@krava>

On 03/21/2018 07:37 PM, Jiri Olsa wrote:
> On Wed, Mar 21, 2018 at 05:25:33PM +0000, Quentin Monnet wrote:
>> 2018-03-21 16:02 UTC+0100 ~ Jiri Olsa <jolsa@kernel.org>
>>> We use print_bpf_insn in user space (bpftool and soon perf),
>>> so it'd be nice to keep it generic and strip it off the kernel
>>> struct bpf_verifier_env argument.
>>>
>>> This argument can be safely removed, because its users can
>>> use the struct bpf_insn_cbs::private_data to pass it.
>>>
>>> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
>>> ---
>>>  kernel/bpf/disasm.c   | 52 +++++++++++++++++++++++++--------------------------
>>>  kernel/bpf/disasm.h   |  5 +----
>>>  kernel/bpf/verifier.c |  6 +++---
>>>  3 files changed, 30 insertions(+), 33 deletions(-)
>>
>> [...]
>>
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index c6eff108aa99..9f27d3fa7259 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>>> @@ -202,8 +202,7 @@ EXPORT_SYMBOL_GPL(bpf_verifier_log_write);
>>>   * generic for symbol export. The function was renamed, but not the calls in
>>>   * the verifier to avoid complicating backports. Hence the alias below.
>>>   */
>>> -static __printf(2, 3) void verbose(struct bpf_verifier_env *env,
>>> -				   const char *fmt, ...)
>>> +static __printf(2, 3) void verbose(void *private_data, const char *fmt, ...)
>>>  	__attribute__((alias("bpf_verifier_log_write")));
>>
>> Just as a note, verbose() will be aliased to a function whose prototype
>> differs (bpf_verifier_log_write() still expects a struct
>> bpf_verifier_env as its first argument). I am not so familiar with
>> function aliases, could this change be a concern?
> 
> yea, but as it was pointer for pointer switch I did not
> see any problem with that.. I'll check more

Ok, holding off for now until we have clarification. Other option could also
be to make it void *private_data everywhere and for the kernel writer then
do struct bpf_verifier_env *env = private_data.

^ permalink raw reply

* Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: Ingo Molnar @ 2018-03-22  9:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Thomas Gleixner, David Laight, Rahul Lakkireddy,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, mingo@redhat.com, hpa@zytor.com,
	davem@davemloft.net, akpm@linux-foundation.org,
	ganeshgr@chelsio.com, nirranjan@chelsio.com, indranil@chelsio.com,
	Peter Zijlstra, Fenghua Yu, Eric Biggers
In-Reply-To: <CALCETrXQRLEh_oQ5O-u31HT8AC0Gyb=_qwYJe2D0uyG07fNcoA@mail.gmail.com>


* Andy Lutomirski <luto@kernel.org> wrote:

> On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >> And even if you ignore that "maintenance problems down the line" issue
> >> ("we can fix them when they happen") I don't want to see games like
> >> this, because I'm pretty sure it breaks the optimized xsave by tagging
> >> the state as being dirty.
> >
> > That's true - and it would penalize the context switch cost of the affected task
> > for the rest of its lifetime, as I don't think there's much that clears XINUSE
> > other than a FINIT, which is rarely done by user-space.
> >
> >> So no. Don't use vector stuff in the kernel. It's not worth the pain.
> >
> > I agree, but:
> >
> >> The *only* valid use is pretty much crypto, and even there it has had issues.
> >> Benchmarks use big arrays and/or dense working sets etc to "prove" how good the
> >> vector version is, and then you end up in situations where it's used once per
> >> fairly small packet for an interrupt, and it's actually much worse than doing it
> >> by hand.
> >
> > That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
> > expensive, so this argument is somewhat circular.
> 
> If we do the deferred restore, then the XSAVE/XRSTOR happens at most
> once per kernel entry, which isn't so bad IMO.  Also, with PTI, kernel
> entries are already so slow that this will be mostly in the noise :(

For performance/scalability work we should just ignore the PTI overhead: it 
doesn't exist on AMD CPUs and Intel has announced Meltdown-fixed CPUs, to be 
released later this year:

   https://www.anandtech.com/show/12533/intel-spectre-meltdown

By the time any kernel changes we are talking about today get to distros and users 
the newest hardware won't have the Meltdown bug.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH V2 net-next 07/14] net/tls: Support TLS device offload with IPv6
From: Sergei Shtylyov @ 2018-03-22  9:36 UTC (permalink / raw)
  To: Saeed Mahameed, David S. Miller
  Cc: netdev, Dave Watson, Boris Pismenny, Ilya Lesokhin
In-Reply-To: <20180321210146.22537-8-saeedm@mellanox.com>

Hello!

On 3/22/2018 12:01 AM, Saeed Mahameed wrote:

> From: Ilya Lesokhin <ilyal@mellanox.com>
> 
> Previously get_netdev_for_sock worked only with IPv4.
> 
> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

    Only stylistic comments...

> ---
>   net/tls/tls_device.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index e623280ea019..c35fc107d9c5 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
> @@ -34,6 +34,11 @@
>   #include <net/inet_common.h>
>   #include <linux/highmem.h>
>   #include <linux/netdevice.h>
> +#include <net/addrconf.h>
> +#include <net/flow.h>
> +#include <linux/ipv6.h>
> +#include <net/dst.h>
> +#include <linux/security.h>
>   
>   #include <net/tls.h>
>   #include <crypto/aead.h>
> @@ -99,13 +104,55 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>   	spin_unlock_irqrestore(&tls_device_lock, flags);
>   }
>   
> +static inline struct net_device *ipv6_get_netdev(struct sock *sk)
> +{
> +	struct net_device *dev = NULL;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	struct inet_sock *inet = inet_sk(sk);
> +	struct ipv6_pinfo *np = inet6_sk(sk);
> +	struct flowi6 _fl6, *fl6 = &_fl6;
> +	struct dst_entry *dst;
> +
> +	memset(fl6, 0, sizeof(*fl6));
> +	fl6->flowi6_proto = sk->sk_protocol;
> +	fl6->daddr = sk->sk_v6_daddr;
> +	fl6->saddr = np->saddr;
> +	fl6->flowlabel = np->flow_label;
> +	IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> +	fl6->flowi6_oif = sk->sk_bound_dev_if;
> +	fl6->flowi6_mark = sk->sk_mark;
> +	fl6->fl6_sport = inet->inet_sport;
> +	fl6->fl6_dport = inet->inet_dport;
> +	fl6->flowi6_uid = sk->sk_uid;
> +	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
> +
> +	if (ipv6_stub->ipv6_dst_lookup(sock_net(sk), sk, &dst, fl6) < 0)
> +		return NULL;
> +
> +	dev = dst->dev;
> +	dev_hold(dev);
> +	dst_release(dst);
> +
> +#endif

    I think the above empty line should be outside #if as you need an empty 
line between the declaration and other statements.

> +	return dev;
> +}
> +
>   /* We assume that the socket is already connected */
>   static struct net_device *get_netdev_for_sock(struct sock *sk)
>   {
>   	struct inet_sock *inet = inet_sk(sk);
>   	struct net_device *netdev = NULL;
>   
> -	netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +	if (sk->sk_family == AF_INET)
> +		netdev = dev_get_by_index(sock_net(sk),
> +					  inet->cork.fl.flowi_oif);
> +	else if (sk->sk_family == AF_INET6) {

    Need {} in the 1st *if* branch since you have it in the 2nd.

> +		netdev = ipv6_get_netdev(sk);
> +		if (!netdev && !sk->sk_ipv6only &&
> +		    ipv6_addr_type(&sk->sk_v6_daddr) == IPV6_ADDR_MAPPED)
> +			netdev = dev_get_by_index(sock_net(sk),
> +						  inet->cork.fl.flowi_oif);
> +	}
>   
>   	return netdev;
>   }

MBR, Sergei

^ permalink raw reply

* Re: [PATCH v2 bpf-next 5/8] bpf: introduce BPF_RAW_TRACEPOINT
From: Daniel Borkmann @ 2018-03-22  9:43 UTC (permalink / raw)
  To: Alexei Starovoitov, davem
  Cc: torvalds, peterz, rostedt, netdev, kernel-team, linux-api
In-Reply-To: <20180321185448.2806324-6-ast@fb.com>

On 03/21/2018 07:54 PM, Alexei Starovoitov wrote:
[...]
> @@ -546,6 +556,53 @@ extern void ftrace_profile_free_filter(struct perf_event *event);
>  void perf_trace_buf_update(void *record, u16 type);
>  void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
>  
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4);
> +void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5);
> +void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6);
> +void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7);
> +void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8);
> +void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8, u64 arg9);
> +void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10);
> +void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11);
> +void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12);
> +void bpf_trace_run13(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13);
> +void bpf_trace_run14(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13, u64 arg14);
> +void bpf_trace_run15(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13, u64 arg14, u64 arg15);
> +void bpf_trace_run16(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13, u64 arg14, u64 arg15, u64 arg16);
> +void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17);
>  void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
>  			       struct trace_event_call *call, u64 count,
>  			       struct pt_regs *regs, struct hlist_head *head,
[...]
> @@ -896,3 +976,206 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  
>  	return ret;
>  }
> +
> +static __always_inline
> +void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
> +{
> +	rcu_read_lock();
> +	preempt_disable();
> +	(void) BPF_PROG_RUN(prog, args);
> +	preempt_enable();
> +	rcu_read_unlock();
> +}
> +
> +#define EVAL1(FN, X) FN(X)
> +#define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
> +#define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
> +#define EVAL4(FN, X, Y...) FN(X) EVAL3(FN, Y)
> +#define EVAL5(FN, X, Y...) FN(X) EVAL4(FN, Y)
> +#define EVAL6(FN, X, Y...) FN(X) EVAL5(FN, Y)
> +
> +#define COPY(X) args[X - 1] = arg##X;
> +
> +void bpf_trace_run1(struct bpf_prog *prog, u64 arg1)
> +{
> +	u64 args[1];
> +
> +	EVAL1(COPY, 1);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run1);
> +void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2)
> +{
> +	u64 args[2];
> +
> +	EVAL2(COPY, 1, 2);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run2);
> +void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3)
> +{
> +	u64 args[3];
> +
> +	EVAL3(COPY, 1, 2, 3);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run3);
> +void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4)
> +{
> +	u64 args[4];
> +
> +	EVAL4(COPY, 1, 2, 3, 4);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run4);
> +void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5)
> +{
> +	u64 args[5];
> +
> +	EVAL5(COPY, 1, 2, 3, 4, 5);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run5);
> +void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6)
> +{
> +	u64 args[6];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run6);
> +void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7)
> +{
> +	u64 args[7];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL1(COPY, 7);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run7);
> +void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8)
> +{
> +	u64 args[8];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL2(COPY, 7, 8);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run8);
> +void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		    u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		    u64 arg8, u64 arg9)
> +{
> +	u64 args[9];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL3(COPY, 7, 8, 9);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run9);
> +void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10)
> +{
> +	u64 args[10];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL4(COPY, 7, 8, 9, 10);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run10);
> +void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11)
> +{
> +	u64 args[11];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL5(COPY, 7, 8, 9, 10, 11);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run11);
> +void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12)
> +{
> +	u64 args[12];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL6(COPY, 7, 8, 9, 10, 11, 12);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run12);
> +void bpf_trace_run17(struct bpf_prog *prog, u64 arg1, u64 arg2,
> +		     u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7,
> +		     u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12,
> +		     u64 arg13, u64 arg14, u64 arg15, u64 arg16, u64 arg17)
> +{
> +	u64 args[17];
> +
> +	EVAL6(COPY, 1, 2, 3, 4, 5, 6);
> +	EVAL6(COPY, 7, 8, 9, 10, 11, 12);
> +	EVAL5(COPY, 13, 14, 15, 16, 17);
> +	__bpf_trace_run(prog, args);
> +}
> +EXPORT_SYMBOL_GPL(bpf_trace_run17);

Would be nice if we could generate all these above via macro, e.g. when we define
a hard upper limit for max number of tracepoint args anyway, so this gets automatically
adjusted as well. Maybe some of the logic from BPF_CALL_*() macros could be borrowed
for this purpose.

Thanks,
Daniel

^ permalink raw reply

* [PATCH net-next v3 0/5] Rework ip_ra_chain protection
From: Kirill Tkhai @ 2018-03-22  9:44 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai

Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
made rtnl_lock() be used in raw_close(). This function is called
on every RAW socket destruction, so that rtnl_mutex is taken
every time. This scales very sadly. I observe cleanup_net()
spending a lot of time in rtnl_lock() and raw_close() is one
of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).

This patchset reworks the locking: reverts the problem commit
and its descendant, and introduces rtnl-independent locking.
This may have a continuation, and someone may work on killing
rtnl_lock() in mrtsock_destruct() in the future.

Thanks,
Kirill

---
v3: Change patches order: [2/5] and [3/5].
v2: Fix sparse warning [4/5], as reported by kbuild test robot.

---

Kirill Tkhai (5):
      net: Revert "ipv4: get rid of ip_ra_lock"
      net: Move IP_ROUTER_ALERT out of lock_sock(sk)
      net: Revert "ipv4: fix a deadlock in ip_ra_control"
      net: Make ip_ra_chain per struct net
      net: Replace ip_ra_lock with per-net mutex


 include/net/ip.h         |   13 +++++++++++--
 include/net/netns/ipv4.h |    2 ++
 net/core/net_namespace.c |    1 +
 net/ipv4/ip_input.c      |    5 ++---
 net/ipv4/ip_sockglue.c   |   34 +++++++++++++---------------------
 net/ipv4/ipmr.c          |   11 +++++++++--
 net/ipv4/raw.c           |    2 --
 7 files changed, 38 insertions(+), 30 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply

* [PATCH net-next v3 1/5] net: Revert "ipv4: get rid of ip_ra_lock"
From: Kirill Tkhai @ 2018-03-22  9:45 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

This reverts commit ba3f571d5dde. The commit was made
after 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control",
and killed ip_ra_lock, which became useless after rtnl_lock()
made used to destroy every raw ipv4 socket. This scales
very bad, and next patch in series reverts 1215e51edad1.
ip_ra_lock will be used again.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 74c962b9b09c..be7c3b71914d 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -334,6 +334,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
    sent to multicast group to reach destination designated router.
  */
 struct ip_ra_chain __rcu *ip_ra_chain;
+static DEFINE_SPINLOCK(ip_ra_lock);
 
 
 static void ip_ra_destroy_rcu(struct rcu_head *head)
@@ -355,17 +356,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
+	spin_lock_bh(&ip_ra_lock);
 	for (rap = &ip_ra_chain;
-	     (ra = rtnl_dereference(*rap)) != NULL;
+	     (ra = rcu_dereference_protected(*rap,
+			lockdep_is_held(&ip_ra_lock))) != NULL;
 	     rap = &ra->next) {
 		if (ra->sk == sk) {
 			if (on) {
+				spin_unlock_bh(&ip_ra_lock);
 				kfree(new_ra);
 				return -EADDRINUSE;
 			}
 			/* dont let ip_call_ra_chain() use sk again */
 			ra->sk = NULL;
 			RCU_INIT_POINTER(*rap, ra->next);
+			spin_unlock_bh(&ip_ra_lock);
 
 			if (ra->destructor)
 				ra->destructor(sk);
@@ -379,14 +384,17 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 			return 0;
 		}
 	}
-	if (!new_ra)
+	if (!new_ra) {
+		spin_unlock_bh(&ip_ra_lock);
 		return -ENOBUFS;
+	}
 	new_ra->sk = sk;
 	new_ra->destructor = destructor;
 
 	RCU_INIT_POINTER(new_ra->next, ra);
 	rcu_assign_pointer(*rap, new_ra);
 	sock_hold(sk);
+	spin_unlock_bh(&ip_ra_lock);
 
 	return 0;
 }

^ permalink raw reply related

* [PATCH net-next v3 2/5] net: Move IP_ROUTER_ALERT out of lock_sock(sk)
From: Kirill Tkhai @ 2018-03-22  9:45 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

ip_ra_control() does not need sk_lock. Who are the another
users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
sk_lock, while parallel IP_ROUTER_ALERT syscalls are
synchronized by ip_ra_lock. So, we may move this command
out of sk_lock.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index be7c3b71914d..dcbf6afe27e7 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -647,6 +647,8 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 
 	/* If optlen==0, it is equivalent to val == 0 */
 
+	if (optname == IP_ROUTER_ALERT)
+		return ip_ra_control(sk, val ? 1 : 0, NULL);
 	if (ip_mroute_opt(optname))
 		return ip_mroute_setsockopt(sk, optname, optval, optlen);
 
@@ -1157,9 +1159,6 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 			goto e_inval;
 		inet->mc_all = val;
 		break;
-	case IP_ROUTER_ALERT:
-		err = ip_ra_control(sk, val ? 1 : 0, NULL);
-		break;
 
 	case IP_FREEBIND:
 		if (optlen < 1)

^ permalink raw reply related

* [PATCH net-next v3 3/5] net: Revert "ipv4: fix a deadlock in ip_ra_control"
From: Kirill Tkhai @ 2018-03-22  9:45 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

This reverts commit 1215e51edad1.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51edad1 scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |    1 -
 net/ipv4/ipmr.c        |   11 +++++++++--
 net/ipv4/raw.c         |    2 --
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index dcbf6afe27e7..bf5f44b27b7e 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -594,7 +594,6 @@ static bool setsockopt_needs_rtnl(int optname)
 	case MCAST_LEAVE_GROUP:
 	case MCAST_LEAVE_SOURCE_GROUP:
 	case MCAST_UNBLOCK_SOURCE:
-	case IP_ROUTER_ALERT:
 		return true;
 	}
 	return false;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index d752a70855d8..f6be5db16da2 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1399,7 +1399,7 @@ static void mrtsock_destruct(struct sock *sk)
 	struct net *net = sock_net(sk);
 	struct mr_table *mrt;
 
-	ASSERT_RTNL();
+	rtnl_lock();
 	ipmr_for_each_table(mrt, net) {
 		if (sk == rtnl_dereference(mrt->mroute_sk)) {
 			IPV4_DEVCONF_ALL(net, MC_FORWARDING)--;
@@ -1411,6 +1411,7 @@ static void mrtsock_destruct(struct sock *sk)
 			mroute_clean_tables(mrt, false);
 		}
 	}
+	rtnl_unlock();
 }
 
 /* Socket options and virtual interface manipulation. The whole
@@ -1475,8 +1476,13 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
 		if (sk != rcu_access_pointer(mrt->mroute_sk)) {
 			ret = -EACCES;
 		} else {
+			/* We need to unlock here because mrtsock_destruct takes
+			 * care of rtnl itself and we can't change that due to
+			 * the IP_ROUTER_ALERT setsockopt which runs without it.
+			 */
+			rtnl_unlock();
 			ret = ip_ra_control(sk, 0, NULL);
-			goto out_unlock;
+			goto out;
 		}
 		break;
 	case MRT_ADD_VIF:
@@ -1588,6 +1594,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
 	}
 out_unlock:
 	rtnl_unlock();
+out:
 	return ret;
 }
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..720bef7da2f6 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -711,9 +711,7 @@ static void raw_close(struct sock *sk, long timeout)
 	/*
 	 * Raw sockets may have direct kernel references. Kill them.
 	 */
-	rtnl_lock();
 	ip_ra_control(sk, 0, NULL);
-	rtnl_unlock();
 
 	sk_common_release(sk);
 }

^ permalink raw reply related

* [PATCH net-next v3 4/5] net: Make ip_ra_chain per struct net
From: Kirill Tkhai @ 2018-03-22  9:45 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/net/ip.h         |   13 +++++++++++--
 include/net/netns/ipv4.h |    1 +
 net/ipv4/ip_input.c      |    5 ++---
 net/ipv4/ip_sockglue.c   |   15 ++-------------
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fe63ba95d12b..d53b5a9eae34 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -91,6 +91,17 @@ static inline int inet_sdif(struct sk_buff *skb)
 	return 0;
 }
 
+/* Special input handler for packets caught by router alert option.
+   They are selected only by protocol field, and then processed likely
+   local ones; but only if someone wants them! Otherwise, router
+   not running rsvpd will kill RSVP.
+
+   It is user level problem, what it will make with them.
+   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
+   but receiver should be enough clever f.e. to forward mtrace requests,
+   sent to multicast group to reach destination designated router.
+ */
+
 struct ip_ra_chain {
 	struct ip_ra_chain __rcu *next;
 	struct sock		*sk;
@@ -101,8 +112,6 @@ struct ip_ra_chain {
 	struct rcu_head		rcu;
 };
 
-extern struct ip_ra_chain __rcu *ip_ra_chain;
-
 /* IP flags. */
 #define IP_CE		0x8000		/* Flag: "Congestion"		*/
 #define IP_DF		0x4000		/* Flag: "Don't Fragment"	*/
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 382bfd7583cf..97d7ee6667c7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,7 @@ struct netns_ipv4 {
 #endif
 	struct ipv4_devconf	*devconf_all;
 	struct ipv4_devconf	*devconf_dflt;
+	struct ip_ra_chain __rcu *ra_chain;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	struct fib_rules_ops	*rules_ops;
 	bool			fib_has_custom_rules;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 57fc13c6ab2b..7582713dd18f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -159,7 +159,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	struct net *net = dev_net(dev);
 
-	for (ra = rcu_dereference(ip_ra_chain); ra; ra = rcu_dereference(ra->next)) {
+	for (ra = rcu_dereference(net->ipv4.ra_chain); ra; ra = rcu_dereference(ra->next)) {
 		struct sock *sk = ra->sk;
 
 		/* If socket is bound to an interface, only report
@@ -167,8 +167,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 		 */
 		if (sk && inet_sk(sk)->inet_num == protocol &&
 		    (!sk->sk_bound_dev_if ||
-		     sk->sk_bound_dev_if == dev->ifindex) &&
-		    net_eq(sock_net(sk), net)) {
+		     sk->sk_bound_dev_if == dev->ifindex)) {
 			if (ip_is_fragment(ip_hdr(skb))) {
 				if (ip_defrag(net, skb, IP_DEFRAG_CALL_RA_CHAIN))
 					return true;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index bf5f44b27b7e..f36d35fe924b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,18 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 	return 0;
 }
 
-
-/* Special input handler for packets caught by router alert option.
-   They are selected only by protocol field, and then processed likely
-   local ones; but only if someone wants them! Otherwise, router
-   not running rsvpd will kill RSVP.
-
-   It is user level problem, what it will make with them.
-   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
-   but receiver should be enough clever f.e. to forward mtrace requests,
-   sent to multicast group to reach destination designated router.
- */
-struct ip_ra_chain __rcu *ip_ra_chain;
 static DEFINE_SPINLOCK(ip_ra_lock);
 
 
@@ -350,6 +338,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 {
 	struct ip_ra_chain *ra, *new_ra;
 	struct ip_ra_chain __rcu **rap;
+	struct net *net = sock_net(sk);
 
 	if (sk->sk_type != SOCK_RAW || inet_sk(sk)->inet_num == IPPROTO_RAW)
 		return -EINVAL;
@@ -357,7 +346,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
 	spin_lock_bh(&ip_ra_lock);
-	for (rap = &ip_ra_chain;
+	for (rap = &net->ipv4.ra_chain;
 	     (ra = rcu_dereference_protected(*rap,
 			lockdep_is_held(&ip_ra_lock))) != NULL;
 	     rap = &ra->next) {

^ permalink raw reply related

* [PATCH net-next v3 5/5] net: Replace ip_ra_lock with per-net mutex
From: Kirill Tkhai @ 2018-03-22  9:45 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp, ktkhai
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/net/netns/ipv4.h |    1 +
 net/core/net_namespace.c |    1 +
 net/ipv4/ip_sockglue.c   |   15 ++++++---------
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 97d7ee6667c7..8491bc9c86b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -50,6 +50,7 @@ struct netns_ipv4 {
 	struct ipv4_devconf	*devconf_all;
 	struct ipv4_devconf	*devconf_dflt;
 	struct ip_ra_chain __rcu *ra_chain;
+	struct mutex		ra_mutex;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
 	struct fib_rules_ops	*rules_ops;
 	bool			fib_has_custom_rules;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c340d5cfbdec..95ba2c53bd9a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -301,6 +301,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	net->user_ns = user_ns;
 	idr_init(&net->netns_ids);
 	spin_lock_init(&net->nsid_lock);
+	mutex_init(&net->ipv4.ra_mutex);
 
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f36d35fe924b..5ad2d8ed3a3f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,9 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
 	return 0;
 }
 
-static DEFINE_SPINLOCK(ip_ra_lock);
-
-
 static void ip_ra_destroy_rcu(struct rcu_head *head)
 {
 	struct ip_ra_chain *ra = container_of(head, struct ip_ra_chain, rcu);
@@ -345,21 +342,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-	spin_lock_bh(&ip_ra_lock);
+	mutex_lock(&net->ipv4.ra_mutex);
 	for (rap = &net->ipv4.ra_chain;
 	     (ra = rcu_dereference_protected(*rap,
-			lockdep_is_held(&ip_ra_lock))) != NULL;
+			lockdep_is_held(&net->ipv4.ra_mutex))) != NULL;
 	     rap = &ra->next) {
 		if (ra->sk == sk) {
 			if (on) {
-				spin_unlock_bh(&ip_ra_lock);
+				mutex_unlock(&net->ipv4.ra_mutex);
 				kfree(new_ra);
 				return -EADDRINUSE;
 			}
 			/* dont let ip_call_ra_chain() use sk again */
 			ra->sk = NULL;
 			RCU_INIT_POINTER(*rap, ra->next);
-			spin_unlock_bh(&ip_ra_lock);
+			mutex_unlock(&net->ipv4.ra_mutex);
 
 			if (ra->destructor)
 				ra->destructor(sk);
@@ -374,7 +371,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 		}
 	}
 	if (!new_ra) {
-		spin_unlock_bh(&ip_ra_lock);
+		mutex_unlock(&net->ipv4.ra_mutex);
 		return -ENOBUFS;
 	}
 	new_ra->sk = sk;
@@ -383,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 	RCU_INIT_POINTER(new_ra->next, ra);
 	rcu_assign_pointer(*rap, new_ra);
 	sock_hold(sk);
-	spin_unlock_bh(&ip_ra_lock);
+	mutex_unlock(&net->ipv4.ra_mutex);
 
 	return 0;
 }

^ permalink raw reply related

* Re: [PATCH net-next v3 0/5] Rework ip_ra_chain protection
From: Kirill Tkhai @ 2018-03-22  9:49 UTC (permalink / raw)
  To: davem, yoshfuji, edumazet, yanhaishuang, nikolay, yotamg, soheil,
	avagin, nicolas.dichtel, ebiederm, fw, roman.kapl, netdev,
	xiyou.wangcong, dvyukov, andreyknvl, lkp
In-Reply-To: <152171176936.18202.11912079579606814167.stgit@localhost.localdomain>

On 22.03.2018 12:44, Kirill Tkhai wrote:
> Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
> made rtnl_lock() be used in raw_close(). This function is called
> on every RAW socket destruction, so that rtnl_mutex is taken
> every time. This scales very sadly. I observe cleanup_net()
> spending a lot of time in rtnl_lock() and raw_close() is one
> of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).
> 
> This patchset reworks the locking: reverts the problem commit
> and its descendant, and introduces rtnl-independent locking.
> This may have a continuation, and someone may work on killing
> rtnl_lock() in mrtsock_destruct() in the future.
> 
> Thanks,
> Kirill
> 
> ---
> v3: Change patches order: [2/5] and [3/5].
> v2: Fix sparse warning [4/5], as reported by kbuild test robot.
> 
> ---
> 
> Kirill Tkhai (5):
>       net: Revert "ipv4: get rid of ip_ra_lock"
>       net: Move IP_ROUTER_ALERT out of lock_sock(sk)
>       net: Revert "ipv4: fix a deadlock in ip_ra_control"
>       net: Make ip_ra_chain per struct net
>       net: Replace ip_ra_lock with per-net mutex
> 
> 
>  include/net/ip.h         |   13 +++++++++++--
>  include/net/netns/ipv4.h |    2 ++
>  net/core/net_namespace.c |    1 +
>  net/ipv4/ip_input.c      |    5 ++---
>  net/ipv4/ip_sockglue.c   |   34 +++++++++++++---------------------
>  net/ipv4/ipmr.c          |   11 +++++++++--
>  net/ipv4/raw.c           |    2 --
>  7 files changed, 38 insertions(+), 30 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

JFI: I used the below program to test:

#define _GNU_SOURCE
#include <sys/socket.h>
#include <netinet/in.h>
#include <sys/types.h>
#include <linux/mroute.h>
#include <sched.h>

int main()
{
        int sk, v, i = 0;

        if (unshare(CLONE_NEWNET)) {
                perror("unshare");
                return 1;
        }
        sk = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
        if (sk < 0) {
                perror("socket");
                return 1;
        }
        for (i = 0; i < 3; i++)
                fork();

        while (1) {
                setsockopt(sk, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v));
                setsockopt(sk, IPPROTO_IP, MRT_DONE, (void *)&v, sizeof(v));
                v = (i++)%2;
                setsockopt(sk, IPPROTO_IP, IP_ROUTER_ALERT, (void *)&v, sizeof(v));
        }

        return 0;
}

^ permalink raw reply

* RE: DTS for our Configuration
From: Alayev Michael @ 2018-03-22  9:55 UTC (permalink / raw)
  To: 'andrew@lunn.ch'
  Cc: Efter Yoram, Dror Alon, 'netdev@vger.kernel.org',
	Margalit Ofer
In-Reply-To: <48F7D4389F30BA4383F214EE802BA47101706555D3@EXS10.iai.co.il>

Hi Andrew,

>I think this is a problem with the macb driver. To me, it looks like you are going to have to >make some changes to the driver to make this work. Normally the MDIO bus children are >placed within a container node, often called 'mdio-bus' or simply 'mdio'. See for example >Documentation/devicetree/bindings/net/fsl-fec.txt.ן¿½ The macb driver does not do this. It >passed the main DT node of the device to of_mdiobus_register(). It then walks all the >children assuming they devices on the MDIO bus. But the first child it finds is the 'fixed-link'. >This is not supposed to be a child of the bus, which is why it goes wrong.

As you understand, I prefer not to change the driver. 
Is there a way for me to bypass this issue?
Can I use other property than 'fixed-link'?

>Please include the full panic details. The stack trace can be very useful.
Please, see below:

You also asked for the linux log of your suggested dts:
libphy: Fixed MDIO Bus: probed
CAN device driver interface
libphy: MACB_mii_bus: probed
mdio_bus e000b000.ethernet-ffffffff: /amba/ethernet@e000b000/fixed-link has invalid PHY address
mv88e6085 e000b000.ethernet-ffffffff:1d: switch 0xa10 detected: Marvell 88E6390X, revision 1
libphy: mv88e6xxx SMI: probed
mv88e6085 e000b000.ethernet-ffffffff:1c: switch 0xa10 detected: Marvell 88E6390X, revision 1
libphy: mv88e6xxx SMI: probed
mv88e6085: probe of e000b000.ethernet-ffffffff:1c failed with error -16
mdio_bus e000b000.ethernet-ffffffff: scan phy fixed-link at address 1
Unable to handle kernel NULL pointer dereference at virtual address 00000004
pgd = c0004000
[00000004] *pgd=00000000
Internal error: Oops - BUG: 17 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.14.0-xilinx #15
Hardware name: Xilinx Zynq Platform
task: df43b840 task.stack: df43c000
PC is at dsa_unregister_switch+0x10/0x48
LR is at dsa_unregister_switch+0x10/0x48
pc : [<c05aad48>]ן¿½ן¿½ן¿½ lr : [<c05aad48>]ן¿½ן¿½ן¿½ psr: 60000013
sp : df43dd58ן¿½ ip : 00000000ן¿½ fp : 00000000
r10: 00000000ן¿½ r9 : dd009c78ן¿½ r8 : 00000034
r7 : dd0d1434ן¿½ r6 : c091dd68ן¿½ r5 : 00000000ן¿½ r4 : dd0d1210
r3 : df43b840ן¿½ r2 : 00000000ן¿½ r1 : 00000000ן¿½ r0 : c0936cdc
Flags: nZCvן¿½ IRQs onן¿½ FIQs onן¿½ Mode SVC_32ן¿½ ISA ARMן¿½ Segment none
Control: 18c5387dן¿½ Table: 1f65c04aן¿½ DAC: 00000051
Process swapper/0 (pid: 1, stack limit = 0xdf43c210)
Stack: (0xdf43dd58 to 0xdf43e000)
dd40:ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ dd0d1210 00000000
dd60: c091dd68 c0418ac0 dd0d1400 00000000 c091dd68 c04127ac dd0d1400 c03b72b8
dd80: dd0d1400 df4ff530 c091c448 dd0d1460 dfbeb3c4 c03b6420 dd0d1400 dd0d145c
dda0: dd009c78 c03b3acc dd009e8c a0000013 dd1d2800 dd009e80 dd009e8c dd0d1400
ddc0: dd0d1400 dd009e84 dd009e8c c0412868 dd009c00 c0412624 ffffffed 00000001
dde0: dd009c00 dfbeb67c dfbeb3c4 c04bafac 00000000 c04b378c dfbecb5c 0000001c
de00: c06ffa6d df77f000 df77c000 df77c4c0 dfbeb3c4 df524a10 00000000 c0423104
de20: ffffffff df43de5c 00000001 c0229cd4 dfbeb3c4 c0421054 00000001 00000002
de40: 00000001 00000001 dd1c8898 dd1c9cc0 dd1c9c40 dd1c9bc0 00000000 00000000
de60: 00000001 00000001 c091ddc8 c0422974 df524a10 c091ddc8 00000000 00000000
de80: c091ddc8 00000000 00000000 c03b84d4 df524a10 c09523dc c09523e0 c03b6e84
dea0: df524a10 df524a44 c091ddc8 c0918a28 00000000 c0938000 c083383c c03b7080
dec0: 00000000 c091ddc8 c03b7000 c03b5790 df472f58 df4edb34 c091ddc8 00000000
dee0: dd1c5300 c03b64f8 c06ff8d2 c06ff8d3 00000000 c091ddc8 c081aea4 000000a8
df00: c083be5c c03b7808 00000006 c081aea4 000000a8 c0101900 00000000 df43df24
df20: 00000000 00000000 00000000 c075d548 000000a8 00000006 00000006 00000000
df40: cccccccd 00000000 00000000 c0938000 c083383c 00000006 c0833830 000000a8
df60: 00000006 c0833834 000000a8 c083be5c c0938000 c0800d40 00000006 00000006
df80: 00000000 c0800594 00000000 c05e1e64 00000000 00000000 00000000 00000000
dfa0: 00000000 c05e1e6c 00000000 c0106fd0 00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[<c05aad48>] (dsa_unregister_switch) from [<c0418ac0>] (mv88e6xxx_remove+0x1c/0x68)
[<c0418ac0>] (mv88e6xxx_remove) from [<c04127ac>] (mdio_remove+0x18/0x28)
[<c04127ac>] (mdio_remove) from [<c03b72b8>] (device_release_driver_internal+0x128/0x1d0)
[<c03b72b8>] (device_release_driver_internal) from [<c03b6420>] (bus_remove_device+0xcc/0xdc)
[<c03b6420>] (bus_remove_device) from [<c03b3acc>] (device_del+0x1bc/0x258)
[<c03b3acc>] (device_del) from [<c0412868>] (mdio_device_remove+0xc/0x18)
[<c0412868>] (mdio_device_remove) from [<c0412624>] (mdiobus_unregister+0x40/0x74)
[<c0412624>] (mdiobus_unregister) from [<c04bafac>] (of_mdiobus_register+0x234/0x254)
[<c04bafac>] (of_mdiobus_register) from [<c0423104>] (macb_probe+0x790/0xb88)
[<c0423104>] (macb_probe) from [<c03b84d4>] (platform_drv_probe+0x50/0xa0)
[<c03b84d4>] (platform_drv_probe) from [<c03b6e84>] (driver_probe_device+0x13c/0x2b8)
[<c03b6e84>] (driver_probe_device) from [<c03b7080>] (__driver_attach+0x80/0xa4)
[<c03b7080>] (__driver_attach) from [<c03b5790>] (bus_for_each_dev+0x68/0x8c)
[<c03b5790>] (bus_for_each_dev) from [<c03b64f8>] (bus_add_driver+0xc8/0x1dc)
[<c03b64f8>] (bus_add_driver) from [<c03b7808>] (driver_register+0x9c/0xe0)
[<c03b7808>] (driver_register) from [<c0101900>] (do_one_initcall+0xa8/0x11c)
[<c0101900>] (do_one_initcall) from [<c0800d40>] (kernel_init_freeable+0x10c/0x1cc)
[<c0800d40>] (kernel_init_freeable) from [<c05e1e6c>] (kernel_init+0x8/0x10c)
[<c05e1e6c>] (kernel_init) from [<c0106fd0>] (ret_from_fork+0x14/0x24)
Code: e92d4070 e1a05000 e59f0034 eb00ea7e (e5954004)
---[ end trace f6d20ab8f9ad8edb ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

CPU0: stopping
CPU: 0 PID: 0 Comm: swapper/0 Tainted: Gן¿½ן¿½ן¿½ן¿½ן¿½ Dן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ 4.14.0-xilinx #15
Hardware name: Xilinx Zynq Platform
[<c010e7b4>] (unwind_backtrace) from [<c010a9b0>] (show_stack+0x10/0x14)
[<c010a9b0>] (show_stack) from [<c05d1950>] (dump_stack+0x80/0xa0)
[<c05d1950>] (dump_stack) from [<c010cf20>] (ipi_cpu_stop+0x3c/0x70)
[<c010cf20>] (ipi_cpu_stop) from [<c010d720>] (handle_IPI+0x64/0x84)
[<c010d720>] (handle_IPI) from [<c01013f8>] (gic_handle_irq+0x7c/0x98)
[<c01013f8>] (gic_handle_irq) from [<c010b40c>] (__irq_svc+0x6c/0xa8)
Exception stack(0xc0901f48 to 0xc0901f90)
1f40:ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ן¿½ 00000001 00000000 00000000 c0116800 00000000 00000000
1f60: c0900000 c08441f8 c0901fa0 c0833a30 00000000 00000000 1f388000 c0901f98
1f80: c01079e0 c01079e4 60000013 ffffffff
[<c010b40c>] (__irq_svc) from [<c01079e4>] (arch_cpu_idle+0x2c/0x38)
[<c01079e4>] (arch_cpu_idle) from [<c01490fc>] (do_idle+0xd0/0x198)
[<c01490fc>] (do_idle) from [<c01492fc>] (cpu_startup_entry+0x18/0x1c)
[<c01492fc>] (cpu_startup_entry) from [<c0800bd4>] (start_kernel+0x308/0x368)
[<c0800bd4>] (start_kernel) from [<0000807c>] (0x807c)
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b


BR,
Michael


***********************************************************************************************

Please consider the environment before printing this email !
The information contained in this communication is proprietary to Israel Aerospace Industries Ltd. and/or third parties, may contain confidential or privileged information, and is intended only for the use of the intended addressee thereof.
If you are not the intended addressee, please be aware that any use, disclosure, distribution and/or copying of this communication is strictly prohibited. If you receive this communication in error, please notify the sender immediately and delete it from your computer. 
Thank you.

Visit us at:   www.iai.co.il

^ permalink raw reply

* RE: [PATCH v4 11/17] bnx2x: Eliminate duplicate barriers on weakly-ordered archs
From: Kalluru, Sudarsana @ 2018-03-22 10:10 UTC (permalink / raw)
  To: Sinan Kaya, netdev@vger.kernel.org, timur@codeaurora.org,
	sulrich@codeaurora.org
  Cc: linux-arm-msm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Elior, Ariel,
	Dept-Eng Everest Linux L2, linux-kernel@vger.kernel.org
In-Reply-To: <1521513753-7325-12-git-send-email-okaya@codeaurora.org>

-----Original Message-----
From: Sinan Kaya [mailto:okaya@codeaurora.org] 
Sent: 20 March 2018 08:12
To: netdev@vger.kernel.org; timur@codeaurora.org; sulrich@codeaurora.org
Cc: linux-arm-msm@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Sinan Kaya <okaya@codeaurora.org>; Elior, Ariel <Ariel.Elior@cavium.com>; Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>; linux-kernel@vger.kernel.org
Subject: [PATCH v4 11/17] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the register write.

Since code already has an explicit barrier call, changing writel() to writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h       |  9 ++++++++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 21 +++++++++++----------  drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |  2 +-  drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c  |  2 +-
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..ac38db9 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {						\
 #define REG_RD8(bp, offset)		readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)		readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)	writel_relaxed((u32)val,\
+						       REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+	writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)		writel((u32)val, REG_ADDR(bp, offset))
 #define REG_WR8(bp, offset, val)	writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)	writew((u16)val, REG_ADDR(bp, offset))
@@ -760,7 +766,8 @@ struct bnx2x_fastpath {  #endif  #define DOORBELL(bp, cid, val) \
 	do { \
-		writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
+		writel_relaxed((u32)(val),\
+				bp->doorbells + (bp->db_size * (cid))); \
 	} while (0)
 
 /* TX CSUM helpers */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
 	wmb();
 
 	for (i = 0; i < sizeof(rx_prods)/4; i++)
-		REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-		       ((u32 *)&rx_prods)[i]);
+		REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+			       ((u32 *)&rx_prods)[i]);
 
 	mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..2dea1b6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -1608,8 +1608,8 @@ static void bnx2x_hc_int_enable(struct bnx2x *bp)
 		} else
 			val = 0xffff;
 
-		REG_WR(bp, HC_REG_TRAILING_EDGE_0 + port*8, val);
-		REG_WR(bp, HC_REG_LEADING_EDGE_0 + port*8, val);
+		REG_WR_RELAXED(bp, HC_REG_TRAILING_EDGE_0 + port * 8, val);
+		REG_WR_RELAXED(bp, HC_REG_LEADING_EDGE_0 + port * 8, val);
 	}
 
 	/* Make sure that interrupts are indeed enabled from here on */ @@ -1672,8 +1672,8 @@ static void bnx2x_igu_int_enable(struct bnx2x *bp)
 	} else
 		val = 0xffff;
 
-	REG_WR(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
-	REG_WR(bp, IGU_REG_LEADING_EDGE_LATCH, val);
+	REG_WR_RELAXED(bp, IGU_REG_TRAILING_EDGE_LATCH, val);
+	REG_WR_RELAXED(bp, IGU_REG_LEADING_EDGE_LATCH, val);
 
 	/* Make sure that interrupts are indeed enabled from here on */
 	mmiowb();
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 	 */
 	mb();
 
-	REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-		 bp->spq_prod_idx);
+	REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+			 bp->spq_prod_idx);
 	mmiowb();
 }
 
@@ -7761,7 +7761,7 @@ void bnx2x_igu_clear_sb_gen(struct bnx2x *bp, u8 func, u8 idu_sb_id, bool is_pf)
 	barrier();
 	DP(NETIF_MSG_HW, "write 0x%08x to IGU(via GRC) addr 0x%x\n",
 			  ctl, igu_addr_ctl);
-	REG_WR(bp, igu_addr_ctl, ctl);
+	REG_WR_RELAXED(bp, igu_addr_ctl, ctl);
 	mmiowb();
 	barrier();
 
@@ -9720,13 +9720,14 @@ static void bnx2x_process_kill_chip_reset(struct bnx2x *bp, bool global)
 	barrier();
 	mmiowb();
 
-	REG_WR(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_2_SET,
-	       reset_mask2 & (~stay_reset2));
+	REG_WR_RELAXED(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_2_SET,
+		       reset_mask2 & (~stay_reset2));
 
 	barrier();
 	mmiowb();
 
-	REG_WR(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_1_SET, reset_mask1);
+	REG_WR_RELAXED(bp, GRCBASE_MISC + MISC_REGISTERS_RESET_REG_1_SET,
+		       reset_mask1);
 	mmiowb();
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index ffa7959..40e55d8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -105,7 +105,7 @@ static void bnx2x_vf_igu_ack_sb(struct bnx2x *bp, struct bnx2x_virtf *vf,
 
 	DP(NETIF_MSG_HW, "write 0x%08x to IGU(via GRC) addr 0x%x\n",
 	   ctl, igu_addr_ctl);
-	REG_WR(bp, igu_addr_ctl, ctl);
+	REG_WR_RELAXED(bp, igu_addr_ctl, ctl);
 	mmiowb();
 	barrier();
 }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..3b2f1bd 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,7 @@ static int bnx2x_send_msg2pf(struct bnx2x *bp, u8 *done, dma_addr_t msg_mapping)
 	wmb();
 
 	/* Trigger the PF FW */
-	writeb(1, &zone_data->trigger.vf_pf_channel.addr_valid);
+	writeb_relaxed(1, &zone_data->trigger.vf_pf_channel.addr_valid);
 
 	/* Wait for PF to complete */
 	while ((tout >= 0) && (!*done)) {
--
2.7.4

Acked-by: Sudarsana Kalluru <Sudarsana.Kalluru@cavium.com>

^ permalink raw reply related

* [PATCH nf] netfilter: drop template ct when conntrack is skipped.
From: Paolo Abeni @ 2018-03-22 10:08 UTC (permalink / raw)
  To: netdev
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel, coreteam,
	syzbot+0346441ae0545cfcea3a, syzkaller-bugs

The ipv4 nf_ct code currently skips the nf_conntrak_in() call
for fragmented packets. As a results later matches/target can end
up manipulating template ct entry instead of 'real' ones.

Exploiting the above, syzbot found a way to trigger the following
splat:

WARNING: CPU: 1 PID: 4242 at net/netfilter/xt_cluster.c:55
xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 4242 Comm: syzkaller027971 Not tainted 4.16.0-rc2+ #243
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  panic+0x1e4/0x41c kernel/panic.c:183
  __warn+0x1dc/0x200 kernel/panic.c:547
  report_bug+0x211/0x2d0 lib/bug.c:184
  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
  fixup_bug arch/x86/kernel/traps.c:247 [inline]
  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
RIP: 0010:xt_cluster_hash net/netfilter/xt_cluster.c:55 [inline]
RIP: 0010:xt_cluster_mt+0x6c1/0x840 net/netfilter/xt_cluster.c:127
RSP: 0018:ffff8801d2f6f2d0 EFLAGS: 00010293
RAX: ffff8801af700540 RBX: 0000000000000000 RCX: ffffffff84a2d1e1
RDX: 0000000000000000 RSI: ffff8801d2f6f478 RDI: ffff8801cafd336a
RBP: ffff8801d2f6f2e8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801b03b3d18
R13: ffff8801cafd3300 R14: dffffc0000000000 R15: ffff8801d2f6f478
  ipt_do_table+0xa91/0x19b0 net/ipv4/netfilter/ip_tables.c:296
  iptable_filter_hook+0x65/0x80 net/ipv4/netfilter/iptable_filter.c:41
  nf_hook_entry_hookfn include/linux/netfilter.h:120 [inline]
  nf_hook_slow+0xba/0x1a0 net/netfilter/core.c:483
  nf_hook include/linux/netfilter.h:243 [inline]
  NF_HOOK include/linux/netfilter.h:286 [inline]
  raw_send_hdrinc.isra.17+0xf39/0x1880 net/ipv4/raw.c:432
  raw_sendmsg+0x14cd/0x26b0 net/ipv4/raw.c:669
  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
  sock_sendmsg_nosec net/socket.c:629 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:639
  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
  SyS_sendto+0x40/0x50 net/socket.c:1716
  do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x441b49
RSP: 002b:00007ffff5ca8b18 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000441b49
RDX: 0000000000000030 RSI: 0000000020ff7000 RDI: 0000000000000003
RBP: 00000000006cc018 R08: 000000002066354c R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000403470
R13: 0000000000403500 R14: 0000000000000000 R15: 0000000000000000
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Instead of adding checks for template ct on every target/match
manipulating skb->_nfct, simply drop the template ct when skipping
nf_conntrack_in().

Reported-and-tested-by: syzbot+0346441ae0545cfcea3a@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index b50721d9d30e..9db988f9a4d7 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -154,8 +154,20 @@ static unsigned int ipv4_conntrack_local(void *priv,
 					 struct sk_buff *skb,
 					 const struct nf_hook_state *state)
 {
-	if (ip_is_fragment(ip_hdr(skb))) /* IP_NODEFRAG setsockopt set */
+	if (ip_is_fragment(ip_hdr(skb))) { /* IP_NODEFRAG setsockopt set */
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *tmpl;
+
+		tmpl = nf_ct_get(skb, &ctinfo);
+		if (tmpl && nf_ct_is_template(tmpl)) {
+			/* when skipping ct, clear templates to avoid fooling
+			 * later targets/matches
+			 */
+			skb->_nfct = 0;
+			nf_ct_put(tmpl);
+		}
 		return NF_ACCEPT;
+	}
 
 	return nf_conntrack_in(state->net, PF_INET, state->hook, skb);
 }
-- 
2.14.3

^ permalink raw reply related

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: Jakob Unterwurzacher @ 2018-03-22 10:16 UTC (permalink / raw)
  To: John Fastabend, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <ffdc6ca9-bf64-5514-bac9-7b107f9b443b@gmail.com>

On 21.03.18 21:52, John Fastabend wrote:
> Can you try this,
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d4907b5..1e596bd 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
>   enum qdisc_state_t {
>          __QDISC_STATE_SCHED,
>          __QDISC_STATE_DEACTIVATED,
> +       __QDISC_STATE_RUNNING,
>   };
> [...]

Tested, looks good. No OOO observed, no side effects observed, iperf 
numbers on Gigabit Ethernet look the same.

Thanks,
Jakob

^ permalink raw reply

* Re: [PATCH 1/2] r8169: reinstate ALDPS for power saving
From: Kai-Heng Feng @ 2018-03-22 10:19 UTC (permalink / raw)
  To: Hayes Wang
  Cc: nic_swsd, Linux Netdev List, Linux Kernel Mailing List, romieu,
	jrg.otte, David Miller
In-Reply-To: <75F177BF-B016-402B-93B3-62D635F5F9D5@canonical.com>

Kai Heng Feng <kai.heng.feng@canonical.com> wrote:

> Hopefully Hayes (or Realtek) can shed more lights on the issue. Apparently
> ALDPS and ASPM for r8169 is enabled in different commercial products, just
> not in Linux mainline.

Hayes and Realtek folks,

How do we make this patch going forward?
Do you find the root cause that make this patch got reverted?

I guess ALDPS is no longer needed after commit a92a08499b1f ("r8169:  
improve runtime pm in general and suspend unused ports"), now the device  
gets runtime suspended when link is down.

OTOH, ASPM is still quite useful though. When it's enabled, it can save 1W  
power usage, which is quite substantial for a laptop.

So, I'd like to hear your feedback and make ASPM for r8169 eventually gets  
upstreamed.

Kai-Heng

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] dt: bindings: add new dt entries for brcmfmac
From: Ulf Hansson @ 2018-03-22 10:29 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Arend van Spriel, Florian Fainelli, Alexey Roslyakov, Andrew Lunn,
	Rob Herring, Mark Rutland, Franky Lin, Hante Meuleman,
	Chi-Hsien Lin, Wright Feng, netdev, linux-wireless, devicetree,
	Linux Kernel Mailing List, brcm80211-dev-list.pdl,
	brcm80211-dev-list
In-Reply-To: <87po3zxe9n.fsf@kamboji.qca.qualcomm.com>

On 20 March 2018 at 10:55, Kalle Valo <kvalo@codeaurora.org> wrote:
> Arend van Spriel <arend.vanspriel@broadcom.com> writes:
>
>>>> If I get it right, you mean something like this:
>>>>
>>>> mmc3: mmc@1c12000 {
>>>> ...
>>>>          broken-sg-support;
>>>>          sd-head-align = 4;
>>>>          sd-sgentry-align = 512;
>>>>
>>>>          brcmf: wifi@1 {
>>>>                  ...
>>>>          };
>>>> };
>>>>
>>>> Where dt: bindings documentation for these entries should reside?
>>>> In generic MMC bindings? Well, this is the very special case and
>>>> mmc-linux maintainer will unlikely to accept these changes.
>>>> Also, extra kernel code modification might be required. It could make
>>>> quite trivial change much more complex.
>>>
>>> If the MMC maintainers are not copied on this patch series, it will
>>> likely be hard for them to identify this patch series and chime in...
>>
>> The main question is whether this is indeed a "very special case" as
>> Alexey claims it to be or that it is likely to be applicable to other
>> device and host combinations as you are suggesting.
>>
>> If these properties are imposed by the host or host controller it
>> would make sense to have these in the mmc bindings.
>
> BTW, last year we were discussing something similar (I mean related to
> alignment requirements) with ath10k SDIO patches and at the time the
> patch submitter was proposing to have a bounce buffer in ath10k to
> workaround that. I don't remember the details anymore, they are on the
> ath10k mailing list archive if anyone is curious to know, but I would
> not be surprised if they are similar as here. So there might be a need
> to solve this in a generic way (but not sure of course as I haven't
> checked the details).

I re-call something about these as well, here are the patches. Perhaps
I should pick some of them up...

https://patchwork.kernel.org/patch/10123137/
https://patchwork.kernel.org/patch/10123139/
https://patchwork.kernel.org/patch/10123141/
https://patchwork.kernel.org/patch/10123143/

Kind regards
Uffe

^ permalink raw reply

* Drop count for VLAN tagged packets when interface is in promiscuous mode
From: Mikael Arvids @ 2018-03-22 10:11 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi,

I have questions regarding how packet drops are counted in net/core/dev.c.

We open a raw socket (with ETH_P_ALL) in promiscuous mode to capture all packets we receive from a mirrored port on a switch, and in order to ensure that we are not missing any packets we check the rx_dropped statistics on the interface (in addition to the PACKET_STATISTICS on the socket).
Under certain circumstances we could see dropped packets on the interface, even though we were not missing any packets in the capture. After some investigation we concluded that the drop counter were incremented for VLAN tagged PTP packets (ether_type 0x88f7), even though these were captured on the raw socket.

It turns out that packets with an unknown VLAN tag and ether_type other than IP (0x0800) and ARP (0x0806) will increment the drop counter, even when those packets have been processed (by the raw socket). Is this intended?

We have currently patched net/core/dev.c to not increment the drop counter when deliver_skb has been called for the vlan packets, which solves our particular case, but I'm wondering if there could be a more generic solution to this?

diff --git a/components/linux-kernel/xilinx-v2016.3/net/core/dev.c b/components/linux-kernel/xilinx-v2016.3/net/core/dev.c 
index 5c925ac..9d04a1c 100644                                                                                              
--- a/components/linux-kernel/xilinx-v2016.3/net/core/dev.c                                                                
+++ b/components/linux-kernel/xilinx-v2016.3/net/core/dev.c                                                                
@@ -4028,6 +4028,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)                            
        bool deliver_exact = false;                                                                                        
        int ret = NET_RX_DROP;                                                                                             
        __be16 type;                                                                                                       
+       bool prevent_drop_cnt_inc = false;                                                                                 
                                                                                                                           
        net_timestamp_check(!netdev_tstamp_prequeue, skb);                                                                 
                                                                                                                           
@@ -4098,6 +4099,7 @@ ncls:                                                                                                
                if (pt_prev) {                                                                                             
                        ret = deliver_skb(skb, pt_prev, orig_dev);                                                         
                        pt_prev = NULL;                                                                                    
+                       prevent_drop_cnt_inc = true;                                                                       
                }                                                                                                          
                if (vlan_do_receive(&skb))                                                                                 
                        goto another_round;                                                                                
@@ -4160,8 +4162,10 @@ ncls:                                                                                               
                        ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);                                             
        } else {                                                                                                           
 drop:                                                                                                                     
-               if (!deliver_exact)                                                                                        
-                       atomic_long_inc(&skb->dev->rx_dropped);                                                            
+               if (!deliver_exact) {                                                                                      
+                       if (!prevent_drop_cnt_inc)                                                                         
+                               atomic_long_inc(&skb->dev->rx_dropped);                                                    
+               }                                                                                                          
                else                                                                                                       
                        atomic_long_inc(&skb->dev->rx_nohandler);                                                          
                kfree_skb(skb);                                                                                            

Best regards,
Mikael Arvids
***************************************************************
Consider the environment before printing this message.

To read the Companies' Information and Confidentiality Notice, follow this link:
https://www.autoliv.com/autoliv-email-disclaimer
***************************************************************

^ permalink raw reply related

* RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: David Laight @ 2018-03-22 10:35 UTC (permalink / raw)
  To: 'Linus Torvalds', Ingo Molnar
  Cc: Thomas Gleixner, Rahul Lakkireddy, x86@kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	mingo@redhat.com, hpa@zytor.com, davem@davemloft.net,
	akpm@linux-foundation.org, ganeshgr@chelsio.com,
	nirranjan@chelsio.com, indranil@chelsio.com, Andy Lutomirski,
	Peter Zijlstra, Fenghua Yu, Eric Biggers
In-Reply-To: <CA+55aFyK6uJopBjZGyeDz0O+qzCac6AwPkJyOpuaT69inPgVgw@mail.gmail.com>

From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
> 
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> burst over PCIe, and I assume that is where the code and idea came
> from. But I don't actually see any reason why a burst of 8 regular
> quad-word bytes wouldn't cause a 64-byte burst write too.

The big gain is from wide PCIe reads, not writes.
Writes to uncached locations (without WC) are almost certainly required
to generate the 'obvious' PCIe TLP, otherwise things are likely to break.

> So right now this is for _one_ odd rdma controller, with absolutely
> _zero_ performance numbers, and a very high likelihood that it does
> not matter in the least.
> 
> And if there are some atomicity concerns ("we need to guarantee a
> single atomic access for race conditions with the hardware"), they are
> probably bogus and misplaced anyway, since
> 
>  (a) we can't guarantee that AVX2 exists in the first place

Any code would need to be in memcpy_fromio(), not in every driver that
might benefit.
Then fallback code can be used if the registers aren't available.

>  (b) we can't guarantee that %ymm register write will show up on any
> bus as a single write transaction anyway

Misaligned 8 byte accesses generate a single PCIe TLP.
I can look at what happens for AVX2 transfers later.
I've got code that mmap()s PCIe addresses into user space, and can
look at the TLP (indirectly through tracing on an fpga target).
Just need to set something up that uses AVX copies.

> So as far as I can tell, there are basically *zero* upsides, and a lot
> of potential downsides.

There are some upsides.
I'll do a performance measurement for reads.

	David


^ permalink raw reply

* RE: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write
From: David Laight @ 2018-03-22 10:48 UTC (permalink / raw)
  To: 'Linus Torvalds', Alexander Duyck
  Cc: Rahul Lakkireddy, Thomas Gleixner, x86@kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	mingo@redhat.com, hpa@zytor.com, davem@davemloft.net,
	akpm@linux-foundation.org, Ganesh GR, Nirranjan Kirubaharan,
	Indranil Choudhury
In-Reply-To: <CA+55aFydu4+MZ=-5L7wDr2LvNH6Dbmi4qLnfHAAOrHKZCb5wdA@mail.gmail.com>

From: Linus Torvalds
> Sent: 22 March 2018 01:27
> On Tue, Mar 20, 2018 at 7:42 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > Instead of framing this as an enhanced version of the read/write ops
> > why not look at replacing or extending something like the
> > memcpy_fromio or memcpy_toio operations?
> 
> Yes, doing something like "memcpy_fromio_avx()" is much more
> palatable, in that it works like the crypto functions do - if you do
> big chunks, the "kernel_fpu_begin/end()" isn't nearly the issue it can
> be otherwise.
> 
> Note that we definitely have seen hardware that *depends* on the
> regular memcpy_fromio()" not doing big reads. I don't know how
> hardware people screw it up, but it's clearly possible.

I wonder if that hardware works with the current kernel on recent cpus?
I bet it doesn't like the byte accesses that get generated either.

> So it really needs to be an explicitly named function that basically a
> driver can use to say "my hardware really likes big aligned accesses"
> and explicitly ask for some AVX version if possible.

For x86 being able to request a copy done as 'rep movsx' (for each x)
would be useful.
For io copies the cost of the memory access is probably much smaller
that the io access, so really fancy copies are unlikely make much
difference unless the width of the io access changes.

	David


^ permalink raw reply

* [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Jiri Pirko @ 2018-03-22 10:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, jakub.kicinski, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, simon.horman,
	pieter.jansenvanvuuren, john.hurley, dirk.vandermerwe,
	alexander.h.duyck, ogerlitz, dsahern, vijaya.guvva,
	satananda.burla, raghu.vatsavayi, felix.manlunas, gospo,
	sathya.perla, vasundhara-v.volam, tariqt, eranbe,
	jeffrey.t.kirsher

From: Jiri Pirko <jiri@mellanox.com>

This patchset resolves 2 issues we have right now:
1) There are many netdevices / ports in the system, for port, pf, vf
   represenatation but the user has no way to see which is which
2) The ndo_get_phys_port_name is implemented in each driver separatelly,
   which may lead to inconsistent names between drivers.

This patchset introduces port flavours which should address the first
problem. I'm testing this with Netronome nfp hardware. When the user
has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:
# devlink port
pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
pci/0000:05:00.0/268435456: type eth netdev eth0 flavour physical number 0
pci/0000:05:00.0/268435460: type eth netdev enp5s0np1 flavour physical number 1
pci/0000:05:00.0/536875008: type eth netdev eth2 flavour pf_rep number 536875008
pci/0000:05:00.0/536870912: type eth netdev eth1 flavour vf_rep number 0
pci/0000:05:00.0/536870976: type eth netdev eth3 flavour vf_rep number 1
pci/0000:05:00.0/536871040: type eth netdev eth4 flavour vf_rep number 2
pci/0000:05:00.0/536871104: type eth netdev eth5 flavour vf_rep number 3

The indexes are weird numbers now. That needs to be fixed. Also, netdev
renaming does not work correctly for me now for some reason.
Also, there is one extra port that I don't understand what
is the purpose for it - something nfp specific perhaps.

The desired output should look like this:
# devlink port
pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3

As you can see, the netdev names are generated according to the flavour
and port number. In case the port is split, the split subnumber is also
included.

I tested this for mlxsw and nfp. I have no way to test this on DSA hw,
I would really appretiate DSA guys to test this. Thanks!

Jiri Pirko (12):
  devlink: introduce devlink_port_attrs_set
  devlink: extend attrs_set for setting port flavours
  devlink: introduce a helper to generate physical port names
  dsa: set devlink port attrs for dsa ports
  dsa: use devlink helper to generate physical port name
  mlxsw: use devlink helper to generate physical port name
  nfp: flower: fix error path during representor creation
  nfp: set eth_id for representors to avoid port index conflict
  nfp: register devlink port for VF/PF representors
  nfp: flower: create port for flower vnic
  nfp: use devlink helper to generate physical port name
  nfp: flower: set sysfs link to device for representors

 drivers/net/ethernet/mellanox/mlxsw/core.c        | 18 ++++-
 drivers/net/ethernet/mellanox/mlxsw/core.h        |  5 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c    | 21 ++----
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c    | 11 +--
 drivers/net/ethernet/netronome/nfp/flower/main.c  | 17 ++++-
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c  | 45 +++++++++--
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 19 ++++-
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.h |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_port.c     | 30 +-------
 include/net/devlink.h                             | 32 ++++++--
 include/uapi/linux/devlink.h                      | 22 ++++++
 net/core/devlink.c                                | 92 ++++++++++++++++++++---
 net/dsa/dsa2.c                                    | 23 ++++++
 net/dsa/slave.c                                   |  6 +-
 14 files changed, 252 insertions(+), 90 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [patch net-next RFC 01/12] devlink: introduce devlink_port_attrs_set
From: Jiri Pirko @ 2018-03-22 10:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, jakub.kicinski, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, simon.horman,
	pieter.jansenvanvuuren, john.hurley, dirk.vandermerwe,
	alexander.h.duyck, ogerlitz, dsahern, vijaya.guvva,
	satananda.burla, raghu.vatsavayi, felix.manlunas, gospo,
	sathya.perla, vasundhara-v.volam, tariqt, eranbe,
	jeffrey.t.kirsher
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c       |  7 ++--
 drivers/net/ethernet/mellanox/mlxsw/core.h       |  3 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 +--
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   |  2 +-
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 39 +++++++++++++++-----
 include/net/devlink.h                            | 20 +++++++----
 include/uapi/linux/devlink.h                     |  3 ++
 net/core/devlink.c                               | 46 ++++++++++++++++++------
 8 files changed, 93 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 3529b545675d..dc924d5fb3b7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1713,15 +1713,16 @@ EXPORT_SYMBOL(mlxsw_core_port_fini);
 
 void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 			     void *port_driver_priv, struct net_device *dev,
-			     bool split, u32 split_group)
+			     u32 port_number, bool split,
+			     u32 split_port_subnumber)
 {
 	struct mlxsw_core_port *mlxsw_core_port =
 					&mlxsw_core->ports[local_port];
 	struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
 
 	mlxsw_core_port->port_driver_priv = port_driver_priv;
-	if (split)
-		devlink_port_split_set(devlink_port, split_group);
+	devlink_port_attrs_set(devlink_port, port_number,
+			       split, split_port_subnumber);
 	devlink_port_type_eth_set(devlink_port, dev);
 }
 EXPORT_SYMBOL(mlxsw_core_port_eth_set);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 5ddafd74dc00..10589abae67f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -201,7 +201,8 @@ int mlxsw_core_port_init(struct mlxsw_core *mlxsw_core, u8 local_port);
 void mlxsw_core_port_fini(struct mlxsw_core *mlxsw_core, u8 local_port);
 void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 			     void *port_driver_priv, struct net_device *dev,
-			     bool split, u32 split_group);
+			     u32 port_number, bool split,
+			     u32 split_port_subnumber);
 void mlxsw_core_port_ib_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 			    void *port_driver_priv);
 void mlxsw_core_port_clear(struct mlxsw_core *mlxsw_core, u8 local_port,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a120602bca26..59be0bf14127 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2927,8 +2927,8 @@ static int mlxsw_sp_port_create(struct mlxsw_sp *mlxsw_sp, u8 local_port,
 	}
 
 	mlxsw_core_port_eth_set(mlxsw_sp->core, mlxsw_sp_port->local_port,
-				mlxsw_sp_port, dev, mlxsw_sp_port->split,
-				module);
+				mlxsw_sp_port, dev, module + 1,
+				mlxsw_sp_port->split, lane / width);
 	mlxsw_core_schedule_dw(&mlxsw_sp_port->periodic_hw_stats.update_dw, 0);
 	return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index f3c29bbf07e2..eddfcef320f1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -1149,7 +1149,7 @@ static int __mlxsw_sx_port_eth_create(struct mlxsw_sx *mlxsw_sx, u8 local_port,
 	}
 
 	mlxsw_core_port_eth_set(mlxsw_sx->core, mlxsw_sx_port->local_port,
-				mlxsw_sx_port, dev, false, 0);
+				mlxsw_sx_port, dev, module + 1, false, 0);
 	mlxsw_sx->ports[local_port] = mlxsw_sx_port;
 	return 0;
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index eb0fc614673d..3c0f0560f834 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -162,22 +162,45 @@ const struct devlink_ops nfp_devlink_ops = {
 	.eswitch_mode_get	= nfp_devlink_eswitch_mode_get,
 };
 
-int nfp_devlink_port_register(struct nfp_app *app, struct nfp_port *port)
+static int nfp_devlink_port_attrs_set(struct nfp_port *port)
 {
 	struct nfp_eth_table_port eth_port;
+	int ret;
+
+	switch (port->type) {
+	case NFP_PORT_PHYS_PORT:
+		rtnl_lock();
+		ret = nfp_devlink_fill_eth_port(port, &eth_port);
+		rtnl_unlock();
+		if (ret)
+			return ret;
+
+		devlink_port_attrs_set(&port->dl_port, eth_port.label_port,
+				       eth_port.is_split,
+				       eth_port.label_subport);
+		break;
+	case NFP_PORT_PF_PORT:
+		devlink_port_attrs_set(&port->dl_port, port->pf_id, false, 0);
+		break;
+	case NFP_PORT_VF_PORT:
+		devlink_port_attrs_set(&port->dl_port, port->vf_id, false, 0);
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+
+int nfp_devlink_port_register(struct nfp_app *app, struct nfp_port *port)
+{
 	struct devlink *devlink;
 	int ret;
 
-	rtnl_lock();
-	ret = nfp_devlink_fill_eth_port(port, &eth_port);
-	rtnl_unlock();
+	devlink_port_type_eth_set(&port->dl_port, port->netdev);
+	ret = nfp_devlink_port_attrs_set(port);
 	if (ret)
 		return ret;
 
-	devlink_port_type_eth_set(&port->dl_port, port->netdev);
-	if (eth_port.is_split)
-		devlink_port_split_set(&port->dl_port, eth_port.label_port);
-
 	devlink = priv_to_devlink(app->pf);
 
 	return devlink_port_register(devlink, &port->dl_port, port->eth_id);
diff --git a/include/net/devlink.h b/include/net/devlink.h
index c83125ad20ff..29c3bc260a3e 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -35,6 +35,13 @@ struct devlink {
 	char priv[0] __aligned(NETDEV_ALIGN);
 };
 
+struct devlink_port_attrs {
+	bool set;
+	u32 port_number; /* same value as "split group" */
+	bool split;
+	u32 split_subport_number;
+};
+
 struct devlink_port {
 	struct list_head list;
 	struct devlink *devlink;
@@ -43,8 +50,7 @@ struct devlink_port {
 	enum devlink_port_type type;
 	enum devlink_port_type desired_type;
 	void *type_dev;
-	bool split;
-	u32 split_group;
+	struct devlink_port_attrs attrs;
 };
 
 struct devlink_sb_pool_info {
@@ -373,8 +379,9 @@ void devlink_port_type_eth_set(struct devlink_port *devlink_port,
 void devlink_port_type_ib_set(struct devlink_port *devlink_port,
 			      struct ib_device *ibdev);
 void devlink_port_type_clear(struct devlink_port *devlink_port);
-void devlink_port_split_set(struct devlink_port *devlink_port,
-			    u32 split_group);
+void devlink_port_attrs_set(struct devlink_port *devlink_port,
+			    u32 port_number, bool split,
+			    u32 split_subport_number);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
@@ -468,8 +475,9 @@ static inline void devlink_port_type_clear(struct devlink_port *devlink_port)
 {
 }
 
-static inline void devlink_port_split_set(struct devlink_port *devlink_port,
-					  u32 split_group)
+static inline void devlink_port_attrs_set(struct devlink_port *devlink_port,
+					  u32 port_number, bool split,
+					  u32 split_subport_number)
 {
 }
 
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 1df65a4c2044..15b031a5ee7a 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -224,6 +224,9 @@ enum devlink_attr {
 	DEVLINK_ATTR_DPIPE_TABLE_RESOURCE_ID,	/* u64 */
 	DEVLINK_ATTR_DPIPE_TABLE_RESOURCE_UNITS,/* u64 */
 
+	DEVLINK_ATTR_PORT_NUMBER,		/* u32 */
+	DEVLINK_ATTR_PORT_SPLIT_SUBPORT_NUMBER,	/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index f23e5ed7c90f..b0fca9644722 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -453,6 +453,25 @@ static void devlink_notify(struct devlink *devlink, enum devlink_command cmd)
 				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
 }
 
+static int devlink_nl_port_attrs_put(struct sk_buff *msg,
+				     struct devlink_port *devlink_port)
+{
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+
+	if (!attrs->set)
+		return 0;
+	if (nla_put_u32(msg, DEVLINK_ATTR_PORT_NUMBER, attrs->port_number))
+		return -EMSGSIZE;
+	if (!attrs->split)
+		return 0;
+	if (nla_put_u32(msg, DEVLINK_ATTR_PORT_SPLIT_GROUP, attrs->port_number))
+		return -EMSGSIZE;
+	if (nla_put_u32(msg, DEVLINK_ATTR_PORT_SPLIT_SUBPORT_NUMBER,
+			attrs->split_subport_number))
+		return -EMSGSIZE;
+	return 0;
+}
+
 static int devlink_nl_port_fill(struct sk_buff *msg, struct devlink *devlink,
 				struct devlink_port *devlink_port,
 				enum devlink_command cmd, u32 portid,
@@ -492,9 +511,7 @@ static int devlink_nl_port_fill(struct sk_buff *msg, struct devlink *devlink,
 				   ibdev->name))
 			goto nla_put_failure;
 	}
-	if (devlink_port->split &&
-	    nla_put_u32(msg, DEVLINK_ATTR_PORT_SPLIT_GROUP,
-			devlink_port->split_group))
+	if (devlink_nl_port_attrs_put(msg, devlink_port))
 		goto nla_put_failure;
 
 	genlmsg_end(msg, hdr);
@@ -2972,19 +2989,28 @@ void devlink_port_type_clear(struct devlink_port *devlink_port)
 EXPORT_SYMBOL_GPL(devlink_port_type_clear);
 
 /**
- *	devlink_port_split_set - Set port is split
+ *	devlink_port_attrs_set - Set port attributes
  *
  *	@devlink_port: devlink port
- *	@split_group: split group - identifies group split port is part of
+ *	@port_number: number of the port that is facing user, for example
+ *	              the front panel port number
+ *	@split: indicates if this is split port
+ *	@split_subport_number: if the port is split, this is the number
+ *	                       of subport.
  */
-void devlink_port_split_set(struct devlink_port *devlink_port,
-			    u32 split_group)
+void devlink_port_attrs_set(struct devlink_port *devlink_port,
+			    u32 port_number, bool split,
+			    u32 split_subport_number)
 {
-	devlink_port->split = true;
-	devlink_port->split_group = split_group;
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+
+	attrs->set = true;
+	attrs->port_number = port_number;
+	attrs->split = split;
+	attrs->split_subport_number = split_subport_number;
 	devlink_port_notify(devlink_port, DEVLINK_CMD_PORT_NEW);
 }
-EXPORT_SYMBOL_GPL(devlink_port_split_set);
+EXPORT_SYMBOL_GPL(devlink_port_attrs_set);
 
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
-- 
2.14.3

^ permalink raw reply related

* [patch net-next RFC 03/12] devlink: introduce a helper to generate physical port names
From: Jiri Pirko @ 2018-03-22 10:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, jakub.kicinski, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, simon.horman,
	pieter.jansenvanvuuren, john.hurley, dirk.vandermerwe,
	alexander.h.duyck, ogerlitz, dsahern, vijaya.guvva,
	satananda.burla, raghu.vatsavayi, felix.manlunas, gospo,
	sathya.perla, vasundhara-v.volam, tariqt, eranbe,
	jeffrey.t.kirsher
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Each driver implements physical port name generation by itself. However
as devlink has all needed info, it can easily do the job for all its
users. So implement this helper in devlink.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/devlink.h |  9 +++++++++
 net/core/devlink.c    | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 900295afc521..2e5bfe7723b4 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -384,6 +384,8 @@ void devlink_port_attrs_set(struct devlink_port *devlink_port,
 			    enum devlink_port_flavour flavour,
 			    u32 port_number, bool split,
 			    u32 split_subport_number);
+int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+				    char *name, size_t len);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
@@ -484,6 +486,13 @@ static inline void devlink_port_attrs_set(struct devlink_port *devlink_port,
 {
 }
 
+static inline int
+devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+				char *name, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int devlink_sb_register(struct devlink *devlink,
 				      unsigned int sb_index, u32 size,
 				      u16 ingress_pools_count,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 782476a1ff8f..4ba69383ab58 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3017,6 +3017,45 @@ void devlink_port_attrs_set(struct devlink_port *devlink_port,
 }
 EXPORT_SYMBOL_GPL(devlink_port_attrs_set);
 
+int devlink_port_get_phys_port_name(struct devlink_port *devlink_port,
+				    char *name, size_t len)
+{
+	struct devlink_port_attrs *attrs = &devlink_port->attrs;
+	int n = 0;
+
+	if (!attrs->set)
+		return -EOPNOTSUPP;
+
+	switch (attrs->flavour) {
+	case DEVLINK_PORT_FLAVOUR_PHYSICAL:
+		if (!attrs->split)
+			n = snprintf(name, len, "p%u", attrs->port_number);
+		else
+			n = snprintf(name, len, "p%us%u", attrs->port_number,
+				     attrs->split_subport_number);
+		break;
+	case DEVLINK_PORT_FLAVOUR_PF_REP:
+		n = snprintf(name, len, "pf%d", attrs->port_number);
+		break;
+	case DEVLINK_PORT_FLAVOUR_VF_REP:
+		n = snprintf(name, len, "vf%d", attrs->port_number);
+		break;
+	case DEVLINK_PORT_FLAVOUR_CPU:
+	case DEVLINK_PORT_FLAVOUR_DSA:
+		/* As CPU and DSA ports do not have a netdevice associated
+		 * case should not ever happen.
+		 */
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	if (n >= len)
+		return -EINVAL;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(devlink_port_get_phys_port_name);
+
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
 			u32 size, u16 ingress_pools_count,
 			u16 egress_pools_count, u16 ingress_tc_count,
-- 
2.14.3

^ permalink raw reply related

* [patch net-next RFC 02/12] devlink: extend attrs_set for setting port flavours
From: Jiri Pirko @ 2018-03-22 10:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, jakub.kicinski, mlxsw, andrew, vivien.didelot,
	f.fainelli, michael.chan, ganeshgr, saeedm, simon.horman,
	pieter.jansenvanvuuren, john.hurley, dirk.vandermerwe,
	alexander.h.duyck, ogerlitz, dsahern, vijaya.guvva,
	satananda.burla, raghu.vatsavayi, felix.manlunas, gospo,
	sathya.perla, vasundhara-v.volam, tariqt, eranbe,
	jeffrey.t.kirsher
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@mellanox.com>

Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, pf_rep, vf_rep, cpu, dsa
User can query this to see right away what is the purpose of each port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c       |  4 ++--
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c | 12 +++++++++---
 include/net/devlink.h                            |  3 +++
 include/uapi/linux/devlink.h                     | 19 +++++++++++++++++++
 net/core/devlink.c                               |  5 +++++
 5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index dc924d5fb3b7..0b6e646fed75 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -1721,8 +1721,8 @@ void mlxsw_core_port_eth_set(struct mlxsw_core *mlxsw_core, u8 local_port,
 	struct devlink_port *devlink_port = &mlxsw_core_port->devlink_port;
 
 	mlxsw_core_port->port_driver_priv = port_driver_priv;
-	devlink_port_attrs_set(devlink_port, port_number,
-			       split, split_port_subnumber);
+	devlink_port_attrs_set(devlink_port, DEVLINK_PORT_FLAVOUR_PHYSICAL,
+			       port_number, split, split_port_subnumber);
 	devlink_port_type_eth_set(devlink_port, dev);
 }
 EXPORT_SYMBOL(mlxsw_core_port_eth_set);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index 3c0f0560f834..e3a46faaadc6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -175,15 +175,21 @@ static int nfp_devlink_port_attrs_set(struct nfp_port *port)
 		if (ret)
 			return ret;
 
-		devlink_port_attrs_set(&port->dl_port, eth_port.label_port,
+		devlink_port_attrs_set(&port->dl_port,
+				       DEVLINK_PORT_FLAVOUR_PHYSICAL,
+				       eth_port.label_port,
 				       eth_port.is_split,
 				       eth_port.label_subport);
 		break;
 	case NFP_PORT_PF_PORT:
-		devlink_port_attrs_set(&port->dl_port, port->pf_id, false, 0);
+		devlink_port_attrs_set(&port->dl_port,
+				       DEVLINK_PORT_FLAVOUR_PF_REP,
+				       port->pf_id, false, 0);
 		break;
 	case NFP_PORT_VF_PORT:
-		devlink_port_attrs_set(&port->dl_port, port->vf_id, false, 0);
+		devlink_port_attrs_set(&port->dl_port,
+				       DEVLINK_PORT_FLAVOUR_VF_REP,
+				       port->vf_id, false, 0);
 		break;
 	default:
 		break;
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 29c3bc260a3e..900295afc521 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -37,6 +37,7 @@ struct devlink {
 
 struct devlink_port_attrs {
 	bool set;
+	enum devlink_port_flavour flavour;
 	u32 port_number; /* same value as "split group" */
 	bool split;
 	u32 split_subport_number;
@@ -380,6 +381,7 @@ void devlink_port_type_ib_set(struct devlink_port *devlink_port,
 			      struct ib_device *ibdev);
 void devlink_port_type_clear(struct devlink_port *devlink_port);
 void devlink_port_attrs_set(struct devlink_port *devlink_port,
+			    enum devlink_port_flavour flavour,
 			    u32 port_number, bool split,
 			    u32 split_subport_number);
 int devlink_sb_register(struct devlink *devlink, unsigned int sb_index,
@@ -476,6 +478,7 @@ static inline void devlink_port_type_clear(struct devlink_port *devlink_port)
 }
 
 static inline void devlink_port_attrs_set(struct devlink_port *devlink_port,
+					  enum devlink_port_flavour flavour,
 					  u32 port_number, bool split,
 					  u32 split_subport_number)
 {
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 15b031a5ee7a..74d0e620059b 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -132,6 +132,24 @@ enum devlink_eswitch_encap_mode {
 	DEVLINK_ESWITCH_ENCAP_MODE_BASIC,
 };
 
+enum devlink_port_flavour {
+	DEVLINK_PORT_FLAVOUR_PHYSICAL, /* Any kind of a port physically
+					* facing the user.
+					*/
+	DEVLINK_PORT_FLAVOUR_PF_REP, /* Port represents a SR-IOV physical
+				      *	function counterpart port of
+				      *	embedded switch.
+				      */
+	DEVLINK_PORT_FLAVOUR_VF_REP, /* Port represents a SR-IOV virtual
+				      *	function counterpart port of
+				      *	embedded switch.
+				      */
+	DEVLINK_PORT_FLAVOUR_CPU, /* CPU port */
+	DEVLINK_PORT_FLAVOUR_DSA, /* Distributed switch architecture
+				   * interconnect port.
+				   */
+};
+
 enum devlink_attr {
 	/* don't change the order or add anything between, this is ABI! */
 	DEVLINK_ATTR_UNSPEC,
@@ -224,6 +242,7 @@ enum devlink_attr {
 	DEVLINK_ATTR_DPIPE_TABLE_RESOURCE_ID,	/* u64 */
 	DEVLINK_ATTR_DPIPE_TABLE_RESOURCE_UNITS,/* u64 */
 
+	DEVLINK_ATTR_PORT_FLAVOUR,		/* u16 */
 	DEVLINK_ATTR_PORT_NUMBER,		/* u32 */
 	DEVLINK_ATTR_PORT_SPLIT_SUBPORT_NUMBER,	/* u32 */
 
diff --git a/net/core/devlink.c b/net/core/devlink.c
index b0fca9644722..782476a1ff8f 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -460,6 +460,8 @@ static int devlink_nl_port_attrs_put(struct sk_buff *msg,
 
 	if (!attrs->set)
 		return 0;
+	if (nla_put_u16(msg, DEVLINK_ATTR_PORT_FLAVOUR, attrs->flavour))
+		return -EMSGSIZE;
 	if (nla_put_u32(msg, DEVLINK_ATTR_PORT_NUMBER, attrs->port_number))
 		return -EMSGSIZE;
 	if (!attrs->split)
@@ -2992,6 +2994,7 @@ EXPORT_SYMBOL_GPL(devlink_port_type_clear);
  *	devlink_port_attrs_set - Set port attributes
  *
  *	@devlink_port: devlink port
+ *	@flavour: flavour of the port
  *	@port_number: number of the port that is facing user, for example
  *	              the front panel port number
  *	@split: indicates if this is split port
@@ -2999,12 +3002,14 @@ EXPORT_SYMBOL_GPL(devlink_port_type_clear);
  *	                       of subport.
  */
 void devlink_port_attrs_set(struct devlink_port *devlink_port,
+			    enum devlink_port_flavour flavour,
 			    u32 port_number, bool split,
 			    u32 split_subport_number)
 {
 	struct devlink_port_attrs *attrs = &devlink_port->attrs;
 
 	attrs->set = true;
+	attrs->flavour = flavour;
 	attrs->port_number = port_number;
 	attrs->split = split;
 	attrs->split_subport_number = split_subport_number;
-- 
2.14.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox