Netdev List
 help / color / mirror / Atom feed
* [PATCH btf v2 0/3] Add BTF types deduplication algorithm
From: Andrii Nakryiko @ 2019-02-05  1:29 UTC (permalink / raw)
  To: netdev, ast, yhs, daniel, acme, kernel-team, kafai, ecree,
	andrii.nakryiko
  Cc: Andrii Nakryiko

This patch series adds BTF deduplication algorithm to libbpf. This algorithm
allows to take BTF type information containing duplicate per-compilation unit
information and reduce it to equivalent set of BTF types with no duplication without
loss of information. It also deduplicates strings and removes those strings that
are not referenced from any BTF type (and line information in .BTF.ext section,
if any).

Algorithm also resolves struct/union forward declarations into concrete BTF types
across multiple compilation units to facilitate better deduplication ratio. If
undesired, this resolution can be disabled through specifying corresponding options.

When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
only 29247 out of initial 3073497 BTF type descriptors.

Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
support is also described more verbosely at:
https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html

v1->v2:
- rebase on latest bpf-next
- err_log/elog -> pr_debug
- btf__dedup, btf__get_strings, btf__get_nr_types listed under 0.0.2 version

Andrii Nakryiko (3):
  btf: extract BTF type size calculation
  btf: add BTF types deduplication algorithm
  selftests/btf: add initial BTF dedup tests

 tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
 tools/lib/bpf/btf.h                    |   10 +
 tools/lib/bpf/libbpf.map               |    3 +
 tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
 4 files changed, 2332 insertions(+), 67 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
From: Martin Lau @ 2019-02-05  0:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, Alexei Starovoitov, Kernel Team,
	Lawrence Brakmo
In-Reply-To: <51fe0012-e294-078a-4fc4-6151f8b55195@iogearbox.net>

On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote:
> Hi Martin,
> 
> On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> > before accessing the fields in sock.  For example, in __netdev_pick_tx:
> > 
> > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> > 			    struct net_device *sb_dev)
> > {
> > 	/* ... */
> > 
> > 	struct sock *sk = skb->sk;
> > 
> > 		if (queue_index != new_index && sk &&
> > 		    sk_fullsock(sk) &&
> > 		    rcu_access_pointer(sk->sk_dst_cache))
> > 			sk_tx_queue_set(sk, new_index);
> > 
> > 	/* ... */
> > 
> > 	return queue_index;
> > }
> > 
> > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> > where a few of the convert_ctx_access() in filter.c has already been
> > accessing the skb->sk sock_common's fields,
> > e.g. sock_ops_convert_ctx_access().
> > 
> > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> > Some of the fileds in "bpf_sock" will not be directly
> > accessible through the "__sk_buff->sk" pointer.  It is limited
> > by the new "bpf_sock_common_is_valid_access()".
> > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
> >      are not allowed.
> > 
> > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> > can be used to get a sk with all accessible fields in "bpf_sock".
> > This helper is added to both cg_skb and sched_(cls|act).
> > 
> > int cg_skb_foo(struct __sk_buff *skb) {
> > 	struct bpf_sock *sk;
> > 	__u32 family;
> > 
> > 	sk = skb->sk;
> > 	if (!sk)
> > 		return 1;
> > 
> > 	sk = bpf_sk_fullsock(sk);
> > 	if (!sk)
> > 		return 1;
> > 
> > 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> > 		return 1;
> > 
> > 	/* some_traffic_shaping(); */
> > 
> > 	return 1;
> > }
> > 
> > (1) The sk is read only
> > 
> > (2) There is no new "struct bpf_sock_common" introduced.
> > 
> > (3) Future kernel sock's members could be added to bpf_sock only
> >     instead of repeatedly adding at multiple places like currently
> >     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> > 
> > (4) After "sk = skb->sk", the reg holding sk is in type
> >     PTR_TO_SOCK_COMMON_OR_NULL.
> > 
> > (5) After bpf_sk_fullsock(), the return type will be in type
> >     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
> >     bpf_sk_lookup_xxx().
> > 
> >     However, bpf_sk_fullsock() does not take refcnt.  The
> >     acquire_reference_state() is only depending on the return type now.
> >     To avoid it, a new is_acquire_function() is checked before calling
> >     acquire_reference_state().
> 
> Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
> all this is more of an implementation detail which we would expose here to
> the developer.
> 
> Is there a specific reason why fetching skb->sk couldn't already be of the
> type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
> needed and most logic we have today could already be reused (modulo refcnt
> avoidance)?
Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL).

Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()),
not sure a sk (e.g. tw sock) can always be traced back to a full sk.

In term of the patch implementation, it is not much difference.  It is a bit
simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but
not a lot.  I have tried both.

The "fullsock" has already been exposed in another form.
e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock
while other sock_common fields will still be available.  The bpf_prog
can test the sock_ops->is_fullsock for what to do.

> 
> In particular, do you need the skb->sk without the full-sk part somewhere
> (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
> helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
> parent where you could also access remaining bpf_sock fields?
I am thinking more on what if the bpf_prog only needs the fields from
sock_common (e.g. the src/dst ip/port) and skb already has
other needed info (e.g. protocol/mark/priority).
Enforing skb->sk must be a fullsock will unnecessarily limit those
bpf_prog from seeing all skb.

A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock()
tester.  I think having one "struct bpf_sock" is better and less confusing.

Later, for the running context that is sure to have a fullsock,
skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of
PTR_TO_SOCK_COMMON_OR_NULL.

Thanks,
Martin

> 
> This could then also be plugged into bpf_tcp_sock() given this needs to be
> full sk anyway.

^ permalink raw reply

* Re: [PATCH v2] net: dsa: mv88e6xxx: Revise irq setup ordering
From: John David Anglin @ 2019-02-05  0:38 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <20190204231410.GG3397@lunn.ch>

On 2019-02-04 6:14 p.m., Andrew Lunn wrote:
> On Mon, Feb 04, 2019 at 04:59:13PM -0500, John David Anglin wrote:
>> This change fixes a race condition in the setup of hardware irqs and the
>> code enabling PHY link detection in the mv88e6xxx driver.
>>
>> This race was observed on the espressobin board where the GPIO interrupt
>> controller only supports edge interrupts.  If the INTn output pin goes low
>> before the GPIO interrupt is enabled, PHY link interrupts are not detected.
>>
>> With this change, we
>> 1) force INTn high by clearing all interrupt enables in global 1 control 1,
>> 2) setup the hardware irq, and then
>> 3) perform the remaining common setup.
>>
>> This simplifies the setup and allows some unnecessary code to be removed.
> Hi Dave
>
> I took a closer look now. I don't actually see why the current code is
> wrong.
The problem is INTn can go low before the interrupt handler for it is
registered and enabled.
As a result, interrupts never occur if link happens to come up before
the interrupt handler
completes being enabled.
>
> mv88e6xxx_g1_irq_setup() calls mv88e6xxx_g1_irq_setup_common() and
> then registers the interrupt handler.
>
> mv88e6xxx_g1_irq_setup_common() does what you want, it masks all
> interrupts in the hardware and clears any pending interrupts which can
> be cleared.
>
> The change you made is actually dangerous. As soon as you request the
> interrupt, it is live, it can fire, and call
> mv88e6xxx_g1_irq_thread_work(). That needs the irq domain. But the
> change you made defers the creating of the domain until after the
> interrupt is registered. So we can de-refernece a NULL pointer in the
> interrupt handler.
This can't happen.  The domain is setup immediately after registering
the GPIO interrupt.
The interrupt can't fire until one of the enables is set.  These are set
by mv88e6xxx_g2_irq_setup(),
mv88e6xxx_g1_atu_prob_irq_setup() and
mv88e6xxx_g1_vtu_prob_irq_setup().  These irqs
are setup after mv88e6xxx_g1_irq_setup()/mv88e6xxx_irq_poll_setup() is
called.  Thus, the
irq domain is setup before the GPIO interrupt can fire.

I have tested both hardware and polled interrupts using espressobin with
v4.20.6 and networking
starts correctly.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: fix libbpf_print
From: Yonghong Song @ 2019-02-05  0:37 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190205002055.80759-1-sdf@google.com>



On 2/4/19 4:20 PM, Stanislav Fomichev wrote:
> With the recent print rework we now have the following problem:
> pr_{warning,info,debug} expand to __pr which calls libbpf_print.
> libbpf_print does va_start and calls __libbpf_pr with va_list argument.
> In __base_pr we again do va_start. Because the next argument is a
> va_list, we don't get correct pointer to the argument (and print noting
> in my case, I don't know why it doesn't crash tbh).
> 
> Fix this by changing libbpf_print_fn_t signature to accept va_list and
> remove unneeded calls to va_start in the existing users.
> 
> Alternatively, this can we solved by exporting __libbpf_pr and
> changing __pr macro to (and killing libbpf_print):
> {
> 	if (__libbpf_pr)
> 		__libbpf_pr(level, "libbpf: " fmt, ##__VA_ARGS__)
> }
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

It is my mistake. My early version did passed correctly and later
on I made some changes and did not test properly. Thanks for the fix!

Acked-by: Yonghong Song <yhs@fb.com>


> ---
>   tools/lib/bpf/libbpf.c                         | 14 ++++----------
>   tools/lib/bpf/libbpf.h                         |  3 +--
>   tools/perf/util/bpf-loader.c                   | 10 ++--------
>   tools/testing/selftests/bpf/test_btf.c         | 13 ++-----------
>   tools/testing/selftests/bpf/test_libbpf_open.c | 10 ++--------
>   tools/testing/selftests/bpf/test_progs.c       | 10 ++--------
>   6 files changed, 13 insertions(+), 47 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 84ca6c2bea91..47969aa0faf8 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -54,22 +54,16 @@
>   
>   #define __printf(a, b)	__attribute__((format(printf, a, b)))
>   
> -__printf(2, 3)
> -static int __base_pr(enum libbpf_print_level level, const char *format, ...)
> +static int __base_pr(enum libbpf_print_level level, const char *format,
> +		     va_list args)
>   {
> -	va_list args;
> -	int err;
> -
>   	if (level == LIBBPF_DEBUG)
>   		return 0;
>   
> -	va_start(args, format);
> -	err = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return err;
> +	return vfprintf(stderr, format, args);
>   }
>   
> -static __printf(2, 3) libbpf_print_fn_t __libbpf_pr = __base_pr;
> +static libbpf_print_fn_t __libbpf_pr = __base_pr;
>   
>   void libbpf_set_print(libbpf_print_fn_t fn)
>   {
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 19dbc1bed960..69a7c25eaccc 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -54,8 +54,7 @@ enum libbpf_print_level {
>   };
>   
>   typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level,
> -				 const char *, ...)
> -	__attribute__((format(printf, 2, 3)));
> +				 const char *, va_list ap);
>   
>   LIBBPF_API void libbpf_set_print(libbpf_print_fn_t fn);
>   
> diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
> index 38afdbe6a9e0..037d8ff6a634 100644
> --- a/tools/perf/util/bpf-loader.c
> +++ b/tools/perf/util/bpf-loader.c
> @@ -25,15 +25,9 @@
>   #include "c++/clang-c.h"
>   
>   static int libbpf_perf_print(enum libbpf_print_level level __attribute__((unused)),
> -			      const char *fmt, ...)
> +			      const char *fmt, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
> -	va_start(args, fmt);
> -	ret = veprintf(1, verbose, pr_fmt(fmt), args);
> -	va_end(args);
> -	return ret;
> +	return veprintf(1, verbose, pr_fmt(fmt), args);
>   }
>   
>   struct bpf_prog_priv {
> diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
> index aebaeff5a5a0..5afab823ffbe 100644
> --- a/tools/testing/selftests/bpf/test_btf.c
> +++ b/tools/testing/selftests/bpf/test_btf.c
> @@ -52,19 +52,10 @@ static int count_result(int err)
>   	return err;
>   }
>   
> -#define __printf(a, b)	__attribute__((format(printf, a, b)))
> -
> -__printf(2, 3)
>   static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
> -		     const char *format, ...)
> +		     const char *format, va_list args)
>   {
> -	va_list args;
> -	int err;
> -
> -	va_start(args, format);
> -	err = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return err;
> +	return vfprintf(stderr, format, args);
>   }
>   
>   #define BTF_INFO_ENC(kind, kind_flag, vlen)			\
> diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
> index b9ff3bf76544..1909ecf4d999 100644
> --- a/tools/testing/selftests/bpf/test_libbpf_open.c
> +++ b/tools/testing/selftests/bpf/test_libbpf_open.c
> @@ -36,19 +36,13 @@ static void usage(char *argv[])
>   
>   static bool debug = 0;
>   static int libbpf_debug_print(enum libbpf_print_level level,
> -			      const char *fmt, ...)
> +			      const char *fmt, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
>   	if (level == LIBBPF_DEBUG && !debug)
>   		return 0;
>   
> -	va_start(args, fmt);
>   	fprintf(stderr, "[%d] ", level);
> -	ret = vfprintf(stderr, fmt, args);
> -	va_end(args);
> -	return ret;
> +	return vfprintf(stderr, fmt, args);
>   }
>   
>   #define EXIT_FAIL_LIBBPF EXIT_FAILURE
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 55d05102e7bf..c52bd90fbb34 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -1785,18 +1785,12 @@ static void test_task_fd_query_tp(void)
>   }
>   
>   static int libbpf_debug_print(enum libbpf_print_level level,
> -			      const char *format, ...)
> +			      const char *format, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
>   	if (level == LIBBPF_DEBUG)
>   		return 0;
>   
> -	va_start(args, format);
> -	ret = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return ret;
> +	return vfprintf(stderr, format, args);
>   }
>   
>   static void test_reference_tracking()
> 

^ permalink raw reply

* Re: [PATCH btf 0/3] Add BTF types deduplication algorithm
From: Daniel Borkmann @ 2019-02-05  0:26 UTC (permalink / raw)
  To: Andrii Nakryiko, netdev, ast, yhs, kafai, acme, kernel-team
In-Reply-To: <20190131065837.3380288-1-andriin@fb.com>

Hi Andrii,

On 01/31/2019 07:58 AM, Andrii Nakryiko wrote:
> This patch series adds BTF deduplication algorithm to libbpf. This algorithm
> allows to take BTF type information containing duplicate per-compilation unit
> information and reduce it to equivalent set of BTF types with no duplication without
> loss of information. It also deduplicates strings and removes those strings that
> are not referenced from any BTF type (and line information in .BTF.ext section,
> if any).
> 
> Algorithm also resolves struct/union forward declarations into concrete BTF types
> across multiple compilation units to facilitate better deduplication ratio. If
> undesired, this resolution can be disabled through specifying corresponding options.
> 
> When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
> the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
> only 29247 out of initial 3073497 BTF type descriptors.
> 
> Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
> support is also described more verbosely at:
> https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
> 
> Andrii Nakryiko (3):
>   btf: extract BTF type size calculation
>   btf: add BTF types deduplication algorithm
>   selftests/btf: add initial BTF dedup tests
> 
>  tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
>  tools/lib/bpf/btf.h                    |   11 +
>  tools/lib/bpf/libbpf.map               |    3 +
>  tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
>  4 files changed, 2333 insertions(+), 67 deletions(-)

This would need a proper rebase for bpf-next so that it applies w/o
bigger conflicts. Please also change the libbpf.map and place the newly
exported functions under LIBBPF_0.0.2 (as everything under 0.0.1 was from
prior released kernel).

Thanks,
Daniel

^ permalink raw reply

* [PATCH bpf-next] libbpf: fix libbpf_print
From: Stanislav Fomichev @ 2019-02-05  0:20 UTC (permalink / raw)
  To: netdev; +Cc: yhs, davem, ast, daniel, Stanislav Fomichev

With the recent print rework we now have the following problem:
pr_{warning,info,debug} expand to __pr which calls libbpf_print.
libbpf_print does va_start and calls __libbpf_pr with va_list argument.
In __base_pr we again do va_start. Because the next argument is a
va_list, we don't get correct pointer to the argument (and print noting
in my case, I don't know why it doesn't crash tbh).

Fix this by changing libbpf_print_fn_t signature to accept va_list and
remove unneeded calls to va_start in the existing users.

Alternatively, this can we solved by exporting __libbpf_pr and
changing __pr macro to (and killing libbpf_print):
{
	if (__libbpf_pr)
		__libbpf_pr(level, "libbpf: " fmt, ##__VA_ARGS__)
}

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/lib/bpf/libbpf.c                         | 14 ++++----------
 tools/lib/bpf/libbpf.h                         |  3 +--
 tools/perf/util/bpf-loader.c                   | 10 ++--------
 tools/testing/selftests/bpf/test_btf.c         | 13 ++-----------
 tools/testing/selftests/bpf/test_libbpf_open.c | 10 ++--------
 tools/testing/selftests/bpf/test_progs.c       | 10 ++--------
 6 files changed, 13 insertions(+), 47 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 84ca6c2bea91..47969aa0faf8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -54,22 +54,16 @@
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
-__printf(2, 3)
-static int __base_pr(enum libbpf_print_level level, const char *format, ...)
+static int __base_pr(enum libbpf_print_level level, const char *format,
+		     va_list args)
 {
-	va_list args;
-	int err;
-
 	if (level == LIBBPF_DEBUG)
 		return 0;
 
-	va_start(args, format);
-	err = vfprintf(stderr, format, args);
-	va_end(args);
-	return err;
+	return vfprintf(stderr, format, args);
 }
 
-static __printf(2, 3) libbpf_print_fn_t __libbpf_pr = __base_pr;
+static libbpf_print_fn_t __libbpf_pr = __base_pr;
 
 void libbpf_set_print(libbpf_print_fn_t fn)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 19dbc1bed960..69a7c25eaccc 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -54,8 +54,7 @@ enum libbpf_print_level {
 };
 
 typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level,
-				 const char *, ...)
-	__attribute__((format(printf, 2, 3)));
+				 const char *, va_list ap);
 
 LIBBPF_API void libbpf_set_print(libbpf_print_fn_t fn);
 
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 38afdbe6a9e0..037d8ff6a634 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -25,15 +25,9 @@
 #include "c++/clang-c.h"
 
 static int libbpf_perf_print(enum libbpf_print_level level __attribute__((unused)),
-			      const char *fmt, ...)
+			      const char *fmt, va_list args)
 {
-	va_list args;
-	int ret;
-
-	va_start(args, fmt);
-	ret = veprintf(1, verbose, pr_fmt(fmt), args);
-	va_end(args);
-	return ret;
+	return veprintf(1, verbose, pr_fmt(fmt), args);
 }
 
 struct bpf_prog_priv {
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index aebaeff5a5a0..5afab823ffbe 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -52,19 +52,10 @@ static int count_result(int err)
 	return err;
 }
 
-#define __printf(a, b)	__attribute__((format(printf, a, b)))
-
-__printf(2, 3)
 static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
-		     const char *format, ...)
+		     const char *format, va_list args)
 {
-	va_list args;
-	int err;
-
-	va_start(args, format);
-	err = vfprintf(stderr, format, args);
-	va_end(args);
-	return err;
+	return vfprintf(stderr, format, args);
 }
 
 #define BTF_INFO_ENC(kind, kind_flag, vlen)			\
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index b9ff3bf76544..1909ecf4d999 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -36,19 +36,13 @@ static void usage(char *argv[])
 
 static bool debug = 0;
 static int libbpf_debug_print(enum libbpf_print_level level,
-			      const char *fmt, ...)
+			      const char *fmt, va_list args)
 {
-	va_list args;
-	int ret;
-
 	if (level == LIBBPF_DEBUG && !debug)
 		return 0;
 
-	va_start(args, fmt);
 	fprintf(stderr, "[%d] ", level);
-	ret = vfprintf(stderr, fmt, args);
-	va_end(args);
-	return ret;
+	return vfprintf(stderr, fmt, args);
 }
 
 #define EXIT_FAIL_LIBBPF EXIT_FAILURE
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 55d05102e7bf..c52bd90fbb34 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1785,18 +1785,12 @@ static void test_task_fd_query_tp(void)
 }
 
 static int libbpf_debug_print(enum libbpf_print_level level,
-			      const char *format, ...)
+			      const char *format, va_list args)
 {
-	va_list args;
-	int ret;
-
 	if (level == LIBBPF_DEBUG)
 		return 0;
 
-	va_start(args, format);
-	ret = vfprintf(stderr, format, args);
-	va_end(args);
-	return ret;
+	return vfprintf(stderr, format, args);
 }
 
 static void test_reference_tracking()
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [net-next][PATCH 2/5] rds: rdma: add consumer reject
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

For legacy protocol version incompatibility with non linux RDS,
consumer reject reason being used to convey it to peer. But the
choice of reject reason value as '1' was really poor.

Anyway for interoperability reasons with shipping products,
it needs to be supported. For any future versions, properly
encoded reject reason should to be used.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/ib_cm.c          |  6 ++++--
 net/rds/rdma_transport.c | 12 ++++++++++++
 net/rds/rdma_transport.h |  6 ++++++
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0eeae09..a1c3ad3 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -734,8 +734,10 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 
 	/* Check whether the remote protocol version matches ours. */
 	version = rds_ib_protocol_compatible(event, isv6);
-	if (!version)
+	if (!version) {
+		err = RDS_RDMA_REJ_INCOMPAT;
 		goto out;
+	}
 
 	dp = event->param.conn.private_data;
 	if (isv6) {
@@ -851,7 +853,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 	if (conn)
 		mutex_unlock(&conn->c_cm_lock);
 	if (err)
-		rdma_reject(cm_id, NULL, 0);
+		rdma_reject(cm_id, &err, sizeof(int));
 	return destroy;
 }
 
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 6b0f57c..63cbc6b 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -51,6 +51,8 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 	struct rds_connection *conn = cm_id->context;
 	struct rds_transport *trans;
 	int ret = 0;
+	int *err;
+	u8 len;
 
 	rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 		 event->event, rdma_event_msg(event->event));
@@ -106,8 +108,18 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 		break;
 
 	case RDMA_CM_EVENT_REJECTED:
+		if (!conn)
+			break;
+		err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
+		if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
+				&conn->c_laddr, &conn->c_faddr);
+			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+			rds_conn_drop(conn);
+		}
 		rdsdebug("Connection rejected: %s\n",
 			 rdma_reject_msg(cm_id, event->status));
+		break;
 		/* FALLTHROUGH */
 	case RDMA_CM_EVENT_ADDR_ERROR:
 	case RDMA_CM_EVENT_ROUTE_ERROR:
diff --git a/net/rds/rdma_transport.h b/net/rds/rdma_transport.h
index 200d313..bfafd4a 100644
--- a/net/rds/rdma_transport.h
+++ b/net/rds/rdma_transport.h
@@ -11,6 +11,12 @@
 
 #define RDS_RDMA_RESOLVE_TIMEOUT_MS     5000
 
+/* Below reject reason is for legacy interoperability issue with non-linux
+ * RDS endpoints where older version incompatibility is conveyed via value 1.
+ * For future version(s), proper encoded reject reason should be be used.
+ */
+#define RDS_RDMA_REJ_INCOMPAT		1
+
 int rds_rdma_conn_connect(struct rds_connection *conn);
 int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event);
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 0/5] rds: add tos support
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar

RDS applications make use of tos to classify database traffic.
This feature has been used in shipping products from 2.6.32 based
kernels. Its tied with RDS v4.1 protocol version and the compatibility
gets negotiated as part of connections setup.

Patchset keeps full backward compatibility using existing connection
negotiation scheme. Currently the feature is exploited by RDMA
transport and for TCP transport the user tos values are mapped to
same default class (0).

For RDMA transports, RDMA CM service type API is used to
set up different SL(service lanes) and the IB fabric is configured
for tos mapping using Subnet Manager(SL to VL mappings).
Similarly for ROCE fabric, user priority is mapped with different
DSCP code points which are associated with different switch queues
in the fabric.

The original code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved significantly over period of time.

Thanks to Yanjun for doing testing with various combinations of host like
v3.1<->v4.1, v4.1.<->v3.1, v4.1 upstream to shipping v4.1 etc etc

Patchset is also available on below git tree.

The following changes since commit cc7335786f7278d66bdcf96d3d411edfcb01be51:

  socket: fix for Add SO_TIMESTAMP[NS]_NEW (2019-02-03 20:36:11 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_net-next-5.1/rds-tos-v4

for you to fetch changes up to fd261ce6a30e01ad67c416e2c67e263024b3a6f9:

  rds: rdma: update rdma transport for tos (2019-02-04 14:59:13 -0800)

----------------------------------------------------------------

Santosh Shilimkar (5):
  rds: make v3.1 as compat version
  rds: rdma: add consumer reject
  rds: add type of service(tos) infrastructure
  rds: add transport specific tos_map hook
  rds: rdma: update rdma transport for tos

 include/uapi/linux/rds.h | 11 ++++++++
 net/rds/af_rds.c         | 37 ++++++++++++++++++++++++-
 net/rds/connection.c     | 21 ++++++++------
 net/rds/ib.c             | 11 ++++++++
 net/rds/ib.h             |  4 ++-
 net/rds/ib_cm.c          | 72 +++++++++++++++++++++++++++---------------------
 net/rds/ib_recv.c        |  4 +--
 net/rds/ib_send.c        |  5 ++--
 net/rds/rdma_transport.c | 14 ++++++++++
 net/rds/rdma_transport.h |  6 ++++
 net/rds/rds.h            | 14 ++++++++--
 net/rds/recv.c           |  1 +
 net/rds/send.c           |  7 +++--
 net/rds/tcp.c            |  8 ++++++
 net/rds/tcp_listen.c     |  2 +-
 net/rds/threads.c        |  1 +
 16 files changed, 166 insertions(+), 52 deletions(-)

-- 
1.9.1


^ permalink raw reply

* [net-next][PATCH 4/5] rds: add transport specific tos_map hook
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

RDMA transport maps user tos to underline virtual lanes(VL)
for IB or DSCP values. RDMA CM transport abstract thats for
RDS. TCP transport makes use of default priority 0 and maps
all user tos values to it.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/af_rds.c | 10 ++++++----
 net/rds/ib.c     | 10 ++++++++++
 net/rds/rds.h    |  1 +
 net/rds/tcp.c    |  7 +++++++
 4 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 9045158..d6cc97f 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -255,16 +255,18 @@ static __poll_t rds_poll(struct file *file, struct socket *sock,
 static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
 	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
-	rds_tos_t tos;
+	rds_tos_t utos, tos = 0;
 
 	switch (cmd) {
 	case SIOCRDSSETTOS:
-		if (get_user(tos, (rds_tos_t __user *)arg))
+		if (get_user(utos, (rds_tos_t __user *)arg))
 			return -EFAULT;
 
 		if (rs->rs_transport &&
-		    rs->rs_transport->t_type == RDS_TRANS_TCP)
-			tos = 0;
+		    rs->rs_transport->get_tos_map)
+			tos = rs->rs_transport->get_tos_map(utos);
+		else
+			return -ENOIOCTLCMD;
 
 		spin_lock_bh(&rds_sock_lock);
 		if (rs->rs_tos || rs->rs_conn) {
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 21b6588..2da9b75 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -515,6 +515,15 @@ void rds_ib_exit(void)
 	rds_ib_mr_exit();
 }
 
+static u8 rds_ib_get_tos_map(u8 tos)
+{
+	/* 1:1 user to transport map for RDMA transport.
+	 * In future, if custom map is desired, hook can export
+	 * user configurable map.
+	 */
+	return tos;
+}
+
 struct rds_transport rds_ib_transport = {
 	.laddr_check		= rds_ib_laddr_check,
 	.xmit_path_complete	= rds_ib_xmit_path_complete,
@@ -537,6 +546,7 @@ struct rds_transport rds_ib_transport = {
 	.sync_mr		= rds_ib_sync_mr,
 	.free_mr		= rds_ib_free_mr,
 	.flush_mrs		= rds_ib_flush_mrs,
+	.get_tos_map		= rds_ib_get_tos_map,
 	.t_owner		= THIS_MODULE,
 	.t_name			= "infiniband",
 	.t_unloading		= rds_ib_is_unloading,
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 7e52b92..0d8f67c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -574,6 +574,7 @@ struct rds_transport {
 	void (*free_mr)(void *trans_private, int invalidate);
 	void (*flush_mrs)(void);
 	bool (*t_unloading)(struct rds_connection *conn);
+	u8 (*get_tos_map)(u8 tos);
 };
 
 /* Bind hash table key length.  It is the sum of the size of a struct
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb68519..fd26941 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -453,6 +453,12 @@ static void rds_tcp_destroy_conns(void)
 
 static void rds_tcp_exit(void);
 
+static u8 rds_tcp_get_tos_map(u8 tos)
+{
+	/* all user tos mapped to default 0 for TCP transport */
+	return 0;
+}
+
 struct rds_transport rds_tcp_transport = {
 	.laddr_check		= rds_tcp_laddr_check,
 	.xmit_path_prepare	= rds_tcp_xmit_path_prepare,
@@ -467,6 +473,7 @@ struct rds_transport rds_tcp_transport = {
 	.inc_free		= rds_tcp_inc_free,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
+	.get_tos_map		= rds_tcp_get_tos_map,
 	.t_owner		= THIS_MODULE,
 	.t_name			= "tcp",
 	.t_type			= RDS_TRANS_TCP,
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 1/5] rds: make v3.1 as compat version
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

Mark RDSv3.1 as compat version and add v4.1 version macro's.
Subsequent patches enable TOS(Type of Service) feature which is
tied with v4.1 for RDMA transport.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/connection.c |  1 +
 net/rds/ib_cm.c      | 40 +++++++++++++++++++++++-----------------
 net/rds/rds.h        |  4 ++++
 net/rds/threads.c    |  1 +
 4 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 3bd2f4a..1ab14b6 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -139,6 +139,7 @@ static void __rds_conn_path_init(struct rds_connection *conn,
 	atomic_set(&cp->cp_state, RDS_CONN_DOWN);
 	cp->cp_send_gen = 0;
 	cp->cp_reconnect_jiffies = 0;
+	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
 	INIT_DELAYED_WORK(&cp->cp_send_w, rds_send_worker);
 	INIT_DELAYED_WORK(&cp->cp_recv_w, rds_recv_worker);
 	INIT_DELAYED_WORK(&cp->cp_conn_w, rds_connect_worker);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index bfbb31f..0eeae09 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -133,23 +133,24 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		rds_ib_set_flow_control(conn, be32_to_cpu(credit));
 	}
 
-	if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-		pr_notice("RDS/IB: Connection <%pI6c,%pI6c> version %u.%u no longer supported\n",
-			  &conn->c_laddr, &conn->c_faddr,
-			  RDS_PROTOCOL_MAJOR(conn->c_version),
-			  RDS_PROTOCOL_MINOR(conn->c_version));
-		set_bit(RDS_DESTROY_PENDING, &conn->c_path[0].cp_flags);
-		rds_conn_destroy(conn);
-		return;
-	} else {
-		pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
-			  ic->i_active_side ? "Active" : "Passive",
-			  &conn->c_laddr, &conn->c_faddr,
-			  RDS_PROTOCOL_MAJOR(conn->c_version),
-			  RDS_PROTOCOL_MINOR(conn->c_version),
-			  ic->i_flowctl ? ", flow control" : "");
+	if (conn->c_version < RDS_PROTOCOL_VERSION) {
+		if (conn->c_version != RDS_PROTOCOL_COMPAT_VERSION) {
+			pr_notice("RDS/IB: Connection <%pI6c,%pI6c> version %u.%u no longer supported\n",
+				  &conn->c_laddr, &conn->c_faddr,
+				  RDS_PROTOCOL_MAJOR(conn->c_version),
+				  RDS_PROTOCOL_MINOR(conn->c_version));
+			rds_conn_destroy(conn);
+			return;
+		}
 	}
 
+	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
+		  ic->i_active_side ? "Active" : "Passive",
+		  &conn->c_laddr, &conn->c_faddr,
+		  RDS_PROTOCOL_MAJOR(conn->c_version),
+		  RDS_PROTOCOL_MINOR(conn->c_version),
+		  ic->i_flowctl ? ", flow control" : "");
+
 	atomic_set(&ic->i_cq_quiesce, 0);
 
 	/* Init rings and fill recv. this needs to wait until protocol
@@ -184,6 +185,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 					    NULL);
 	}
 
+	conn->c_proposed_version = conn->c_version;
 	rds_connect_complete(conn);
 }
 
@@ -667,6 +669,9 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 		version = RDS_PROTOCOL_3_0;
 		while ((common >>= 1) != 0)
 			version++;
+	} else if (RDS_PROTOCOL_COMPAT_VERSION ==
+		   RDS_PROTOCOL(major, minor)) {
+		version = RDS_PROTOCOL_COMPAT_VERSION;
 	} else {
 		if (isv6)
 			printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI6c using incompatible protocol version %u.%u\n",
@@ -861,7 +866,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 
 	/* If the peer doesn't do protocol negotiation, we must
 	 * default to RDSv3.0 */
-	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_VERSION);
 	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
 
 	ret = rds_ib_setup_qp(conn);
@@ -870,7 +875,8 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 		goto out;
 	}
 
-	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION,
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp,
+				  conn->c_proposed_version,
 				  UINT_MAX, UINT_MAX, isv6);
 	ret = rdma_connect(cm_id, &conn_param);
 	if (ret)
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 4ffe100..660023f 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -19,10 +19,13 @@
  */
 #define RDS_PROTOCOL_3_0	0x0300
 #define RDS_PROTOCOL_3_1	0x0301
+#define RDS_PROTOCOL_4_0	0x0400
+#define RDS_PROTOCOL_4_1	0x0401
 #define RDS_PROTOCOL_VERSION	RDS_PROTOCOL_3_1
 #define RDS_PROTOCOL_MAJOR(v)	((v) >> 8)
 #define RDS_PROTOCOL_MINOR(v)	((v) & 255)
 #define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
+#define RDS_PROTOCOL_COMPAT_VERSION	RDS_PROTOCOL_3_1
 
 /* The following ports, 16385, 18634, 18635, are registered with IANA as
  * the ports to be used for RDS over TCP and UDP.  Currently, only RDS over
@@ -151,6 +154,7 @@ struct rds_connection {
 	struct rds_cong_map	*c_fcong;
 
 	/* Protocol version */
+	unsigned int		c_proposed_version;
 	unsigned int		c_version;
 	possible_net_t		c_net;
 
diff --git a/net/rds/threads.c b/net/rds/threads.c
index e64f9e4..32dc50f 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -93,6 +93,7 @@ void rds_connect_path_complete(struct rds_conn_path *cp, int curr)
 		queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
 	}
 	rcu_read_unlock();
+	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
 }
 EXPORT_SYMBOL_GPL(rds_connect_path_complete);
 
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 5/5] rds: rdma: update rdma transport for tos
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

For RDMA transports, RDS TOS is an extension of IB QoS(Annex A13)
to provide clients the ability to segregate traffic flows for
different type of data. RDMA CM abstract it for ULPs using
rdma_set_service_type(). Internally, each traffic flow is
represented by a connection with all of its independent resources
like that of a normal connection, and is differentiated by
service type. In other words, there can be multiple qp connections
between an IP pair and each supports a unique service type.

The feature has been added from RDSv4.1 onwards and supports
rolling upgrades. RDMA connection metadata also carries the tos
information to set up SL on end to end context. The original
code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved over period of time.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/ib.h             |  4 +++-
 net/rds/ib_cm.c          | 32 +++++++++++++++++---------------
 net/rds/ib_recv.c        |  4 ++--
 net/rds/ib_send.c        |  5 +++--
 net/rds/rdma_transport.c |  1 +
 net/rds/send.c           |  5 +++--
 6 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 71ff356..752f922 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -67,7 +67,9 @@ struct rds_ib_conn_priv_cmn {
 	u8			ricpc_protocol_major;
 	u8			ricpc_protocol_minor;
 	__be16			ricpc_protocol_minor_mask;	/* bitmask */
-	__be32			ricpc_reserved1;
+	u8			ricpc_dp_toss;
+	u8			ripc_reserved1;
+	__be16			ripc_reserved2;
 	__be64			ricpc_ack_seq;
 	__be32			ricpc_credit;	/* non-zero enables flow ctl */
 };
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 70518e3..66c6eb5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -144,9 +144,9 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		}
 	}
 
-	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
+	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c,%d> version %u.%u%s\n",
 		  ic->i_active_side ? "Active" : "Passive",
-		  &conn->c_laddr, &conn->c_faddr,
+		  &conn->c_laddr, &conn->c_faddr, conn->c_tos,
 		  RDS_PROTOCOL_MAJOR(conn->c_version),
 		  RDS_PROTOCOL_MINOR(conn->c_version),
 		  ic->i_flowctl ? ", flow control" : "");
@@ -222,6 +222,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
 			dp->ricp_v6.dp_ack_seq =
 			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+			dp->ricp_v6.dp_cmn.ricpc_dp_toss = conn->c_tos;
 
 			conn_param->private_data = &dp->ricp_v6;
 			conn_param->private_data_len = sizeof(dp->ricp_v6);
@@ -236,6 +237,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
 			dp->ricp_v4.dp_ack_seq =
 			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+			dp->ricp_v4.dp_cmn.ricpc_dp_toss = conn->c_tos;
 
 			conn_param->private_data = &dp->ricp_v4;
 			conn_param->private_data_len = sizeof(dp->ricp_v4);
@@ -391,10 +393,9 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
 		rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST);
 		break;
 	default:
-		rdsdebug("Fatal QP Event %u (%s) "
-			"- connection %pI6c->%pI6c, reconnecting\n",
-			event->event, ib_event_msg(event->event),
-			&conn->c_laddr, &conn->c_faddr);
+		rdsdebug("Fatal QP Event %u (%s) - connection %pI6c->%pI6c, reconnecting\n",
+			 event->event, ib_event_msg(event->event),
+			 &conn->c_laddr, &conn->c_faddr);
 		rds_conn_drop(conn);
 		break;
 	}
@@ -662,11 +663,11 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 
 	/* Even if len is crap *now* I still want to check it. -ASG */
 	if (event->param.conn.private_data_len < data_len || major == 0)
-		return RDS_PROTOCOL_3_0;
+		return RDS_PROTOCOL_4_0;
 
 	common = be16_to_cpu(mask) & RDS_IB_SUPPORTED_PROTOCOLS;
-	if (major == 3 && common) {
-		version = RDS_PROTOCOL_3_0;
+	if (major == 4 && common) {
+		version = RDS_PROTOCOL_4_0;
 		while ((common >>= 1) != 0)
 			version++;
 	} else if (RDS_PROTOCOL_COMPAT_VERSION ==
@@ -778,15 +779,16 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 		daddr6 = &d_mapped_addr;
 	}
 
-	rdsdebug("saddr %pI6c daddr %pI6c RDSv%u.%u lguid 0x%llx fguid "
-		 "0x%llx\n", saddr6, daddr6,
-		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version),
+	rdsdebug("saddr %pI6c daddr %pI6c RDSv%u.%u lguid 0x%llx fguid 0x%llx, tos:%d\n",
+		 saddr6, daddr6, RDS_PROTOCOL_MAJOR(version),
+		 RDS_PROTOCOL_MINOR(version),
 		 (unsigned long long)be64_to_cpu(lguid),
-		 (unsigned long long)be64_to_cpu(fguid));
+		 (unsigned long long)be64_to_cpu(fguid), dp_cmn->ricpc_dp_toss);
 
 	/* RDS/IB is not currently netns aware, thus init_net */
 	conn = rds_conn_create(&init_net, daddr6, saddr6,
-			       &rds_ib_transport, 0, GFP_KERNEL, ifindex);
+			       &rds_ib_transport, dp_cmn->ricpc_dp_toss,
+			       GFP_KERNEL, ifindex);
 	if (IS_ERR(conn)) {
 		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
 		conn = NULL;
@@ -868,7 +870,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 
 	/* If the peer doesn't do protocol negotiation, we must
 	 * default to RDSv3.0 */
-	rds_ib_set_protocol(conn, RDS_PROTOCOL_VERSION);
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_4_1);
 	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
 
 	ret = rds_ib_setup_qp(conn);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 2f16146..d395eec 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -986,9 +986,9 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
 	} else {
 		/* We expect errors as the qp is drained during shutdown */
 		if (rds_conn_up(conn) || rds_conn_connecting(conn))
-			rds_ib_conn_error(conn, "recv completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
+			rds_ib_conn_error(conn, "recv completion on <%pI6c,%pI6c, %d> had status %u (%s), disconnecting and reconnecting\n",
 					  &conn->c_laddr, &conn->c_faddr,
-					  wc->status,
+					  conn->c_tos, wc->status,
 					  ib_wc_status_msg(wc->status));
 	}
 
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 4e0c36a..09c46f2 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -305,8 +305,9 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 
 	/* We expect errors as the qp is drained during shutdown */
 	if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-		rds_ib_conn_error(conn, "send completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
-				  &conn->c_laddr, &conn->c_faddr, wc->status,
+		rds_ib_conn_error(conn, "send completion on <%pI6c,%pI6c,%d> had status %u (%s), disconnecting and reconnecting\n",
+				  &conn->c_laddr, &conn->c_faddr,
+				  conn->c_tos, wc->status,
 				  ib_wc_status_msg(wc->status));
 	}
 }
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index e37f915..46bce83 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -83,6 +83,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 		break;
 
 	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		rdma_set_service_type(cm_id, conn->c_tos);
 		/* XXX do we need to clean up if this fails? */
 		ret = rdma_resolve_route(cm_id,
 					 RDS_RDMA_RESOLVE_TIMEOUT_MS);
diff --git a/net/rds/send.c b/net/rds/send.c
index c555e12..166dd57 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1277,12 +1277,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* rds_conn_create has a spinlock that runs with IRQ off.
 	 * Caching the conn in the socket helps a lot. */
-	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) {
+	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr) &&
+	    rs->rs_tos == rs->rs_conn->c_tos) {
 		conn = rs->rs_conn;
 	} else {
 		conn = rds_conn_create_outgoing(sock_net(sock->sk),
 						&rs->rs_bound_addr, &daddr,
-						rs->rs_transport, 0,
+						rs->rs_transport, rs->rs_tos,
 						sock->sk->sk_allocation,
 						scope_id);
 		if (IS_ERR(conn)) {
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 3/5] rds: add type of service(tos) infrastructure
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

RDS Service type (TOS) is user-defined and needs to be configured
via RDS IOCTL interface. It must be set before initiating any
traffic and once set the TOS can not be changed. All out-going
traffic from the socket will be associated with its TOS.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 include/uapi/linux/rds.h | 11 +++++++++++
 net/rds/af_rds.c         | 35 ++++++++++++++++++++++++++++++++++-
 net/rds/connection.c     | 20 +++++++++++---------
 net/rds/ib.c             |  1 +
 net/rds/ib_cm.c          |  2 +-
 net/rds/rdma_transport.c |  1 +
 net/rds/rds.h            |  9 +++++++--
 net/rds/recv.c           |  1 +
 net/rds/send.c           |  6 +++---
 net/rds/tcp.c            |  1 +
 net/rds/tcp_listen.c     |  2 +-
 11 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 8b73cb6..5d0f76c 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -69,6 +69,12 @@
 #define RDS_TRANS_COUNT	3
 #define	RDS_TRANS_NONE	(~0)
 
+/* IOCTLS commands for SOL_RDS */
+#define SIOCRDSSETTOS		(SIOCPROTOPRIVATE)
+#define SIOCRDSGETTOS		(SIOCPROTOPRIVATE + 1)
+
+typedef __u8	rds_tos_t;
+
 /*
  * Control message types for SOL_RDS.
  *
@@ -149,6 +155,7 @@ struct rds_info_connection {
 	__be32		faddr;
 	__u8		transport[TRANSNAMSIZ];		/* null term ascii */
 	__u8		flags;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_connection {
@@ -171,6 +178,7 @@ struct rds_info_message {
 	__be16		lport;
 	__be16		fport;
 	__u8		flags;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_message {
@@ -214,6 +222,7 @@ struct rds_info_tcp_socket {
 	__u32           last_sent_nxt;
 	__u32           last_expected_una;
 	__u32           last_seen_una;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_tcp_socket {
@@ -240,6 +249,7 @@ struct rds_info_rdma_connection {
 	__u32		max_send_sge;
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
+	__u8		tos;
 };
 
 struct rds6_info_rdma_connection {
@@ -253,6 +263,7 @@ struct rds6_info_rdma_connection {
 	__u32		max_send_sge;
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
+	__u8		tos;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 65571a6..9045158 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -254,7 +254,38 @@ static __poll_t rds_poll(struct file *file, struct socket *sock,
 
 static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
-	return -ENOIOCTLCMD;
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	rds_tos_t tos;
+
+	switch (cmd) {
+	case SIOCRDSSETTOS:
+		if (get_user(tos, (rds_tos_t __user *)arg))
+			return -EFAULT;
+
+		if (rs->rs_transport &&
+		    rs->rs_transport->t_type == RDS_TRANS_TCP)
+			tos = 0;
+
+		spin_lock_bh(&rds_sock_lock);
+		if (rs->rs_tos || rs->rs_conn) {
+			spin_unlock_bh(&rds_sock_lock);
+			return -EINVAL;
+		}
+		rs->rs_tos = tos;
+		spin_unlock_bh(&rds_sock_lock);
+		break;
+	case SIOCRDSGETTOS:
+		spin_lock_bh(&rds_sock_lock);
+		tos = rs->rs_tos;
+		spin_unlock_bh(&rds_sock_lock);
+		if (put_user(tos, (rds_tos_t __user *)arg))
+			return -EFAULT;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+
+	return 0;
 }
 
 static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
@@ -650,6 +681,8 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
 	spin_lock_init(&rs->rs_rdma_lock);
 	rs->rs_rdma_keys = RB_ROOT;
 	rs->rs_rx_traces = 0;
+	rs->rs_tos = 0;
+	rs->rs_conn = NULL;
 
 	spin_lock_bh(&rds_sock_lock);
 	list_add_tail(&rs->rs_item, &rds_sock_list);
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 1ab14b6..7ea134f 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -84,7 +84,7 @@ static struct rds_connection *rds_conn_lookup(struct net *net,
 					      const struct in6_addr *laddr,
 					      const struct in6_addr *faddr,
 					      struct rds_transport *trans,
-					      int dev_if)
+					      u8 tos, int dev_if)
 {
 	struct rds_connection *conn, *ret = NULL;
 
@@ -92,6 +92,7 @@ static struct rds_connection *rds_conn_lookup(struct net *net,
 		if (ipv6_addr_equal(&conn->c_faddr, faddr) &&
 		    ipv6_addr_equal(&conn->c_laddr, laddr) &&
 		    conn->c_trans == trans &&
+		    conn->c_tos == tos &&
 		    net == rds_conn_net(conn) &&
 		    conn->c_dev_if == dev_if) {
 			ret = conn;
@@ -160,7 +161,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp,
+						gfp_t gfp, u8 tos,
 						int is_outgoing,
 						int dev_if)
 {
@@ -172,7 +173,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
 
 	rcu_read_lock();
-	conn = rds_conn_lookup(net, head, laddr, faddr, trans, dev_if);
+	conn = rds_conn_lookup(net, head, laddr, faddr, trans, tos, dev_if);
 	if (conn &&
 	    conn->c_loopback &&
 	    conn->c_trans != &rds_loop_transport &&
@@ -206,6 +207,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	conn->c_isv6 = !ipv6_addr_v4mapped(laddr);
 	conn->c_faddr = *faddr;
 	conn->c_dev_if = dev_if;
+	conn->c_tos = tos;
 
 #if IS_ENABLED(CONFIG_IPV6)
 	/* If the local address is link local, set c_bound_if to be the
@@ -298,7 +300,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		struct rds_connection *found;
 
 		found = rds_conn_lookup(net, head, laddr, faddr, trans,
-					dev_if);
+					tos, dev_if);
 		if (found) {
 			struct rds_conn_path *cp;
 			int i;
@@ -333,10 +335,10 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 struct rds_connection *rds_conn_create(struct net *net,
 				       const struct in6_addr *laddr,
 				       const struct in6_addr *faddr,
-				       struct rds_transport *trans, gfp_t gfp,
-				       int dev_if)
+				       struct rds_transport *trans, u8 tos,
+				       gfp_t gfp, int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 0, dev_if);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, tos, 0, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create);
 
@@ -344,9 +346,9 @@ struct rds_connection *rds_conn_create_outgoing(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp, int dev_if)
+						u8 tos, gfp_t gfp, int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 1, dev_if);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, tos, 1, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create_outgoing);
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9d7b758..21b6588 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -301,6 +301,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 
 	iinfo->src_addr = conn->c_laddr.s6_addr32[3];
 	iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
+	iinfo->tos = conn->c_tos;
 
 	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
 	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index a1c3ad3..70518e3 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -786,7 +786,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 
 	/* RDS/IB is not currently netns aware, thus init_net */
 	conn = rds_conn_create(&init_net, daddr6, saddr6,
-			       &rds_ib_transport, GFP_KERNEL, ifindex);
+			       &rds_ib_transport, 0, GFP_KERNEL, ifindex);
 	if (IS_ERR(conn)) {
 		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
 		conn = NULL;
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 63cbc6b..e37f915 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -115,6 +115,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
 				&conn->c_laddr, &conn->c_faddr);
 			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+			conn->c_tos = 0;
 			rds_conn_drop(conn);
 		}
 		rdsdebug("Connection rejected: %s\n",
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 660023f..7e52b92 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -158,6 +158,9 @@ struct rds_connection {
 	unsigned int		c_version;
 	possible_net_t		c_net;
 
+	/* TOS */
+	u8			c_tos;
+
 	struct list_head	c_map_item;
 	unsigned long		c_map_queued;
 
@@ -652,6 +655,7 @@ struct rds_sock {
 	u8			rs_rx_traces;
 	u8			rs_rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
 	struct rds_msg_zcopy_queue rs_zcookie_queue;
+	u8			rs_tos;
 };
 
 static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
@@ -760,13 +764,14 @@ struct rds_sock *rds_find_bound(const struct in6_addr *addr, __be16 port,
 struct rds_connection *rds_conn_create(struct net *net,
 				       const struct in6_addr *laddr,
 				       const struct in6_addr *faddr,
-				       struct rds_transport *trans, gfp_t gfp,
+				       struct rds_transport *trans,
+				       u8 tos, gfp_t gfp,
 				       int dev_if);
 struct rds_connection *rds_conn_create_outgoing(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp, int dev_if);
+						u8 tos, gfp_t gfp, int dev_if);
 void rds_conn_shutdown(struct rds_conn_path *cpath);
 void rds_conn_destroy(struct rds_connection *conn);
 void rds_conn_drop(struct rds_connection *conn);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 6bb6b16..853de48 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -782,6 +782,7 @@ void rds_inc_info_copy(struct rds_incoming *inc,
 
 	minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence);
 	minfo.len = be32_to_cpu(inc->i_hdr.h_len);
+	minfo.tos = inc->i_conn->c_tos;
 
 	if (flip) {
 		minfo.laddr = daddr;
diff --git a/net/rds/send.c b/net/rds/send.c
index fd8b687..c555e12 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1277,12 +1277,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* rds_conn_create has a spinlock that runs with IRQ off.
 	 * Caching the conn in the socket helps a lot. */
-	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
+	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) {
 		conn = rs->rs_conn;
-	else {
+	} else {
 		conn = rds_conn_create_outgoing(sock_net(sock->sk),
 						&rs->rs_bound_addr, &daddr,
-						rs->rs_transport,
+						rs->rs_transport, 0,
 						sock->sk->sk_allocation,
 						scope_id);
 		if (IS_ERR(conn)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index c16f0a3..eb68519 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -267,6 +267,7 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 		tsinfo.last_sent_nxt = tc->t_last_sent_nxt;
 		tsinfo.last_expected_una = tc->t_last_expected_una;
 		tsinfo.last_seen_una = tc->t_last_seen_una;
+		tsinfo.tos = tc->t_cpath->cp_conn->c_tos;
 
 		rds_info_copy(iter, &tsinfo, sizeof(tsinfo));
 	}
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index c12203f..810a3a4 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -200,7 +200,7 @@ int rds_tcp_accept_one(struct socket *sock)
 
 	conn = rds_conn_create(sock_net(sock->sk),
 			       my_addr, peer_addr,
-			       &rds_tcp_transport, GFP_KERNEL, dev_if);
+			       &rds_tcp_transport, 0, GFP_KERNEL, dev_if);
 
 	if (IS_ERR(conn)) {
 		ret = PTR_ERR(conn);
-- 
1.9.1


^ permalink raw reply related

* Re: [PATCH v1] net: dsa: qca8k: implement DT-based ports <-> phy translation
From: Christian Lamparter @ 2019-02-04 23:55 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Florian Fainelli, Vivien Didelot
In-Reply-To: <20190204222641.GE3397@lunn.ch>

Hello Andrew and Florian.

I concated both replies into this post.

On Monday, February 4, 2019 11:26:41 PM CET Andrew Lunn wrote:
> On Mon, Feb 04, 2019 at 10:35:55PM +0100, Christian Lamparter wrote:
> > The QCA8337 enumerates 5 PHYs on the MDC/MDIO access: PHY0-PHY4.
> > Based on the System Block Diagram in Section 1.2 of the
> > QCA8337's datasheet. These PHYs are internally connected
> > to MACs of PORT 1 - PORT 5.
> 
> Hi Christian
> 
> Is the off-by-one the problem here?
> 
Yes, this was triggered by a MII_PHYSID1 and MII_PHYSID2 read for a 
non-existent PHY at address >0x5< coming from dsa_register_switch()
during boot.

I added a WARN_ON to qca8k_phy_read() to see from where the reads are
coming from:

[    6.218168] WARNING: CPU: 0 PID: 21 at qca8k_phy_read+0x3c/0x68
[    6.224062] Modules linked in:
[    6.227107] CPU: 0 PID: 21 Comm: kworker/0:1 Tainted: G        W         4.19.16 #0
[    6.234732] Workqueue: events deferred_probe_work_func
[    6.239849] NIP:  c0308528 LR: c0308528 CTR: c0257d30
[    6.244878] REGS: cf485b80 TRAP: 0700   Tainted: G        W          (4.19.16)
[    6.252064] MSR:  00029000 <CE,EE,ME>  CR: 28002222  XER: 00000000
[    6.258224]
[    6.258224] GPR00: c0308528 cf485c30 cf47c000 00000005 00000000 00000215 c09d61e4 c09d3de0
[    6.258224] GPR08: 00021000 c09bfb00 c09bfb00 00000007 24002444 00000000 c003f81c cf466060
[    6.258224] GPR16: ffffffff cf633f60 cf633f6c 00000001 cf633f40 c0589f7c 006080c0 c09bede8
[    6.258224] GPR24: cf633f40 00000000 00000000 fffff000 00000003 cdf21790 00000003 00000005
[    6.292952] NIP [c0308528] qca8k_phy_read+0x3c/0x68
[    6.297808] LR [c0308528] qca8k_phy_read+0x3c/0x68
[    6.302574] Call Trace:
[    6.305013] [cf485c30] [c0308528] qca8k_phy_read+0x3c/0x68 (unreliable)
[    6.311602] [cf485c50] [c0300530] mdiobus_read+0x6c/0x9c
[    6.316894] [cf485c70] [c02ffdcc] get_phy_device+0x188/0x204
[    6.322527] [cf485cd0] [c0300740] mdiobus_scan+0x20/0x160
[    6.327901] [cf485d00] [c0300a3c] __mdiobus_register+0x1bc/0x2a8
[    6.333884] [cf485d30] [c047e690] dsa_register_switch+0x6a0/0x904
[    6.339954] [cf485db0] [c0300dd8] mdio_probe+0x40/0x88
[    6.345070] [cf485dc0] [c026947c] really_probe+0x168/0x300
[    6.350530] [cf485df0] [c0269b44] driver_probe_device+0x410/0x460
[    6.356594] [cf485e10] [c02672cc] bus_for_each_drv+0x5c/0xc0
[    6.362229] [cf485e40] [c0269270] __device_attach+0xa8/0x144
[    6.367862] [cf485e70] [c0268430] bus_probe_device+0x3c/0xc0
[    6.373495] [cf485e90] [c02689dc] deferred_probe_work_func+0x70/0xac
[    6.379821] [cf485eb0] [c003a2c4] process_one_work+0x25c/0x3b0
[    6.385633] [cf485ed0] [c003a708] worker_thread+0x2f0/0x434
[    6.391180] [cf485f10] [c003f950] kthread+0x134/0x138
[    6.396209] [cf485f40] [c000d23c] ret_from_kernel_thread+0x14/0x1c
[    6.402357] Instruction dump:
[    6.405313] 93c10018 7cbe2b78 93e1001c 7c9f2378 93a10014 83a30018 4bfffe4d 3d20c058
[    6.413027] 7c651b78 7fe4fb78 3869c970 4bd4e35d <0fe00000> 80010024 7fc5f378 807d0004
[    6.420916] ---[ end trace 00aeb6767b21cd36 ]---

If I disable port@5 (see qca8k.txt example)
<https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/net/dsa/qca8k.txt#L103>

then problem went away (but of course, this makes the WAN port useless).

When I looked more into it, I started to notice that the mdiobus_scan
started from address >1< (not 0). It was skipping PHY0 (which does exist!)
and then the extract from qca8k.txt suddenly made sense:

|The integrated switch subnode should be specified according to the binding
|described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
|port and PHY id, each subnode describing a port needs to have a valid phandle
|referencing the internal PHY connected to it. The CPU port of this switch is
|always port 0.

From what I can tell, the slave mdio port-numbers (i.e 0 - 5/6 on the qca8k)
are being used directly as phy-addresses on the slave-bus. And since the port0
is a cpu port it gets skipped when the ds->phy_mii_mask is created in
dsa_switch_setup():

| ds->phys_mii_mask |= dsa_user_ports(ds);  // 0x3E

<https://elixir.bootlin.com/linux/v5.0-rc5/source/net/dsa/dsa2.c#L350>

However, ds->phys_mii_mask is used to calculate the slave's phy_mask in
dsa_slave_mii_bus_init()

|ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask; // ~0x3E

<https://github.com/torvalds/linux/blob/master/net/dsa/slave.c#L61>

which is then later used by __mdiobus_register() to scan for the
supposedly existing PHY at 0x1 - 0x5.

|	for (i = 0; i < PHY_MAX_ADDR; i++) {
|		if ((bus->phy_mask & (1 << i)) == 0) {
|			struct phy_device *phydev;
|
|			phydev = mdiobus_scan(bus, i);
|			if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
|				err = PTR_ERR(phydev);
|				goto error;
|			}
|		}
|	}

So, I'm looking for a way to get this "-1" somewhere and this version
was the best justification I came up with. Because as Florian said,
this is supposed to work for different drivers as well.

---

On Monday, February 4, 2019 11:11:41 PM CET Florian Fainelli wrote:
> On 2/4/19 1:35 PM, Christian Lamparter wrote:
> > The QCA8337 enumerates 5 PHYs on the MDC/MDIO access: PHY0-PHY4.
> > Based on the System Block Diagram in Section 1.2 of the
> > QCA8337's datasheet. These PHYs are internally connected
> > to MACs of PORT 1 - PORT 5. However, neither qca8k's slave
> > mdio access functions qca8k_phy_read()/qca8k_phy_write()
> > nor the dsa framework is set up for that.
> > 
> > This version of the patch uses the existing phy-handle
> > properties of each specified DSA Port in the DT to map
> > each PORT/MAC to its exposed PHY on the MDIO bus. This
> > is supported by the current binding document qca8k.txt
> > as well.
> 
> I don't think you should have to do any of this translation, because you
> can do a couple of things with DSA/Device Tree:
> 
> - you can not provide a phy-handle property at all, in which case, the
> core DSA layer assumes that the PHY is part of the switch's internal
> MDIO bus which is implictly created by dsa_slave_mii_bus_create()
> 
> - you can specify a phy-handle property and then the PHY device tree
> node can be placed pretty much anywhere in Device Tree, including on a
> separate MDIO bus Device Tre node which is "external" to the switch
> 
> In either case, the PHY device's MDIO bus parent and its address are
> taken care of by drivers/of/of_mdio.c. You can look at mx88e6xxx for how
> it deals with its internal vs. external MDIO bus controller and that
> driver is used on a wide variety of cconfiguration.

Hm, this sounds to me like the qca8k driver might be cheating a bit. Though,
I can't really tell, I found "stub" translation routines in the mv88e6060
driver called mv88e6060_port_to_phy_addr(). But it just checks the range.
I think the issue here is that qca8k_phy_read/write don't actually operate
on the "internal" mdio-bus of the switch. Instead the mdiobus_read/write
in qca8k_phy_read/write operates on the provided mdio-bus (by the 
ethernet-mac or gpio-mdio, etc...). But let's see what else can be done 
(maybe qca8k_phy_read/write() is just wrong and should be dropped?).

Regards,
Christian



^ permalink raw reply

* Re: [PATCH mlx5-next 12/12] net/mlx5: Set ODP SRQ support in firmware
From: Saeed Mahameed @ 2019-02-04 23:54 UTC (permalink / raw)
  To: jgg@ziepe.ca, leon@kernel.org
  Cc: Majd Dibbiny, Moni Shoua, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, dledford@redhat.com
In-Reply-To: <20190204212319.GE10237@ziepe.ca>

On Mon, 2019-02-04 at 14:23 -0700, Jason Gunthorpe wrote:
> On Sun, Feb 03, 2019 at 11:03:11AM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 31, 2019 at 04:28:44PM -0700, Jason Gunthorpe wrote:
> > > On Tue, Jan 22, 2019 at 08:48:51AM +0200, Leon Romanovsky wrote:
> > > > From: Moni Shoua <monis@mellanox.com>
> > > > 
> > > > To avoid compatibility issue with older kernels the firmware
> > > > doesn't
> > > > allow SRQ to work with ODP unless kernel asks for it.
> > > > 
> > > > Signed-off-by: Moni Shoua <monis@mellanox.com>
> > > > Reviewed-by: Majd Dibbiny <majd@mellanox.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 53
> > > > +++++++++++++++++++
> > > >  include/linux/mlx5/device.h                   |  3 ++
> > > >  include/linux/mlx5/mlx5_ifc.h                 |  1 +
> > > >  3 files changed, 57 insertions(+)
> > > > 
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > index be81b319b0dc..b3a76df0cf6c 100644
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > @@ -459,6 +459,53 @@ static int handle_hca_cap_atomic(struct
> > > > mlx5_core_dev *dev)
> > > >  	return err;
> > > >  }
> > > > 
> > > > +static int handle_hca_cap_odp(struct mlx5_core_dev *dev)
> > > > +{
> > > > +	void *set_ctx;
> > > > +	void *set_hca_cap;
> > > > +	int set_sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
> > > > +	int err;
> > > > +
> > > > +	if (!MLX5_CAP_GEN(dev, pg))
> > > > +		return 0;
> > > 
> > > Should a
> > > 
> > >     if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
> > >              return 0;
> > > 
> > > Be here?
> > 
> > We had similar discussion in mlx5_ib main.c, but here we are
> > talking
> > about mlx5_core code, which from my point of view should represent
> > the
> > real HW capabilities without relation to kernel compilation mode.
> 
> This switch is to tell the FW that the mlx5_ib module supports the
> new
> protocol - so having it in core code at all is really weird. I assume
> there is some startup sequence reason?
> 

Yes, sadly this must be in startup, set_hca_cap requests must come
prior to init_hca command.

> Since the modularity is already wrecked it seems like an odd
> reason not to add the if..
> 

Agree, even better, let's compile out the whole function. I would even
consider having a separate file in mlx5/core for IB related start-up
procedures :).

> Jason

^ permalink raw reply

* Re: [PATCH mlx5-next 12/12] net/mlx5: Set ODP SRQ support in firmware
From: Saeed Mahameed @ 2019-02-04 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe, leon@kernel.org, dledford@redhat.com
  Cc: Majd Dibbiny, Moni Shoua, Leon Romanovsky,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190122064851.6032-13-leon@kernel.org>

On Tue, 2019-01-22 at 08:48 +0200, Leon Romanovsky wrote:
> From: Moni Shoua <monis@mellanox.com>
> 
> To avoid compatibility issue with older kernels the firmware doesn't
> allow SRQ to work with ODP unless kernel asks for it.
> 
> Signed-off-by: Moni Shoua <monis@mellanox.com>
> Reviewed-by: Majd Dibbiny <majd@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/main.c    | 53
> +++++++++++++++++++
>  include/linux/mlx5/device.h                   |  3 ++
>  include/linux/mlx5/mlx5_ifc.h                 |  1 +
>  3 files changed, 57 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index be81b319b0dc..b3a76df0cf6c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -459,6 +459,53 @@ static int handle_hca_cap_atomic(struct
> mlx5_core_dev *dev)
>  	return err;
>  }
>  
> +static int handle_hca_cap_odp(struct mlx5_core_dev *dev)
> +{
> +	void *set_ctx;
> +	void *set_hca_cap;
> +	int set_sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
> +	int err;
> +

reversed xmas tree.

> +	if (!MLX5_CAP_GEN(dev, pg))
> +		return 0;
> +
> +	err = mlx5_core_get_caps(dev, MLX5_CAP_ODP);
> +	if (err)
> +		return err;
> +
> +	/**
> +	 * If all bits are cleared we shouldn't try to set it
> +	 * or we might fail while trying to access a reserved bit.
> +	 */

"set them" not "set it" ? 

to me this is a redundant comment, the code is self explanatory.

> +	if (!(MLX5_CAP_ODP_MAX(dev, ud_odp_caps.srq_receive) ||
> +	      MLX5_CAP_ODP_MAX(dev, rc_odp_caps.srq_receive) ||
> +	      MLX5_CAP_ODP_MAX(dev, xrc_odp_caps.srq_receive)))
> +		return 0;
> +
> +	set_ctx = kzalloc(set_sz, GFP_KERNEL);
> +	if (!set_ctx)
> +		return -ENOMEM;
> +
> +	set_hca_cap = MLX5_ADDR_OF(set_hca_cap_in, set_ctx,
> capability);
> +	memcpy(set_hca_cap, dev->caps.hca_cur[MLX5_CAP_ODP],
> +	       MLX5_ST_SZ_BYTES(odp_cap));
> +
> +	/* set ODP SRQ support for RC/UD and XRC transports */
> +	MLX5_SET(odp_cap, set_hca_cap, ud_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, ud_odp_caps.srq_receive)));
> +
> +	MLX5_SET(odp_cap, set_hca_cap, rc_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, rc_odp_caps.srq_receive)));
> +
> +	MLX5_SET(odp_cap, set_hca_cap, xrc_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, xrc_odp_caps.srq_receive)));
> +

Redundant parentheses around the 3rd parameter. 

> +	err = set_caps(dev, set_ctx, set_sz,
> MLX5_SET_HCA_CAP_OP_MOD_ODP);
> +
> +	kfree(set_ctx);
> +	return err;
> +}
> +
>  static int handle_hca_cap(struct mlx5_core_dev *dev)
>  {
>  	void *set_ctx = NULL;
> @@ -926,6 +973,12 @@ static int mlx5_load_one(struct mlx5_core_dev
> *dev, struct mlx5_priv *priv,
>  		goto reclaim_boot_pages;
>  	}
>  
> +	err = handle_hca_cap_odp(dev);
> +	if (err) {
> +		dev_err(&pdev->dev, "handle_hca_cap_odp failed\n");
> +		goto reclaim_boot_pages;
> +	}
> +
>  	err = mlx5_satisfy_startup_pages(dev, 0);
>  	if (err) {
>  		dev_err(&pdev->dev, "failed to allocate init pages\n");
> diff --git a/include/linux/mlx5/device.h
> b/include/linux/mlx5/device.h
> index 8c4a820bd4c1..0845a227a7b2 100644
> --- a/include/linux/mlx5/device.h
> +++ b/include/linux/mlx5/device.h
> @@ -1201,6 +1201,9 @@ enum mlx5_qcam_feature_groups {
>  #define MLX5_CAP_ODP(mdev, cap)\
>  	MLX5_GET(odp_cap, mdev->caps.hca_cur[MLX5_CAP_ODP], cap)
>  
> +#define MLX5_CAP_ODP_MAX(mdev, cap)\
> +	MLX5_GET(odp_cap, mdev->caps.hca_max[MLX5_CAP_ODP], cap)
> +
>  #define MLX5_CAP_VECTOR_CALC(mdev, cap) \
>  	MLX5_GET(vector_calc_cap, \
>  		 mdev->caps.hca_cur[MLX5_CAP_VECTOR_CALC], cap)
> diff --git a/include/linux/mlx5/mlx5_ifc.h
> b/include/linux/mlx5/mlx5_ifc.h
> index 5407db8ba8e1..c5c679390fbd 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -72,6 +72,7 @@ enum {
>  
>  enum {
>  	MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE        = 0x0,
> +	MLX5_SET_HCA_CAP_OP_MOD_ODP                   = 0x2,
>  	MLX5_SET_HCA_CAP_OP_MOD_ATOMIC                = 0x3,
>  };
>  

^ permalink raw reply

* [PATCH net-next 2/2] net: marvell: mvpp2: fix lack of link interrupts
From: Russell King @ 2019-02-04 23:35 UTC (permalink / raw)
  To: Antoine Tenart, Maxime Chevallier; +Cc: Baruch Siach, David S. Miller, netdev

Sven Auhagen reports that if he changes a SFP+ module for a SFP module
on the Macchiatobin Single Shot, the link does not come back up.  For
Sven, it is as easy as:

- Insert a SFP+ module connected, and use ping6 to verify link is up.
- Remove SFP+ module
- Insert SFP 1000base-X module use ping6 to verify link is up: Link
  up event did not trigger and the link is down

but that doesn't show the problem for me.  Locally, this has been
reproduced by:

- Boot with no modules.
- Insert SFP+ module, confirm link is up.
- Replace module with 25000base-X module.  Confirm link is up.
- Set remote end down, link is reported as dropped at both ends.
- Set remote end up, link is reported up at remote end, but not local
  end due to lack of link interrupt.

Fix this by setting up both GMAC and XLG interrupts for port 0, but
only unmasking the appropriate interrupt according to the current mode
set in the mac_config() method.  However, only do the mask/unmask
dance when we are really changing the link mode to avoid missing any
link interrupts.

Tested-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
Possibly needs backporting to stable kernels, but would conflict with
patch 1. That could be solved by re-ordering the patches.

 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index c54ed19dcdae..d8974c446f8e 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -1133,7 +1133,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 {
 	u32 val;
 
-	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
+	if (port->phylink ||
+	    phy_interface_mode_is_rgmii(port->phy_interface) ||
 	    phy_interface_mode_is_8023z(port->phy_interface) ||
 	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_MASK);
@@ -4635,6 +4636,7 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 			     const struct phylink_link_state *state)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
+	bool change_interface = port->phy_interface != state->interface;
 
 	/* Check for invalid configuration */
 	if (state->interface == PHY_INTERFACE_MODE_10GKR && port->gop_id != 0) {
@@ -4644,14 +4646,16 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 
 	/* Make sure the port is disabled when reconfiguring the mode */
 	mvpp2_port_disable(port);
+	if (change_interface) {
+		mvpp22_gop_mask_irq(port);
 
-	if (port->priv->hw_version == MVPP22 &&
-	    port->phy_interface != state->interface) {
-		port->phy_interface = state->interface;
+		if (port->priv->hw_version == MVPP22) {
+			port->phy_interface = state->interface;
 
-		/* Reconfigure the serdes lanes */
-		phy_power_off(port->comphy);
-		mvpp22_mode_reconfigure(port);
+			/* Reconfigure the serdes lanes */
+			phy_power_off(port->comphy);
+			mvpp22_mode_reconfigure(port);
+		}
 	}
 
 	/* mac (re)configuration */
@@ -4665,6 +4669,9 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 	if (port->priv->hw_version == MVPP21 && port->flags & MVPP2_F_LOOPBACK)
 		mvpp2_port_loopback_set(port, state);
 
+	if (change_interface)
+		mvpp22_gop_unmask_irq(port);
+
 	mvpp2_port_enable(port);
 }
 
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 1/2] net: marvell: mvpp2: use phy_interface_mode_is_8023z() helper
From: Russell King @ 2019-02-04 23:35 UTC (permalink / raw)
  To: Antoine Tenart, Maxime Chevallier; +Cc: Baruch Siach, David S. Miller, netdev

Use the phy_interface_mode_is_8023z() helper for detecting interface
modes that use 802.3z serial encoding.  This is equivalent to testing
for both 1000base-X and 2500base-X.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 38 ++++++++++---------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index f1dab0b55769..c54ed19dcdae 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -1090,9 +1090,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
 	u32 val;
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		/* Enable the GMAC link status irq for this port */
 		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
 		val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -1122,9 +1121,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
 	}
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
 		val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
 		writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -1136,9 +1134,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 	u32 val;
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_MASK);
 		val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
 		writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -1259,9 +1256,8 @@ static void mvpp2_port_loopback_set(struct mvpp2_port *port,
 	else
 		val &= ~MVPP2_GMAC_GMII_LB_EN_MASK;
 
-	if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX)
+	if (phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII)
 		val |= MVPP2_GMAC_PCS_LB_EN_MASK;
 	else
 		val &= ~MVPP2_GMAC_PCS_LB_EN_MASK;
@@ -2487,9 +2483,8 @@ static irqreturn_t mvpp2_link_status_isr(int irq, void *dev_id)
 				link = true;
 		}
 	} else if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-		   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-		   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-		   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+		   phy_interface_mode_is_8023z(port->phy_interface) ||
+		   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_STAT);
 		if (val & MVPP22_GMAC_INT_STAT_LINK) {
 			event = true;
@@ -4581,8 +4576,7 @@ static void mvpp2_gmac_config(struct mvpp2_port *port, unsigned int mode,
 	ctrl0 &= ~MVPP2_GMAC_PORT_TYPE_MASK;
 	ctrl2 &= ~(MVPP2_GMAC_PORT_RESET_MASK | MVPP2_GMAC_PCS_ENABLE_MASK);
 
-	if (state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    state->interface == PHY_INTERFACE_MODE_2500BASEX) {
+	if (phy_interface_mode_is_8023z(state->interface)) {
 		/* 1000BaseX and 2500BaseX ports cannot negotiate speed nor can
 		 * they negotiate duplex: they are always operating with a fixed
 		 * speed of 1000/2500Mbps in full duplex, so force 1000/2500
@@ -4602,9 +4596,8 @@ static void mvpp2_gmac_config(struct mvpp2_port *port, unsigned int mode,
 	if (phylink_test(state->advertising, Asym_Pause))
 		an |= MVPP2_GMAC_FC_ADV_ASM_EN;
 
-	if (state->interface == PHY_INTERFACE_MODE_SGMII ||
-	    state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    state->interface == PHY_INTERFACE_MODE_2500BASEX) {
+	if (phy_interface_mode_is_8023z(state->interface) ||
+	    state->interface == PHY_INTERFACE_MODE_SGMII) {
 		an |= MVPP2_GMAC_IN_BAND_AUTONEG;
 		ctrl2 |= MVPP2_GMAC_INBAND_AN_MASK | MVPP2_GMAC_PCS_ENABLE_MASK;
 
@@ -4665,9 +4658,8 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 	if (state->interface == PHY_INTERFACE_MODE_10GKR)
 		mvpp2_xlg_config(port, mode, state);
 	else if (phy_interface_mode_is_rgmii(state->interface) ||
-		 state->interface == PHY_INTERFACE_MODE_SGMII ||
-		 state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-		 state->interface == PHY_INTERFACE_MODE_2500BASEX)
+		 phy_interface_mode_is_8023z(state->interface) ||
+		 state->interface == PHY_INTERFACE_MODE_SGMII)
 		mvpp2_gmac_config(port, mode, state);
 
 	if (port->priv->hw_version == MVPP21 && port->flags & MVPP2_F_LOOPBACK)
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next v2 1/1] net/mlx5: Fix code style issue in mlx driver
From: Saeed Mahameed @ 2019-02-04 23:26 UTC (permalink / raw)
  To: Tonghao Zhang; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <1548797004-45137-1-git-send-email-xiangxia.m.yue@gmail.com>

On Thu, Jan 31, 2019 at 5:20 PM <xiangxia.m.yue@gmail.com> wrote:
>
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>
> Add the tab before '}' and keep the code style consistent.
>
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

Dave, you can take this patch to net-next.

Thanks,
Saeed.

^ permalink raw reply

* Re: [PATCH v2] net: dsa: mv88e6xxx: Revise irq setup ordering
From: Andrew Lunn @ 2019-02-04 23:14 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <824d011b-3692-69c3-5e2c-58e950a80abf@bell.net>

On Mon, Feb 04, 2019 at 04:59:13PM -0500, John David Anglin wrote:
> This change fixes a race condition in the setup of hardware irqs and the
> code enabling PHY link detection in the mv88e6xxx driver.
> 
> This race was observed on the espressobin board where the GPIO interrupt
> controller only supports edge interrupts.  If the INTn output pin goes low
> before the GPIO interrupt is enabled, PHY link interrupts are not detected.
> 
> With this change, we
> 1) force INTn high by clearing all interrupt enables in global 1 control 1,
> 2) setup the hardware irq, and then
> 3) perform the remaining common setup.
> 
> This simplifies the setup and allows some unnecessary code to be removed.

Hi Dave

I took a closer look now. I don't actually see why the current code is
wrong.

mv88e6xxx_g1_irq_setup() calls mv88e6xxx_g1_irq_setup_common() and
then registers the interrupt handler.

mv88e6xxx_g1_irq_setup_common() does what you want, it masks all
interrupts in the hardware and clears any pending interrupts which can
be cleared.

The change you made is actually dangerous. As soon as you request the
interrupt, it is live, it can fire, and call
mv88e6xxx_g1_irq_thread_work(). That needs the irq domain. But the
change you made defers the creating of the domain until after the
interrupt is registered. So we can de-refernece a NULL pointer in the
interrupt handler.

	  Andrew

^ permalink raw reply

* Re: [PATCH 1/3 net-next] net: phy: aquantia: improve setting speed and duplex in aqr_read_status
From: Heiner Kallweit @ 2019-02-04 23:06 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, David Miller, Nikita Yushchenko,
	netdev@vger.kernel.org
In-Reply-To: <20190204222348.GD3397@lunn.ch>

On 04.02.2019 23:23, Andrew Lunn wrote:
>> I'd like to use standard registers wherever possible. This patch is
>> meant as a quick win to improve what we do already in aqr_read_status.
>> Once we have a generic c45 read_status function we should switch to it.
> 
> Hi Heiner
> 
> I don't see much point in adding code which we know we are soon going
> to replace. Just replace it.
> 
OK, let me have a closer look at the other patches you sent me.
To test them I need to get my DTU running first. And I need to check
what happens if certain standard registers don't report what they
should and how to deal with this. E.g. the AQCS109 according to the
datasheet reports in the speed ability register that it is 10G-capable,
what it is not.

>> However I assume that information like interface mode we still have
>> to read from vendor registers.
> 
> For the Aquantia PHY, yes. It appears the Marvell PHY does not have
> any registers which indicate this, so it uses heuristics based on the
> link speed.
> 
>     Andrew
> 
Heiner

^ permalink raw reply

* Re: [PATCH] net: phylink: dsa: mv88e6xxx: Revise irq setup ordering
From: Andrew Lunn @ 2019-02-04 22:47 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <c536ef45-0d5a-5d7e-d9f2-4ff030a34eb9@bell.net>

On Mon, Feb 04, 2019 at 04:38:53PM -0500, John David Anglin wrote:
> On 2019-02-04 3:19 p.m., Andrew Lunn wrote:
> > The IRQ core would do this if it was needed.
> >
> > How many other irq thread work functions can you point to which do
> > something similar?
> This is comment for handle_edge_irq:
> 
> /**
>  *    handle_edge_irq - edge type IRQ handler
>  *    @desc:    the interrupt description structure for this irq
>  *
>  *    Interrupt occures on the falling and/or rising edge of a hardware
>  *    signal. The occurrence is latched into the irq controller hardware
>  *    and must be acked in order to be reenabled. After the ack another
>  *    interrupt can happen on the same source even before the first one
>  *    is handled by the associated event handler. If this happens it
>  *    might be necessary to disable (mask) the interrupt depending on the
>  *    controller hardware. This requires to reenable the interrupt inside
>  *    of the loop which handles the interrupts which have arrived while
>  *    the handler was running. If all pending interrupts are handled, the
>  *    loop is left.
>  */
> 
> As can be seen, the above comment suggests that it may be necessary to
> disable (mask) interrupt
> as I proposed.

Hi Dave

This comment is describing what handle_edge_irq() actually does. Read
the code. It does not say anything about that the handling thread
function should do.

	 Andrew

^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
From: Daniel Borkmann @ 2019-02-04 22:33 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team, Lawrence Brakmo
In-Reply-To: <20190201070356.4148323-1-kafai@fb.com>

Hi Martin,

On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> before accessing the fields in sock.  For example, in __netdev_pick_tx:
> 
> static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> 			    struct net_device *sb_dev)
> {
> 	/* ... */
> 
> 	struct sock *sk = skb->sk;
> 
> 		if (queue_index != new_index && sk &&
> 		    sk_fullsock(sk) &&
> 		    rcu_access_pointer(sk->sk_dst_cache))
> 			sk_tx_queue_set(sk, new_index);
> 
> 	/* ... */
> 
> 	return queue_index;
> }
> 
> This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> where a few of the convert_ctx_access() in filter.c has already been
> accessing the skb->sk sock_common's fields,
> e.g. sock_ops_convert_ctx_access().
> 
> "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> Some of the fileds in "bpf_sock" will not be directly
> accessible through the "__sk_buff->sk" pointer.  It is limited
> by the new "bpf_sock_common_is_valid_access()".
> e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
>      are not allowed.
> 
> The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> can be used to get a sk with all accessible fields in "bpf_sock".
> This helper is added to both cg_skb and sched_(cls|act).
> 
> int cg_skb_foo(struct __sk_buff *skb) {
> 	struct bpf_sock *sk;
> 	__u32 family;
> 
> 	sk = skb->sk;
> 	if (!sk)
> 		return 1;
> 
> 	sk = bpf_sk_fullsock(sk);
> 	if (!sk)
> 		return 1;
> 
> 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> 		return 1;
> 
> 	/* some_traffic_shaping(); */
> 
> 	return 1;
> }
> 
> (1) The sk is read only
> 
> (2) There is no new "struct bpf_sock_common" introduced.
> 
> (3) Future kernel sock's members could be added to bpf_sock only
>     instead of repeatedly adding at multiple places like currently
>     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> 
> (4) After "sk = skb->sk", the reg holding sk is in type
>     PTR_TO_SOCK_COMMON_OR_NULL.
> 
> (5) After bpf_sk_fullsock(), the return type will be in type
>     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
>     bpf_sk_lookup_xxx().
> 
>     However, bpf_sk_fullsock() does not take refcnt.  The
>     acquire_reference_state() is only depending on the return type now.
>     To avoid it, a new is_acquire_function() is checked before calling
>     acquire_reference_state().

Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
all this is more of an implementation detail which we would expose here to
the developer.

Is there a specific reason why fetching skb->sk couldn't already be of the
type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
needed and most logic we have today could already be reused (modulo refcnt
avoidance)?

In particular, do you need the skb->sk without the full-sk part somewhere
(e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
parent where you could also access remaining bpf_sock fields?

This could then also be plugged into bpf_tcp_sock() given this needs to be
full sk anyway.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH v1] net: dsa: qca8k: implement DT-based ports <-> phy translation
From: Andrew Lunn @ 2019-02-04 22:26 UTC (permalink / raw)
  To: Christian Lamparter; +Cc: netdev, Florian Fainelli, Vivien Didelot
In-Reply-To: <20190204213555.26054-1-chunkeey@gmail.com>

On Mon, Feb 04, 2019 at 10:35:55PM +0100, Christian Lamparter wrote:
> The QCA8337 enumerates 5 PHYs on the MDC/MDIO access: PHY0-PHY4.
> Based on the System Block Diagram in Section 1.2 of the
> QCA8337's datasheet. These PHYs are internally connected
> to MACs of PORT 1 - PORT 5.

Hi Christian

Is the off-by-one the problem here?

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH 1/3 net-next] net: phy: aquantia: improve setting speed and duplex in aqr_read_status
From: Andrew Lunn @ 2019-02-04 22:23 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Florian Fainelli, David Miller, Nikita Yushchenko,
	netdev@vger.kernel.org
In-Reply-To: <d6ceeb7c-7502-7332-ce6e-b95b4d9b2be1@gmail.com>

> I'd like to use standard registers wherever possible. This patch is
> meant as a quick win to improve what we do already in aqr_read_status.
> Once we have a generic c45 read_status function we should switch to it.

Hi Heiner

I don't see much point in adding code which we know we are soon going
to replace. Just replace it.

> However I assume that information like interface mode we still have
> to read from vendor registers.

For the Aquantia PHY, yes. It appears the Marvell PHY does not have
any registers which indicate this, so it uses heuristics based on the
link speed.

    Andrew

^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: use localhost in tcp_{server,client}.py
From: Daniel Borkmann @ 2019-02-04 22:16 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev; +Cc: davem, ast
In-Reply-To: <20190204184319.177504-1-sdf@google.com>

On 02/04/2019 07:43 PM, Stanislav Fomichev wrote:
> Bind and connect to localhost. There is no reason for this test to
> use non-localhost interface. This lets us run this test in a network
> namespace.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Applied, thanks!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox