Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net/mlx5e: Return in default case statement in tx_post_resync_params
From: Leon Romanovsky @ 2019-07-10  4:24 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Nick Desaulniers, Saeed Mahameed, David S. Miller, Boris Pismenny,
	netdev, linux-rdma, LKML, clang-built-linux
In-Reply-To: <20190709231024.GA61953@archlinux-threadripper>

On Tue, Jul 09, 2019 at 04:10:24PM -0700, Nathan Chancellor wrote:
> On Tue, Jul 09, 2019 at 03:44:59PM -0700, Nick Desaulniers wrote:
> > On Mon, Jul 8, 2019 at 4:13 PM Nathan Chancellor
> > <natechancellor@gmail.com> wrote:
> > >
> > > clang warns:
> > >
> > > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2:
> > > warning: variable 'rec_seq_sz' is used uninitialized whenever switch
> > > default is taken [-Wsometimes-uninitialized]
> > >         default:
> > >         ^~~~~~~
> > > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note:
> > > uninitialized use occurs here
> > >         skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
> > >                                                     ^~~~~~~~~~
> > > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:239:16: note:
> > > initialize the variable 'rec_seq_sz' to silence this warning
> > >         u16 rec_seq_sz;
> > >                       ^
> > >                        = 0
> > > 1 warning generated.
> > >
> > > This case statement was clearly designed to be one that should not be
> > > hit during runtime because of the WARN_ON statement so just return early
> > > to prevent copying uninitialized memory up into rn_be.
> > >
> > > Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
> > > Link: https://github.com/ClangBuiltLinux/linux/issues/590
> > > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > > index 3f5f4317a22b..5c08891806f0 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > > @@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
> > >         }
> > >         default:
> > >                 WARN_ON(1);
> > > +               return;
> > >         }
> >
> > hmm...a switch statement with a single case is a code smell.  How
> > about a single conditional with early return?  Then the "meat" of the
> > happy path doesn't need an additional level of indentation.
> > --
> > Thanks,
> > ~Nick Desaulniers
>
> I assume that the reason for this is there may be other cipher types
> added in the future? I suppose the maintainers can give more clarity to
> that.

Our devices supports extra ciphers, for example TLS_CIPHER_AES_GCM_256.
So I assume this was the reason for switch<->case, but because such
implementation doesn't exist in any driver, I recommend to rewrite the
code to have "if" statement and return early.

>
> Furthermore, if they want the switch statements to remain, it looks like
> fill_static_params_ctx also returns in the default statement so it seems
> like this is the right fix.
>
> Cheers,
> Nathan

^ permalink raw reply

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: John Fastabend @ 2019-07-10  3:39 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190709194525.0d4c15a6@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Mon, 08 Jul 2019 19:14:05 +0000, John Fastabend wrote:
> > @@ -287,6 +313,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
> >  #endif
> >  }
> >  
> > +static void tls_sk_proto_unhash(struct sock *sk)
> > +{
> > +	struct inet_connection_sock *icsk = inet_csk(sk);
> > +	long timeo = sock_sndtimeo(sk, 0);
> > +	struct tls_context *ctx;
> > +
> > +	if (unlikely(!icsk->icsk_ulp_data)) {
> 
> Is this for when sockmap is stacked on top of TLS and TLS got removed
> without letting sockmap know?

Right its a pattern I used on the sockmap side and put here. But
I dropped the patch to let sockmap stack on top of TLS because
it was more than a fix IMO. We could probably drop this check on
the other hand its harmless.
> 
> > +		if (sk->sk_prot->unhash)
> > +			sk->sk_prot->unhash(sk);
> > +	}
> > +
> > +	ctx = tls_get_ctx(sk);
> > +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> > +		tls_sk_proto_cleanup(sk, ctx, timeo);
> > +	icsk->icsk_ulp_data = NULL;
> 
> I think close only starts checking if ctx is NULL in patch 6.
> Looks like some chunks of ctx checking/clearing got spread to
> patch 1 and some to patch 6.

Yeah, I thought the patches were easier to read this way but
maybe not. Could add something in the commit log.

> 
> > +	tls_ctx_free_wq(ctx);
> > +
> > +	if (ctx->unhash)
> > +		ctx->unhash(sk);
> > +}
> > +
> >  static void tls_sk_proto_close(struct sock *sk, long timeout)
> >  {
> >  	struct tls_context *ctx = tls_get_ctx(sk);
> 



^ permalink raw reply

* Re: [bpf PATCH v2 6/6] bpf: sockmap/tls, close can race with map free
From: John Fastabend @ 2019-07-10  3:33 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190709193846.62f0a2c7@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Mon, 08 Jul 2019 19:15:18 +0000, John Fastabend wrote:
> > @@ -352,15 +354,18 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
> >  	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
> >  		goto skip_tx_cleanup;
> >  
> > -	sk->sk_prot = ctx->sk_proto;
> >  	tls_sk_proto_cleanup(sk, ctx, timeo);
> >  
> >  skip_tx_cleanup:
> > +	write_lock_bh(&sk->sk_callback_lock);
> > +	icsk->icsk_ulp_data = NULL;
> 
> Is ulp_data pointer now supposed to be updated under the
> sk_callback_lock?

Yes otherwise it can race with tls_update(). I didn't remove the
ulp pointer null set from tcp_ulp.c though. Could be done in this
patch or as a follow up.
> 
> > +	if (sk->sk_prot->close == tls_sk_proto_close)
> > +		sk->sk_prot = ctx->sk_proto;
> > +	write_unlock_bh(&sk->sk_callback_lock);
> >  	release_sock(sk);
> >  	if (ctx->rx_conf == TLS_SW)
> >  		tls_sw_release_strp_rx(ctx);
> > -	sk_proto_close(sk, timeout);
> > -
> > +	ctx->sk_proto_close(sk, timeout);
> >  	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
> >  	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
> >  		tls_ctx_free(ctx);

^ permalink raw reply

* Re: [bpf PATCH v2 0/6] bpf: sockmap/tls fixes
From: John Fastabend @ 2019-07-10  3:28 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190709192145.473d2d80@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Tue, 9 Jul 2019 17:04:59 -0700, Jakub Kicinski wrote:
> > On Tue, 09 Jul 2019 08:40:14 -0700, John Fastabend wrote:
> > > Jakub Kicinski wrote:  
> > > > Looks like strparser is not done'd for offload?    
> > > 
> > > Right so if rx_conf != TLS_SW then the hardware needs to do
> > > the strparser functionality.  
> > 
> > Can I just take a stab at fixing the HW part?
> > 
> > Can I rebase this onto net-next?  There are a few patches from net
> > missing in the bpf tree.
> 
> I think I fixed patch 1 for offload, I need to test it a little more
> and I'll send it back to you. In the meantime, let me ask some
> questions about the other two :)

Great thanks. When your ready push it back and I'll retest in
my setup.

^ permalink raw reply

* [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Xiaoming Ni @ 2019-07-10  3:09 UTC (permalink / raw)
  To: adobriyan, akpm, anna.schumaker, arjan, bfields, chuck.lever,
	davem, gregkh, jlayton, luto, mingo, Nadia.Derbey, paulmck,
	semen.protsenko, stable, stern, tglx, torvalds, trond.myklebust,
	viresh.kumar, vvs
  Cc: alex.huangjianhui, dylix.dailei, nixiaoming, linux-kernel,
	linux-nfs, netdev

Registering the same notifier to a hook repeatedly can cause the hook
list to form a ring or lose other members of the list.

case1: An infinite loop in notifier_chain_register() can cause soft lockup
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test2);

case2: An infinite loop in notifier_chain_register() can cause soft lockup
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_call_chain(&test_notifier_list, 0, NULL);

case3: lose other hook test2
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test2);
        atomic_notifier_chain_register(&test_notifier_list, &test1);

case4: Unregister returns 0, but the hook is still in the linked list,
        and it is not really registered. If you call notifier_call_chain
        after ko is unloaded, it will trigger oops. if the system is
       	configured with softlockup_panic and the same hook is repeatedly
       	registered on the panic_notifier_list, it will cause a loop panic.

so. need add a check in in notifier_chain_register() to avoid duplicate
registration

v1:
* use notifier_chain_cond_register replace notifier_chain_register

v2:
* Add a check in notifier_chain_register() to avoid duplicate registration
* remove notifier_chain_cond_register() to avoid duplicate code 
* remove blocking_notifier_chain_cond_register() to avoid duplicate code

v3:
* Add a cover letter.

Xiaoming Ni (3):
  kernel/notifier.c: avoid duplicate registration
  kernel/notifier.c: remove notifier_chain_cond_register()
  kernel/notifier.c: remove blocking_notifier_chain_cond_register()

 include/linux/notifier.h |  4 ----
 kernel/notifier.c        | 41 +++--------------------------------------
 net/sunrpc/rpc_pipe.c    |  2 +-
 3 files changed, 4 insertions(+), 43 deletions(-)

-- 
1.8.5.6


^ permalink raw reply

* [PATCH v3 1/3] kernel/notifier.c: avoid duplicate registration
From: Xiaoming Ni @ 2019-07-10  3:09 UTC (permalink / raw)
  To: adobriyan, akpm, anna.schumaker, arjan, bfields, chuck.lever,
	davem, gregkh, jlayton, luto, mingo, Nadia.Derbey, paulmck,
	semen.protsenko, stable, stern, tglx, torvalds, trond.myklebust,
	viresh.kumar, vvs
  Cc: alex.huangjianhui, dylix.dailei, nixiaoming, linux-kernel,
	linux-nfs, netdev

Registering the same notifier to a hook repeatedly can cause the hook
list to form a ring or lose other members of the list.

case1: An infinite loop in notifier_chain_register() can cause soft lockup
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test2);

case2: An infinite loop in notifier_chain_register() can cause soft lockup
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_call_chain(&test_notifier_list, 0, NULL);

case3: lose other hook test2
        atomic_notifier_chain_register(&test_notifier_list, &test1);
        atomic_notifier_chain_register(&test_notifier_list, &test2);
        atomic_notifier_chain_register(&test_notifier_list, &test1);

case4: Unregister returns 0, but the hook is still in the linked list,
        and it is not really registered. If you call notifier_call_chain
        after ko is unloaded, it will trigger oops.

If the system is configured with softlockup_panic and the same
hook is repeatedly registered on the panic_notifier_list, it
will cause a loop panic.

Add a check in notifier_chain_register() to avoid duplicate registration

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
---
 kernel/notifier.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index d9f5081..30bedb8 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -23,7 +23,10 @@ static int notifier_chain_register(struct notifier_block **nl,
 		struct notifier_block *n)
 {
 	while ((*nl) != NULL) {
-		WARN_ONCE(((*nl) == n), "double register detected");
+		if (unlikely((*nl) == n)) {
+			WARN(1, "double register detected");
+			return 0;
+		}
 		if (n->priority > (*nl)->priority)
 			break;
 		nl = &((*nl)->next);
-- 
1.8.5.6


^ permalink raw reply related

* [PATCH v3 3/3] kernel/notifier.c: remove blocking_notifier_chain_cond_register()
From: Xiaoming Ni @ 2019-07-10  3:09 UTC (permalink / raw)
  To: adobriyan, akpm, anna.schumaker, arjan, bfields, chuck.lever,
	davem, gregkh, jlayton, luto, mingo, Nadia.Derbey, paulmck,
	semen.protsenko, stable, stern, tglx, torvalds, trond.myklebust,
	viresh.kumar, vvs
  Cc: alex.huangjianhui, dylix.dailei, nixiaoming, linux-kernel,
	linux-nfs, netdev

blocking_notifier_chain_cond_register() does not consider
system_booting state, which is the only difference between this
function and blocking_notifier_cain_register(). This can be a bug
and is a piece of duplicate code.

Delete blocking_notifier_chain_cond_register()

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
---
 include/linux/notifier.h |  4 ----
 kernel/notifier.c        | 23 -----------------------
 net/sunrpc/rpc_pipe.c    |  2 +-
 3 files changed, 1 insertion(+), 28 deletions(-)

diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 0096a05..0189476 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -150,10 +150,6 @@ extern int raw_notifier_chain_register(struct raw_notifier_head *nh,
 extern int srcu_notifier_chain_register(struct srcu_notifier_head *nh,
 		struct notifier_block *nb);
 
-extern int blocking_notifier_chain_cond_register(
-		struct blocking_notifier_head *nh,
-		struct notifier_block *nb);
-
 extern int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
 		struct notifier_block *nb);
 extern int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh,
diff --git a/kernel/notifier.c b/kernel/notifier.c
index e3d221f..63d7501 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -221,29 +221,6 @@ int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
 EXPORT_SYMBOL_GPL(blocking_notifier_chain_register);
 
 /**
- *	blocking_notifier_chain_cond_register - Cond add notifier to a blocking notifier chain
- *	@nh: Pointer to head of the blocking notifier chain
- *	@n: New entry in notifier chain
- *
- *	Adds a notifier to a blocking notifier chain, only if not already
- *	present in the chain.
- *	Must be called in process context.
- *
- *	Currently always returns zero.
- */
-int blocking_notifier_chain_cond_register(struct blocking_notifier_head *nh,
-		struct notifier_block *n)
-{
-	int ret;
-
-	down_write(&nh->rwsem);
-	ret = notifier_chain_register(&nh->head, n);
-	up_write(&nh->rwsem);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blocking_notifier_chain_cond_register);
-
-/**
  *	blocking_notifier_chain_unregister - Remove notifier from a blocking notifier chain
  *	@nh: Pointer to head of the blocking notifier chain
  *	@n: Entry to remove from notifier chain
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 126d314..1287f80 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -50,7 +50,7 @@
 
 int rpc_pipefs_notifier_register(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_cond_register(&rpc_pipefs_notifier_list, nb);
+	return blocking_notifier_chain_register(&rpc_pipefs_notifier_list, nb);
 }
 EXPORT_SYMBOL_GPL(rpc_pipefs_notifier_register);
 
-- 
1.8.5.6


^ permalink raw reply related

* [PATCH v3 2/3] kernel/notifier.c: remove notifier_chain_cond_register()
From: Xiaoming Ni @ 2019-07-10  3:09 UTC (permalink / raw)
  To: adobriyan, akpm, anna.schumaker, arjan, bfields, chuck.lever,
	davem, gregkh, jlayton, luto, mingo, Nadia.Derbey, paulmck,
	semen.protsenko, stable, stern, tglx, torvalds, trond.myklebust,
	viresh.kumar, vvs
  Cc: alex.huangjianhui, dylix.dailei, nixiaoming, linux-kernel,
	linux-nfs, netdev

The only difference between notifier_chain_cond_register() and
notifier_chain_register() is the lack of warning hints for duplicate
registrations.
Consider using notifier_chain_register() instead of
notifier_chain_cond_register() to avoid duplicate code

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
---
 kernel/notifier.c | 17 +----------------
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index 30bedb8..e3d221f 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -36,21 +36,6 @@ static int notifier_chain_register(struct notifier_block **nl,
 	return 0;
 }
 
-static int notifier_chain_cond_register(struct notifier_block **nl,
-		struct notifier_block *n)
-{
-	while ((*nl) != NULL) {
-		if ((*nl) == n)
-			return 0;
-		if (n->priority > (*nl)->priority)
-			break;
-		nl = &((*nl)->next);
-	}
-	n->next = *nl;
-	rcu_assign_pointer(*nl, n);
-	return 0;
-}
-
 static int notifier_chain_unregister(struct notifier_block **nl,
 		struct notifier_block *n)
 {
@@ -252,7 +237,7 @@ int blocking_notifier_chain_cond_register(struct blocking_notifier_head *nh,
 	int ret;
 
 	down_write(&nh->rwsem);
-	ret = notifier_chain_cond_register(&nh->head, n);
+	ret = notifier_chain_register(&nh->head, n);
 	up_write(&nh->rwsem);
 	return ret;
 }
-- 
1.8.5.6


^ permalink raw reply related

* Re: [bpf PATCH v2 2/6] bpf: tls fix transition through disconnect with close
From: Jakub Kicinski @ 2019-07-10  2:45 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <156261324561.31108.14410711674221391677.stgit@ubuntu3-kvm1>

On Mon, 08 Jul 2019 19:14:05 +0000, John Fastabend wrote:
> @@ -287,6 +313,27 @@ static void tls_sk_proto_cleanup(struct sock *sk,
>  #endif
>  }
>  
> +static void tls_sk_proto_unhash(struct sock *sk)
> +{
> +	struct inet_connection_sock *icsk = inet_csk(sk);
> +	long timeo = sock_sndtimeo(sk, 0);
> +	struct tls_context *ctx;
> +
> +	if (unlikely(!icsk->icsk_ulp_data)) {

Is this for when sockmap is stacked on top of TLS and TLS got removed
without letting sockmap know?

> +		if (sk->sk_prot->unhash)
> +			sk->sk_prot->unhash(sk);
> +	}
> +
> +	ctx = tls_get_ctx(sk);
> +	if (ctx->tx_conf == TLS_SW || ctx->rx_conf == TLS_SW)
> +		tls_sk_proto_cleanup(sk, ctx, timeo);
> +	icsk->icsk_ulp_data = NULL;

I think close only starts checking if ctx is NULL in patch 6.
Looks like some chunks of ctx checking/clearing got spread to
patch 1 and some to patch 6.

> +	tls_ctx_free_wq(ctx);
> +
> +	if (ctx->unhash)
> +		ctx->unhash(sk);
> +}
> +
>  static void tls_sk_proto_close(struct sock *sk, long timeout)
>  {
>  	struct tls_context *ctx = tls_get_ctx(sk);


^ permalink raw reply

* Re: [bpf PATCH v2 6/6] bpf: sockmap/tls, close can race with map free
From: Jakub Kicinski @ 2019-07-10  2:38 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <156261331866.31108.6405316261950259075.stgit@ubuntu3-kvm1>

On Mon, 08 Jul 2019 19:15:18 +0000, John Fastabend wrote:
> @@ -352,15 +354,18 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
>  	if (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)
>  		goto skip_tx_cleanup;
>  
> -	sk->sk_prot = ctx->sk_proto;
>  	tls_sk_proto_cleanup(sk, ctx, timeo);
>  
>  skip_tx_cleanup:
> +	write_lock_bh(&sk->sk_callback_lock);
> +	icsk->icsk_ulp_data = NULL;

Is ulp_data pointer now supposed to be updated under the
sk_callback_lock?

> +	if (sk->sk_prot->close == tls_sk_proto_close)
> +		sk->sk_prot = ctx->sk_proto;
> +	write_unlock_bh(&sk->sk_callback_lock);
>  	release_sock(sk);
>  	if (ctx->rx_conf == TLS_SW)
>  		tls_sw_release_strp_rx(ctx);
> -	sk_proto_close(sk, timeout);
> -
> +	ctx->sk_proto_close(sk, timeout);
>  	if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW &&
>  	    ctx->tx_conf != TLS_HW_RECORD && ctx->rx_conf != TLS_HW_RECORD)
>  		tls_ctx_free(ctx);


^ permalink raw reply

* Re: [bpf PATCH v2 6/6] bpf: sockmap/tls, close can race with map free
From: Jakub Kicinski @ 2019-07-10  2:36 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <156261331866.31108.6405316261950259075.stgit@ubuntu3-kvm1>

On Mon, 08 Jul 2019 19:15:18 +0000, John Fastabend wrote:
> @@ -836,22 +841,39 @@ static int tls_init(struct sock *sk)

There is a goto out above this which has to be turned into return 0;
if out now releases the lock.

>  	if (sk->sk_state != TCP_ESTABLISHED)
>  		return -ENOTSUPP;
>  
> +	tls_build_proto(sk);
> +
>  	/* allocate tls context */
> +	write_lock_bh(&sk->sk_callback_lock);
>  	ctx = create_ctx(sk);
>  	if (!ctx) {
>  		rc = -ENOMEM;
>  		goto out;
>  	}
>  
> -	tls_build_proto(sk);
>  	ctx->tx_conf = TLS_BASE;
>  	ctx->rx_conf = TLS_BASE;
>  	ctx->sk_proto = sk->sk_prot;
>  	update_sk_prot(sk, ctx);
>  out:
> +	write_unlock_bh(&sk->sk_callback_lock);
>  	return rc;
>  }

^ permalink raw reply

* Re: [PATCH bpf-next v3] virtio_net: add XDP meta data support
From: Jason Wang @ 2019-07-10  2:30 UTC (permalink / raw)
  To: Daniel Borkmann, Yuya Kusakabe
  Cc: ast, davem, hawk, jakub.kicinski, john.fastabend, kafai, mst,
	netdev, songliubraving, yhs
In-Reply-To: <116cdb35-57b3-e2fe-ef8a-05cc6a1afbbe@iogearbox.net>


On 2019/7/10 上午4:03, Daniel Borkmann wrote:
> On 07/09/2019 05:04 AM, Jason Wang wrote:
>> On 2019/7/9 上午6:38, Daniel Borkmann wrote:
>>> On 07/02/2019 04:11 PM, Yuya Kusakabe wrote:
>>>> On 7/2/19 5:33 PM, Jason Wang wrote:
>>>>> On 2019/7/2 下午4:16, Yuya Kusakabe wrote:
>>>>>> This adds XDP meta data support to both receive_small() and
>>>>>> receive_mergeable().
>>>>>>
>>>>>> Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access")
>>>>>> Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com>
>>>>>> ---
>>>>>> v3:
>>>>>>     - fix preserve the vnet header in receive_small().
>>>>>> v2:
>>>>>>     - keep copy untouched in page_to_skb().
>>>>>>     - preserve the vnet header in receive_small().
>>>>>>     - fix indentation.
>>>>>> ---
>>>>>>     drivers/net/virtio_net.c | 45 +++++++++++++++++++++++++++-------------
>>>>>>     1 file changed, 31 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>> index 4f3de0ac8b0b..03a1ae6fe267 100644
>>>>>> --- a/drivers/net/virtio_net.c
>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>> @@ -371,7 +371,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>                        struct receive_queue *rq,
>>>>>>                        struct page *page, unsigned int offset,
>>>>>>                        unsigned int len, unsigned int truesize,
>>>>>> -                   bool hdr_valid)
>>>>>> +                   bool hdr_valid, unsigned int metasize)
>>>>>>     {
>>>>>>         struct sk_buff *skb;
>>>>>>         struct virtio_net_hdr_mrg_rxbuf *hdr;
>>>>>> @@ -393,7 +393,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>         else
>>>>>>             hdr_padded_len = sizeof(struct padded_vnet_hdr);
>>>>>>     -    if (hdr_valid)
>>>>>> +    if (hdr_valid && !metasize)
>>>>>>             memcpy(hdr, p, hdr_len);
>>>>>>           len -= hdr_len;
>>>>>> @@ -405,6 +405,11 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>             copy = skb_tailroom(skb);
>>>>>>         skb_put_data(skb, p, copy);
>>>>>>     +    if (metasize) {
>>>>>> +        __skb_pull(skb, metasize);
>>>>>> +        skb_metadata_set(skb, metasize);
>>>>>> +    }
>>>>>> +
>>>>>>         len -= copy;
>>>>>>         offset += copy;
>>>>>>     @@ -644,6 +649,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>>>>>         unsigned int delta = 0;
>>>>>>         struct page *xdp_page;
>>>>>>         int err;
>>>>>> +    unsigned int metasize = 0;
>>>>>>           len -= vi->hdr_len;
>>>>>>         stats->bytes += len;
>>>>>> @@ -683,10 +689,13 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>>>>>               xdp.data_hard_start = buf + VIRTNET_RX_PAD + vi->hdr_len;
>>>>>>             xdp.data = xdp.data_hard_start + xdp_headroom;
>>>>>> -        xdp_set_data_meta_invalid(&xdp);
>>>>>>             xdp.data_end = xdp.data + len;
>>>>>> +        xdp.data_meta = xdp.data;
>>>>>>             xdp.rxq = &rq->xdp_rxq;
>>>>>>             orig_data = xdp.data;
>>>>>> +        /* Copy the vnet header to the front of data_hard_start to avoid
>>>>>> +         * overwriting by XDP meta data */
>>>>>> +        memcpy(xdp.data_hard_start - vi->hdr_len, xdp.data - vi->hdr_len, vi->hdr_len);
>>> I'm not fully sure if I'm following this one correctly, probably just missing
>>> something. Isn't the vnet header based on how we set up xdp.data_hard_start
>>> earlier already in front of it? Wouldn't we copy invalid data from xdp.data -
>>> vi->hdr_len into the vnet header at that point (given there can be up to 256
>>> bytes of headroom between the two)? If it's relative to xdp.data and headroom
>>> is >0, then BPF prog could otherwise mangle this; something doesn't add up to
>>> me here. Could you clarify? Thx
>> Vnet headr sits just in front of xdp.data not xdp.data_hard_start. So it could be overwrote by metadata, that's why we need a copy here.
> For the current code, you can adjust the xdp.data with a positive/negative offset
> already via bpf_xdp_adjust_head() helper. If vnet headr sits just in front of
> xdp.data, couldn't this be overridden today as well then? Anyway, just wondering
> how this is handled differently?


We will invalidate the vnet header in this case. But for the case of 
metadata adjustment without header adjustment, we want to seek a way to 
preserve that.

Thanks


>
> Thanks,
> Daniel

^ permalink raw reply

* Re: [RFC v2] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2019-07-10  2:26 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Alex Williamson, mst, maxime.coquelin, linux-kernel, kvm,
	virtualization, netdev, dan.daly, cunming.liang, zhihong.wang,
	idos, Rob Miller, Ariel Adam
In-Reply-To: <20190709063317.GA29300@___>


On 2019/7/9 下午2:33, Tiwei Bie wrote:
> On Tue, Jul 09, 2019 at 10:50:38AM +0800, Jason Wang wrote:
>> On 2019/7/8 下午2:16, Tiwei Bie wrote:
>>> On Fri, Jul 05, 2019 at 08:49:46AM -0600, Alex Williamson wrote:
>>>> On Thu, 4 Jul 2019 14:21:34 +0800
>>>> Tiwei Bie <tiwei.bie@intel.com> wrote:
>>>>> On Thu, Jul 04, 2019 at 12:31:48PM +0800, Jason Wang wrote:
>>>>>> On 2019/7/3 下午9:08, Tiwei Bie wrote:
>>>>>>> On Wed, Jul 03, 2019 at 08:16:23PM +0800, Jason Wang wrote:
>>>>>>>> On 2019/7/3 下午7:52, Tiwei Bie wrote:
>>>>>>>>> On Wed, Jul 03, 2019 at 06:09:51PM +0800, Jason Wang wrote:
>>>>>>>>>> On 2019/7/3 下午5:13, Tiwei Bie wrote:
>>>>>>>>>>> Details about this can be found here:
>>>>>>>>>>>
>>>>>>>>>>> https://lwn.net/Articles/750770/
>>>>>>>>>>>
>>>>>>>>>>> What's new in this version
>>>>>>>>>>> ==========================
>>>>>>>>>>>
>>>>>>>>>>> A new VFIO device type is introduced - vfio-vhost. This addressed
>>>>>>>>>>> some comments from here:https://patchwork.ozlabs.org/cover/984763/
>>>>>>>>>>>
>>>>>>>>>>> Below is the updated device interface:
>>>>>>>>>>>
>>>>>>>>>>> Currently, there are two regions of this device: 1) CONFIG_REGION
>>>>>>>>>>> (VFIO_VHOST_CONFIG_REGION_INDEX), which can be used to setup the
>>>>>>>>>>> device; 2) NOTIFY_REGION (VFIO_VHOST_NOTIFY_REGION_INDEX), which
>>>>>>>>>>> can be used to notify the device.
>>>>>>>>>>>
>>>>>>>>>>> 1. CONFIG_REGION
>>>>>>>>>>>
>>>>>>>>>>> The region described by CONFIG_REGION is the main control interface.
>>>>>>>>>>> Messages will be written to or read from this region.
>>>>>>>>>>>
>>>>>>>>>>> The message type is determined by the `request` field in message
>>>>>>>>>>> header. The message size is encoded in the message header too.
>>>>>>>>>>> The message format looks like this:
>>>>>>>>>>>
>>>>>>>>>>> struct vhost_vfio_op {
>>>>>>>>>>> 	__u64 request;
>>>>>>>>>>> 	__u32 flags;
>>>>>>>>>>> 	/* Flag values: */
>>>>>>>>>>>       #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
>>>>>>>>>>> 	__u32 size;
>>>>>>>>>>> 	union {
>>>>>>>>>>> 		__u64 u64;
>>>>>>>>>>> 		struct vhost_vring_state state;
>>>>>>>>>>> 		struct vhost_vring_addr addr;
>>>>>>>>>>> 	} payload;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>> The existing vhost-kernel ioctl cmds are reused as the message
>>>>>>>>>>> requests in above structure.
>>>>>>>>>> Still a comments like V1. What's the advantage of inventing a new protocol?
>>>>>>>>> I'm trying to make it work in VFIO's way..
>>>>>>>>>> I believe either of the following should be better:
>>>>>>>>>>
>>>>>>>>>> - using vhost ioctl,  we can start from SET_VRING_KICK/SET_VRING_CALL and
>>>>>>>>>> extend it with e.g notify region. The advantages is that all exist userspace
>>>>>>>>>> program could be reused without modification (or minimal modification). And
>>>>>>>>>> vhost API hides lots of details that is not necessary to be understood by
>>>>>>>>>> application (e.g in the case of container).
>>>>>>>>> Do you mean reusing vhost's ioctl on VFIO device fd directly,
>>>>>>>>> or introducing another mdev driver (i.e. vhost_mdev instead of
>>>>>>>>> using the existing vfio_mdev) for mdev device?
>>>>>>>> Can we simply add them into ioctl of mdev_parent_ops?
>>>>>>> Right, either way, these ioctls have to be and just need to be
>>>>>>> added in the ioctl of the mdev_parent_ops. But another thing we
>>>>>>> also need to consider is that which file descriptor the userspace
>>>>>>> will do the ioctl() on. So I'm wondering do you mean let the
>>>>>>> userspace do the ioctl() on the VFIO device fd of the mdev
>>>>>>> device?
>>>>>> Yes.
>>>>> Got it! I'm not sure what's Alex opinion on this. If we all
>>>>> agree with this, I can do it in this way.
>>>>>
>>>>>> Is there any other way btw?
>>>>> Just a quick thought.. Maybe totally a bad idea. I was thinking
>>>>> whether it would be odd to do non-VFIO's ioctls on VFIO's device
>>>>> fd. So I was wondering whether it's possible to allow binding
>>>>> another mdev driver (e.g. vhost_mdev) to the supported mdev
>>>>> devices. The new mdev driver, vhost_mdev, can provide similar
>>>>> ways to let userspace open the mdev device and do the vhost ioctls
>>>>> on it. To distinguish with the vfio_mdev compatible mdev devices,
>>>>> the device API of the new vhost_mdev compatible mdev devices
>>>>> might be e.g. "vhost-net" for net?
>>>>>
>>>>> So in VFIO case, the device will be for passthru directly. And
>>>>> in VHOST case, the device can be used to accelerate the existing
>>>>> virtualized devices.
>>>>>
>>>>> How do you think?
>>>> VFIO really can't prevent vendor specific ioctls on the device file
>>>> descriptor for mdevs, but a) we'd want to be sure the ioctl address
>>>> space can't collide with ioctls we'd use for vfio defined purposes and
>>>> b) maybe the VFIO user API isn't what you want in the first place if
>>>> you intend to mostly/entirely ignore the defined ioctl set and replace
>>>> them with your own.  In the case of the latter, you're also not getting
>>>> the advantages of the existing VFIO userspace code, so why expose a
>>>> VFIO device at all.
>>> Yeah, I totally agree.
>>
>> I guess the original idea is to reuse the VFIO DMA/IOMMU API for this. Then
>> we have the chance to reuse vfio codes in qemu for dealing with e.g vIOMMU.
> Yeah, you are right. We have several choices here:
>
> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>      based DMA API and potentially reuse a lot of VFIO code in QEMU.
>
>      But in this case, we have two choices for the VFIO device interface
>      (i.e. the interface on top of VFIO device fd):
>
>      A) we may invent a new vhost protocol (as demonstrated by the code
>         in this RFC) on VFIO device fd to make it work in VFIO's way,
>         i.e. regions and irqs.
>
>      B) Or as you proposed, instead of inventing a new vhost protocol,
>         we can reuse most existing vhost ioctls on the VFIO device fd
>         directly. There should be no conflicts between the VFIO ioctls
>         (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>
> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>      And we will introduce a new mdev driver vhost-mdev to do this.
>      It would be natural to reuse the existing kernel vhost interface
>      (ioctls) on it as much as possible. But we will need to invent
>      some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>      choice, but it's too heavy and doesn't support vIOMMU by itself).
>
> I'm not sure which one is the best choice we all want..
> Which one (#1/A, #1/B, or #2) would you prefer?


#2 looks better. One concern is that we may end up with similar API as 
what VFIO does. And I do see some new RFC for VFIO to add more DMA API.

Consider it was still in the stage of RFC, does it make sense if we try 
this way with some sample parents?


>
>>
>>>> The mdev interface does provide a general interface for creating and
>>>> managing virtual devices, vfio-mdev is just one driver on the mdev
>>>> bus.  Parav (Mellanox) has been doing work on mdev-core to help clean
>>>> out vfio-isms from the interface, aiui, with the intent of implementing
>>>> another mdev bus driver for using the devices within the kernel.
>>> Great to know this! I found below series after some searching:
>>>
>>> https://lkml.org/lkml/2019/3/8/821
>>>
>>> In above series, the new mlx5_core mdev driver will do the probe
>>> by calling mlx5_get_core_dev() first on the parent device of the
>>> mdev device. In vhost_mdev, maybe we can also keep track of all
>>> the compatible mdev devices and use this info to do the probe.
>>
>> I don't get why this is needed. My understanding is if we want to go this
>> way, there're actually two parts. 1) Vhost mdev that implements the device
>> managements and vhost ioctl. 2) Vhost it self, which can accept mdev fd as
>> it backend through VHOST_NET_SET_BACKEND.
> I think with vhost-mdev (or with vfio-mdev if we agree to do vhost
> ioctls on vfio device fd directly), we don't need to open /dev/vhost-net
> (and there is no VHOST_NET_SET_BACKEND needed) at all. Either way,
> after getting the fd of the mdev, we just need to do vhost ioctls
> on it directly.


The reason I ask is that vhost-net is designed to not tied to any kind 
of backend. So it's better to have a single place to deal with ioctl. 
But it's not must.

Thanks


>
>>
>>> But we also need a way to allow vfio_mdev driver to distinguish
>>> and reject the incompatible mdev devices.
>>
>> One issue for this series is that it doesn't consider DMA isolation at all.
>>
>>
>>>> It
>>>> seems like this vhost-mdev driver might be similar, using mdev but not
>>>> necessarily vfio-mdev to expose devices.  Thanks,
>>> Yeah, I also think so!
>>
>> I've cced some driver developers for their inputs. I think we need a sample
>> parent drivers in the next version for us to understand the full picture.
>>
>>
>> Thanks
>>
>>
>>> Thanks!
>>> Tiwei
>>>
>>>> Alex

^ permalink raw reply

* Re: [bpf PATCH v2 0/6] bpf: sockmap/tls fixes
From: Jakub Kicinski @ 2019-07-10  2:21 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <20190709170459.387bced6@cakuba.netronome.com>

On Tue, 9 Jul 2019 17:04:59 -0700, Jakub Kicinski wrote:
> On Tue, 09 Jul 2019 08:40:14 -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:  
> > > Looks like strparser is not done'd for offload?    
> > 
> > Right so if rx_conf != TLS_SW then the hardware needs to do
> > the strparser functionality.  
> 
> Can I just take a stab at fixing the HW part?
> 
> Can I rebase this onto net-next?  There are a few patches from net
> missing in the bpf tree.

I think I fixed patch 1 for offload, I need to test it a little more
and I'll send it back to you. In the meantime, let me ask some
questions about the other two :)

^ permalink raw reply

* Re: [EXT] [PATCH net-next 16/16] qlge: Refill empty buffer queues from wq
From: Benjamin Poirier @ 2019-07-10  1:18 UTC (permalink / raw)
  To: Manish Chopra; +Cc: GR-Linux-NIC-Dev, netdev@vger.kernel.org
In-Reply-To: <DM6PR18MB2697EC53399F214EC3DFC4ABABFD0@DM6PR18MB2697.namprd18.prod.outlook.com>

On 2019/06/27 14:18, Manish Chopra wrote:
> > -----Original Message-----
> > From: Benjamin Poirier <bpoirier@suse.com>
> > Sent: Monday, June 17, 2019 1:19 PM
> > To: Manish Chopra <manishc@marvell.com>; GR-Linux-NIC-Dev <GR-Linux-
> > NIC-Dev@marvell.com>; netdev@vger.kernel.org
> > Subject: [EXT] [PATCH net-next 16/16] qlge: Refill empty buffer queues from
> > wq
> > 
> > External Email
> > 
> > ----------------------------------------------------------------------
> > When operating at mtu 9000, qlge does order-1 allocations for rx buffers in
> > atomic context. This is especially unreliable when free memory is low or
> > fragmented. Add an approach similar to commit 3161e453e496 ("virtio: net
> > refill on out-of-memory") to qlge so that the device doesn't lock up if there
> > are allocation failures.
> > 
[...]
> > +
> > +static void ql_update_buffer_queues(struct rx_ring *rx_ring, gfp_t gfp,
> > +				    unsigned long delay)
> > +{
> > +	bool sbq_fail, lbq_fail;
> > +
> > +	sbq_fail = !!qlge_refill_bq(&rx_ring->sbq, gfp);
> > +	lbq_fail = !!qlge_refill_bq(&rx_ring->lbq, gfp);
> > +
> > +	/* Minimum number of buffers needed to be able to receive at least
> > one
> > +	 * frame of any format:
> > +	 * sbq: 1 for header + 1 for data
> > +	 * lbq: mtu 9000 / lb size
> > +	 * Below this, the queue might stall.
> > +	 */
> > +	if ((sbq_fail && QLGE_BQ_HW_OWNED(&rx_ring->sbq) < 2) ||
> > +	    (lbq_fail && QLGE_BQ_HW_OWNED(&rx_ring->lbq) <
> > +	     DIV_ROUND_UP(9000, LARGE_BUFFER_MAX_SIZE)))
> > +		/* Allocations can take a long time in certain cases (ex.
> > +		 * reclaim). Therefore, use a workqueue for long-running
> > +		 * work items.
> > +		 */
> > +		queue_delayed_work_on(smp_processor_id(),
> > system_long_wq,
> > +				      &rx_ring->refill_work, delay);
> >  }
> > 
> 
> This is probably going to mess up when at the interface load time (qlge_open()) allocation failure occurs, in such cases we don't really want to re-try allocations
> using refill_work but rather simply fail the interface load.

Why would you want to turn a recoverable failure into a fatal failure?

In case of allocation failure at ndo_open time, allocations are retried
later from a workqueue. Meanwhile, the device can use the available rx
buffers (if any could be allocated at all).

> Just to make sure here in such cases it shouldn't lead to kernel panic etc. while completing qlge_open() and
> leaving refill_work executing in background. Or probably handle such allocation failures from the napi context and schedule refill_work from there.
> 

I've just tested allocation failures at open time and didn't find
problems; with mtu 9000, using bcc, for example:
tools/inject.py -P 0.5 -c 100 alloc_page "should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) (order == 1) => qlge_refill_bq()"

What exact scenario do you have in mind that's going to lead to
problems? Please try it out and describe it precisely.

^ permalink raw reply

* RE: [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default
From: Haiyang Zhang @ 2019-07-10  1:04 UTC (permalink / raw)
  To: David Miller
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	olaf@aepfle.de, vkuznets, linux-kernel@vger.kernel.org
In-Reply-To: <20190709.172936.1666884223446806217.davem@davemloft.net>



> -----Original Message-----
> From: linux-hyperv-owner@vger.kernel.org <linux-hyperv-
> owner@vger.kernel.org> On Behalf Of David Miller
> Sent: Tuesday, July 9, 2019 8:30 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next] Name NICs based on vmbus offer and enable
> async probe by default
> 
> 
> The net-next tree, if you are reading netdev today, has been closed.
I will re-submit when the tree re-opened. 
Thanks,
- Haiyang

^ permalink raw reply

* RE: [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default
From: Haiyang Zhang @ 2019-07-10  1:03 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	olaf@aepfle.de, vkuznets, davem@davemloft.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190709164714.70774c92@hermes.lan>



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, July 9, 2019 7:47 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: sashal@kernel.org; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; KY Srinivasan <kys@microsoft.com>; Stephen
> Hemminger <sthemmin@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; davem@davemloft.net; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH net-next] Name NICs based on vmbus offer and enable
> async probe by default
> 
> On Tue, 9 Jul 2019 22:56:30 +0000
> Haiyang Zhang <haiyangz@microsoft.com> wrote:
> 
> > -				VRSS_CHANNEL_MAX);
> > +	if (probe_type == PROBE_PREFER_ASYNCHRONOUS) {
> > +		snprintf(name, IFNAMSIZ, "eth%d", dev->channel->dev_num);
> 
> What about PCI passthrough or VF devices that are also being probed and
> consuming the ethN names.  Won't there be a collision?

VF usually shows up a few seconds later than the synthetic NIC. Faster probing
will reduce the probability of collision.
Even if a collision happens, the code below will re-register the NIC with "eth%d":
+	if (ret == -EEXIST) {
+		pr_info("NIC name %s exists, request another name.\n",
+			net->name);
+		strlcpy(net->name, "eth%d", IFNAMSIZ);
+		ret = register_netdevice(net);
+	}

Thanks,
- Haiyang

^ permalink raw reply

* Re: [PATCH iproute2 v2 1/2] tc: added mask parameter in skbedit action
From: Stephen Hemminger @ 2019-07-10  0:32 UTC (permalink / raw)
  To: Roman Mashak; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <1562601978-3611-1-git-send-email-mrv@mojatatu.com>

On Mon,  8 Jul 2019 12:06:17 -0400
Roman Mashak <mrv@mojatatu.com> wrote:

> Add 32-bit missing mask attribute in iproute2/tc, which has been long
> supported by the kernel side.
> 
> v2: print value in hex with print_hex() as suggested by Stephen Hemminger.
> 
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>

Both patches applied

^ permalink raw reply

* Re: [PATCH iproute2] ip-route: fix json formatting for metrics
From: Stephen Hemminger @ 2019-07-10  0:29 UTC (permalink / raw)
  To: Andrea Claudi; +Cc: netdev, dsahern
In-Reply-To: <f1535e547aa6da8216ca2a0da7c06b645a132929.1562578533.git.aclaudi@redhat.com>

On Mon,  8 Jul 2019 11:36:42 +0200
Andrea Claudi <aclaudi@redhat.com> wrote:

> Setting metrics for routes currently lead to non-parsable
> json output. For example:
> 
> $ ip link add type dummy
> $ ip route add 192.168.2.0 dev dummy0 metric 100 mtu 1000 rto_min 3
> $ ip -j route | jq
> parse error: ':' not as part of an object at line 1, column 319
> 
> Fixing this opening a json object in the metrics array and using
> print_string() instead of fprintf().
> 
> This is the output for the above commands applying this patch:
> 
> $ ip -j route | jq
> [
>   {
>     "dst": "192.168.2.0",
>     "dev": "dummy0",
>     "scope": "link",
>     "metric": 100,
>     "flags": [],
>     "metrics": [
>       {
>         "mtu": 1000,
>         "rto_min": 3
>       }
>     ]
>   }
> ]
> 
> Fixes: 663c3cb23103f ("iproute: implement JSON and color output")
> Fixes: 968272e791710 ("iproute: refactor metrics print")
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
> ---

Applied, thanks

^ permalink raw reply

* Re: [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default
From: David Miller @ 2019-07-10  0:29 UTC (permalink / raw)
  To: haiyangz
  Cc: sashal, linux-hyperv, netdev, kys, sthemmin, olaf, vkuznets,
	linux-kernel
In-Reply-To: <1562712932-79936-1-git-send-email-haiyangz@microsoft.com>


The net-next tree, if you are reading netdev today, has been closed.

^ permalink raw reply

* Re: [bpf PATCH v2 0/6] bpf: sockmap/tls fixes
From: Jakub Kicinski @ 2019-07-10  0:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev, edumazet, bpf
In-Reply-To: <5d24b55e8b868_3b162ae67af425b43e@john-XPS-13-9370.notmuch>

On Tue, 09 Jul 2019 08:40:14 -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > Looks like strparser is not done'd for offload?  
> 
> Right so if rx_conf != TLS_SW then the hardware needs to do
> the strparser functionality.

Can I just take a stab at fixing the HW part?

Can I rebase this onto net-next?  There are a few patches from net
missing in the bpf tree.


^ permalink raw reply

* Re: [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default
From: Stephen Hemminger @ 2019-07-09 23:47 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: sashal@kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, KY Srinivasan, Stephen Hemminger,
	olaf@aepfle.de, vkuznets, davem@davemloft.net,
	linux-kernel@vger.kernel.org
In-Reply-To: <1562712932-79936-1-git-send-email-haiyangz@microsoft.com>

On Tue, 9 Jul 2019 22:56:30 +0000
Haiyang Zhang <haiyangz@microsoft.com> wrote:

> -				VRSS_CHANNEL_MAX);
> +	if (probe_type == PROBE_PREFER_ASYNCHRONOUS) {
> +		snprintf(name, IFNAMSIZ, "eth%d", dev->channel->dev_num);

What about PCI passthrough or VF devices that are also being probed and
consuming the ethN names.  Won't there be a collision?

^ permalink raw reply

* Re: [PATCH iproute2] utils: don't match empty strings as prefixes
From: Matteo Croce @ 2019-07-09 23:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, David Ahern
In-Reply-To: <20190709143758.695a65bc@hermes.lan>

On Tue, Jul 9, 2019 at 11:38 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Tue,  9 Jul 2019 22:40:40 +0200
> Matteo Croce <mcroce@redhat.com> wrote:
>
> > iproute has an utility function which checks if a string is a prefix for
> > another one, to allow use of abbreviated commands, e.g. 'addr' or 'a'
> > instead of 'address'.
> >
> > This routine unfortunately considers an empty string as prefix
> > of any pattern, leading to undefined behaviour when an empty
> > argument is passed to ip:
> >
> >     # ip ''
> >     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> >         link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> >         inet 127.0.0.1/8 scope host lo
> >            valid_lft forever preferred_lft forever
> >         inet6 ::1/128 scope host
> >            valid_lft forever preferred_lft forever
> >
> >     # tc ''
> >     qdisc noqueue 0: dev lo root refcnt 2
> >
> >     # ip address add 192.0.2.0/24 '' 198.51.100.1 dev dummy0
> >     # ip addr show dev dummy0
> >     6: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
> >         link/ether 02:9d:5e:e9:3f:c0 brd ff:ff:ff:ff:ff:ff
> >         inet 192.0.2.0/24 brd 198.51.100.1 scope global dummy0
> >            valid_lft forever preferred_lft forever
> >
> > Rewrite matches() so it takes care of an empty input, and doesn't
> > scan the input strings three times: the actual implementation
> > does 2 strlen and a memcpy to accomplish the same task.
> >
> > Signed-off-by: Matteo Croce <mcroce@redhat.com>
> > ---
> >  include/utils.h |  2 +-
> >  lib/utils.c     | 14 +++++++++-----
> >  2 files changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/utils.h b/include/utils.h
> > index 927fdc17..f4d12abb 100644
> > --- a/include/utils.h
> > +++ b/include/utils.h
> > @@ -198,7 +198,7 @@ int nodev(const char *dev);
> >  int check_ifname(const char *);
> >  int get_ifname(char *, const char *);
> >  const char *get_ifname_rta(int ifindex, const struct rtattr *rta);
> > -int matches(const char *arg, const char *pattern);
> > +int matches(const char *prefix, const char *string);
> >  int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits);
> >  int inet_addr_match_rta(const inet_prefix *m, const struct rtattr *rta);
> >
> > diff --git a/lib/utils.c b/lib/utils.c
> > index be0f11b0..73ce19bb 100644
> > --- a/lib/utils.c
> > +++ b/lib/utils.c
> > @@ -887,13 +887,17 @@ const char *get_ifname_rta(int ifindex, const struct rtattr *rta)
> >       return name;
> >  }
> >
> > -int matches(const char *cmd, const char *pattern)
> > +/* Check if 'prefix' is a non empty prefix of 'string' */
> > +int matches(const char *prefix, const char *string)
> >  {
> > -     int len = strlen(cmd);
> > +     if (!*prefix)
> > +             return 1;
> > +     while(*string && *prefix == *string) {
> > +             prefix++;
> > +             string++;
> > +     }
> >
> > -     if (len > strlen(pattern))
> > -             return -1;
> > -     return memcmp(pattern, cmd, len);
> > +     return *prefix;
> >  }
> >
> >  int inet_addr_match(const inet_prefix *a, const inet_prefix *b, int bits)
>
> ERROR: space required before the open parenthesis '('
> #134: FILE: lib/utils.c:895:
> +       while(*string && *prefix == *string) {
>
> total: 1 errors, 1 warnings, 30 lines checked
>
> The empty prefix string is a bug and should not be allowed.
> Also return value should be same as old code (yours isn't).
>
>
>

The old return value was the difference between the first pair of
bytes, according to the memcmp manpage.
All calls only checks if the matches() return value is 0 or not 0:

iproute2$ git grep 'matches(' |grep -v -e '== 0' -e '= 0' -e '!matches('
include/utils.h:int matches(const char *prefix, const char *string);
include/xtables.h:extern void xtables_register_matches(struct
xtables_match *, unsigned int);
lib/color.c:    if (matches(dup, "-color"))
lib/utils.c:int matches(const char *prefix, const char *string)
tc/tc.c:                if (matches(argv[0], iter->c))

Is it a problem if it returns a non negative value for non matching strings?

Regards,


--
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* Re: [PATCH] net/mlx5e: Return in default case statement in tx_post_resync_params
From: Nathan Chancellor @ 2019-07-09 23:10 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Boris Pismenny,
	netdev, linux-rdma, LKML, clang-built-linux
In-Reply-To: <CAKwvOdkYdNiKorJAKHZ7LTfk9eOpMqe6F4QSmJWQ=-YNuPAyrw@mail.gmail.com>

On Tue, Jul 09, 2019 at 03:44:59PM -0700, Nick Desaulniers wrote:
> On Mon, Jul 8, 2019 at 4:13 PM Nathan Chancellor
> <natechancellor@gmail.com> wrote:
> >
> > clang warns:
> >
> > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:251:2:
> > warning: variable 'rec_seq_sz' is used uninitialized whenever switch
> > default is taken [-Wsometimes-uninitialized]
> >         default:
> >         ^~~~~~~
> > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:255:46: note:
> > uninitialized use occurs here
> >         skip_static_post = !memcmp(rec_seq, &rn_be, rec_seq_sz);
> >                                                     ^~~~~~~~~~
> > drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c:239:16: note:
> > initialize the variable 'rec_seq_sz' to silence this warning
> >         u16 rec_seq_sz;
> >                       ^
> >                        = 0
> > 1 warning generated.
> >
> > This case statement was clearly designed to be one that should not be
> > hit during runtime because of the WARN_ON statement so just return early
> > to prevent copying uninitialized memory up into rn_be.
> >
> > Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support")
> > Link: https://github.com/ClangBuiltLinux/linux/issues/590
> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > index 3f5f4317a22b..5c08891806f0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> > @@ -250,6 +250,7 @@ tx_post_resync_params(struct mlx5e_txqsq *sq,
> >         }
> >         default:
> >                 WARN_ON(1);
> > +               return;
> >         }
> 
> hmm...a switch statement with a single case is a code smell.  How
> about a single conditional with early return?  Then the "meat" of the
> happy path doesn't need an additional level of indentation.
> -- 
> Thanks,
> ~Nick Desaulniers

I assume that the reason for this is there may be other cipher types
added in the future? I suppose the maintainers can give more clarity to
that.

Furthermore, if they want the switch statements to remain, it looks like
fill_static_params_ctx also returns in the default statement so it seems
like this is the right fix.

Cheers,
Nathan

^ permalink raw reply

* [PATCH net-next] Name NICs based on vmbus offer and enable async probe by default
From: Haiyang Zhang @ 2019-07-09 22:56 UTC (permalink / raw)
  To: sashal@kernel.org, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org
  Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, olaf@aepfle.de,
	vkuznets, davem@davemloft.net, linux-kernel@vger.kernel.org

Previously the async probing caused NIC naming in random order.

The patch adds a dev_num field in vmbus channel structure. It’s assigned
to the first available number when the channel is offered. So netvsc can
use it for NIC naming based on channel offer sequence. Now we re-enable
the async probing mode by default for faster probing.

Also added a modules parameter, probe_type, to set sync probing mode if
a user wants to.

Fixes: af0a5646cb8d ("use the new async probing feature for the hyperv drivers")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/hv/channel_mgmt.c       | 46 +++++++++++++++++++++++++++++++++++++++--
 drivers/net/hyperv/netvsc_drv.c | 33 ++++++++++++++++++++++++++---
 include/linux/hyperv.h          |  4 ++++
 3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index addcef5..ab7c05b 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -304,6 +304,8 @@ bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp,
 
 EXPORT_SYMBOL_GPL(vmbus_prep_negotiate_resp);
 
+#define HV_DEV_NUM_INVALID (-1)
+
 /*
  * alloc_channel - Allocate and initialize a vmbus channel object
  */
@@ -315,6 +317,8 @@ static struct vmbus_channel *alloc_channel(void)
 	if (!channel)
 		return NULL;
 
+	channel->dev_num = HV_DEV_NUM_INVALID;
+
 	spin_lock_init(&channel->lock);
 	init_completion(&channel->rescind_event);
 
@@ -533,6 +537,42 @@ static void vmbus_add_channel_work(struct work_struct *work)
 }
 
 /*
+ * Get the first available device number of its type, then
+ * record it in the channel structure.
+ */
+static void hv_set_devnum(struct vmbus_channel *newchannel)
+{
+	struct vmbus_channel *channel;
+	unsigned int i = 0;
+	bool found;
+
+	BUG_ON(!mutex_is_locked(&vmbus_connection.channel_mutex));
+
+	/* Only HV_NIC uses this number for now */
+	if (hv_get_dev_type(newchannel) != HV_NIC)
+		return;
+
+next:
+	found = false;
+
+	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
+		if (i == channel->dev_num &&
+		    guid_equal(&channel->offermsg.offer.if_type,
+			       &newchannel->offermsg.offer.if_type)) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found) {
+		i++;
+		goto next;
+	}
+
+	newchannel->dev_num = i;
+}
+
+/*
  * vmbus_process_offer - Process the offer by creating a channel/device
  * associated with this offer
  */
@@ -561,10 +601,12 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel)
 		}
 	}
 
-	if (fnew)
+	if (fnew) {
+		hv_set_devnum(newchannel);
+
 		list_add_tail(&newchannel->listentry,
 			      &vmbus_connection.chn_list);
-	else {
+	} else {
 		/*
 		 * Check to see if this is a valid sub-channel.
 		 */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index afdcc56..af53690 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -57,6 +57,10 @@
 module_param(debug, int, 0444);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+static unsigned int probe_type __ro_after_init = PROBE_PREFER_ASYNCHRONOUS;
+module_param(probe_type, uint, 0444);
+MODULE_PARM_DESC(probe_type, "Probe type: 1=async(default), 2=sync");
+
 static LIST_HEAD(netvsc_dev_list);
 
 static void netvsc_change_rx_flags(struct net_device *net, int change)
@@ -2233,10 +2237,19 @@ static int netvsc_probe(struct hv_device *dev,
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device_info *device_info = NULL;
 	struct netvsc_device *nvdev;
+	char name[IFNAMSIZ];
 	int ret = -ENOMEM;
 
-	net = alloc_etherdev_mq(sizeof(struct net_device_context),
-				VRSS_CHANNEL_MAX);
+	if (probe_type == PROBE_PREFER_ASYNCHRONOUS) {
+		snprintf(name, IFNAMSIZ, "eth%d", dev->channel->dev_num);
+		net = alloc_netdev_mqs(sizeof(struct net_device_context), name,
+				       NET_NAME_ENUM, ether_setup,
+				       VRSS_CHANNEL_MAX, VRSS_CHANNEL_MAX);
+	} else {
+		net = alloc_etherdev_mq(sizeof(struct net_device_context),
+					VRSS_CHANNEL_MAX);
+	}
+
 	if (!net)
 		goto no_net;
 
@@ -2323,6 +2336,14 @@ static int netvsc_probe(struct hv_device *dev,
 		net->max_mtu = ETH_DATA_LEN;
 
 	ret = register_netdevice(net);
+
+	if (ret == -EEXIST) {
+		pr_info("NIC name %s exists, request another name.\n",
+			net->name);
+		strlcpy(net->name, "eth%d", IFNAMSIZ);
+		ret = register_netdevice(net);
+	}
+
 	if (ret != 0) {
 		pr_err("Unable to register netdev.\n");
 		goto register_failed;
@@ -2407,7 +2428,7 @@ static int netvsc_remove(struct hv_device *dev)
 	.probe = netvsc_probe,
 	.remove = netvsc_remove,
 	.driver = {
-		.probe_type = PROBE_FORCE_SYNCHRONOUS,
+		.probe_type = PROBE_PREFER_ASYNCHRONOUS,
 	},
 };
 
@@ -2473,6 +2494,12 @@ static int __init netvsc_drv_init(void)
 	}
 	netvsc_ring_bytes = ring_size * PAGE_SIZE;
 
+	if (probe_type != PROBE_PREFER_ASYNCHRONOUS)
+		probe_type = PROBE_FORCE_SYNCHRONOUS;
+
+	netvsc_drv.driver.probe_type = probe_type;
+	pr_info("probe_type: %u\n", probe_type);
+
 	ret = vmbus_driver_register(&netvsc_drv);
 	if (ret)
 		return ret;
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 6256cc3..12fc5ea 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -841,6 +841,10 @@ struct vmbus_channel {
 	 */
 	struct vmbus_channel *primary_channel;
 	/*
+	 * Used for device naming based on channel offer sequence.
+	 */
+	int dev_num;
+	/*
 	 * Support per-channel state for use by vmbus drivers.
 	 */
 	void *per_channel_state;
-- 
1.8.3.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox