Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [bpf PATCH 4/6] bpf: sockmap, tcp_disconnect to listen transition
From: John Fastabend @ 2018-06-14  5:48 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: ast, daniel, netdev
In-Reply-To: <20180614005601.bsjijlsr7smpngde@kafai-mbp>

On 06/13/2018 05:56 PM, Martin KaFai Lau wrote:
> On Wed, Jun 13, 2018 at 10:50:14AM -0700, John Fastabend wrote:
>> After adding checks to ensure TCP is in ESTABLISHED state when a
>> sock is added we need to also ensure that user does not transition
>> through tcp_disconnect() and back into ESTABLISHED state without
>> sockmap removing the sock.
>>
>> To do this add unhash hook and remove sock from map there.
> In bpf_tcp_init():
>         sk->sk_prot = &tcp_bpf_proto;
> 
> I may have missed a lock while reading sockmap.c.
> Is it possible that tcp_disconnect() is being called while
> the above assignment is also being done (e.g. through BPF_MAP_UPDATE_ELEM)?
> The same situation go for the ESTABLISHED check.
> 

Right because ESTABLISHED is checked without any locking its
possible that the state changes during the update (from userspce
BPF_MAP_UPDATE, from sock_ops program it is locked). I have
the below patch on my tree now, I was thinking to send it as
a follow on but on second thought it likely makes more sense
as part of the patch that adds the ESTABLISHED check.

Also after the below the sk_callback lock used to protect
psock->maps is becoming increasingly pointless it allows the
delete and map free ops to be called without taking the full
sock lock. It might be time to just drop it in bpf-next and
use the sock lock in the delete cases. The more annoying part
will be the delete will have to have different userspace and
bpf program helpers so we know when we need the lock.

--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2074,17 +2074,20 @@ static int sock_map_update_elem(struct bpf_map *map,
                return -EINVAL;
        }

+       lock_sock(skops.sk);
        /* ULPs are currently supported only for TCP sockets in ESTABLISHED
         * state.
         */
        if (skops.sk->sk_type != SOCK_STREAM ||
            skops.sk->sk_protocol != IPPROTO_TCP ||
            skops.sk->sk_state != TCP_ESTABLISHED) {
-               fput(socket->file);
-               return -EOPNOTSUPP;
+               err = -EOPNOTSUPP;
+               goto out;
        }

        err = sock_map_ctx_update_elem(&skops, map, key, flags);
+out:
+       release_sock(skops.sk);
        fput(socket->file);
        return err;
 }
@@ -2423,17 +2426,20 @@ static int sock_hash_update_elem(struct bpf_map
*map,
                return -EINVAL;
        }

+       lock_sock(skops.sk);
        /* ULPs are currently supported only for TCP sockets in ESTABLISHED
         * state.
         */
        if (skops.sk->sk_type != SOCK_STREAM ||
            skops.sk->sk_protocol != IPPROTO_TCP ||
            skops.sk->sk_state != TCP_ESTABLISHED) {
-               fput(socket->file);
-               return -EOPNOTSUPP;
+               err = -EOPNOTSUPP;
+               goto out;
        }

        err = sock_hash_ctx_update_elem(&skops, map, key, flags);
+out:
+       release_sock(skops.sk);
        fput(socket->file);
        return err;

^ permalink raw reply

* Re: [PATCH iproute2-next v2] ip-xfrm: Add support for OUTPUT_MARK
From: Subash Abhinov Kasiviswanathan @ 2018-06-14  5:09 UTC (permalink / raw)
  To: David Ahern; +Cc: Lorenzo Colitti, netdev, Stephen Hemminger, Steffen Klassert
In-Reply-To: <107348e5-3035-8d07-59c0-95a84c4f8222@gmail.com>

> any reason to put output-mark on its own line? Why not
> 	mark 0x10000/0x3ffff output-mark 0x20000
> 

Hi David

I will move it to the same line in v3.

> is the documentation clear on the difference between mark and 
> output-mark?

Lorenzo has described the differences in detail in the kernel commit I
had listed in the commit text of this patch. Pasting from there -

The output mark differs from the existing xfrm mark in two ways:

1. The xfrm mark is used to match xfrm policies and states, while
    the xfrm output mark is used to set the mark (and influence
    the routing) of the packets emitted by those states.
2. The existing mark is constrained to be a subset of the bits of
    the originating socket or transformed packet, but the output
    mark is arbitrary and depends only on the state.

The use of a separate mark provides additional flexibility. For
example:

- A packet subject to two transforms (e.g., transport mode inside
   tunnel mode) can have two different output marks applied to it,
   one for the transport mode SA and one for the tunnel mode SA.
- On a system where socket marks determine routing, the packets
   emitted by an IPsec tunnel can be routed based on a mark that
   is determined by the tunnel, not by the marks of the
   unencrypted packets.
- Support for setting the output marks can be introduced without
   breaking any existing setups that employ both mark-based
   routing and xfrm tunnel mode. Simply changing the code to use
   the xfrm mark for routing output packets could xfrm mark could
   change behaviour in a way that breaks these setups.

> 
>> 
>>> @@ -61,6 +61,7 @@ static void usage(void)
>>>         fprintf(stderr, "        [ flag FLAG-LIST ] [ sel SELECTOR ] 
>>> [ LIMIT-LIST ] [ encap ENCAP ]\n");
>>>         fprintf(stderr, "        [ coa ADDR[/PLEN] ] [ ctx CTX ] [ 
>>> extra-flag EXTRA-FLAG-LIST ]\n");
>>>         fprintf(stderr, "        [ offload [dev DEV] dir DIR ]\n");
>>> +       fprintf(stderr, "        [ output-mark OUTPUT-MARK]\n");
>> 
>> Nit: I think you want a space between OUTPUT-MARK and ].
> 
> yes.
> 

I will update this as well.

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Y Song @ 2018-06-14  4:36 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <3b976fba-cfa2-3541-e823-89879ce56ca5@lab.ntt.co.jp>

On Wed, Jun 13, 2018 at 8:40 PM, Toshiaki Makita
<makita.toshiaki@lab.ntt.co.jp> wrote:
> On 2018/06/14 11:56, Y Song wrote:
> ...
>>> @@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>>>         return 0;
>>>  }
>>>
>>> +struct sk_buff;
>>> +
>>> +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
>>> +                                          struct sk_buff *skb,
>>> +                                          struct bpf_prog *xdp_prog)
>>> +{
>>> +       return 0;
>>
>> should you return an error code here?
>
> My understanding is that this function never be called if
> CONFIG_BPF_SYSCALL is not set so any value is OK.
> I aligned it with other functions of devmap, specifically
> dev_map_enqueue() which returns 0.
>
>>
>>> +}
>>> +
>>>  static inline
>>>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>>>  {
>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>> index 45fc0f5..8ddff1f 100644
>>> --- a/include/linux/filter.h
>>> +++ b/include/linux/filter.h
>>> @@ -19,6 +19,7 @@
>>>  #include <linux/cryptohash.h>
>>>  #include <linux/set_memory.h>
>>>  #include <linux/kallsyms.h>
>>> +#include <linux/if_vlan.h>
>>>
>>>  #include <net/sch_generic.h>
>>>
>>> @@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
>>>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>>>                                        const struct bpf_insn *patch, u32 len);
>>>
>>> +static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
>>> +                                          struct net_device *fwd)
>>
>> Previously this function is only used in filter.c and now it is only
>> used in devmap.c. Maybe this function should be in devmap.c
>> until it will be used cross different files?
>
> This function is also called from xdp_do_generic_redirect() in
> net/core/filter.c, so I can't move it to devmap.c.

Thanks for the explanation. Make sense.
Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [iproute2-next v3 2/2] tipc: JSON support for tipc link printouts
From: David Ahern @ 2018-06-14  3:47 UTC (permalink / raw)
  To: Hoang Le, netdev, tipc-discussion
In-Reply-To: <1528770749-10415-2-git-send-email-hoang.h.le@dektech.com.au>

On 6/11/18 8:32 PM, Hoang Le wrote:
> Add json output support for tipc link command
> 

...

> 
> Acked-by: Jon Maloy <jon.maloy@ericsson.com>
> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
> ---
>  man/man8/tipc-link.8 |  10 ++
>  tipc/link.c          | 414 +++++++++++++++++++++++++++++++++------------------
>  2 files changed, 282 insertions(+), 142 deletions(-)

applied to iproute2-next. Thanks

^ permalink raw reply

* Re: [iproute2-next v3 1/2] tipc: JSON support for showing nametable
From: David Ahern @ 2018-06-14  3:47 UTC (permalink / raw)
  To: Hoang Le, netdev, tipc-discussion
In-Reply-To: <1528770749-10415-1-git-send-email-hoang.h.le@dektech.com.au>

On 6/11/18 8:32 PM, Hoang Le wrote:
> Add json output support for nametable show
> 
> Example output:
> $tipc -j -p nametable show
> 
> [ {
>         "type": 0,
>         "lower": 16781313,
>         "upper": 16781313,
>         "scope": "zone",
>         "port": 0,
>         "node": ""
>     },{
>         "type": 0,
>         "lower": 16781416,
>         "upper": 16781416,
>         "scope": "cluster",
>         "port": 0,
>         "node": ""
>     } ]
> 
> v2:
>     Replace variable 'json_flag' by 'json' declared in include/utils.h
>     Add new parameter '-pretty' to support pretty output
> 
> v3:
>     Update manual page
> 
> Acked-by: Jon Maloy <jon.maloy@ericsson.com>
> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
> ---
>  man/man8/tipc-nametable.8 | 10 ++++++++++
>  man/man8/tipc.8           |  9 +++++++++
>  tipc/nametable.c          | 31 ++++++++++++++++++++++---------
>  tipc/tipc.c               | 20 +++++++++++++++++++-
>  4 files changed, 60 insertions(+), 10 deletions(-)
> 

applied to iproute2-next. Thanks

^ permalink raw reply

* [PATCHv2 net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Xin Long @ 2018-06-14  3:44 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Neil Horman, Michael Tuexen, davem

This feature is actually already supported by sk->sk_reuse which can be
set by socket level opt SO_REUSEADDR. But it's not working exactly as
RFC6458 demands in section 8.1.27, like:

  - This option only supports one-to-one style SCTP sockets
  - This socket option must not be used after calling bind()
    or sctp_bindx().

Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
work in linux.

To separate it from the socket level version, this patch adds 'reuse' in
sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
setup limitations that are needed when it is being enabled.

"It should be noted that the behavior of the socket-level socket option
to reuse ports and/or addresses for SCTP sockets is unspecified", so it
leaves SO_REUSEADDR as is for the compatibility.

Note that the name SCTP_REUSE_PORT is kind of confusing, it is identical
to SO_REUSEADDR with some extra restriction, so here it uses 'reuse' in
sctp_sock instead of 'reuseport'. As for sk->sk_reuseport support for
SCTP, it will be added in another patch.

Thanks to Neil to make this clear.

v1->v2:
  - add sctp_sk->reuse to separate it from the socket level version.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/structs.h |  1 +
 include/uapi/linux/sctp.h  |  1 +
 net/sctp/socket.c          | 62 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index dbe1b91..5764a3d 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -220,6 +220,7 @@ struct sctp_sock {
 	__u32 adaptation_ind;
 	__u32 pd_point;
 	__u16	nodelay:1,
+		reuse:1,
 		disable_fragments:1,
 		v4mapped:1,
 		frag_interleave:1,
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index b64d583..c02986a 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -100,6 +100,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_RECVNXTINFO	33
 #define SCTP_DEFAULT_SNDINFO	34
 #define SCTP_AUTH_DEACTIVATE_KEY	35
+#define SCTP_REUSE_PORT		36
 
 /* Internal Socket Options. Some of the sctp library functions are
  * implemented using these socket options.
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d20f7ad..d71d224 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4169,6 +4169,28 @@ static int sctp_setsockopt_interleaving_supported(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_reuse_port(struct sock *sk, char __user *optval,
+				      unsigned int optlen)
+{
+	int val;
+
+	if (!sctp_style(sk, TCP))
+		return -EOPNOTSUPP;
+
+	if (sctp_sk(sk)->ep->base.bind_addr.port)
+		return -EFAULT;
+
+	if (optlen < sizeof(int))
+		return -EINVAL;
+
+	if (get_user(val, (int __user *)optval))
+		return -EFAULT;
+
+	sctp_sk(sk)->reuse = !!val;
+
+	return 0;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4363,6 +4385,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_setsockopt_interleaving_supported(sk, optval,
 								optlen);
 		break;
+	case SCTP_REUSE_PORT:
+		retval = sctp_setsockopt_reuse_port(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -7196,6 +7221,26 @@ static int sctp_getsockopt_interleaving_supported(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_reuse_port(struct sock *sk, int len,
+				      char __user *optval,
+				      int __user *optlen)
+{
+	int val;
+
+	if (len < sizeof(int))
+		return -EINVAL;
+
+	len = sizeof(int);
+	val = sctp_sk(sk)->reuse;
+	if (put_user(len, optlen))
+		return -EFAULT;
+
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+
+	return 0;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -7391,6 +7436,9 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_getsockopt_interleaving_supported(sk, len, optval,
 								optlen);
 		break;
+	case SCTP_REUSE_PORT:
+		retval = sctp_getsockopt_reuse_port(sk, len, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -7428,6 +7476,7 @@ static struct sctp_bind_bucket *sctp_bucket_create(
 
 static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 {
+	bool reuse = (sk->sk_reuse || sctp_sk(sk)->reuse);
 	struct sctp_bind_hashbucket *head; /* hash list */
 	struct sctp_bind_bucket *pp;
 	unsigned short snum;
@@ -7500,13 +7549,11 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 		 * used by other socket (pp->owner not empty); that other
 		 * socket is going to be sk2.
 		 */
-		int reuse = sk->sk_reuse;
 		struct sock *sk2;
 
 		pr_debug("%s: found a possible match\n", __func__);
 
-		if (pp->fastreuse && sk->sk_reuse &&
-			sk->sk_state != SCTP_SS_LISTENING)
+		if (pp->fastreuse && reuse && sk->sk_state != SCTP_SS_LISTENING)
 			goto success;
 
 		/* Run through the list of sockets bound to the port
@@ -7524,7 +7571,7 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 			ep2 = sctp_sk(sk2)->ep;
 
 			if (sk == sk2 ||
-			    (reuse && sk2->sk_reuse &&
+			    (reuse && (sk2->sk_reuse || sctp_sk(sk2)->reuse) &&
 			     sk2->sk_state != SCTP_SS_LISTENING))
 				continue;
 
@@ -7548,12 +7595,12 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 	 * SO_REUSEADDR on this socket -sk-).
 	 */
 	if (hlist_empty(&pp->owner)) {
-		if (sk->sk_reuse && sk->sk_state != SCTP_SS_LISTENING)
+		if (reuse && sk->sk_state != SCTP_SS_LISTENING)
 			pp->fastreuse = 1;
 		else
 			pp->fastreuse = 0;
 	} else if (pp->fastreuse &&
-		(!sk->sk_reuse || sk->sk_state == SCTP_SS_LISTENING))
+		   (!reuse || sk->sk_state == SCTP_SS_LISTENING))
 		pp->fastreuse = 0;
 
 	/* We are set, so fill up all the data in the hash table
@@ -7684,7 +7731,7 @@ int sctp_inet_listen(struct socket *sock, int backlog)
 		err = 0;
 		sctp_unhash_endpoint(ep);
 		sk->sk_state = SCTP_SS_CLOSED;
-		if (sk->sk_reuse)
+		if (sk->sk_reuse || sctp_sk(sk)->reuse)
 			sctp_sk(sk)->bind_hash->fastreuse = 1;
 		goto out;
 	}
@@ -8549,6 +8596,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
 	newsk->sk_no_check_tx = sk->sk_no_check_tx;
 	newsk->sk_no_check_rx = sk->sk_no_check_rx;
 	newsk->sk_reuse = sk->sk_reuse;
+	sctp_sk(newsk)->reuse = sp->reuse;
 
 	newsk->sk_shutdown = sk->sk_shutdown;
 	newsk->sk_destruct = sctp_destruct_sock;
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  3:40 UTC (permalink / raw)
  To: Y Song; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <CAH3MdRX7NdAdmEp-zmdR1tY6QZTm3yDx7j+wHoFxT4soOM=zRQ@mail.gmail.com>

On 2018/06/14 11:56, Y Song wrote:
...
>> @@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>>         return 0;
>>  }
>>
>> +struct sk_buff;
>> +
>> +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
>> +                                          struct sk_buff *skb,
>> +                                          struct bpf_prog *xdp_prog)
>> +{
>> +       return 0;
> 
> should you return an error code here?

My understanding is that this function never be called if
CONFIG_BPF_SYSCALL is not set so any value is OK.
I aligned it with other functions of devmap, specifically
dev_map_enqueue() which returns 0.

> 
>> +}
>> +
>>  static inline
>>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>>  {
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 45fc0f5..8ddff1f 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/cryptohash.h>
>>  #include <linux/set_memory.h>
>>  #include <linux/kallsyms.h>
>> +#include <linux/if_vlan.h>
>>
>>  #include <net/sch_generic.h>
>>
>> @@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
>>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>>                                        const struct bpf_insn *patch, u32 len);
>>
>> +static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
>> +                                          struct net_device *fwd)
> 
> Previously this function is only used in filter.c and now it is only
> used in devmap.c. Maybe this function should be in devmap.c
> until it will be used cross different files?

This function is also called from xdp_do_generic_redirect() in
net/core/filter.c, so I can't move it to devmap.c.

Thanks,
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH iproute2-next v2] ip-xfrm: Add support for OUTPUT_MARK
From: David Ahern @ 2018-06-14  3:39 UTC (permalink / raw)
  To: Lorenzo Colitti, Subash Abhinov Kasiviswanathan
  Cc: netdev, Stephen Hemminger, Steffen Klassert
In-Reply-To: <CAKD1Yr119qtuabrPL=MbVGeUgRykTeqCmjvCb1buAQU2yZCKjw@mail.gmail.com>

On 6/12/18 9:14 PM, Lorenzo Colitti wrote:
> On Wed, Jun 13, 2018 at 3:48 AM Subash Abhinov Kasiviswanathan
> <subashab@codeaurora.org> wrote:
>>
>> src 192.168.1.1 dst 192.168.1.2
>>         proto esp spi 0x00004321 reqid 0 mode tunnel
>>         replay-window 0 flag af-unspec
>>         mark 0x10000/0x3ffff
>>         output-mark 0x20000
> 
> Nit: I don't know what guarantees we provide (if any) that the output
> format of "ip xfrm state" does not change except to add new lines at
> the end. Personally, I feel that an app or script that depends on
> "auth-trunc" (or anything else, really) being on the line immediately
> after "mark" is brittle and should be fixed. This is particularly true
> since in general between the mark and the encryption there might be an
> auth-trunc line, or an auth line, or neither. As such, adding this
> line here seems OK to me.

any reason to put output-mark on its own line? Why not
	mark 0x10000/0x3ffff output-mark 0x20000

is the documentation clear on the difference between mark and output-mark?

> 
>> @@ -61,6 +61,7 @@ static void usage(void)
>>         fprintf(stderr, "        [ flag FLAG-LIST ] [ sel SELECTOR ] [ LIMIT-LIST ] [ encap ENCAP ]\n");
>>         fprintf(stderr, "        [ coa ADDR[/PLEN] ] [ ctx CTX ] [ extra-flag EXTRA-FLAG-LIST ]\n");
>>         fprintf(stderr, "        [ offload [dev DEV] dir DIR ]\n");
>> +       fprintf(stderr, "        [ output-mark OUTPUT-MARK]\n");
> 
> Nit: I think you want a space between OUTPUT-MARK and ].

yes.

> 
> Other than that,
> 
> Acked-by: Lorenzo Colitti <lorenzo@google.com>
> 

^ permalink raw reply

* Re: [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Y Song @ 2018-06-14  2:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <1528942062-2353-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

On Wed, Jun 13, 2018 at 7:07 PM, Toshiaki Makita
<makita.toshiaki@lab.ntt.co.jp> wrote:
> Commit 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue") changed
> the return value type of __devmap_lookup_elem() from struct net_device *
> to struct bpf_dtab_netdev * but forgot to modify generic XDP code
> accordingly.
> Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
> net_device is expected, then skb->dev was set to invalid value.
>
> v2:
> - Fix compiler warning without CONFIG_BPF_SYSCALL.
>
> Fixes: 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue")
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  include/linux/bpf.h    | 12 ++++++++++++
>  include/linux/filter.h | 16 ++++++++++++++++
>  kernel/bpf/devmap.c    | 14 ++++++++++++++
>  net/core/filter.c      | 21 ++++-----------------
>  4 files changed, 46 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 995c3b1..7df32a3 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -488,12 +488,15 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
>
>  /* Map specifics */
>  struct xdp_buff;
> +struct sk_buff;
>
>  struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
>  void __dev_map_flush(struct bpf_map *map);
>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>                     struct net_device *dev_rx);
> +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
> +                            struct bpf_prog *xdp_prog);
>
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
>  void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
> @@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>         return 0;
>  }
>
> +struct sk_buff;
> +
> +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
> +                                          struct sk_buff *skb,
> +                                          struct bpf_prog *xdp_prog)
> +{
> +       return 0;

should you return an error code here?

> +}
> +
>  static inline
>  struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
>  {
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 45fc0f5..8ddff1f 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -19,6 +19,7 @@
>  #include <linux/cryptohash.h>
>  #include <linux/set_memory.h>
>  #include <linux/kallsyms.h>
> +#include <linux/if_vlan.h>
>
>  #include <net/sch_generic.h>
>
> @@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
>
> +static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
> +                                          struct net_device *fwd)

Previously this function is only used in filter.c and now it is only
used in devmap.c. Maybe this function should be in devmap.c
until it will be used cross different files?

> +{
> +       unsigned int len;
> +
> +       if (unlikely(!(fwd->flags & IFF_UP)))
> +               return -ENETDOWN;
> +
> +       len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
> +       if (skb->len > len)
> +               return -EMSGSIZE;
> +
> +       return 0;
> +}
> +
>  /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
>   * same cpu context. Further for best results no more than a single map
>   * for the do_redirect/do_flush pair should be used. This limitation is
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index a7cc7b3..642c97f 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -345,6 +345,20 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
>         return bq_enqueue(dst, xdpf, dev_rx);
>  }
>
> +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
> +                            struct bpf_prog *xdp_prog)
> +{
> +       int err;
> +
> +       err = __xdp_generic_ok_fwd_dev(skb, dst->dev);
> +       if (unlikely(err))
> +               return err;
> +       skb->dev = dst->dev;
> +       generic_xdp_tx(skb, xdp_prog);
> +
> +       return 0;
> +}
> +
>  static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
>  {
>         struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3d9ba7e..e7f12e9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3214,20 +3214,6 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>
> -static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
> -{
> -       unsigned int len;
> -
> -       if (unlikely(!(fwd->flags & IFF_UP)))
> -               return -ENETDOWN;
> -
> -       len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
> -       if (skb->len > len)
> -               return -EMSGSIZE;
> -
> -       return 0;
> -}
> -
>  static int xdp_do_generic_redirect_map(struct net_device *dev,
>                                        struct sk_buff *skb,
>                                        struct xdp_buff *xdp,
> @@ -3256,10 +3242,11 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
>         }
>
>         if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> -               if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
> +               struct bpf_dtab_netdev *dst = fwd;
> +
> +               err = dev_map_generic_redirect(dst, skb, xdp_prog);
> +               if (unlikely(err))
>                         goto err;
> -               skb->dev = fwd;
> -               generic_xdp_tx(skb, xdp_prog);
>         } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
>                 struct xdp_sock *xs = fwd;
>
> --
> 1.8.3.1
>
>

^ permalink raw reply

* Re: [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Eric Dumazet @ 2018-06-14  2:30 UTC (permalink / raw)
  To: Saeed Mahameed, Tariq Toukan, Martin KaFai Lau; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180614005309.17357-1-saeedm@mellanox.com>



On 06/13/2018 05:53 PM, Saeed Mahameed wrote:
> When a new rx packet arrives, the rx path will decide whether to reuse
> the same page or not according to the current rx frag page offset and
> frag size, i.e:
> release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> 
> Martin debugged this and he showed that this can cause mlx4 XDP to reuse
> buffers when it shouldn't.
> 
> Using frag_info->frag_size is wrong since frag size is always the port
> mtu and the frag stride might be larger, especially in XDP case where
> frag_stride == PAGE_SIZE.

Hmmm... why frag_size is not PAGE_SIZE for XDP then ?

This patch is a bit strange, since we do :

u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);

frags->page_offset += sz_align;
release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

So the @release condition is really to have enough space from frags->page_offset
to hold up to frag_info->frag_size bytes.

(NIC wont push more than frag_info->frag_size bytes, regardless of how big is out frag_stride )

Your patch would prevent a page being reused if say the amount of available bytes is 1536,
but frag_stride = 2048


I would suggest a patch like the following instead.

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5edd0cf2999cbde37b3404aafd700584696af940..83d6b17b102bc9a22bfd8e68863d079f38670501 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1082,7 +1082,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
         * This only works when num_frags == 1.
         */
        if (priv->tx_ring_num[TX_XDP]) {
-               priv->frag_info[0].frag_size = eff_mtu;
+               priv->frag_info[0].frag_size = PAGE_SIZE;
                /* This will gain efficient xdp frame recycling at the
                 * expense of more costly truesize accounting
                 */

^ permalink raw reply related

* [PATCH bpf v2] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  2:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer

Commit 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue") changed
the return value type of __devmap_lookup_elem() from struct net_device *
to struct bpf_dtab_netdev * but forgot to modify generic XDP code
accordingly.
Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
net_device is expected, then skb->dev was set to invalid value.

v2:
- Fix compiler warning without CONFIG_BPF_SYSCALL.

Fixes: 67f29e07e131 ("bpf: devmap introduce dev_map_enqueue")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/bpf.h    | 12 ++++++++++++
 include/linux/filter.h | 16 ++++++++++++++++
 kernel/bpf/devmap.c    | 14 ++++++++++++++
 net/core/filter.c      | 21 ++++-----------------
 4 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 995c3b1..7df32a3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -488,12 +488,15 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
 
 /* Map specifics */
 struct xdp_buff;
+struct sk_buff;
 
 struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
 void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
 void __dev_map_flush(struct bpf_map *map);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
+			     struct bpf_prog *xdp_prog);
 
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key);
 void __cpu_map_insert_ctx(struct bpf_map *map, u32 index);
@@ -586,6 +589,15 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+struct sk_buff;
+
+static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
+					   struct sk_buff *skb,
+					   struct bpf_prog *xdp_prog)
+{
+	return 0;
+}
+
 static inline
 struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key)
 {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..8ddff1f 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
+#include <linux/if_vlan.h>
 
 #include <net/sch_generic.h>
 
@@ -786,6 +787,21 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
+					   struct net_device *fwd)
+{
+	unsigned int len;
+
+	if (unlikely(!(fwd->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
+	if (skb->len > len)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index a7cc7b3..642c97f 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -345,6 +345,20 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return bq_enqueue(dst, xdpf, dev_rx);
 }
 
+int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
+			     struct bpf_prog *xdp_prog)
+{
+	int err;
+
+	err = __xdp_generic_ok_fwd_dev(skb, dst->dev);
+	if (unlikely(err))
+		return err;
+	skb->dev = dst->dev;
+	generic_xdp_tx(skb, xdp_prog);
+
+	return 0;
+}
+
 static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key);
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d9ba7e..e7f12e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3214,20 +3214,6 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
-static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
-{
-	unsigned int len;
-
-	if (unlikely(!(fwd->flags & IFF_UP)))
-		return -ENETDOWN;
-
-	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
-	if (skb->len > len)
-		return -EMSGSIZE;
-
-	return 0;
-}
-
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3256,10 +3242,11 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 	}
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
-		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
+		struct bpf_dtab_netdev *dst = fwd;
+
+		err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		if (unlikely(err))
 			goto err;
-		skb->dev = fwd;
-		generic_xdp_tx(skb, xdp_prog);
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
 		struct xdp_sock *xs = fwd;
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: David Miller @ 2018-06-14  2:05 UTC (permalink / raw)
  To: nhorman; +Cc: lucien.xin, netdev, linux-sctp, marcelo.leitner, eric.dumazet
In-Reply-To: <20180614004643.GA12409@neilslaptop.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 13 Jun 2018 20:46:43 -0400

> Do you have any performance numbers to compare with and without this
> patch?  Adding a function like this implies that any fixes that go
> into skb_gro_receive now need to be evaluated for this function too,
> which means theres an implied overhead in maintaining it.  If we're
> signing up for that, I'd like to know that theres a significant
> performance benefit.

Neil, I asked Xin and Marcelo to do this.

There is no reason for GSO code to use a GRO helper.

And this is, in particular, blocking some skb_gro_receive() surgery
I plan to perform.

^ permalink raw reply

* Re: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Maciej Żenczykowski @ 2018-06-14  1:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrey Vagin, David Miller, netdev
In-Reply-To: <CANn89iJ6Yj8VPBQHsJcEcJUePRJz6NpZrL3OenSfiYccKDbDFA@mail.gmail.com>

> Hi Andrey
>
> This commit was reverted, do we still need  this patch ?

I think it still makes things easier to understand...

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Xin Long @ 2018-06-14  1:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem,
	Eric Dumazet
In-Reply-To: <20180614004643.GA12409@neilslaptop.think-freely.org>

On Thu, Jun 14, 2018 at 8:46 AM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
>> Now sctp GSO uses skb_gro_receive() to append the data into head
>> skb frag_list. However it actually only needs very few code from
>> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
>> of its members are not needed here.
>>
>> This patch is to add sctp_packet_gso_append() to build GSO frames
>> instead of skb_gro_receive(), and it would avoid many unnecessary
>> checks and make the code clearer.
>>
>> Note that sctp will use page frags instead of frag_list to build
>> GSO frames in another patch. But it may take time, as sctp's GSO
>> frames may have different size. skb_segment() can only split it
>> into the frags with the same size, which would break the border
>> of sctp chunks.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> Do you have any performance numbers to compare with and without this patch?
> Adding a function like this implies that any fixes that go into skb_gro_receive
> now need to be evaluated for this function too, which means theres an implied
> overhead in maintaining it.  If we're signing up for that, I'd like to know that
> theres a significant performance benefit.
Hi Neil,

I don't think there's a noticeable performance benefit since it's
just avoided some checks and variables settings.

The new function makes SCTP GSO code clearer and readable,
as skb_gro_receive() should only be used in the GRO code paths,
it's confusing in sctp tx path.

We're doing this, actually because skb_gro_receive() is being
changed now, it would not be suitable for SCTP GSO, see:
https://www.spinics.net/lists/netdev/msg507716.html

And also, we believe page frags will be used to replace frag_list
to build SCTP GSO frames soon. After that, this function will also
be dropped.

^ permalink raw reply

* Re: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Maciej Żenczykowski @ 2018-06-14  1:10 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David S. Miller, Linux NetDev, Eric Dumazet
In-Reply-To: <20180614005606.1057-1-avagin@openvz.org>

>  #define SK_NO_REUSE    0
>  #define SK_CAN_REUSE   1

since it's a boolean now these should go away too I believe.

should there simply/also be a separate privileged socket option to
set/get force reuse?

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14  1:02 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Jason Wang, Alexander Duyck, virtio-dev,
	qemu-devel, Jiri Pirko, Jakub Kicinski, Netdev, Brandeburg, Jesse,
	virtualization, aaron.f.brown
In-Reply-To: <b9c88515-cf41-2a9a-078e-9c9e5adbbf14@intel.com>

On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>>>
>>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>>> device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>>>> feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>>>> via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>
>>>>>> So I do not think we can commit to this interface: we
>>>>>> really need to control visibility of the primary device.
>>>>>
>>>>> The problem is legacy guest won't use primary device at all if we do
>>>>> this.
>>>>
>>>> And that's by design - I think it's the only way to ensure the
>>>> legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio.  But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.

Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.

-Siwei

>
>
>>
>>
>>>>> How about control the visibility of standby device?
>>>>>
>>>>> Thanks
>>>>
>>>> standy the always there to guarantee no downtime.
>>>>
>>>>>> However just for testing purposes, we could add a non-stable
>>>>>> interface "x-standby" with the understanding that as any
>>>>>> x- prefix it's unstable and will be changed down the road,
>>>>>> likely in the next release.
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>>     2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>>                          true),
>>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> +                      false),
>>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>>     };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                                          * Steering */
>>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23       /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for
>>>>>>> another device
>>>>>>> +                                         * with the same MAC.
>>>>>>> +                                         */
>>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63        /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>>> --
>>>>>>> 2.14.3
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
>

^ permalink raw reply

* Re: [bpf PATCH 4/6] bpf: sockmap, tcp_disconnect to listen transition
From: Martin KaFai Lau @ 2018-06-14  0:56 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, netdev
In-Reply-To: <20180613175014.4264.84395.stgit@john-Precision-Tower-5810>

On Wed, Jun 13, 2018 at 10:50:14AM -0700, John Fastabend wrote:
> After adding checks to ensure TCP is in ESTABLISHED state when a
> sock is added we need to also ensure that user does not transition
> through tcp_disconnect() and back into ESTABLISHED state without
> sockmap removing the sock.
> 
> To do this add unhash hook and remove sock from map there.
In bpf_tcp_init():
        sk->sk_prot = &tcp_bpf_proto;

I may have missed a lock while reading sockmap.c.
Is it possible that tcp_disconnect() is being called while
the above assignment is also being done (e.g. through BPF_MAP_UPDATE_ELEM)?
The same situation go for the ESTABLISHED check.

> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  kernel/bpf/sockmap.c |   67 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 50 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 2e848cd..91c7b47 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -130,6 +130,7 @@ struct smap_psock {
>  
>  	struct proto *sk_proto;
>  	void (*save_close)(struct sock *sk, long timeout);
> +	void (*save_unhash)(struct sock *sk);
>  	void (*save_data_ready)(struct sock *sk);
>  	void (*save_write_space)(struct sock *sk);
>  };
> @@ -141,6 +142,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  static int bpf_tcp_sendpage(struct sock *sk, struct page *page,
>  			    int offset, size_t size, int flags);
>  static void bpf_tcp_close(struct sock *sk, long timeout);
> +static void bpf_tcp_unhash(struct sock *sk);
>  
>  static inline struct smap_psock *smap_psock_sk(const struct sock *sk)
>  {
> @@ -182,6 +184,7 @@ static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS],
>  {
>  	prot[SOCKMAP_BASE]			= *base;
>  	prot[SOCKMAP_BASE].close		= bpf_tcp_close;
> +	prot[SOCKMAP_BASE].unhash		= bpf_tcp_unhash;
>  	prot[SOCKMAP_BASE].recvmsg		= bpf_tcp_recvmsg;
>  	prot[SOCKMAP_BASE].stream_memory_read	= bpf_tcp_stream_read;
>  
> @@ -215,6 +218,7 @@ static int bpf_tcp_init(struct sock *sk)
>  	}
>  
>  	psock->save_close = sk->sk_prot->close;
> +	psock->save_unhash = sk->sk_prot->unhash;
>  	psock->sk_proto = sk->sk_prot;
>  
>  	/* Build IPv6 sockmap whenever the address of tcpv6_prot changes */
> @@ -302,28 +306,12 @@ struct smap_psock_map_entry *psock_map_pop(struct sock *sk,
>  	return e;
>  }
>  
> -static void bpf_tcp_close(struct sock *sk, long timeout)
> +static void bpf_tcp_remove(struct sock *sk, struct smap_psock *psock)
>  {
> -	void (*close_fun)(struct sock *sk, long timeout);
>  	struct smap_psock_map_entry *e;
>  	struct sk_msg_buff *md, *mtmp;
> -	struct smap_psock *psock;
>  	struct sock *osk;
>  
> -	rcu_read_lock();
> -	psock = smap_psock_sk(sk);
> -	if (unlikely(!psock)) {
> -		rcu_read_unlock();
> -		return sk->sk_prot->close(sk, timeout);
> -	}
> -
> -	/* The psock may be destroyed anytime after exiting the RCU critial
> -	 * section so by the time we use close_fun the psock may no longer
> -	 * be valid. However, bpf_tcp_close is called with the sock lock
> -	 * held so the close hook and sk are still valid.
> -	 */
> -	close_fun = psock->save_close;
> -
>  	if (psock->cork) {
>  		free_start_sg(psock->sock, psock->cork);
>  		kfree(psock->cork);
> @@ -378,6 +366,51 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>  		}
>  		e = psock_map_pop(sk, psock);
>  	}
> +}
> +
> +static void bpf_tcp_unhash(struct sock *sk)
> +{
> +	void (*unhash_fun)(struct sock *sk);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->unhash(sk);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
> +	unhash_fun = psock->save_unhash;
> +	bpf_tcp_remove(sk, psock);
> +	rcu_read_unlock();
> +	unhash_fun(sk);
> +
> +}
> +
> +static void bpf_tcp_close(struct sock *sk, long timeout)
> +{
> +	void (*close_fun)(struct sock *sk, long timeout);
> +	struct smap_psock *psock;
> +
> +	rcu_read_lock();
> +	psock = smap_psock_sk(sk);
> +	if (unlikely(!psock)) {
> +		rcu_read_unlock();
> +		return sk->sk_prot->close(sk, timeout);
> +	}
> +
> +	/* The psock may be destroyed anytime after exiting the RCU critial
> +	 * section so by the time we use close_fun the psock may no longer
> +	 * be valid. However, bpf_tcp_close is called with the sock lock
> +	 * held so the close hook and sk are still valid.
> +	 */
> +	close_fun = psock->save_close;
> +	bpf_tcp_remove(sk, psock);
>  	rcu_read_unlock();
>  	close_fun(sk, timeout);
>  }
> 

^ permalink raw reply

* [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse
From: Andrei Vagin @ 2018-06-14  0:56 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Andrei Vagin, Maciej Żenczykowski, Eric Dumazet

The commit f396922d862a added a check to not allow changing
SO_REUSEADDR/SO_REUSEPORT on bound sockets. First, it doesn't
take into account that TCP_REPAIR changes SO_REUSEADDR. Second, now it
is impossible to restore a socket state and set SO_REUSEADDR,
because the kernel always sets SO_REUSEADDR into zero after disabling
the repair mode.

Currently, sk_reuse can have three values: SK_NO_REUSE, SK_CAN_REUSE,
SK_FORCE_REUSE. SK_CAN_REUSE is set by SOL_REUSEADDR.  SK_FORCE_REUSE is
used to ignore bind conflicts for sockets in the repair mode.

This patch makes sk->sk_reuse back into a boolean and adds
sk->sk_force_reuse to track SK_FORCE_REUSE separatly.

In this case, the tcp_repair mode doesn't affect sk_reuse and
it's value can be set before switching a socketn into the repair mode.

Fixes: f396922d862a ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
---
 include/net/sock.h              | 13 ++++---------
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp.c                  |  4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..8ad19286ab9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -130,6 +130,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_force_reuse: ignore bind conflicts
  *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
@@ -174,7 +175,8 @@ struct sock_common {
 
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse:4;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_force_reuse:1;
 	unsigned char		skc_reuseport:1;
 	unsigned char		skc_ipv6only:1;
 	unsigned char		skc_net_refcnt:1;
@@ -339,6 +341,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_force_reuse		__sk_common.skc_force_reuse
 #define sk_reuseport		__sk_common.skc_reuseport
 #define sk_ipv6only		__sk_common.skc_ipv6only
 #define sk_net_refcnt		__sk_common.skc_net_refcnt
@@ -502,16 +505,8 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
-/*
- * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
- * or not whether his port will be reused by someone else. SK_FORCE_REUSE
- * on a socket means that the socket will reuse everybody else's port
- * without looking at the other's sk_reuse value.
- */
-
 #define SK_NO_REUSE	0
 #define SK_CAN_REUSE	1
-#define SK_FORCE_REUSE	2
 
 int sk_set_peek_off(struct sock *sk, int val);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 33a88e045efd..2ac1c591b60c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -306,7 +306,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		goto fail_unlock;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (sk->sk_reuse == SK_FORCE_REUSE)
+		if (sk->sk_force_reuse)
 			goto success;
 
 		if ((tb->fastreuse > 0 && reuse) ||
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba..70bfdd5a2fc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2810,11 +2810,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EPERM;
 		else if (val == 1) {
 			tp->repair = 1;
-			sk->sk_reuse = SK_FORCE_REUSE;
+			sk->sk_force_reuse = 1;
 			tp->repair_queue = TCP_NO_QUEUE;
 		} else if (val == 0) {
 			tp->repair = 0;
-			sk->sk_reuse = SK_NO_REUSE;
+			sk->sk_force_reuse = 0;
 			tcp_send_window_probe(sk);
 		} else
 			err = -EINVAL;
-- 
2.17.0

^ permalink raw reply related

* [net RFC] net/mlx4_en: Use frag stride in crossing page boundary condition
From: Saeed Mahameed @ 2018-06-14  0:53 UTC (permalink / raw)
  To: Tariq Toukan, Martin KaFai Lau; +Cc: netdev, Saeed Mahameed, Eric Dumazet

When a new rx packet arrives, the rx path will decide whether to reuse
the same page or not according to the current rx frag page offset and
frag size, i.e:
release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;

Martin debugged this and he showed that this can cause mlx4 XDP to reuse
buffers when it shouldn't.

Using frag_info->frag_size is wrong since frag size is always the port
mtu and the frag stride might be larger, especially in XDP case where
frag_stride == PAGE_SIZE.

In XDP there is an assumption to have a page per packet and reuse can
break such assumption and might cause packet data corruptions, since in 
XDP frag_offset will always reset to the beginning of the page when
refilling the rx buffer.

Fix this by using the stride size rather than frag size in "release"
condition evaluation.

For non XDP setup this will yield the same behavior since frag_stride
already equals to ALIGN(frag_size, SMP_CACHE_BYTES) and on XDP setup the
"release" condition will always be true as it is supposed to be since
frag_stride == PAGE_SIZE.

Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
CC: Eric Dumazet <edumazet@google.com>

---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 5c613c6663da..f63dde0288b7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -504,7 +504,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
 			u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);

 			frags->page_offset += sz_align;
-			release = frags->page_offset + frag_info->frag_size > PAGE_SIZE;
+			release = frags->page_offset + frag_info->frag_stride > PAGE_SIZE;
 		}
 		if (release) {
 			dma_unmap_page(priv->ddev, dma, PAGE_SIZE, priv->dma_dir);
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Neil Horman @ 2018-06-14  0:46 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, davem,
	eric.dumazet
In-Reply-To: <d41bb7dda9b5c176d5c0a23a8705744f49fcb570.1528933022.git.lucien.xin@gmail.com>

On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
> Now sctp GSO uses skb_gro_receive() to append the data into head
> skb frag_list. However it actually only needs very few code from
> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
> of its members are not needed here.
> 
> This patch is to add sctp_packet_gso_append() to build GSO frames
> instead of skb_gro_receive(), and it would avoid many unnecessary
> checks and make the code clearer.
> 
> Note that sctp will use page frags instead of frag_list to build
> GSO frames in another patch. But it may take time, as sctp's GSO
> frames may have different size. skb_segment() can only split it
> into the frags with the same size, which would break the border
> of sctp chunks.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
Do you have any performance numbers to compare with and without this patch?
Adding a function like this implies that any fixes that go into skb_gro_receive
now need to be evaluated for this function too, which means theres an implied
overhead in maintaining it.  If we're signing up for that, I'd like to know that
theres a significant performance benefit.

Neil

^ permalink raw reply

* Re: [PATCH bpf] xdp: Fix handling of devmap in generic XDP
From: Toshiaki Makita @ 2018-06-14  0:33 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, netdev,
	Jesper Dangaard Brouer
In-Reply-To: <201806131752.fqIMeuHk%fengguang.wu@intel.com>

On 2018/06/13 18:27, kbuild test robot wrote:
> Hi Toshiaki,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on bpf/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Toshiaki-Makita/xdp-Fix-handling-of-devmap-in-generic-XDP/20180613-161204
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
> config: i386-randconfig-a1-201823 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from net//bpf/test_run.c:7:0:
>>> include/linux/bpf.h:594:16: warning: 'struct sk_buff' declared inside parameter list
>             struct bpf_prog *xdp_prog)
>                    ^
>>> include/linux/bpf.h:594:16: warning: its scope is only this definition or declaration, which is probably not what you want
> 
> vim +594 include/linux/bpf.h
> 
>    591	
>    592	static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
>    593						   struct sk_buff *skb,
>  > 594						   struct bpf_prog *xdp_prog)
>    595	{
>    596		return 0;
>    597	}
>    598	

Ugh I did build test for the entire tree with CONFIG_BPF_SYSCALL, and
net/core and kernel/bpf without CONFIG_BPF_SYSCALL but not net/bpf.
I'll make sure to test the entire tree next. will send v2.

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH net-queue] i40e: Fix incorrect skb reserved size on rx
From: Toshiaki Makita @ 2018-06-14  0:25 UTC (permalink / raw)
  To: Daniel Borkmann, Jeff Kirsher; +Cc: John Fastabend, intel-wired-lan, netdev
In-Reply-To: <8ec63217-2aab-b240-ce95-530e5f1c1f15@iogearbox.net>

On 2018/06/13 18:06, Daniel Borkmann wrote:
> On 06/13/2018 10:08 AM, Toshiaki Makita wrote:
>> i40e_build_skb() reserves I40E_SKB_PAD + (xdp->data -
>> xdp->data_hard_start) but obviously I40E_SKB_PAD is unnecessary here
>> and mac_header/data feilds in skb becomes incorrect, and breaks normal
>> skb receive path as well as XDP receive path.
>>
>> Fixes: cc5b114dcf98 ("bpf, i40e: add meta data support")
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> Thanks Toshiaki, I sent a complete fix yesterday here:
> 
> https://lkml.org/lkml/2018/6/12/843

Oh I was not aware of it. Thanks, let's go with your patch.

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Marcelo Ricardo Leitner @ 2018-06-13 23:42 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem, eric.dumazet
In-Reply-To: <d41bb7dda9b5c176d5c0a23a8705744f49fcb570.1528933022.git.lucien.xin@gmail.com>

On Thu, Jun 14, 2018 at 07:37:02AM +0800, Xin Long wrote:
> Now sctp GSO uses skb_gro_receive() to append the data into head
> skb frag_list. However it actually only needs very few code from
> skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
> of its members are not needed here.
> 
> This patch is to add sctp_packet_gso_append() to build GSO frames
> instead of skb_gro_receive(), and it would avoid many unnecessary
> checks and make the code clearer.
> 
> Note that sctp will use page frags instead of frag_list to build
> GSO frames in another patch. But it may take time, as sctp's GSO
> frames may have different size. skb_segment() can only split it
> into the frags with the same size, which would break the border
> of sctp chunks.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

I cannot test it, but it looks good to me. Thanks Xin

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  include/net/sctp/structs.h |  5 +++++
>  net/sctp/output.c          | 28 ++++++++++++++++++----------
>  2 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index ebf809e..dbe1b91 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -1133,6 +1133,11 @@ struct sctp_input_cb {
>  };
>  #define SCTP_INPUT_CB(__skb)	((struct sctp_input_cb *)&((__skb)->cb[0]))
>  
> +struct sctp_output_cb {
> +	struct sk_buff *last;
> +};
> +#define SCTP_OUTPUT_CB(__skb)	((struct sctp_output_cb *)&((__skb)->cb[0]))
> +
>  static inline const struct sk_buff *sctp_gso_headskb(const struct sk_buff *skb)
>  {
>  	const struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index e672dee..7f849b0 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -409,6 +409,21 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  	refcount_inc(&sk->sk_wmem_alloc);
>  }
>  
> +static void sctp_packet_gso_append(struct sk_buff *head, struct sk_buff *skb)
> +{
> +	if (SCTP_OUTPUT_CB(head)->last == head)
> +		skb_shinfo(head)->frag_list = skb;
> +	else
> +		SCTP_OUTPUT_CB(head)->last->next = skb;
> +	SCTP_OUTPUT_CB(head)->last = skb;
> +
> +	head->truesize += skb->truesize;
> +	head->data_len += skb->len;
> +	head->len += skb->len;
> +
> +	__skb_header_release(skb);
> +}
> +
>  static int sctp_packet_pack(struct sctp_packet *packet,
>  			    struct sk_buff *head, int gso, gfp_t gfp)
>  {
> @@ -422,7 +437,7 @@ static int sctp_packet_pack(struct sctp_packet *packet,
>  
>  	if (gso) {
>  		skb_shinfo(head)->gso_type = sk->sk_gso_type;
> -		NAPI_GRO_CB(head)->last = head;
> +		SCTP_OUTPUT_CB(head)->last = head;
>  	} else {
>  		nskb = head;
>  		pkt_size = packet->size;
> @@ -503,15 +518,8 @@ static int sctp_packet_pack(struct sctp_packet *packet,
>  					 &packet->chunk_list);
>  		}
>  
> -		if (gso) {
> -			if (skb_gro_receive(&head, nskb)) {
> -				kfree_skb(nskb);
> -				return 0;
> -			}
> -			if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
> -					 sk->sk_gso_max_segs))
> -				return 0;
> -		}
> +		if (gso)
> +			sctp_packet_gso_append(head, nskb);
>  
>  		pkt_count++;
>  	} while (!list_empty(&packet->chunk_list));
> -- 
> 2.1.0
>

^ permalink raw reply

* [PULL v2] vhost, virtio
From: Michael S. Tsirkin @ 2018-06-13 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: ohad, kevin, kvm, mst, netdev, linux-remoteproc, linux-kernel,
	stable, bjorn.andersson, syzbot+87cfa083e727a224754b,
	virtualization

Page hints are reworked - I dropped them for now.

The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:

  Linux 4.17 (2018-06-03 14:15:21 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 2eb98105f8c7f4b867f7f714a998f5b8c1bb009b:

  virtio: update the comments for transport features (2018-06-12 04:59:29 +0300)

----------------------------------------------------------------
virtio, vhost: features, fixes

VF support for virtio.
DMA barriers for virtio strong barriers.
Bugfixes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

----------------------------------------------------------------
Michael S. Tsirkin (2):
      virtio_ring: switch to dma_XX barriers for rpmsg
      vhost: fix info leak due to uninitialized memory

Tiwei Bie (2):
      virtio_pci: support enabling VFs
      virtio: update the comments for transport features

 drivers/vhost/vhost.c              |  3 +++
 drivers/virtio/virtio_pci_common.c | 30 ++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_modern.c | 14 ++++++++++++++
 include/linux/virtio_ring.h        |  4 ++--
 include/uapi/linux/virtio_config.h | 16 ++++++++++++----
 5 files changed, 61 insertions(+), 6 deletions(-)

^ permalink raw reply

* [PATCH net-next] sctp: define sctp_packet_gso_append to build GSO frames
From: Xin Long @ 2018-06-13 23:37 UTC (permalink / raw)
  To: network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Neil Horman, davem, eric.dumazet

Now sctp GSO uses skb_gro_receive() to append the data into head
skb frag_list. However it actually only needs very few code from
skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
of its members are not needed here.

This patch is to add sctp_packet_gso_append() to build GSO frames
instead of skb_gro_receive(), and it would avoid many unnecessary
checks and make the code clearer.

Note that sctp will use page frags instead of frag_list to build
GSO frames in another patch. But it may take time, as sctp's GSO
frames may have different size. skb_segment() can only split it
into the frags with the same size, which would break the border
of sctp chunks.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/structs.h |  5 +++++
 net/sctp/output.c          | 28 ++++++++++++++++++----------
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ebf809e..dbe1b91 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1133,6 +1133,11 @@ struct sctp_input_cb {
 };
 #define SCTP_INPUT_CB(__skb)	((struct sctp_input_cb *)&((__skb)->cb[0]))
 
+struct sctp_output_cb {
+	struct sk_buff *last;
+};
+#define SCTP_OUTPUT_CB(__skb)	((struct sctp_output_cb *)&((__skb)->cb[0]))
+
 static inline const struct sk_buff *sctp_gso_headskb(const struct sk_buff *skb)
 {
 	const struct sctp_chunk *chunk = SCTP_INPUT_CB(skb)->chunk;
diff --git a/net/sctp/output.c b/net/sctp/output.c
index e672dee..7f849b0 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -409,6 +409,21 @@ static void sctp_packet_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	refcount_inc(&sk->sk_wmem_alloc);
 }
 
+static void sctp_packet_gso_append(struct sk_buff *head, struct sk_buff *skb)
+{
+	if (SCTP_OUTPUT_CB(head)->last == head)
+		skb_shinfo(head)->frag_list = skb;
+	else
+		SCTP_OUTPUT_CB(head)->last->next = skb;
+	SCTP_OUTPUT_CB(head)->last = skb;
+
+	head->truesize += skb->truesize;
+	head->data_len += skb->len;
+	head->len += skb->len;
+
+	__skb_header_release(skb);
+}
+
 static int sctp_packet_pack(struct sctp_packet *packet,
 			    struct sk_buff *head, int gso, gfp_t gfp)
 {
@@ -422,7 +437,7 @@ static int sctp_packet_pack(struct sctp_packet *packet,
 
 	if (gso) {
 		skb_shinfo(head)->gso_type = sk->sk_gso_type;
-		NAPI_GRO_CB(head)->last = head;
+		SCTP_OUTPUT_CB(head)->last = head;
 	} else {
 		nskb = head;
 		pkt_size = packet->size;
@@ -503,15 +518,8 @@ static int sctp_packet_pack(struct sctp_packet *packet,
 					 &packet->chunk_list);
 		}
 
-		if (gso) {
-			if (skb_gro_receive(&head, nskb)) {
-				kfree_skb(nskb);
-				return 0;
-			}
-			if (WARN_ON_ONCE(skb_shinfo(head)->gso_segs >=
-					 sk->sk_gso_max_segs))
-				return 0;
-		}
+		if (gso)
+			sctp_packet_gso_append(head, nskb);
 
 		pkt_count++;
 	} while (!list_empty(&packet->chunk_list));
-- 
2.1.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox