Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH net] tcp: allow to use TCP Fastopen with MSG_ZEROCOPY
From: Alexey Kodanev @ 2018-04-03 12:43 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn, Eric Dumazet, David Miller, Alexey Kodanev

With TCP Fastopen we can have the following cases, which could also
use MSG_ZEROCOPY flag with send() and sendto():

* sendto() + MSG_FASTOPEN flag, sk state can be in TCP_CLOSE at
  the start of tcp_sendmsg()

* set socket option TCP_FASTOPEN_CONNECT, then connect()
  and send(), sk state in TCP_SYN_SENT

Currently, both cases with tcp_sendmsg() and MSG_ZEROCOPY flag results
to EINVAL error, because of the check for TCP_ESTABLISHED sk state in
the beginning of tcp_sendmsg().

Both conditions require two more checks there: !tp->fastopen_connect
and !(flags & MSG_FASTOPEN). It looks like we could remove the original
check altogether for this unlikely event instead. That way tcp_sendmsg()
without TFO should fail with EPIPE on sk_stream_wait_connect(), as
before the introduction of MSG_ZEROCOPY there. And work smoothly for
the TFO cases.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---

Is there something that I've overlooked and we can't use it here, and
we should handle this type of error, while using sendto() + TFO,
in userspace?

 net/ipv4/tcp.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9225610..768f02c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1193,11 +1193,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	flags = msg->msg_flags;

 	if (flags & MSG_ZEROCOPY && size) {
-		if (sk->sk_state != TCP_ESTABLISHED) {
-			err = -EINVAL;
-			goto out_err;
-		}
-
 		skb = tcp_write_queue_tail(sk);
 		uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
 		if (!uarg) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: WARNING in add_uevent_var
From: Johannes Berg @ 2018-04-03 12:34 UTC (permalink / raw)
  To: syzbot, davem, linux-kernel, linux-wireless, netdev,
	syzkaller-bugs
In-Reply-To: <000000000000a010b80568d75018@google.com>

On Sun, 2018-04-01 at 23:01 -0700, syzbot wrote:

> So far this crash happened 5 times on net-next, upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6614377067184128
> 

Huh, fun. Looks like you're basically creating a new HWSIM radio with an
insanely long name (4k!) and nothing stops you, until we try to generate
an rfkill instance which sends a uevent and only has a 2k buffer for the
environment variables, where we put the name ...

But yeah, we should probably limit the phy name to something sane, I'll
pick 128 and send a patch.

johannes

^ permalink raw reply

* Re: [PATCH] vhost-net: add limitation of sent packets for tx polling
From: haibinzhang(张海斌) @ 2018-04-03 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	lidongchen(陈立东),
	yunfangtai(台运方)


>On Tue, Apr 03, 2018 at 08:08:26AM +0000, haibinzhang wrote:
>> handle_tx will delay rx for a long time when tx busy polling udp packets
>> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
>> takes into account only sent-bytes but no single packet length.
>> 
>> Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
>> then another machine pinged the client. Result shows as follow:
>> 
>> Packet#       Ping-Latency(ms)
>>               min     avg     max
>> Origin      3.319  18.489  57.503
>> 64          1.643   2.021   2.552
>> 128         1.825   2.600   3.224
>> 256         1.997   2.710   4.295
>> 512*        1.860   3.171   4.631
>> 1024        2.002   4.173   9.056
>> 2048        2.257   5.650   9.688
>> 4096        2.093   8.508  15.943
>> 
>> 512 is selected, which is multi-VRING_SIZE
>
>There's no guarantee vring size is 256.
>
>Could you pls try with a different tx ring size?
>
>I suspect we want:
>
>#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
>
>
>> and close to VHOST_NET_WEIGHT/MTU.
>
>Puzzled by this part.  Does tweaking MTU change anything?

The MTU of ethernet is 1500, so VHOST_NET_WEIGHT/MTU equals 0x80000/1500=350.
Then sent-bytes cannot reach VHOST_NET_WEIGHT in one handle_tx even with 1500-bytes 
frame if packet# is less than 350. So packet# must be bigger than 350.
512 meets this condition and is also DEFAULT VRING_SIZE aligned.

>
>> To evaluate this change, another tests were done using netperf(RR, TX) between
>> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
>> does not show obvious changes:
>> 
>> TCP_RR
>> 
>> size/sessions/+thu%/+normalize%
>>    1/       1/  -7%/        -2%
>>    1/       4/  +1%/         0%
>>    1/       8/  +1%/        -2%
>>   64/       1/  -6%/         0%
>>   64/       4/   0%/        +2%
>>   64/       8/   0%/         0%
>>  256/       1/  -3%/        -4%
>>  256/       4/  +3%/        +4%
>>  256/       8/  +2%/         0%
>> 
>> UDP_RR
>> 
>> size/sessions/+thu%/+normalize%
>>    1/       1/  -5%/        +1%
>>    1/       4/  +4%/        +1%
>>    1/       8/  -1%/        -1%
>>   64/       1/  -2%/        -3%
>>   64/       4/  -5%/        -1%
>>   64/       8/   0%/        -1%
>>  256/       1/  +7%/        +1%
>>  256/       4/  +1%/        +1%
>>  256/       8/  +2%/        +2%
>> 
>> TCP_STREAM
>> 
>> size/sessions/+thu%/+normalize%
>>   64/       1/   0%/        -3%
>>   64/       4/  +3%/        -1%
>>   64/       8/  +9%/        -4%
>>  256/       1/  +1%/        -4%
>>  256/       4/  -1%/        -1%
>>  256/       8/  +7%/        +5%
>>  512/       1/  +1%/         0%
>>  512/       4/  +1%/        -1%
>>  512/       8/  +7%/        -5%
>> 1024/       1/   0%/        -1%
>> 1024/       4/  +3%/         0%
>> 1024/       8/  +8%/        +5%
>> 2048/       1/  +2%/        +2%
>> 2048/       4/  +1%/         0%
>> 2048/       8/  -2%/         0%
>> 4096/       1/  -2%/         0%
>> 4096/       4/  +2%/         0%
>> 4096/       8/  +9%/        -2%
>> 
>> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
>> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
>> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>> ---
>>  drivers/vhost/net.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 8139bc70ad7d..13a23f3f3ea4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>>   * Using this limit prevents one virtqueue from starving others. */
>>  #define VHOST_NET_WEIGHT 0x80000
>>  
>> +/* Max number of packets transferred before requeueing the job.
>> + * Using this limit prevents one virtqueue from starving rx. */
>> +#define VHOST_NET_PKT_WEIGHT 512
>> +
>>  /* MAX number of TX used buffers for outstanding zerocopy */
>>  #define VHOST_MAX_PEND 128
>>  #define VHOST_GOODCOPY_LEN 256
>> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
>>  	struct socket *sock;
>>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>>  	bool zcopy, zcopy_used;
>> +	int sent_pkts = 0;
>>  
>>  	mutex_lock(&vq->mutex);
>>  	sock = vq->private_data;
>> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
>>  		else
>>  			vhost_zerocopy_signal_used(net, vq);
>>  		vhost_net_tx_packet(net);
>> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>> +		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
>> +		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>>  			vhost_poll_queue(&vq->poll);
>>  			break;
>>  		}
>> -- 
>> 2.12.3
>> 


^ permalink raw reply

* Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device
From: Michael S. Tsirkin @ 2018-04-03 12:25 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev
In-Reply-To: <1522573990-5242-2-git-send-email-si-wei.liu@oracle.com>

On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>      }
>  }
>  
> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
> +                           uint16_t *devfn, Error **errp)
> +{
> +    uint16_t busnum = 0, slot = 0, func = 0;
> +    const char *pc, *pd, *pe;
> +    Error *local_err = NULL;
> +    ObjectClass *class;
> +    char value[1024];
> +    BusState *bus;
> +    uint64_t u64;
> +
> +    if (!(pc = strchr(id, ':'))) {
> +        error_setg(errp, "Invalid id: backup=%s, "
> +                   "correct format should be backup="
> +                   "'<bus-id>:<slot>[.<function>]'", id);
> +        return -1;
> +    }
> +    get_opt_name(value, sizeof(value), id, ':');
> +    if (pc != id + 1) {
> +        bus = qbus_find(value, errp);
> +        if (!bus)
> +            return -1;
> +
> +        class = object_get_class(OBJECT(bus));
> +        if (class != object_class_by_name(TYPE_PCI_BUS) &&
> +            class != object_class_by_name(TYPE_PCIE_BUS)) {
> +            error_setg(errp, "%s is not a device on pci bus", id);
> +            return -1;
> +        }
> +        busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
> +    }

pci_bus_num is almost always a bug if not done within
a context of a PCI host, bridge, etc.

In particular this will not DTRT if run before guest assigns bus
numbers.


> +
> +    if (!devfn)
> +        goto out;
> +
> +    pd = strchr(pc, '.');
> +    pe = get_opt_name(value, sizeof(value), pc + 1, '.');
> +    if (pe != pc + 1) {
> +        parse_option_number("slot", value, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        slot = (uint16_t)u64;
> +    }
> +    if (pd && *(pd + 1) != '\0') {
> +        parse_option_number("function", pd, &u64, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return -1;
> +        }
> +        func = (uint16_t)u64;
> +    }
> +
> +out:
> +    if (busnr)
> +        *busnr = busnum;
> +    if (devfn)
> +        *devfn = ((slot & 0x1F) << 3) | (func & 0x7);
> +    return 0;
> +}
> +
>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>  {
>      DeviceState *dev;
> -- 
> 1.8.3.1

^ permalink raw reply

* Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden
From: Michael S. Tsirkin @ 2018-04-03 12:20 UTC (permalink / raw)
  To: Si-Wei Liu
  Cc: jiri, stephen, alexander.h.duyck, davem, jesse.brandeburg,
	kubakici, jasowang, sridhar.samudrala, netdev, virtualization,
	virtio-dev
In-Reply-To: <1522573990-5242-4-git-send-email-si-wei.liu@oracle.com>

On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index aa40664..0827b7e 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -80,6 +80,8 @@ struct virtio_net_config {
>  	__u16 max_virtqueue_pairs;
>  	/* Default maximum transmit unit advice */
>  	__u16 mtu;
> +	/* Device at bus:slot.function backed up by virtio_net */
> +	__u16 bsf2backup;
>  } __attribute__((packed));

I'm not sure this is a good interface.  This isn't unique even on some
PCI systems, not to speak of non-PCI ones.

>  /*
> -- 
> 1.8.3.1

^ permalink raw reply

* [PATCH] kernel/bpf/syscall: fix warning defined but not used
From: Anders Roxell @ 2018-04-03 12:09 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linux-kernel, Anders Roxell

There will be a build warning -Wunused-function if CONFIG_CGROUP_BPF
isn't defined, since the only user is inside #ifdef CONFIG_CGROUP_BPF:
kernel/bpf/syscall.c:1229:12: warning: ‘bpf_prog_attach_check_attach_type’
    defined but not used [-Wunused-function]
 static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Current code moves function bpf_prog_attach_check_attach_type inside
ifdef CONFIG_CGROUP_BPF.

Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
---
 kernel/bpf/syscall.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7457f2676c6d..56f49557adda 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1226,18 +1226,6 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 	}
 }
 
-static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
-					     enum bpf_attach_type attach_type)
-{
-	switch (prog->type) {
-	case BPF_PROG_TYPE_CGROUP_SOCK:
-	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
-		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
-	default:
-		return 0;
-	}
-}
-
 /* last field in 'union bpf_attr' used by this command */
 #define	BPF_PROG_LOAD_LAST_FIELD expected_attach_type
 
@@ -1465,6 +1453,18 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr)
 
 #ifdef CONFIG_CGROUP_BPF
 
+static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
+					     enum bpf_attach_type attach_type)
+{
+	switch (prog->type) {
+	case BPF_PROG_TYPE_CGROUP_SOCK:
+	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	default:
+		return 0;
+	}
+}
+
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
 
 static int sockmap_get_from_fd(const union bpf_attr *attr,
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH] vhost-net: add limitation of sent packets for tx polling
From: Michael S. Tsirkin @ 2018-04-03 11:59 UTC (permalink / raw)
  To: haibinzhang(张海斌)
  Cc: Jason Wang, kvm@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	lidongchen(陈立东),
	yunfangtai(台运方)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC4627@EXMBX-SZMAIL011.tencent.com>

On Tue, Apr 03, 2018 at 08:08:26AM +0000, haibinzhang(张海斌) wrote:
> handle_tx will delay rx for a long time when tx busy polling udp packets
> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
> takes into account only sent-bytes but no single packet length.
> 
> Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
> then another machine pinged the client. Result shows as follow:
> 
> Packet#       Ping-Latency(ms)
>               min     avg     max
> Origin      3.319  18.489  57.503
> 64          1.643   2.021   2.552
> 128         1.825   2.600   3.224
> 256         1.997   2.710   4.295
> 512*        1.860   3.171   4.631
> 1024        2.002   4.173   9.056
> 2048        2.257   5.650   9.688
> 4096        2.093   8.508  15.943
> 
> 512 is selected, which is multi-VRING_SIZE

There's no guarantee vring size is 256.

Could you pls try with a different tx ring size?

I suspect we want:

#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)


> and close to VHOST_NET_WEIGHT/MTU.

Puzzled by this part.  Does tweaking MTU change anything?

> To evaluate this change, another tests were done using netperf(RR, TX) between
> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
> does not show obvious changes:
> 
> TCP_RR
> 
> size/sessions/+thu%/+normalize%
>    1/       1/  -7%/        -2%
>    1/       4/  +1%/         0%
>    1/       8/  +1%/        -2%
>   64/       1/  -6%/         0%
>   64/       4/   0%/        +2%
>   64/       8/   0%/         0%
>  256/       1/  -3%/        -4%
>  256/       4/  +3%/        +4%
>  256/       8/  +2%/         0%
> 
> UDP_RR
> 
> size/sessions/+thu%/+normalize%
>    1/       1/  -5%/        +1%
>    1/       4/  +4%/        +1%
>    1/       8/  -1%/        -1%
>   64/       1/  -2%/        -3%
>   64/       4/  -5%/        -1%
>   64/       8/   0%/        -1%
>  256/       1/  +7%/        +1%
>  256/       4/  +1%/        +1%
>  256/       8/  +2%/        +2%
> 
> TCP_STREAM
> 
> size/sessions/+thu%/+normalize%
>   64/       1/   0%/        -3%
>   64/       4/  +3%/        -1%
>   64/       8/  +9%/        -4%
>  256/       1/  +1%/        -4%
>  256/       4/  -1%/        -1%
>  256/       8/  +7%/        +5%
>  512/       1/  +1%/         0%
>  512/       4/  +1%/        -1%
>  512/       8/  +7%/        -5%
> 1024/       1/   0%/        -1%
> 1024/       4/  +3%/         0%
> 1024/       8/  +8%/        +5%
> 2048/       1/  +2%/        +2%
> 2048/       4/  +1%/         0%
> 2048/       8/  -2%/         0%
> 4096/       1/  -2%/         0%
> 4096/       4/  +2%/         0%
> 4096/       8/  +9%/        -2%
> 
> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> ---
>  drivers/vhost/net.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8139bc70ad7d..13a23f3f3ea4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>  
> +/* Max number of packets transferred before requeueing the job.
> + * Using this limit prevents one virtqueue from starving rx. */
> +#define VHOST_NET_PKT_WEIGHT 512
> +
>  /* MAX number of TX used buffers for outstanding zerocopy */
>  #define VHOST_MAX_PEND 128
>  #define VHOST_GOODCOPY_LEN 256
> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
>  	struct socket *sock;
>  	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>  	bool zcopy, zcopy_used;
> +	int sent_pkts = 0;
>  
>  	mutex_lock(&vq->mutex);
>  	sock = vq->private_data;
> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
>  		else
>  			vhost_zerocopy_signal_used(net, vq);
>  		vhost_net_tx_packet(net);
> -		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> +		if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> +		    unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
>  			break;
>  		}
> -- 
> 2.12.3
> 

^ permalink raw reply

* [PATCH net v6 4/4] ipv6: udp: set dst cache for a connected sk if current not valid
From: Alexey Kodanev @ 2018-04-03 12:00 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Martin KaFai Lau, David Miller, Alexey Kodanev
In-Reply-To: <1522756810-24985-1-git-send-email-alexey.kodanev@oracle.com>

A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

    udp_v6_send_skb(), for example with vti6 tunnel:
        vti6_xmit(), get ICMPV6_PKT_TOOBIG error
            skb_dst_update_pmtu(), can create a RTF_CACHE clone
            icmpv6_send()
    ...
    udpv6_err()
        ip6_sk_update_pmtu()
           ip6_update_pmtu(), can create a RTF_CACHE clone
           ...
           ip6_datagram_dst_update()
                ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---
 net/ipv6/udp.c | 21 ++-------------------
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4aa50ea..9b74092 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, false);
+	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, connected);
 	if (IS_ERR(dst)) {
 		err = PTR_ERR(dst);
 		dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb))
 			err = udp_v6_send_skb(skb, &fl6);
-		goto release_dst;
+		goto out;
 	}
 
 	lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = np->recverr ? net_xmit_errno(err) : 0;
 	release_sock(sk);
 
-release_dst:
-	if (dst) {
-		if (connected) {
-			ip6_dst_store(sk, dst,
-				      ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
-				      &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
-				      ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
-				      &np->saddr :
-#endif
-				      NULL);
-		} else {
-			dst_release(dst);
-		}
-		dst = NULL;
-	}
-
 out:
 	dst_release(dst);
 	fl6_sock_release(flowlabel);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v6 3/4] ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
From: Alexey Kodanev @ 2018-04-03 12:00 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Martin KaFai Lau, David Miller, Alexey Kodanev
In-Reply-To: <1522756810-24985-1-git-send-email-alexey.kodanev@oracle.com>

This should make it consistent with ip6_sk_dst_lookup_flow()
that is accepting the new 'connected' parameter of type bool.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---
 net/ipv6/udp.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5bc102b2..4aa50ea 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1097,10 +1097,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	struct dst_entry *dst;
 	struct ipcm6_cookie ipc6;
 	int addr_len = msg->msg_namelen;
+	bool connected = false;
 	int ulen = len;
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int err;
-	int connected = 0;
 	int is_udplite = IS_UDPLITE(sk);
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
 	struct sockcm_cookie sockc;
@@ -1222,7 +1222,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		fl6.fl6_dport = inet->inet_dport;
 		daddr = &sk->sk_v6_daddr;
 		fl6.flowlabel = np->flow_label;
-		connected = 1;
+		connected = true;
 	}
 
 	if (!fl6.flowi6_oif)
@@ -1252,7 +1252,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		}
 		if (!(opt->opt_nflen|opt->opt_flen))
 			opt = NULL;
-		connected = 0;
+		connected = false;
 	}
 	if (!opt) {
 		opt = txopt_get(np);
@@ -1274,11 +1274,11 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	final_p = fl6_update_dst(&fl6, opt, &final);
 	if (final_p)
-		connected = 0;
+		connected = false;
 
 	if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr)) {
 		fl6.flowi6_oif = np->mcast_oif;
-		connected = 0;
+		connected = false;
 	} else if (!fl6.flowi6_oif)
 		fl6.flowi6_oif = np->ucast_oif;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v6 2/4] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
From: Alexey Kodanev @ 2018-04-03 12:00 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Martin KaFai Lau, David Miller, Alexey Kodanev
In-Reply-To: <1522756810-24985-1-git-send-email-alexey.kodanev@oracle.com>

Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---
 include/net/ipv6.h    |  3 ++-
 net/ipv6/ip6_output.c | 15 ++++++++++++---
 net/ipv6/ping.c       |  2 +-
 net/ipv6/udp.c        |  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..1d416f2 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -977,7 +977,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 *fl6,
 				      const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-					 const struct in6_addr *final_dst);
+					 const struct in6_addr *final_dst,
+					 bool connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
 				      struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..46ea7b6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 *fl6,
  *	@sk: socket which provides the dst cache and route info
  *	@fl6: flow to lookup
  *	@final_dst: final destination address for ipsec lookup
+ *	@connected: whether @sk is connected or not
  *
  *	This function performs a route lookup on the given flow with the
  *	possibility of using the cached route in the socket if it is valid.
  *	It will take the socket dst lock when operating on the dst cache.
  *	As a result, this function can only be used in process context.
  *
+ *	In addition, for a connected socket, cache the dst in the socket
+ *	if the current cache is not valid.
+ *
  *	It returns a valid dst pointer on success, or a pointer encoded
  *	error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-					 const struct in6_addr *final_dst)
+					 const struct in6_addr *final_dst,
+					 bool connected)
 {
 	struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
 	dst = ip6_sk_dst_check(sk, dst, fl6);
-	if (!dst)
-		dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+	if (dst)
+		return dst;
+
+	dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+	if (connected && !IS_ERR(dst))
+		ip6_sk_dst_store_flow(sk, dst_clone(dst), fl6);
 
 	return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..746eeae 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	ipc6.tclass = np->tclass;
 	fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-	dst = ip6_sk_dst_lookup_flow(sk, &fl6,  daddr);
+	dst = ip6_sk_dst_lookup_flow(sk, &fl6, daddr, false);
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 	rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..5bc102b2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p);
+	dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, false);
 	if (IS_ERR(dst)) {
 		err = PTR_ERR(dst);
 		dst = NULL;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v6 1/4] ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
From: Alexey Kodanev @ 2018-04-03 12:00 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Martin KaFai Lau, David Miller, Alexey Kodanev
In-Reply-To: <1522756810-24985-1-git-send-email-alexey.kodanev@oracle.com>

Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---
 include/net/ip6_route.h |  3 +++
 net/ipv6/datagram.c     |  9 +--------
 net/ipv6/route.c        | 17 +++++++++++++++++
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ac0866b..abec280 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -210,6 +210,9 @@ static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 #endif
 }
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+			   const struct flowi6 *fl6);
+
 static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 {
 	struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8f6a391 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
 		}
 	}
 
-	ip6_dst_store(sk, dst,
-		      ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
-		      &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
-		      ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
-		      &np->saddr :
-#endif
-		      NULL);
+	ip6_sk_dst_store_flow(sk, dst, &fl6);
 
 out:
 	fl6_sock_release(flowlabel);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b0d5c64..b14008e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2153,6 +2153,23 @@ void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+			   const struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+	struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+	ip6_dst_store(sk, dst,
+		      ipv6_addr_equal(&fl6->daddr, &sk->sk_v6_daddr) ?
+		      &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+		      ipv6_addr_equal(&fl6->saddr, &np->saddr) ?
+		      &np->saddr :
+#endif
+		      NULL);
+}
+
 /* Handle redirects */
 struct ip6rd_flowi {
 	struct flowi6 fl6;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v6 0/4] ipv6: udp: set dst cache for a connected sk if current not valid
From: Alexey Kodanev @ 2018-04-03 12:00 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Martin KaFai Lau, David Miller, Alexey Kodanev

A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
    if (dst) {
        if (connected) {
            ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first three patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow():

  * the first patch adds ip6_sk_dst_store_flow() function with
    commonly used source and destiantion addresses checks using
    the flow information.

  * the second patch adds a new argument to ip6_sk_dst_lookup_flow()
    and ability to store dst in the socket's cache. Also, the two
    users of the function are updated without enabling the new
    behavior: pingv6_sendmsg() and udpv6_sendmsg().

  * the third patch makes 'connected' variable in udpv6_sendmsg()
    to be consistent with ip6_sk_dst_store_flow(), changes its type
    from int to bool.

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v6: * use bool type for a new parameter in ip_sk_dst_lookup_flow()
    * add one more patch to convert 'connected' variable in
      udpv6_sendmsg() from int to bool type. If it shouldn't be
      here I will resend it when the net-next is opened.

v5: * relocate ip6_sk_dst_store_flow() to net/ipv6/route.c and
      rename ip6_dst_store_flow() to ip6_sk_dst_store_flow() as
      suggested by Martin

v4: * fix the error in the build of ip_dst_store_flow() reported by
      kbuild test robot due to missing checks for CONFIG_IPV6: add
      new function to ip6_output.c instead of ip6_route.h
    * add 'const' to struct flowi6 in ip6_dst_store_flow()
    * minor commit messages fixes

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
      update socket's dst cache inside ip6_sk_dst_lookup_flow()
      if the current one is invalid
    * the issue not reproduced in 4.1, but starting from 4.2. Add
      one more 'Fixes:' commit that creates new RTF_CACHE route.
      Though, it is also mentioned in the first one

Alexey Kodanev (4):
  ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
  ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
  ipv6: udp: set dst cache for a connected sk if current not valid

 include/net/ip6_route.h |  3 +++
 include/net/ipv6.h      |  3 ++-
 net/ipv6/datagram.c     |  9 +--------
 net/ipv6/ip6_output.c   | 15 ++++++++++++---
 net/ipv6/ping.c         |  2 +-
 net/ipv6/route.c        | 17 +++++++++++++++++
 net/ipv6/udp.c          | 31 +++++++------------------------
 7 files changed, 43 insertions(+), 37 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: possible deadlock in skb_queue_tail
From: Kirill Tkhai @ 2018-04-03 11:42 UTC (permalink / raw)
  To: Dmitry Vyukov, Ingo Molnar
  Cc: syzbot, David Miller, David Herrmann, Denys Vlasenko,
	David Windsor, elena.reshetova, ishkamiel, Kees Cook, LKML,
	matthew, Mateusz Jurczyk, netdev, syzkaller-bugs, Al Viro, xemul
In-Reply-To: <CACT4Y+Zb4Y1=H8Vtd1d+ULKu-P3ftud=SgFqSAfu9SE6Lu3LMA@mail.gmail.com>

On 03.04.2018 14:25, Dmitry Vyukov wrote:
> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 02.04.2018 12:20, syzbot wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on net-next commit
>>> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +0000)
>>> Merge branch 'chelsio-inline-tls'
>>> syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>> Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
>>> Kernel config: https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>>> compiler: gcc (GCC) 7.1.1 20170620
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+6b495100f17ca8554ab9@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>>
>>> ======================================================
>>> WARNING: possible circular locking dependency detected
>>> 4.16.0-rc6+ #290 Not tainted
>>> ------------------------------------------------------
>>> syz-executor7/20971 is trying to acquire lock:
>>>  (&af_unix_sk_receive_queue_lock_key){+.+.}, at: [<00000000271ef0d8>] skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>
>>> but task is already holding lock:
>>>  (&(&u->lock)->rlock/1){+.+.}, at: [<000000004e725e14>] unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>>
>>> which lock already depends on the new lock.
>>>
>>>
>>> the existing dependency chain (in reverse order) is:
>>>
>>> -> #1 (&(&u->lock)->rlock/1){+.+.}:
>>>        _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>>        sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>>        sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>>>        sk_diag_dump net/unix/diag.c:178 [inline]
>>>        unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>>>        netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>>>        __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>>>        netlink_dump_start include/linux/netlink.h:214 [inline]
>>>        unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>>>        __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>>        sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>>>        netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>>>        sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>>        netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>>>        netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>>>        netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>>        sock_sendmsg+0xca/0x110 net/socket.c:639
>>>        sock_write_iter+0x31a/0x5d0 net/socket.c:908
>>>        call_write_iter include/linux/fs.h:1782 [inline]
>>>        new_sync_write fs/read_write.c:469 [inline]
>>>        __vfs_write+0x684/0x970 fs/read_write.c:482
>>>        vfs_write+0x189/0x510 fs/read_write.c:544
>>>        SYSC_write fs/read_write.c:589 [inline]
>>>        SyS_write+0xef/0x220 fs/read_write.c:581
>>>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> -> #0 (&af_unix_sk_receive_queue_lock_key){+.+.}:
>>>        lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>>        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>>        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>>        skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>        unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>>        sock_sendmsg+0xca/0x110 net/socket.c:639
>>>        ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>>        __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>>        SYSC_sendmmsg net/socket.c:2168 [inline]
>>>        SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
>> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
>> it's unix_listen(). The function is applied to stream and seqpacket
>> socket types.
>>
>> It can't be stream because of the second stack, and seqpacket also can't,
>> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
>> in the way, we don't see it in the stack.
>>
>> So, this is looks like false positive result for me.
>>
>> Kirill
> 
> Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are
> always different?

In these 2 particular stacks they have to be different.

But we may meet another stacks, where stream or seqpacket
functions are used instead of unix_dgram_sendmsg(), and
they may be true positive.

Kirill
 
> +Ingo for lockdep false positive
> Do we need some kind of annotation here?
> 
> 
>>> other info that might help us debug this:
>>>
>>>  Possible unsafe locking scenario:
>>>
>>>        CPU0                    CPU1
>>>        ----                    ----
>>>   lock(&(&u->lock)->rlock/1);
>>>                                lock(&af_unix_sk_receive_queue_lock_key);
>>>                                lock(&(&u->lock)->rlock/1);
>>>   lock(&af_unix_sk_receive_queue_lock_key);
>>>
>>>  *** DEADLOCK ***
>>>
>>> 1 lock held by syz-executor7/20971:
>>>  #0:  (&(&u->lock)->rlock/1){+.+.}, at: [<000000004e725e14>] unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>>
>>> stack backtrace:
>>> CPU: 0 PID: 20971 Comm: syz-executor7 Not tainted 4.16.0-rc6+ #290
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:17 [inline]
>>>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>>>  print_circular_bug.isra.38+0x2cd/0x2dc kernel/locking/lockdep.c:1223
>>>  check_prev_add kernel/locking/lockdep.c:1863 [inline]
>>>  check_prevs_add kernel/locking/lockdep.c:1976 [inline]
>>>  validate_chain kernel/locking/lockdep.c:2417 [inline]
>>>  __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431
>>>  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>>  _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>>  skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>  unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>>  sock_sendmsg+0xca/0x110 net/socket.c:639
>>>  ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>>  __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>>  SYSC_sendmmsg net/socket.c:2168 [inline]
>>>  SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>> RIP: 0033:0x455269
>>> RSP: 002b:00007f71ffad6c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
>>> RAX: ffffffffffffffda RBX: 00007f71ffad76d4 RCX: 0000000000455269
>>> RDX: 04924924924924f4 RSI: 0000000020000200 RDI: 0000000000000016
>>> RBP: 000000000072bf58 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 00000000200000d4 R11: 0000000000000246 R12: 00000000ffffffff
>>> R13: 00000000000004ca R14: 00000000006f9390 R15: 0000000000000001
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: sync thread started: state = BACKUP, mcast_ifn = bcsh0, syncid = 0, id = 0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>> IPVS: Unknown mcast interface: bcsh0
>>>
>>>
>>> ---
>>> This bug is generated by a dumb bot. It may contain errors.
>>> See https://goo.gl/tpsmEJ for details.
>>> Direct all questions to syzkaller@googlegroups.com.
>>>
>>> syzbot will keep track of this bug report.
>>> If you forgot to add the Reported-by tag, once the fix for this bug is merged
>>> into any tree, please reply to this email with:
>>> #syz fix: exact-commit-title
>>> To mark this as a duplicate of another syzbot report, please reply with:
>>> #syz dup: exact-subject-of-another-report
>>> If it's a one-off invalid bug report, please reply with:
>>> #syz invalid
>>> Note: if the crash happens again, it will cause creation of a new bug report.
>>> Note: all commands must start from beginning of the line in the email body.
>>
>> --
>> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/06c79d3f-3f28-7f1e-9431-66c18149c9e6%40virtuozzo.com.
>> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register
From: Maxime Ripard @ 2018-04-03 11:36 UTC (permalink / raw)
  To: Chen-Yu Tsai
  Cc: Icenowy Zheng, Michael Turquette, Stephen Boyd,
	Giuseppe Cavallaro, Rob Herring, Mark Rutland, Mark Brown,
	linux-arm-kernel, linux-clk, devicetree, netdev, Corentin Labbe
In-Reply-To: <CAGb2v65c0wFKecrNtLhJvP71iMPxio4xJ835hD91GdwtFZKXBQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3668 bytes --]

On Tue, Apr 03, 2018 at 05:58:05PM +0800, Chen-Yu Tsai wrote:
> On Tue, Apr 3, 2018 at 5:54 PM, Icenowy Zheng <icenowy@aosc.io> wrote:
> >
> >
> > 于 2018年4月3日 GMT+08:00 下午5:53:08, Chen-Yu Tsai <wens@csie.org> 写到:
> >>On Tue, Apr 3, 2018 at 5:50 PM, Maxime Ripard
> >><maxime.ripard@bootlin.com> wrote:
> >>> On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
> >>>> On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
> >>>> > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
> >>>> > <maxime.ripard@bootlin.com> wrote:
> >>>> > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
> >>>> > >> From: Icenowy Zheng <icenowy@aosc.io>
> >>>> > >>
> >>>> > >> There's a GMAC configuration register, which exists on
> >>A64/A83T/H3/H5 in
> >>>> > >> the syscon part, in the CCU of R40 SoC.
> >>>> > >>
> >>>> > >> Export a regmap of the CCU.
> >>>> > >>
> >>>> > >> Read access is not restricted to all registers, but only the
> >>GMAC
> >>>> > >> register is allowed to be written.
> >>>> > >>
> >>>> > >> Signed-off-by: Icenowy Zheng <icenowy@aosc.io>
> >>>> > >> Signed-off-by: Chen-Yu Tsai <wens@csie.org>
> >>>> > >
> >>>> > > Gah, this is crazy. I'm really starting to regret letting that
> >>syscon
> >>>> > > in in the first place...
> >>>> >
> >>>> > IMHO syscon is really a better fit. It's part of the glue layer
> >>and
> >>>> > most other dwmac user platforms treat it as such and use a syscon.
> >>>> > Plus the controls encompass delays (phase), inverters (polarity),
> >>>> > and even signal routing. It's not really just a group of clock
> >>controls,
> >>>> > like what we poorly modeled for A20/A31. I think that was really a
> >>>> > mistake.
> >>>> >
> >>>> > As I mentioned in the cover letter, a slightly saner approach
> >>would
> >>>> > be to let drivers add custom syscon entries, which would then
> >>require
> >>>> > less custom plumbing.
> >>>>
> >>>> A syscon is convenient, sure, but it also bypasses any abstraction
> >>>> layer we have everywhere else, which means that we'll have to
> >>maintain
> >>>> the register layout in each and every driver that uses it.
> >>>>
> >>>> So far, it's only be the GMAC, but it can also be others (the SRAM
> >>>> controller comes to my mind), and then, if there's any difference in
> >>>> the design in a future SoC, we'll have to maintain that in the GMAC
> >>>> driver as well.
> >>>
> >>> I guess I forgot to say something, I'm fine with using a syscon we
> >>> already have.
> >>>
> >>> I'm just questionning if merging any other driver using one is the
> >>> right move.
> >>
> >>Right. So in this case, we are not actually going through the syscon
> >>API. Rather we are exporting a regmap whose properties we actually
> >>define. If it makes you more acceptable to it, we could map just
> >>the GMAC register in the new regmap, and also have it named. This
> >>is all plumbing within the kernel so the device tree stays the same.
> >
> > I think my driver has already restricted the write permission
> > only to GMAC register.
> 
> Correct, but it still maps the entire region out, which means the
> consumer needs to know which offset to use. Maxime is saying this
> is something that is troublesome to maintain. So my proposal was
> to create a regmap with a base at the GMAC register offset. That
> way, the consumer doesn't need to use an offset to access it.

I guess this is something we can keep in mind if it gets out of
control yse.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: WARNING: refcount bug in should_fail
From: Dmitry Vyukov @ 2018-04-03 11:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Tetsuo Handa, syzbot, syzkaller-bugs,
	linux-fsdevel, Linux-MM, netdev
In-Reply-To: <20180403052009.GH30522@ZenIV.linux.org.uk>

On Tue, Apr 3, 2018 at 7:20 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:
>
>> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
>> of bugs (most of them preexisting, but I should've checked instead
>> of assuming that everything's fine).  Will push out later tonight.
>
> OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
> in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
> and double-dput in late failure exit in rpc_fill_super())
> and 5 regressions from register_shrinker() failure recovery.

Nice!

^ permalink raw reply

* Re: possible deadlock in skb_queue_tail
From: Dmitry Vyukov @ 2018-04-03 11:25 UTC (permalink / raw)
  To: Kirill Tkhai, Ingo Molnar
  Cc: syzbot, David Miller, David Herrmann, Denys Vlasenko,
	David Windsor, elena.reshetova, ishkamiel, Kees Cook, LKML,
	matthew, Mateusz Jurczyk, netdev, syzkaller-bugs, Al Viro, xemul
In-Reply-To: <06c79d3f-3f28-7f1e-9431-66c18149c9e6@virtuozzo.com>

On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 02.04.2018 12:20, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +0000)
>> Merge branch 'chelsio-inline-tls'
>> syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
>> Kernel config: https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>> compiler: gcc (GCC) 7.1.1 20170620
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+6b495100f17ca8554ab9@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for details.
>> If you forward the report, please keep this part and the footer.
>>
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 4.16.0-rc6+ #290 Not tainted
>> ------------------------------------------------------
>> syz-executor7/20971 is trying to acquire lock:
>>  (&af_unix_sk_receive_queue_lock_key){+.+.}, at: [<00000000271ef0d8>] skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>
>> but task is already holding lock:
>>  (&(&u->lock)->rlock/1){+.+.}, at: [<000000004e725e14>] unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&(&u->lock)->rlock/1){+.+.}:
>>        _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>        sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>        sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>>        sk_diag_dump net/unix/diag.c:178 [inline]
>>        unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>>        netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>>        __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>>        netlink_dump_start include/linux/netlink.h:214 [inline]
>>        unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>>        __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>        sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>>        netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>>        sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>        netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>>        netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>>        netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>        sock_sendmsg+0xca/0x110 net/socket.c:639
>>        sock_write_iter+0x31a/0x5d0 net/socket.c:908
>>        call_write_iter include/linux/fs.h:1782 [inline]
>>        new_sync_write fs/read_write.c:469 [inline]
>>        __vfs_write+0x684/0x970 fs/read_write.c:482
>>        vfs_write+0x189/0x510 fs/read_write.c:544
>>        SYSC_write fs/read_write.c:589 [inline]
>>        SyS_write+0xef/0x220 fs/read_write.c:581
>>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> -> #0 (&af_unix_sk_receive_queue_lock_key){+.+.}:
>>        lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>        skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>        unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>        sock_sendmsg_nosec net/socket.c:629 [inline]
>>        sock_sendmsg+0xca/0x110 net/socket.c:639
>>        ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>        __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>        SYSC_sendmmsg net/socket.c:2168 [inline]
>>        SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>        do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>        entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
> it's unix_listen(). The function is applied to stream and seqpacket
> socket types.
>
> It can't be stream because of the second stack, and seqpacket also can't,
> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
> in the way, we don't see it in the stack.
>
> So, this is looks like false positive result for me.
>
> Kirill

Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are
always different?

+Ingo for lockdep false positive
Do we need some kind of annotation here?


>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&(&u->lock)->rlock/1);
>>                                lock(&af_unix_sk_receive_queue_lock_key);
>>                                lock(&(&u->lock)->rlock/1);
>>   lock(&af_unix_sk_receive_queue_lock_key);
>>
>>  *** DEADLOCK ***
>>
>> 1 lock held by syz-executor7/20971:
>>  #0:  (&(&u->lock)->rlock/1){+.+.}, at: [<000000004e725e14>] unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>
>> stack backtrace:
>> CPU: 0 PID: 20971 Comm: syz-executor7 Not tainted 4.16.0-rc6+ #290
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>>  print_circular_bug.isra.38+0x2cd/0x2dc kernel/locking/lockdep.c:1223
>>  check_prev_add kernel/locking/lockdep.c:1863 [inline]
>>  check_prevs_add kernel/locking/lockdep.c:1976 [inline]
>>  validate_chain kernel/locking/lockdep.c:2417 [inline]
>>  __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431
>>  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>  _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>  skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>  unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:639
>>  ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>  __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>  SYSC_sendmmsg net/socket.c:2168 [inline]
>>  SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> RIP: 0033:0x455269
>> RSP: 002b:00007f71ffad6c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
>> RAX: ffffffffffffffda RBX: 00007f71ffad76d4 RCX: 0000000000455269
>> RDX: 04924924924924f4 RSI: 0000000020000200 RDI: 0000000000000016
>> RBP: 000000000072bf58 R08: 0000000000000000 R09: 0000000000000000
>> R10: 00000000200000d4 R11: 0000000000000246 R12: 00000000ffffffff
>> R13: 00000000000004ca R14: 00000000006f9390 R15: 0000000000000001
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: sync thread started: state = BACKUP, mcast_ifn = bcsh0, syncid = 0, id = 0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>> IPVS: Unknown mcast interface: bcsh0
>>
>>
>> ---
>> This bug is generated by a dumb bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for details.
>> Direct all questions to syzkaller@googlegroups.com.
>>
>> syzbot will keep track of this bug report.
>> If you forgot to add the Reported-by tag, once the fix for this bug is merged
>> into any tree, please reply to this email with:
>> #syz fix: exact-commit-title
>> To mark this as a duplicate of another syzbot report, please reply with:
>> #syz dup: exact-subject-of-another-report
>> If it's a one-off invalid bug report, please reply with:
>> #syz invalid
>> Note: if the crash happens again, it will cause creation of a new bug report.
>> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/06c79d3f-3f28-7f1e-9431-66c18149c9e6%40virtuozzo.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-04-03 11:16 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, David Lebrun, Daniel Borkmann
In-Reply-To: <20180330230326.ol5f2nmucfjdkop4@ast-mbp.dhcp.thefacebook.com>

2018-03-31 1:03 GMT+02:00 Alexei Starovoitov <alexei.starovoitov@gmail.com>:
>
> On Fri, Mar 23, 2018 at 10:15:59AM +0000, Mathieu Xhonneux wrote:
> > As of Linux 4.14, it is possible to define advanced local processing for
> > IPv6 packets with a Segment Routing Header through the seg6local LWT
> > infrastructure. This LWT implements the network programming principles
> > defined in the IETF “SRv6 Network Programming” draft.
> >
> > The implemented operations are generic, and it would be very interesting to
> > be able to implement user-specific seg6local actions, without having to
> > modify the kernel directly. To do so, this patchset adds an End.BPF action
> > to seg6local, powered by some specific Segment Routing-related helpers,
> > which provide SR functionalities that can be applied on the packet. This
> > BPF hook would then allow to implement specific actions at native kernel
> > speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> > Segment Routing Header (SRH) encapsulation depending on the content of
> > the packet, etc ...
> >
> > This patchset is divided in 5 patches, whose main features are :
> >
> > - A new seg6local action End.BPF with the corresponding new BPF program
> >   type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
> >   passed to the LWT seg6local through netlink, the same way as the LWT
> >   BPF hook operates.
> > - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
> >   shrink a SRH and apply on a packet some of the generic SRv6 actions.
> > - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
> >   encapsulation (via IPv6 encapsulation or inlining if the packet contains
> >   already an IPv6 header).
> >
> > As this patchset adds a new LWT BPF hook, I took into account the result of
> > the discussions when the LWT BPF infrastructure got merged. Hence, the
> > seg6local BPF hook doesn’t allow write access to skb->data directly, only
> > the SRH can be modified through specific helpers, which ensures that the
> > integrity of the packet is maintained.
> > More details are available in the related patches messages.
> >
> > The performances of this BPF hook have been assessed with the BPF JIT
> > enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
> > clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> > BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> > TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> > drops the throughput to 410kpps, and inlining a SRH via
> > bpf_lwt_seg6_action drops the throughput to 420kpps.
> > All throughputs are stable.
> >
> > Any comments on the patchset are welcome.
>
> I've looked through the patches and everything looks very good.
> Feel free to resubmit without RFC tag.
Thanks, I will do this as soon as net-next opens.

>
> In patch 2 I was a bit concerned that:
> +       struct seg6_bpf_srh_state *srh_state = (struct seg6_bpf_srh_state *)
> +                                              &skb->cb;
> would not collide with other users of skb->cb, but it seems the way
> the hook is placed such usage should always be valid.
> Would be good to add a comment describing the situation.
Yes, it's indeed a little hack, but this should be OK since the IPv6 layer does
not use the cb field. Another solution would be to create a new field in
__sk_buff but it's more cumbersome.
I will add a comment.

>
> Looks like somewhat odd 'End.BPF' name comes from similar names in SRv6 draft.
> Do you plan to disclose such End.BPF action in the draft as well?
This is something I've discussed with David Lebrun (the author of the Segment
Routing implementation). There's no plan to disclose an End.BPF action as-is
in the draft, since eBPF is really specific to Linux, and David doesn't mind not
having a 1:1 mapping between the actions of the draft and the implemented
ones. Writing "End.BPF" instead of just "bpf" is important to indicate that the
action will advance to the next segment by itself, like all other End actions.
One could imagine adding later a T.BPF action (a transit action), whose SID
wouldn't have to be a segment, but that could still e.g. add/edit/delete TLVs.

>
> Thanks
>

^ permalink raw reply

* RE: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Razvan Stefanescu @ 2018-04-03 11:12 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Arnd Bergmann, gregkh, Laurentiu Tudor, Linux Kernel Mailing List,
	Stuart Yoder, Ruxandra Ioana Ciocoi Radulescu, Roy Pledge,
	Networking, Ioana Ciornei
In-Reply-To: <20180402134441.GB10520@lunn.ch>

Hello Andrew,

> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Monday, April 2, 2018 4:45 PM
> To: Ioana Ciornei <ioana.ciornei@nxp.com>
> Cc: Arnd Bergmann <arnd@arndb.de>; gregkh
> <gregkh@linuxfoundation.org>; Laurentiu Tudor
> <laurentiu.tudor@nxp.com>; Linux Kernel Mailing List <linux-
> kernel@vger.kernel.org>; Stuart Yoder <stuyoder@gmail.com>; Ruxandra
> Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>; Razvan Stefanescu
> <razvan.stefanescu@nxp.com>; Roy Pledge <roy.pledge@nxp.com>;
> Networking <netdev@vger.kernel.org>
> Subject: Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
> 
> Hi Ioana
> 
> > The commands listed above are for creating/destroying DPAA2 objects
> > in Management Complex and not for runtime configuration where
> > standard userspace tools are used.
> 
> Please can you explain why this is not just plumbing inside a
> switchdev driver?
> 
> The hardware has a number of physical ports. So on probe, i would
> expect it to create a DPMAC, DPNI, and DPIO for each port, and a linux
> netdev. From then on, standard tools are all that are needed. The
> switchdev driver can create a l2 switch object when the user uses the
> ip link add name br0 type bridge. It can then connect the switch
> object to the DPNI when the user adds an interface to the switch, etc.
> 

I'll chime in as you mentioned switchdev driver. 

DPAA2 offers several object-based abstractions for modeling network
related devices (interfaces, L2 Ethernet switch) or accelerators
(DPSECI - crypto and DPDCEI - compression), the latter not up-streamed yet.
They are modeled using various low-level resources (e.g. queues,
classification tables, physical ports) and have multiple configuration and
interconnectivity options, managed by the Management Complex. 
Resources are limited and they are only used when needed by the objects,
to accommodate more configurations and usage scenarios.

Some of the objects have a 1-to-1 correspondence to physical resources
(e.g. DPMACs to physical ports), while others (like DPNIs and DPSW)
can be seen as a collection of the mentioned resources. The types and 
number of such objects are not predetermined.

When the board boots up, none of them exist yet. Restool allows a user to
define the system topology, by providing a way to dynamically create, destroy
and interconnect these objects.

After an object is created, it will be presented on the fsl-mc bus. A driver
is loaded to implement the required kernel interfaces specific to that object
type. Kernel can boot and afterwards the DPAA2 objects are added, as the user
requires.

As you mentioned DPMACs: objects of this type can be connected only to a DPNI
(a network interface like object) or to a DPSW (L2 ethernet switch) port.
Likewise, a DPNI can have only one connection (to a DPMAC, a DPSW port or
another DPNI object).

Here's several examples of valid connection types:
  * DPMAC <----> DPNI (standard network i/f corresponding to a physical port)
  * DPMAC <----> DPSW (physical port in a switch)
  * DPNI <----> DPSW (virtual network interface connected to a switch port)
  * DPNI <----> DPNI

In the latter case, the two DPNIs will not be connected to any physical
port, but can be used as a point-to-point connection between two virtual
machines for instance.

So, it is not possible to connect a DPNI to a DPSW after it was connected
to a DPMAC. The DPNI-DPMAC pair would have to be disconnected and
DPMAC will be reconnected to the switch. DPNI interface that is no longer
connected to a DPMAC will be destroyed and any new addition/deletion of
a DPNI/DPMAC interface to the switch port will trigger the entire switch
re-configuration.

Best regards,
Razvan Stefanescu

^ permalink raw reply

* [net-next V9 PATCH 16/16] xdp: transition into using xdp_frame for ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   30 ++++++++++++++-----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++++++++---------
 drivers/net/tun.c                             |   19 ++++++++++------
 drivers/net/virtio_net.c                      |   24 ++++++++++++--------
 include/linux/netdevice.h                     |    4 ++-
 net/core/filter.c                             |   17 +++++++++++++-
 7 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c8bf4d35fdea..87fb27ab9c24 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 #define I40E_XDP_CONSUMED 1
 #define I40E_XDP_TX 2
 
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
+static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+				 struct i40e_ring *xdp_ring)
+{
+	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!xdpf))
+		return I40E_XDP_CONSUMED;
+
+	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 		break;
 	case XDP_TX:
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-		result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
@@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
  * @xdp: data to transmit
  * @xdp_ring: XDP Tx ring
  **/
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring)
 {
 	u16 i = xdp_ring->next_to_use;
 	struct i40e_tx_buffer *tx_bi;
 	struct i40e_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
+	u32 size = xdpf->len;
 	dma_addr_t dma;
-	u32 size;
-
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return I40E_XDP_CONSUMED;
-
-	size = xdpf->len;
 
 	if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
 		xdp_ring->tx_stats.tx_busy++;
@@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
  *
  * Returns Zero if sent, else an error code
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct i40e_netdev_priv *np = netdev_priv(dev);
 	unsigned int queue_index = smp_processor_id();
@@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
-	err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 857b1d743c8d..4bf318b8be85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4f2864165723..0daccaf72a30 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2262,7 +2262,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2270,6 +2270,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2283,7 +2284,12 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -8344,20 +8350,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -10010,7 +10011,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10026,7 +10027,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a6a74e896430..46ac5dd79fa3 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1294,18 +1294,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
@@ -1329,6 +1324,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1676,7 +1681,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		case XDP_TX:
 			get_page(alloc_frag->page);
 			alloc_frag->offset += buflen;
-			if (tun_xdp_xmit(tun->dev, &xdp))
+			if (tun_xdp_tx(tun->dev, &xdp))
 				goto err_redirect;
 			tun_xdp_flush(tun->dev);
 			rcu_read_unlock();
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ab3d7cbc4c49..01694e26f03e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static int __virtnet_xdp_xmit(struct virtnet_info *vi,
-			      struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -459,7 +455,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return 0;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -472,7 +468,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	return __virtnet_xdp_xmit(vi, xdp);
+	return __virtnet_xdp_xmit(vi, xdpf);
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
@@ -569,6 +565,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -611,7 +608,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -702,6 +702,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -766,7 +767,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..14e0777ffcfb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1165,7 +1165,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1356,7 +1356,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d31aff93270d..3bb0cb98a9be 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2749,13 +2749,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2771,11 +2776,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related

* [net-next V9 PATCH 15/16] xdp: transition into using xdp_frame for return API
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c     |    5 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 9 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 96c54cbfb1f9..c8bf4d35fdea 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
 		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
 			kfree(tx_buffer->raw_buf);
 		else if (ring_is_xdp(ring))
-			xdp_return_frame(tx_buffer->xdpf->data,
-					 &tx_buffer->xdpf->mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 		if (dma_unmap_len(tx_buffer, len))
@@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 		/* free the skb/XDP data */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buf->xdpf->data, &tx_buf->xdpf->mem);
+			xdp_return_frame(tx_buf->xdpf);
 		else
 			napi_consume_skb(tx_buf->skb, napi_budget);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index abb5248e917e..7dd5038cfcc4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -241,8 +241,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -250,7 +249,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f10904ec2172..4f2864165723 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5797,7 +5798,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8348,16 +8349,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8372,8 +8378,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f42436d7f2d9..7bbf0db27a01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -890,6 +890,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 	dma_sync_single_range_for_cpu(rq->pdev, di->addr, wi->offset,
 				      frag_size, DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b52d69801b2d..a6a74e896430 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2189,7 +2189,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 42d338fe9a8d..ab3d7cbc4c49 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d0ee437753dc..137ad5f9f40f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 33e382afbd95..0c86b53a3a63 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,9 +308,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {

^ permalink raw reply related

* [net-next V9 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1a05d1072c5e..3317a4da87cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -534,6 +536,7 @@ struct mlx5e_rq {
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 13c1e61258a7..d0f2cd86ef32 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
+	pool_size = 1 << params->log_rq_mtu_frames;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+
+		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.order     = rq->buff.page_order;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+	pp_params.pool_size = pool_size;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dev       = c->pdev;
+	pp_params.dma_dir   = rq->buff.map_dir;
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		if (rq->wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
+			kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -556,6 +579,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -569,6 +594,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e24be05907f..f42436d7f2d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,

^ permalink raw reply related

* [net-next V9 PATCH 13/16] xdp: allow page_pool as an allocator type in xdp_return_frame
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |   14 +++++++++++
 include/net/xdp.h       |    3 ++
 net/core/xdp.c          |   60 ++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 1fe77db59518..c79087153148 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -117,7 +117,12 @@ void __page_pool_put_page(struct page_pool *pool,
 
 static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
 {
+	/* When page_pool isn't compiled-in, net/core/xdp.c doesn't
+	 * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
+	 */
+#ifdef CONFIG_PAGE_POOL
 	__page_pool_put_page(pool, page, false);
+#endif
 }
 /* Very limited use-cases allow recycle direct */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
@@ -126,4 +131,13 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 	__page_pool_put_page(pool, page, true);
 }
 
+static inline bool is_page_pool_compiled_in(void)
+{
+#ifdef CONFIG_PAGE_POOL
+	return true;
+#else
+	return false;
+#endif
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5f67c62540aa..d0ee437753dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
 	u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8b2cb79b5de0..33e382afbd95 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/rhashtable.h>
+#include <net/page_pool.h>
 
 #include <net/xdp.h>
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
 	struct xdp_mem_info mem;
-	void *allocator;
+	union {
+		void *allocator;
+		struct page_pool *page_pool;
+	};
 	struct rhash_head node;
 	struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	/* Allow this ID to be reused */
 	ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-	/* TODO: Depending on allocator type/pointer free resources */
+	/* Notice, driver is expected to free the *allocator,
+	 * e.g. page_pool, and MUST also use RCU free.
+	 */
 
 	/* Poison memory */
 	xa->mem.id = 0xFFFF;
@@ -225,6 +231,17 @@ static int __mem_id_cyclic_get(gfp_t gfp)
 	return id;
 }
 
+static bool __is_supported_mem_type(enum xdp_mem_type type)
+{
+	if (type == MEM_TYPE_PAGE_POOL)
+		return is_page_pool_compiled_in();
+
+	if (type >= MEM_TYPE_MAX)
+		return false;
+
+	return true;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum xdp_mem_type type, void *allocator)
 {
@@ -238,13 +255,16 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 		return -EFAULT;
 	}
 
-	if (type >= MEM_TYPE_MAX)
-		return -EINVAL;
+	if (!__is_supported_mem_type(type))
+		return -EOPNOTSUPP;
 
 	xdp_rxq->mem.type = type;
 
-	if (!allocator)
+	if (!allocator) {
+		if (type == MEM_TYPE_PAGE_POOL)
+			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
+	}
 
 	/* Delay init of rhashtable to save memory if feature isn't used */
 	if (!mem_id_init) {
@@ -290,15 +310,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+	struct xdp_mem_allocator *xa;
+	struct page *page;
+
+	switch (mem->type) {
+	case MEM_TYPE_PAGE_POOL:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		page = virt_to_head_page(data);
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_SHARED:
 		page_frag_free(data);
-		return;
-	}
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
+		break;
+	case MEM_TYPE_PAGE_ORDER0:
+		page = virt_to_page(data); /* Assumes order0 page*/
 		put_page(page);
+		break;
+	default:
+		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+		break;
 	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [net-next V9 PATCH 12/16] page_pool: refurbish version of page_pool code
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust ____cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |    1 
 include/net/page_pool.h                         |  129 +++++++++
 net/Kconfig                                     |    3 
 net/core/Makefile                               |    1 
 net/core/page_pool.c                            |  317 +++++++++++++++++++++++
 5 files changed, 451 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index c032319f1cb9..12257034131e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -30,6 +30,7 @@ config MLX5_CORE_EN
 	bool "Mellanox Technologies ConnectX-4 Ethernet support"
 	depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE
 	depends on IPV6=y || IPV6=n || MLX5_CORE=m
+	select PAGE_POOL
 	default n
 	---help---
 	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..1fe77db59518
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count;
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	unsigned int	flags;
+	unsigned int	order;
+	unsigned int	pool_size;
+	int		nid;  /* Numa node id to allocate from pages from */
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+};
+
+struct page_pool {
+	struct rcu_head rcu;
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..6fa1a4493b8c 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,9 @@ config MAY_USE_DEVLINK
 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
 	  devlink is a loadable module and the driver using it is built-in.
 
+config PAGE_POOL
+       bool
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..7080417f8bc8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-$(CONFIG_PAGE_POOL) += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..68bf07206744
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,317 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+static int page_pool_init(struct page_pool *pool,
+			  const struct page_pool_params *params)
+{
+	unsigned int ring_qsize = 1024; /* Default */
+
+	memcpy(&pool->p, params, sizeof(pool->p));
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	/* Sanity limit mem that can be pinned down */
+	if (ring_qsize > 32768)
+		return -E2BIG;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill
+		 *
+		 * Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: the ring consumer lock is not really
+		 * needed due to the softirq/NAPI protection, but
+		 * later need the ability to reclaim pages on the
+		 * ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages_slow(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+	/* An optimization would be to call __free_pages(page, pool->p.order)
+	 * knowing page is not part of page-cache (thus avoiding a
+	 * __page_cache_release() call).
+	 */
+}
+
+static bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in special circumstances, into the
+ * alloc side cache.  E.g. during RX-NAPI processing for XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This allocator is optimized for the XDP mode that uses
+	 * one-frame-per-page, but have fallbacks that act like the
+	 * regular page allocator APIs.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier done in page_ref_count / READ_ONCE */
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, fallback to free pages */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Fallback/non-XDP mode: API user have elevated refcnt.
+	 *
+	 * Many drivers split up the page into fragments, and some
+	 * want to keep doing this to save memory and do refcnt based
+	 * recycling. Support this use case too, to ease drivers
+	 * switching between XDP/non-XDP.
+	 *
+	 * In-case page_pool maintains the DMA mapping, API user must
+	 * call page_pool_put_page once.  In this elevated refcnt
+	 * case, the DMA is unmapped/released, as driver is likely
+	 * doing refcnt based recycle tricks, meaning another process
+	 * will be invoking put_page.
+	 */
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+static void __page_pool_empty_ring(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1))
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+
+		__page_pool_return_page(pool, page);
+	}
+}
+
+static void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	WARN(pool->alloc.count, "API usage violation");
+
+	__page_pool_empty_ring(pool);
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+/* Cleanup and release resources */
+void page_pool_destroy(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* No more consumers should exist, but producers could still
+	 * be in-flight.
+	 */
+	__page_pool_empty_ring(pool);
+
+	/* An xdp_mem_allocator can still ref page_pool pointer */
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy);

^ permalink raw reply related

* [net-next V9 PATCH 11/16] xdp: rhashtable with allocator ID to pointer mapping
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
 drivers/net/tun.c                             |    6 +
 drivers/net/virtio_net.c                      |    7 +
 include/net/xdp.h                             |   14 --
 net/core/xdp.c                                |  223 ++++++++++++++++++++++++-
 5 files changed, 241 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0bfe6cf2bf8b..f10904ec2172 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6370,7 +6370,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = -1;
-	int size;
+	int size, err;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6407,6 +6407,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 			     rx_ring->queue_index) < 0)
 		goto err;
 
+	err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err) {
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+		goto err;
+	}
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 714735c6d3ff..b52d69801b2d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -847,6 +847,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 				       tun->dev, tfile->queue_index);
 		if (err < 0)
 			goto out;
+		err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			goto out;
+		}
 		err = 0;
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f50e1ad81ad4..42d338fe9a8d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1305,6 +1305,13 @@ static int virtnet_open(struct net_device *dev)
 		if (err < 0)
 			return err;
 
+		err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+			return err;
+		}
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index ea3773f94f65..5f67c62540aa 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,7 @@ enum xdp_mem_type {
 
 struct xdp_mem_info {
 	u32 type; /* enum xdp_mem_type, but known size type */
+	u32 id;
 };
 
 struct xdp_rxq_info {
@@ -99,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-static inline
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
-{
-	if (mem->type == MEM_TYPE_PAGE_SHARED)
-		page_frag_free(data);
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
-		put_page(page);
-	}
-}
+void xdp_return_frame(void *data, struct xdp_mem_info *mem);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 7e6b3545277d..8b2cb79b5de0 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -5,6 +5,9 @@
  */
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/rhashtable.h>
 
 #include <net/xdp.h>
 
@@ -13,6 +16,99 @@
 #define REG_STATE_UNREGISTERED	0x2
 #define REG_STATE_UNUSED	0x3
 
+static DEFINE_IDA(mem_id_pool);
+static DEFINE_MUTEX(mem_id_lock);
+#define MEM_ID_MAX 0xFFFE
+#define MEM_ID_MIN 1
+static int mem_id_next = MEM_ID_MIN;
+
+static bool mem_id_init; /* false */
+static struct rhashtable *mem_id_ht;
+
+struct xdp_mem_allocator {
+	struct xdp_mem_info mem;
+	void *allocator;
+	struct rhash_head node;
+	struct rcu_head rcu;
+};
+
+static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
+{
+	const u32 *k = data;
+	const u32 key = *k;
+
+	BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
+		     != sizeof(u32));
+
+	/* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
+	return key << RHT_HASH_RESERVED_SPACE;
+}
+
+static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
+			  const void *ptr)
+{
+	const struct xdp_mem_allocator *xa = ptr;
+	u32 mem_id = *(u32 *)arg->key;
+
+	return xa->mem.id != mem_id;
+}
+
+static const struct rhashtable_params mem_id_rht_params = {
+	.nelem_hint = 64,
+	.head_offset = offsetof(struct xdp_mem_allocator, node),
+	.key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
+	.key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
+	.max_size = MEM_ID_MAX,
+	.min_size = 8,
+	.automatic_shrinking = true,
+	.hashfn    = xdp_mem_id_hashfn,
+	.obj_cmpfn = xdp_mem_id_cmp,
+};
+
+static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
+{
+	struct xdp_mem_allocator *xa;
+
+	xa = container_of(rcu, struct xdp_mem_allocator, rcu);
+
+	/* Allow this ID to be reused */
+	ida_simple_remove(&mem_id_pool, xa->mem.id);
+
+	/* TODO: Depending on allocator type/pointer free resources */
+
+	/* Poison memory */
+	xa->mem.id = 0xFFFF;
+	xa->mem.type = 0xF0F0;
+	xa->allocator = (void *)0xDEAD9001;
+
+	kfree(xa);
+}
+
+static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+{
+	struct xdp_mem_allocator *xa;
+	int id = xdp_rxq->mem.id;
+	int err;
+
+	if (id == 0)
+		return;
+
+	mutex_lock(&mem_id_lock);
+
+	xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
+	if (!xa) {
+		mutex_unlock(&mem_id_lock);
+		return;
+	}
+
+	err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
+	WARN_ON(err);
+
+	call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
+
+	mutex_unlock(&mem_id_lock);
+}
+
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
@@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
+	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
+
+	/* Reset mem info to defaults */
+	xdp_rxq->mem.id = 0;
+	xdp_rxq->mem.type = 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
@@ -72,20 +174,131 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
 
+static int __mem_id_init_hash_table(void)
+{
+	struct rhashtable *rht;
+	int ret;
+
+	if (unlikely(mem_id_init))
+		return 0;
+
+	rht = kzalloc(sizeof(*rht), GFP_KERNEL);
+	if (!rht)
+		return -ENOMEM;
+
+	ret = rhashtable_init(rht, &mem_id_rht_params);
+	if (ret < 0) {
+		kfree(rht);
+		return ret;
+	}
+	mem_id_ht = rht;
+	smp_mb(); /* mutex lock should provide enough pairing */
+	mem_id_init = true;
+
+	return 0;
+}
+
+/* Allocate a cyclic ID that maps to allocator pointer.
+ * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+ *
+ * Caller must lock mem_id_lock.
+ */
+static int __mem_id_cyclic_get(gfp_t gfp)
+{
+	int retries = 1;
+	int id;
+
+again:
+	id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
+	if (id < 0) {
+		if (id == -ENOSPC) {
+			/* Cyclic allocator, reset next id */
+			if (retries--) {
+				mem_id_next = MEM_ID_MIN;
+				goto again;
+			}
+		}
+		return id; /* errno */
+	}
+	mem_id_next = id + 1;
+
+	return id;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum xdp_mem_type type, void *allocator)
 {
+	struct xdp_mem_allocator *xdp_alloc;
+	gfp_t gfp = GFP_KERNEL;
+	int id, errno, ret;
+	void *ptr;
+
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return -EFAULT;
+	}
+
 	if (type >= MEM_TYPE_MAX)
 		return -EINVAL;
 
 	xdp_rxq->mem.type = type;
 
-	if (allocator)
-		return -EOPNOTSUPP;
+	if (!allocator)
+		return 0;
+
+	/* Delay init of rhashtable to save memory if feature isn't used */
+	if (!mem_id_init) {
+		mutex_lock(&mem_id_lock);
+		ret = __mem_id_init_hash_table();
+		mutex_unlock(&mem_id_lock);
+		if (ret < 0) {
+			WARN_ON(1);
+			return ret;
+		}
+	}
+
+	xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
+	if (!xdp_alloc)
+		return -ENOMEM;
+
+	mutex_lock(&mem_id_lock);
+	id = __mem_id_cyclic_get(gfp);
+	if (id < 0) {
+		errno = id;
+		goto err;
+	}
+	xdp_rxq->mem.id = id;
+	xdp_alloc->mem  = xdp_rxq->mem;
+	xdp_alloc->allocator = allocator;
+
+	/* Insert allocator into ID lookup table */
+	ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
+	if (IS_ERR(ptr)) {
+		errno = PTR_ERR(ptr);
+		goto err;
+	}
+
+	mutex_unlock(&mem_id_lock);
 
-	/* TODO: Allocate an ID that maps to allocator pointer
-	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
-	 */
 	return 0;
+err:
+	mutex_unlock(&mem_id_lock);
+	kfree(xdp_alloc);
+	return errno;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
+
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+		page_frag_free(data);
+		return;
+	}
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [net-next V9 PATCH 10/16] mlx5: register a memory model when XDP is enabled
From: Jesper Dangaard Brouer @ 2018-04-03 11:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152275360298.1026.10333759008401281682.stgit@firesoul>

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0aab3afc6885..13c1e61258a7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -512,6 +512,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
+	/* This must only be activate for order-0 pages */
+	if (rq->xdp_prog) {
+		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err)
+			goto err_rq_wq_destroy;
+	}
+
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox