Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-04-16 15:52 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: nhorman, netdev, virtualization, linux-sctp, Vladislav Yasevich
In-Reply-To: <f5655ab8-f24f-a5a5-d000-2bbb90ecc552@redhat.com>

On Mon, Apr 16, 2018 at 09:45:48AM -0400, Vlad Yasevich wrote:
> On 04/11/2018 06:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> >> To support SCTP checksum offloading, we need to add a new feature
> >> to virtio_net, so we can negotiate support between the hypervisor
> >> and the guest.
> >>
> >> The signalling to the guest that an alternate checksum needs to
> >> be used is done via a new flag in the virtio_net_hdr.  If the
> >> flag is set, the host will know to perform an alternate checksum
> >> calculation, which right now is only CRC32c.
> >>
> >> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> >> ---
> >>  drivers/net/virtio_net.c        | 11 ++++++++---
> >>  include/linux/virtio_net.h      |  6 ++++++
> >>  include/uapi/linux/virtio_net.h |  2 ++

Could you pls include a virtio TC mailing list on uapi changes?

> >>  3 files changed, 16 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 7b187ec..b601294 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>  	/* Do we support "hardware" checksums? */
> >>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
> >>  		/* This opens up the world of extra features. */
> >> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> >> +		netdev_features_t sctp = 0;
> >> +
> >> +		if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> >> +			sctp |= NETIF_F_SCTP_CRC;
> >> +
> >> +		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> >>  		if (csum)
> >> -			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> >> +			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> >>  
> >>  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
> >>  			dev->hw_features |= NETIF_F_TSO
> >> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
> >>  };
> >>  
> >>  #define VIRTNET_FEATURES \
> >> -	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> >> +	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
> >>  	VIRTIO_NET_F_MAC, \
> >>  	VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
> >>  	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >> index f144216..2e7a64a 100644
> >> --- a/include/linux/virtio_net.h
> >> +++ b/include/linux/virtio_net.h
> >> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
> >>  
> >>  		if (!skb_partial_csum_set(skb, start, off))
> >>  			return -EINVAL;
> >> +
> >> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> >> +			skb->csum_not_inet = 1;
> >>  	}
> >>  
> >>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> >> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
> >>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
> >>  	} /* else everything is zero */
> >>  
> >> +	if (skb->csum_not_inet)
> >> +		hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> >> +
> >>  	return 0;
> >>  }
> >>  
> >> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >> index 5de6ed3..3f279c8 100644
> >> --- a/include/uapi/linux/virtio_net.h
> >> +++ b/include/uapi/linux/virtio_net.h
> >> @@ -36,6 +36,7 @@
> >>  #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
> >>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
> >>  #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
> >> +#define VIRTIO_NET_F_SCTP_CSUM  4	/* SCTP checksum offload support */
> >>  #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
> >>  #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
> >>  #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
> > 
> > Is this a guest or a host checksum? We should differenciate between the
> > two.
> 
> I suppose this is HOST checksum, since it behaves like VIRTIO_NET_F_CSUM only for
> SCTP.  I couldn't find the use for the GUEST side flag, since there technically
> isn't any validations and there is no additional mappings from VIRTIO flag to a
> NETIF flag.
> 
> If the feature is negotiated, the guest ends up generating partially check-summed
> packets, and the host turns on appropriate flags on it's side.   The host will
> also make sure the checksum if fixed up if the guest doesn't support it.
> So 1 flag is currently all that is needed.
> 
> -vlad

Fine, but let's include HOST in there to make things clear.
Also I think we should use a high flag, like 62.

> > 
> > 
> >> @@ -101,6 +102,7 @@ struct virtio_net_config {
> >>  struct virtio_net_hdr_v1 {
> >>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
> >>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> >> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */
> >>  	__u8 flags;
> >>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
> >>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */



> >> -- 
> >> 2.9.5

^ permalink raw reply

* [PATCH] ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()
From: Lorenzo Bianconi @ 2018-04-16 15:52 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <cover.1523893788.git.lorenzo.bianconi@redhat.com>

Remove unnecessary check on update_lft variable in
addrconf_prefix_rcv_add_addr routine since it is always set to 0.
Moreover remove update_lft re-initialization to 0

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
 net/ipv6/addrconf.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index dffa38004c13..b2c0175125db 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2529,7 +2529,6 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct net_device *dev,
 		if (IS_ERR_OR_NULL(ifp))
 			return -1;
 
-		update_lft = 0;
 		create = 1;
 		spin_lock_bh(&ifp->lock);
 		ifp->flags |= IFA_F_MANAGETEMPADDR;
@@ -2551,7 +2550,7 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct net_device *dev,
 			stored_lft = ifp->valid_lft - (now - ifp->tstamp) / HZ;
 		else
 			stored_lft = 0;
-		if (!update_lft && !create && stored_lft) {
+		if (!create && stored_lft) {
 			const u32 minimum_lft = min_t(u32,
 				stored_lft, MIN_VALID_LIFETIME);
 			valid_lft = max(valid_lft, minimum_lft);
-- 
2.14.3

^ permalink raw reply related

* Re: Cavium Octeon III network driver.
From: Florian Fainelli @ 2018-04-16 15:53 UTC (permalink / raw)
  To: Steven J. Hill, netdev
In-Reply-To: <af781f7a-c38c-9f1d-f3b7-b5ad5a667ccf@cavium.com>

On 04/16/2018 08:29 AM, Steven J. Hill wrote:
> On 04/14/2018 07:08 PM, Florian Fainelli wrote:
>>
>> net-next tree is currently closed, but once it opens back up, you would
>> likely want to resubmit those patches. Last I remember they were ready
>> to go.
>>
> The announcement appears on this list for when it is open, correct?

Correct, and the announcement was just made that net-next is open.
-- 
Florian

^ permalink raw reply

* Re: [PATCH] ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()
From: Lorenzo Bianconi @ 2018-04-16 15:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev
In-Reply-To: <cd402c41ec4a3fb61702ae00115e54cfada4ba1d.1523893788.git.lorenzo.bianconi@redhat.com>

> Remove unnecessary check on update_lft variable in
> addrconf_prefix_rcv_add_addr routine since it is always set to 0.
> Moreover remove update_lft re-initialization to 0
>
> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> ---
>  net/ipv6/addrconf.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index dffa38004c13..b2c0175125db 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -2529,7 +2529,6 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct net_device *dev,
>                 if (IS_ERR_OR_NULL(ifp))
>                         return -1;
>
> -               update_lft = 0;
>                 create = 1;
>                 spin_lock_bh(&ifp->lock);
>                 ifp->flags |= IFA_F_MANAGETEMPADDR;
> @@ -2551,7 +2550,7 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct net_device *dev,
>                         stored_lft = ifp->valid_lft - (now - ifp->tstamp) / HZ;
>                 else
>                         stored_lft = 0;
> -               if (!update_lft && !create && stored_lft) {
> +               if (!create && stored_lft) {
>                         const u32 minimum_lft = min_t(u32,
>                                 stored_lft, MIN_VALID_LIFETIME);
>                         valid_lft = max(valid_lft, minimum_lft);
> --
> 2.14.3
>

I forgot 'net-next' tag in the subject. Dave should I send a v2?

Regards,
Lorenzo

^ permalink raw reply

* BUG: DSA blocks (some?) multicast traffic
From: Russell King - ARM Linux @ 2018-04-16 16:00 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot; +Cc: netdev

Hi,

Yesterday, I noticed that one of my platforms had become unreachable
over IPv6.  This platform is connected to a Clearfog, which uses
the Marvell 88e6176 switch.

The setup is:

               (rest of network)
                 | | | | | | |
    Source ---- Netgear switch ---- Clearfog switch ---- Target
                                    ^             ^      ^
                                   lan1          lan5   eth2
                                  port 4        port 0

Both the source and target could ping the IPv6 link-local address on
the clearfog, and IPv4 traffic was fine.

The source was attempting to ping the target, and the target had been
attempting to mount a NFS share over IPv6 (and failing.)

Statistics from the Clearfog switch - port 0 is the target, port 4 is
the netgear switch, port 5 is the CPU port:

     Statistic   Port  0  Port  1  Port  2  Port  3  Port  4  Port  5  Port  6
 in_multicasts:    27670        0        0    14552    78196     1674        0
out_multicasts:      439        0      331      332      331   114860      334

As you can see, very few multicast packets have been sent out any of
the ports.

Running tcpdump on the source machine:

21:31:48.805654 xx:xx:xx:15:37:dd > 33:33:ff:00:00:05,
	ethertype IPv6 (0x86dd), length 86:
	fd8f:xxxx:xxxx:xxxx:222:68ff:fe15:37dd > ff02::1:ff00:5:
	ICMP6, neighbor solicitation, who has
	fd8f:xxxx:xxxx:xxxx:200:ff:fe00:5, length 32

tcpdump on the clearfog for port 4:

21:31:48.807640 xx:xx:xx:15:37:dd > 33:33:ff:00:00:05,
	ethertype IPv6 (0x86dd), length 86:
	fd8f:xxxx:xxxx:xxxx:222:68ff:fe15:37dd > ff02::1:ff00:5:
	ICMP6, neighbor solicitation, who has
	fd8f:xxxx:xxxx:xxxx:200:ff:fe00:5, length 32

The port connected to the target doesn't see these multicast messages
which are intended for it.  However, I am able to see messages
multicasted from the target, which is trying to solicit a different
machine on my network:

21:33:22.394653 00:00:00:00:00:05 > 33:33:ff:10:1b:e6,
	ethertype IPv6 (0x86dd), length 86:
	fd8f:xxxx:xxxx:xxxx:200:ff:fe00:5 > ff02::1:ff10:1be6:
	ICMP6, neighbor solicitation, who has
	fd8f:xxxx:xxxx:xxxx:214:fdff:fe10:1be6, length 32

but these neighbour solicitations are also not forwarded to through
the switch to the rest of the network.

Tweaking the DSA switch port register 4 (which had a value of 0x0433)
to 0x043b resolved the issue - multicast traffic started to egress
these ports, and the neighbour solicitations then worked.

I thought maybe this was a 4.16 regression, but it seems my other
clearfog has a similar register setup, so I'm not sure.  I'm pretty
sure that it used to work at some point, as I'm a heavy user of IPv6
internally, and I'm quite certain that I would have noticed a failure
such as this.

The setup on the target is eth2 is part of a Linux bridge device.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [PATCH net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-16 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Alexey Kuznetsov, yoshfuji, Song Liu, netdev, LKML
In-Reply-To: <d2ab1647-c832-c982-5952-ab7a415f7c76@gmail.com>

On Mon, Apr 16, 2018 at 11:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 04/16/2018 08:33 AM, Yafang Shao wrote:
>> tcp_rcv_space_adjust is called every time data is copied to user space,
>> introducing a tcp tracepoint for which could show us when the packet is
>> copied to user.
>> This could help us figure out whether there's latency in user process.
>>
>> When a tcp packet arrives, tcp_rcv_established() will be called and with
>> the existed tracepoint tcp_probe we could get the time when this packet
>> arrives.
>> Then this packet will be copied to user, and tcp_rcv_space_adjust will
>> be called and with this new introduced tracepoint we could get the time
>> when this packet is copied to user.
>>
>>       arrives time : user process time    => latency caused by user
>>       tcp_probe      tcp_rcv_space_adjust
>>
>> Hence in the prink message, sk is printed as a key to connect these two
>> tracepoints.
>>
>
> socket pointer is not a key.
>
> TCP sockets can be reused pretty fast after free.
>
> I suggest you go for cookie instead, this is an unique 64bit identifier.
> ( sock_gen_cookie() for details )
>
>

Thanks for your explanation.
Will have a try.

Thanks
Yafang

^ permalink raw reply

* Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2
From: Alexander Duyck @ 2018-04-16 16:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse,
	William Tu, Björn Töpel, Karlsson, Magnus,
	Arnaldo Carvalho de Melo
In-Reply-To: <20180416122706.GA20624@infradead.org>

On Mon, Apr 16, 2018 at 5:27 AM, Christoph Hellwig <hch@infradead.org> wrote:
> Can you try the following hack which avoids indirect calls entirely
> for the fast path direct mapping case?
>
> ---
> From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Mon, 16 Apr 2018 14:18:14 +0200
> Subject: dma-mapping: bypass dma_ops for direct mappings
>
> Reportedly the retpoline mitigation for spectre causes huge penalties
> for indirect function calls.  This hack bypasses the dma_ops mechanism
> for simple direct mappings.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/device.h      |  1 +
>  include/linux/dma-mapping.h | 53 +++++++++++++++++++++++++++----------
>  lib/dma-direct.c            |  4 +--
>  3 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 0059b99e1f25..725eec4c6653 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -990,6 +990,7 @@ struct device {
>         bool                    offline_disabled:1;
>         bool                    offline:1;
>         bool                    of_node_reused:1;
> +       bool                    is_dma_direct:1;
>  };
>
>  static inline struct device *kobj_to_dev(struct kobject *kobj)
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index f8ab1c0f589e..c5d384ae25d6 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -223,6 +223,13 @@ static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
>  }
>  #endif
>
> +/* do not use directly! */
> +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
> +               unsigned long offset, size_t size, enum dma_data_direction dir,
> +               unsigned long attrs);
> +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
> +               int nents, enum dma_data_direction dir, unsigned long attrs);
> +
>  static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>                                               size_t size,
>                                               enum dma_data_direction dir,
> @@ -232,9 +239,13 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>         dma_addr_t addr;
>
>         BUG_ON(!valid_dma_direction(dir));
> -       addr = ops->map_page(dev, virt_to_page(ptr),
> -                            offset_in_page(ptr), size,
> -                            dir, attrs);
> +       if (dev->is_dma_direct) {
> +               addr = dma_direct_map_page(dev, virt_to_page(ptr),
> +                               offset_in_page(ptr), size, dir, attrs);
> +       } else {
> +               addr = ops->map_page(dev, virt_to_page(ptr),
> +                               offset_in_page(ptr), size, dir, attrs);
> +       }
>         debug_dma_map_page(dev, virt_to_page(ptr),
>                            offset_in_page(ptr), size,
>                            dir, addr, true);

I'm not sure if I am really a fan of trying to solve this in this way.
It seems like this is going to be optimizing the paths for one case at
the detriment of others. Historically mapping and unmapping has always
been expensive, especially in the case of IOMMU enabled environments.
I would much rather see us focus on having swiotlb_dma_ops replaced
with dma_direct_ops in the cases where the device can access all of
physical memory.

> @@ -249,7 +260,7 @@ static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->unmap_page)
> +       if (!dev->is_dma_direct && ops->unmap_page)

If I understand correctly this is only needed for the swiotlb case and
not the dma_direct case. It would make much more sense to just
overwrite the dev->dma_ops pointer with dma_direct_ops to address all
of the sync and unmap cases.

>                 ops->unmap_page(dev, addr, size, dir, attrs);
>         debug_dma_unmap_page(dev, addr, size, dir, true);
>  }
> @@ -266,7 +277,10 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>         int ents;
>
>         BUG_ON(!valid_dma_direction(dir));
> -       ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +       if (dev->is_dma_direct)
> +               ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
> +       else
> +               ents = ops->map_sg(dev, sg, nents, dir, attrs);
>         BUG_ON(ents < 0);
>         debug_dma_map_sg(dev, sg, nents, ents, dir);
>
> @@ -281,7 +295,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg
>
>         BUG_ON(!valid_dma_direction(dir));
>         debug_dma_unmap_sg(dev, sg, nents, dir);
> -       if (ops->unmap_sg)
> +       if (!dev->is_dma_direct && ops->unmap_sg)
>                 ops->unmap_sg(dev, sg, nents, dir, attrs);
>  }
>
> @@ -295,7 +309,10 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev,
>         dma_addr_t addr;
>
>         BUG_ON(!valid_dma_direction(dir));
> -       addr = ops->map_page(dev, page, offset, size, dir, attrs);
> +       if (dev->is_dma_direct)
> +               addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
> +       else
> +               addr = ops->map_page(dev, page, offset, size, dir, attrs);
>         debug_dma_map_page(dev, page, offset, size, dir, addr, false);
>
>         return addr;
> @@ -309,7 +326,7 @@ static inline void dma_unmap_page_attrs(struct device *dev,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->unmap_page)
> +       if (!dev->is_dma_direct && ops->unmap_page)
>                 ops->unmap_page(dev, addr, size, dir, attrs);
>         debug_dma_unmap_page(dev, addr, size, dir, false);
>  }
> @@ -356,7 +373,7 @@ static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_single_for_cpu)
> +       if (!dev->is_dma_direct && ops->sync_single_for_cpu)
>                 ops->sync_single_for_cpu(dev, addr, size, dir);
>         debug_dma_sync_single_for_cpu(dev, addr, size, dir);
>  }
> @@ -368,7 +385,7 @@ static inline void dma_sync_single_for_device(struct device *dev,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_single_for_device)
> +       if (!dev->is_dma_direct && ops->sync_single_for_device)
>                 ops->sync_single_for_device(dev, addr, size, dir);
>         debug_dma_sync_single_for_device(dev, addr, size, dir);
>  }
> @@ -382,7 +399,7 @@ static inline void dma_sync_single_range_for_cpu(struct device *dev,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_single_for_cpu)
> +       if (!dev->is_dma_direct && ops->sync_single_for_cpu)
>                 ops->sync_single_for_cpu(dev, addr + offset, size, dir);
>         debug_dma_sync_single_range_for_cpu(dev, addr, offset, size, dir);
>  }
> @@ -396,7 +413,7 @@ static inline void dma_sync_single_range_for_device(struct device *dev,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_single_for_device)
> +       if (!dev->is_dma_direct && ops->sync_single_for_device)
>                 ops->sync_single_for_device(dev, addr + offset, size, dir);
>         debug_dma_sync_single_range_for_device(dev, addr, offset, size, dir);
>  }
> @@ -408,7 +425,7 @@ dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_sg_for_cpu)
> +       if (!dev->is_dma_direct && ops->sync_sg_for_cpu)
>                 ops->sync_sg_for_cpu(dev, sg, nelems, dir);
>         debug_dma_sync_sg_for_cpu(dev, sg, nelems, dir);
>  }
> @@ -420,7 +437,7 @@ dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
>         const struct dma_map_ops *ops = get_dma_ops(dev);
>
>         BUG_ON(!valid_dma_direction(dir));
> -       if (ops->sync_sg_for_device)
> +       if (!dev->is_dma_direct && ops->sync_sg_for_device)
>                 ops->sync_sg_for_device(dev, sg, nelems, dir);
>         debug_dma_sync_sg_for_device(dev, sg, nelems, dir);
>
> @@ -600,6 +617,8 @@ static inline int dma_supported(struct device *dev, u64 mask)
>         return ops->dma_supported(dev, mask);
>  }
>
> +extern const struct dma_map_ops swiotlb_dma_ops;
> +
>  #ifndef HAVE_ARCH_DMA_SET_MASK
>  static inline int dma_set_mask(struct device *dev, u64 mask)
>  {
> @@ -609,6 +628,12 @@ static inline int dma_set_mask(struct device *dev, u64 mask)
>         dma_check_mask(dev, mask);
>
>         *dev->dma_mask = mask;
> +       if (dev->dma_ops == &dma_direct_ops ||
> +           (dev->dma_ops == &swiotlb_dma_ops &&
> +            mask == DMA_BIT_MASK(64)))
> +               dev->is_dma_direct = true;
> +       else
> +               dev->is_dma_direct = false;

So I am not sure this will work on x86. If I am not mistaken I believe
dev->dma_ops is normally not set and instead the default dma
operations are pulled via get_arch_dma_ops which returns the global
dma_ops pointer.

What you may want to consider as an alternative would be to look at
modifying drivers that are using the swiotlb so that you could just
overwrite the dev->dma_ops with the dma_direct_ops in the cases where
the hardware can support accessing all of physical hardware and where
we aren't forcing the use of the bounce buffers in the swiotlb.

Then for the above code you only have to worry about the map calls,
and you could just do a check against the dma_direct_ops pointer
instead of having to add a new flag.

>         return 0;
>  }
>  #endif
> diff --git a/lib/dma-direct.c b/lib/dma-direct.c
> index c0bba30fef0a..3deb8666974b 100644
> --- a/lib/dma-direct.c
> +++ b/lib/dma-direct.c
> @@ -120,7 +120,7 @@ void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
>                 free_pages((unsigned long)cpu_addr, page_order);
>  }
>
> -static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
> +dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
>                 unsigned long offset, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs)
>  {
> @@ -131,7 +131,7 @@ static dma_addr_t dma_direct_map_page(struct device *dev, struct page *page,
>         return dma_addr;
>  }
>
> -static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
> +int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
>                 int nents, enum dma_data_direction dir, unsigned long attrs)
>  {
>         int i;
> --
> 2.17.0
>
>

^ permalink raw reply

* [PATCH net-next 1/1] tc-testing: add sample action tests
From: Roman Mashak @ 2018-04-16 16:06 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 .../tc-testing/tc-tests/actions/sample.json        | 588 +++++++++++++++++++++
 1 file changed, 588 insertions(+)
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/actions/sample.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/sample.json b/tools/testing/selftests/tc-testing/tc-tests/actions/sample.json
new file mode 100644
index 000000000000..3aca33c00039
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/sample.json
@@ -0,0 +1,588 @@
+[
+    {
+        "id": "9784",
+        "name": "Add valid sample action with mandatory arguments",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 10 group 1 index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 2",
+        "matchPattern": "action order [0-9]+: sample rate 1/10 group 1.*index 2 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "5c91",
+        "name": "Add valid sample action with mandatory arguments and continue control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 700 group 2 continue index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 2",
+        "matchPattern": "action order [0-9]+: sample rate 1/700 group 2 continue.*index 2 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "334b",
+        "name": "Add valid sample action with mandatory arguments and drop control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 10000 group 11 drop index 22",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/10000 group 11 drop.*index 22 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "da69",
+        "name": "Add valid sample action with mandatory arguments and reclassify control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 20000 group 72 reclassify index 100",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/20000 group 72 reclassify.*index 100 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "13ce",
+        "name": "Add valid sample action with mandatory arguments and pipe control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 20 group 2 pipe index 100",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/20 group 2 pipe.*index 100 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "1886",
+        "name": "Add valid sample action with mandatory arguments and jump control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 700 group 25 jump 4 index 200",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 200",
+        "matchPattern": "action order [0-9]+: sample rate 1/700 group 25 jump 4.*index 200 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "b6d4",
+        "name": "Add sample action with mandatory arguments and invalid control action",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 200000 group 52 foo index 1",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/200000 group 52 foo.*index 1 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "a874",
+        "name": "Add invalid sample action without mandatory arguments",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample index 1",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample.*index 1 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "ac01",
+        "name": "Add invalid sample action without mandatory argument rate",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample group 10 index 1",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample.*group 10.*index 1 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "4203",
+        "name": "Add invalid sample action without mandatory argument group",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 100 index 10",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/100.*index 10 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "14a7",
+        "name": "Add invalid sample action without mandatory argument group",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 100 index 10",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/100.*index 10 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "8f2e",
+        "name": "Add valid sample action with trunc argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 1024 group 4 trunc 1024 index 10",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 4 trunc_size 1024 pipe.*index 10 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "45f8",
+        "name": "Add sample action with maximum rate argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 4294967295 group 4 index 10",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/4294967295 group 4 pipe.*index 10 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "ad0c",
+        "name": "Add sample action with maximum trunc argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 16000 group 4 trunc 4294967295 index 10",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/16000 group 4 trunc_size 4294967295 pipe.*index 10 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "83a9",
+        "name": "Add sample action with maximum group argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 4294 group 4294967295 index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 1",
+        "matchPattern": "action order [0-9]+: sample rate 1/4294 group 4294967295 pipe.*index 1 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "ed27",
+        "name": "Add sample action with invalid rate argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 4294967296 group 4 index 10",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 10",
+        "matchPattern": "action order [0-9]+: sample rate 1/4294967296 group 4 pipe.*index 10 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "2eae",
+        "name": "Add sample action with invalid group argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 4098 group 5294967299 continue index 1",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 1",
+        "matchPattern": "action order [0-9]+: sample rate 1/4098 group 5294967299 continue.*index 1 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "6ff3",
+        "name": "Add sample action with invalid trunc size",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 1024 group 4 trunc 112233445566 index 11",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 11",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 4 trunc_size 112233445566.*index 11 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "2b2a",
+        "name": "Add sample action with invalid index",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 1024 group 4 index 5294967299",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action sample index 5294967299",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 4 pipe.*index 5294967299 ref",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "dee2",
+        "name": "Add sample action with maximum allowed index",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 1024 group 4 index 4294967295",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 4294967295",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 4 pipe.*index 4294967295 ref",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "560e",
+        "name": "Add sample action with cookie",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action sample rate 1024 group 4 index 45 cookie aabbccdd",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action sample index 45",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 4 pipe.*index 45.*cookie aabbccdd",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "704a",
+        "name": "Replace existing sample action with new rate argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ],
+            "$TC actions add action sample rate 1024 group 4 index 4"
+        ],
+        "cmdUnderTest": "$TC actions replace action sample rate 2048 group 4 index 4",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/2048 group 4 pipe.*index 4",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "60eb",
+        "name": "Replace existing sample action with new group argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ],
+            "$TC actions add action sample rate 1024 group 4 index 4"
+        ],
+        "cmdUnderTest": "$TC actions replace action sample rate 1024 group 7 index 4",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 7 pipe.*index 4",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "2cce",
+        "name": "Replace existing sample action with new trunc argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ],
+            "$TC actions add action sample rate 1024 group 4 trunc 48 index 4"
+        ],
+        "cmdUnderTest": "$TC actions replace action sample rate 1024 group 7 trunc 64 index 4",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 7 trunc_size 64 pipe.*index 4",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    },
+    {
+        "id": "59d1",
+        "name": "Replace existing sample action with new control argument",
+        "category": [
+            "actions",
+            "sample"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action sample",
+                0,
+                1,
+                255
+            ],
+            "$TC actions add action sample rate 1024 group 4 reclassify index 4"
+        ],
+        "cmdUnderTest": "$TC actions replace action sample rate 1024 group 7 pipe index 4",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action sample",
+        "matchPattern": "action order [0-9]+: sample rate 1/1024 group 7 pipe.*index 4",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action sample"
+        ]
+    }
+]
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
From: David Miller @ 2018-04-16 16:51 UTC (permalink / raw)
  To: yanjun.zhu; +Cc: tariqt, netdev, linux-rdma, haakon.bugge
In-Reply-To: <1523840527-22746-1-git-send-email-yanjun.zhu@oracle.com>

From: Zhu Yanjun <yanjun.zhu@oracle.com>
Date: Sun, 15 Apr 2018 21:02:07 -0400

> While a faulty cable is used or HCA firmware error, HCA device will
> be offline. When the driver is accessing this offline device, the
> following call trace will pop out.
 ...
> In the above call trace, the function mlx4_cmd_poll calls the function
> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.
> 
> This is not reasonable. Since HCA device is offline when it is being
> accessed, it should not be reset again.
> 
> In this patch, since HCA is offline, the function mlx4_cmd_post returns
> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
> instead of resetting HCA.
> 
> CC: Srinivas Eeda <srinivas.eeda@oracle.com>
> CC: Junxiao Bi <junxiao.bi@oracle.com>
> Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Tariq, I'm assuming you'll take this in and send it to me later.

Thanks.

^ permalink raw reply

* Re: [PATCH net] net: Fix one possible memleak in ip_setup_cork
From: David Miller @ 2018-04-16 16:58 UTC (permalink / raw)
  To: gfree.wind; +Cc: kuznet, netdev
In-Reply-To: <1523845005-6353-2-git-send-email-gfree.wind@vip.163.com>

From: gfree.wind@vip.163.com
Date: Mon, 16 Apr 2018 10:16:45 +0800

> From: Gao Feng <gfree.wind@vip.163.com>
> 
> It would allocate memory in this function when the cork->opt is NULL. But
> the memory isn't freed if failed in the latter rt check, and return error
> directly. It causes the memleak if its caller is ip_make_skb which also
> doesn't free the cork->opt when meet a error.
> 
> Now move the rt check ahead to avoid the memleak.
> 
> Signed-off-by: Gao Feng <gfree.wind@vip.163.com>

Looks good, applied and queued up for -stable.

I guess in the other code paths, ip_flush_pending_frames() or similar
would clean up the in-sock cork information.

^ permalink raw reply

* Re: [PATCH] ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()
From: David Miller @ 2018-04-16 17:00 UTC (permalink / raw)
  To: lorenzo.bianconi; +Cc: netdev
In-Reply-To: <CAJ0CqmVBOk9rctqxoVEvO2yP_++yMDu7xpMYf-PRcqcJaQ=Zag@mail.gmail.com>

From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Mon, 16 Apr 2018 17:56:33 +0200

> I forgot 'net-next' tag in the subject. Dave should I send a v2?

Not necessary.

^ permalink raw reply

* Re: Stack smash error in latest version of iproute2's 'ip'
From: Stephen Hemminger @ 2018-04-16 17:00 UTC (permalink / raw)
  To: Paul Claessen; +Cc: netdev
In-Reply-To: <1590297271.174659.1523897880294@email.1and1.com>

On Mon, 16 Apr 2018 12:58:00 -0400 (EDT)
Paul Claessen <paul@claessen.com> wrote:

> Greetings,
> 
> I'm getting a "*** stack smashing detected ***: ip terminated" error when using 'ip macsec show'.
> 
> This happens on Ubuntu 16.04.4 (Xemian) on an AMD64 PC.
> ip version is 180402
> 
> Screen output:
> 
> paul@paulux:~$ ip -V
> ip utility, iproute2-ss180402
> 
> paul@paulux:~$ ip a
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
> valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
> valid_lft forever preferred_lft forever
> 2: enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
> link/ether 00:21:9b:49:5c:aa brd ff:ff:ff:ff:ff:ff
> 3: enp1s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
> link/ether a0:36:9f:20:37:fc brd ff:ff:ff:ff:ff:ff
> 4: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
> link/ether a0:36:9f:20:37:fe brd ff:ff:ff:ff:ff:ff
> 5: wlxf4ec38845800: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
> link/ether f4:ec:38:84:58:00 brd ff:ff:ff:ff:ff:ff
> inet 10.205.213.28/8 brd 10.255.255.255 scope global dynamic wlxf4ec38845800
> valid_lft 59124sec preferred_lft 59124sec
> inet6 fe80::ea99:1f1b:9ff:1d57/64 scope link
> valid_lft forever preferred_lft forever
> 6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
> link/ether 02:42:76:15:77:09 brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
> valid_lft forever preferred_lft forever
> 7: macsec0@enp8s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1468 qdisc noqueue state UNKNOWN group default qlen 1000
> link/ether 00:21:9b:49:5c:aa brd ff:ff:ff:ff:ff:ff
> inet 192.168.100.1/24 scope global macsec0
> valid_lft forever preferred_lft forever
> inet6 fe80::221:9bff:fe49:5caa/64 scope link
> valid_lft forever preferred_lft forever
> 
> 
> 
> paul@paulux:~$ ip macsec show
> 7: macsec0: protect on validate strict sc off sa off encrypt on send_sci on end_station off scb off replay off
> cipher suite: GCM-AES-128, using ICV length 16
> TXSC: 000000005caa0001 on SA 0
> 0: PN 45547, state on, key 00000000000000000000000000000000
> RXSC: 00000000778c0001, state on
> 0: PN 3976, state on, key 01000000000000000000000000000000
> *** stack smashing detected ***: ip terminated
> Aborted (core dumped)
> 
> 
> Kind regards,
> 
> ~ Paul Claessen
> 
> 

Probably something in the new code to do JSON output.

^ permalink raw reply

* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-04-16 17:07 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, jasowang, nhorman,
	Vladislav Yasevich
In-Reply-To: <20180402134006.10111-2-vyasevic@redhat.com>

On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> To support SCTP checksum offloading, we need to add a new feature
> to virtio_net, so we can negotiate support between the hypervisor
> and the guest.
> 
> The signalling to the guest that an alternate checksum needs to
> be used is done via a new flag in the virtio_net_hdr.  If the
> flag is set, the host will know to perform an alternate checksum
> calculation, which right now is only CRC32c.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/virtio_net.c        | 11 ++++++++---
>  include/linux/virtio_net.h      |  6 ++++++
>  include/uapi/linux/virtio_net.h |  2 ++
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b187ec..b601294 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Do we support "hardware" checksums? */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
>  		/* This opens up the world of extra features. */
> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +		netdev_features_t sctp = 0;
> +
> +		if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> +			sctp |= NETIF_F_SCTP_CRC;
> +
> +		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  		if (csum)
> -			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> +			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
>  
>  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
>  			dev->hw_features |= NETIF_F_TSO
> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
>  };
>  
>  #define VIRTNET_FEATURES \
> -	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> +	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
>  	VIRTIO_NET_F_MAC, \
>  	VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
>  	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index f144216..2e7a64a 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>  
>  		if (!skb_partial_csum_set(skb, start, off))
>  			return -EINVAL;
> +
> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> +			skb->csum_not_inet = 1;
>  	}
>  
>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {

Looks like we need a feature flag to tell host guest is able to handle this flag
in incoming packets ...


> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
>  	} /* else everything is zero */

Is comment above still correct?

>  
> +	if (skb->csum_not_inet)
> +		hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> +

... and a separate flag to tell guest it's ok to set this flag in
outgoing packets.


Also - this chunk looks like a hack fixing up a value we
have set incorrectly.

Specifically this clears all flags except VIRTIO_NET_HDR_F_CSUM_NOT_INET.
Why do this at all? Do we really want to clear e.g. NEEDS_CSUM? I think we do not
since csum_start, csum_offset are set.

Also - what will ever set this flag in the header?


>  	return 0;
>  }
>  
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed3..3f279c8 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -36,6 +36,7 @@
>  #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
>  #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
> +#define VIRTIO_NET_F_SCTP_CSUM  4	/* SCTP checksum offload support */
>  #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
>  #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
>  #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
> @@ -101,6 +102,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */

Not very informative. Instead, please specify what is it actually.


>  	__u8 flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
From: Michael S. Tsirkin @ 2018-04-16 17:09 UTC (permalink / raw)
  To: Vlad Yasevich
  Cc: Vladislav Yasevich, netdev, linux-sctp, virtualization, jasowang,
	nhorman
In-Reply-To: <f5655ab8-f24f-a5a5-d000-2bbb90ecc552@redhat.com>

On Mon, Apr 16, 2018 at 09:45:48AM -0400, Vlad Yasevich wrote:
> On 04/11/2018 06:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Apr 02, 2018 at 09:40:02AM -0400, Vladislav Yasevich wrote:
> >> To support SCTP checksum offloading, we need to add a new feature
> >> to virtio_net, so we can negotiate support between the hypervisor
> >> and the guest.
> >>
> >> The signalling to the guest that an alternate checksum needs to
> >> be used is done via a new flag in the virtio_net_hdr.  If the
> >> flag is set, the host will know to perform an alternate checksum
> >> calculation, which right now is only CRC32c.
> >>
> >> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> >> ---
> >>  drivers/net/virtio_net.c        | 11 ++++++++---
> >>  include/linux/virtio_net.h      |  6 ++++++
> >>  include/uapi/linux/virtio_net.h |  2 ++
> >>  3 files changed, 16 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 7b187ec..b601294 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>  	/* Do we support "hardware" checksums? */
> >>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
> >>  		/* This opens up the world of extra features. */
> >> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> >> +		netdev_features_t sctp = 0;
> >> +
> >> +		if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
> >> +			sctp |= NETIF_F_SCTP_CRC;
> >> +
> >> +		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> >>  		if (csum)
> >> -			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
> >> +			dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
> >>  
> >>  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
> >>  			dev->hw_features |= NETIF_F_TSO
> >> @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
> >>  };
> >>  
> >>  #define VIRTNET_FEATURES \
> >> -	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
> >> +	VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \
> >>  	VIRTIO_NET_F_MAC, \
> >>  	VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
> >>  	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \
> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >> index f144216..2e7a64a 100644
> >> --- a/include/linux/virtio_net.h
> >> +++ b/include/linux/virtio_net.h
> >> @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
> >>  
> >>  		if (!skb_partial_csum_set(skb, start, off))
> >>  			return -EINVAL;
> >> +
> >> +		if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
> >> +			skb->csum_not_inet = 1;
> >>  	}
> >>  
> >>  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> >> @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
> >>  		hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
> >>  	} /* else everything is zero */
> >>  
> >> +	if (skb->csum_not_inet)
> >> +		hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
> >> +
> >>  	return 0;
> >>  }
> >>  
> >> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >> index 5de6ed3..3f279c8 100644
> >> --- a/include/uapi/linux/virtio_net.h
> >> +++ b/include/uapi/linux/virtio_net.h
> >> @@ -36,6 +36,7 @@
> >>  #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
> >>  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
> >>  #define VIRTIO_NET_F_MTU	3	/* Initial MTU advice */
> >> +#define VIRTIO_NET_F_SCTP_CSUM  4	/* SCTP checksum offload support */
> >>  #define VIRTIO_NET_F_MAC	5	/* Host has given MAC address. */
> >>  #define VIRTIO_NET_F_GUEST_TSO4	7	/* Guest can handle TSOv4 in. */
> >>  #define VIRTIO_NET_F_GUEST_TSO6	8	/* Guest can handle TSOv6 in. */
> > 
> > Is this a guest or a host checksum? We should differenciate between the
> > two.
> 
> I suppose this is HOST checksum, since it behaves like VIRTIO_NET_F_CSUM only for
> SCTP.  I couldn't find the use for the GUEST side flag, since there technically
> isn't any validations and there is no additional mappings from VIRTIO flag to a
> NETIF flag.
> 
> If the feature is negotiated, the guest ends up generating partially check-summed
> packets, and the host turns on appropriate flags on it's side.   The host will
> also make sure the checksum if fixed up if the guest doesn't support it.
> So 1 flag is currently all that is needed.
> 
> -vlad

I see code handling VIRTIO_NET_HDR_F_CSUM_NOT_INET on RX side.  Host
needs to know whether it's ok/worth it to set this flag, too.

> > 
> > 
> >> @@ -101,6 +102,7 @@ struct virtio_net_config {
> >>  struct virtio_net_hdr_v1 {
> >>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
> >>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> >> +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4       /* Checksum is not inet */
> >>  	__u8 flags;
> >>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
> >>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> >> -- 
> >> 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 4/5] tun: Add support for SCTP checksum offload
From: Michael S. Tsirkin @ 2018-04-16 17:12 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, jasowang, nhorman,
	Vladislav Yasevich
In-Reply-To: <20180402134006.10111-5-vyasevic@redhat.com>

On Mon, Apr 02, 2018 at 09:40:05AM -0400, Vladislav Yasevich wrote:
> Adds a new tun offload flag to allow for SCTP checksum offload.
> The flag has to be set by the user and defaults to "no offload".
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>

When would user set this flag? Wouldn't that be when
userspace is ready to get SCTP packets without a checksum?
Seems to be this is an indication that when userspace
is qemu running a guest, said guest needs to communicate
the new ability to qemu.

> ---
>  drivers/net/tun.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index a1ba262..263bcbe 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2719,6 +2719,11 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
>  		arg &= ~TUN_F_UFO;
>  	}
>  
> +	if (arg & TUN_F_SCTP_CSUM) {
> +		features |= NETIF_F_SCTP_CRC;
> +		arg &= ~TUN_F_SCTP_CSUM;
> +	}
> +
>  	/* This gives the user a way to test for new features in future by
>  	 * trying to set them. */
>  	if (arg)
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH net-next 5/5] macvlan/macvtap: Add support for SCTP checksum offload.
From: Michael S. Tsirkin @ 2018-04-16 17:14 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, jasowang, nhorman,
	Vladislav Yasevich
In-Reply-To: <20180402134006.10111-6-vyasevic@redhat.com>

On Mon, Apr 02, 2018 at 09:40:06AM -0400, Vladislav Yasevich wrote:
> Since we now have support for software CRC32c offload, turn it on
> for macvlan and macvtap devices so that guests can take advantage
> of offload SCTP checksums to the host or host hardware.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/macvlan.c       | 5 +++--
>  drivers/net/tap.c           | 8 +++++---
>  include/uapi/linux/if_tun.h | 1 +
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 725f4b4..646b730 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -834,7 +834,7 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
>  
>  #define ALWAYS_ON_OFFLOADS \
>  	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE | \
> -	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL)
> +	 NETIF_F_GSO_ROBUST | NETIF_F_GSO_ENCAP_ALL | NETIF_F_SCTP_CRC)
>  
>  #define ALWAYS_ON_FEATURES (ALWAYS_ON_OFFLOADS | NETIF_F_LLTX)
>  
> @@ -842,7 +842,8 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
>  	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
>  	 NETIF_F_GSO | NETIF_F_TSO | NETIF_F_LRO | \
>  	 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
> -	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
> +	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
> +	 NETIF_F_SCTP_CRC)
>  
>  #define MACVLAN_STATE_MASK \
>  	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 9b6cb78..2c8512b 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -369,8 +369,7 @@ rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
>  		 *	  check, we either support them all or none.
>  		 */
>  		if (skb->ip_summed == CHECKSUM_PARTIAL &&
> -		    !(features & NETIF_F_CSUM_MASK) &&
> -		    skb_checksum_help(skb))
> +		    skb_csum_hwoffload_help(skb, features))
>  			goto drop;
>  		if (ptr_ring_produce(&q->ring, skb))
>  			goto drop;
> @@ -945,6 +944,9 @@ static int set_offload(struct tap_queue *q, unsigned long arg)
>  		}
>  	}
>  
> +	if (arg & TUN_F_SCTP_CSUM)
> +		feature_mask |= NETIF_F_SCTP_CRC;
> +
>  	/* tun/tap driver inverts the usage for TSO offloads, where
>  	 * setting the TSO bit means that the userspace wants to
>  	 * accept TSO frames and turning it off means that user space

Does not the above comment apply to the new flag too?  It seems that
it's value should be inverted for macvtap, and isn't, here.


> @@ -1077,7 +1079,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
>  	case TUNSETOFFLOAD:
>  		/* let the user check for future flags */
>  		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> -			    TUN_F_TSO_ECN | TUN_F_UFO))
> +			    TUN_F_TSO_ECN | TUN_F_UFO | TUN_F_SCTP_CSUM))
>  			return -EINVAL;
>  
>  		rtnl_lock();
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index ee432cd..c3bb282 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -86,6 +86,7 @@
>  #define TUN_F_TSO6	0x04	/* I can handle TSO for IPv6 packets */
>  #define TUN_F_TSO_ECN	0x08	/* I can handle TSO with ECN bits. */
>  #define TUN_F_UFO	0x10	/* I can handle UFO packets */
> +#define TUN_F_SCTP_CSUM 0x20	/* I can handle SCTP checksum offload */
>  
>  /* Protocol info prepended to the packets (when IFF_NO_PI is not set) */
>  #define TUN_PKT_STRIP	0x0001

Doesn't this belong in the previous patch?

> 2.9.5

^ permalink raw reply

* [PATCH net-next 0/5] tcp: add zero copy receive
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet

This patch series add mmap() support to TCP sockets for RX zero copy.

While tcp_mmap() patch itself is quite small (~100 LOC), optimal support
for asynchronous mmap() required better SO_RCVLOWAT behavior, and a
test program to demonstrate how mmap() on TCP sockets can be used.

Note that mmap() (and associated munmap()) calls are adding more
pressure on per-process VM semaphore, so might not show benefit
for processus with high number of threads.

Eric Dumazet (5):
  tcp: fix SO_RCVLOWAT and RCVBUF autotuning
  tcp: fix delayed acks behavior for SO_RCVLOWAT
  tcp: avoid extra wakeups for SO_RCVLOWAT users
  tcp: implement mmap() for zero copy receive
  selftests: net: add tcp_mmap program

 include/linux/net.h                    |   1 +
 include/net/tcp.h                      |   4 +
 net/core/sock.c                        |   5 +-
 net/ipv4/af_inet.c                     |   3 +-
 net/ipv4/tcp.c                         | 138 ++++++++
 net/ipv4/tcp_input.c                   |  22 +-
 net/ipv6/af_inet6.c                    |   3 +-
 tools/testing/selftests/net/Makefile   |   2 +
 tools/testing/selftests/net/tcp_mmap.c | 437 +++++++++++++++++++++++++
 9 files changed, 608 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/net/tcp_mmap.c

-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply

* [PATCH net-next 1/5] tcp: fix SO_RCVLOWAT and RCVBUF autotuning
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.

Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.

Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/net.h |  1 +
 include/net/tcp.h   |  1 +
 net/core/sock.c     |  5 ++++-
 net/ipv4/af_inet.c  |  1 +
 net/ipv4/tcp.c      | 21 +++++++++++++++++++++
 net/ipv6/af_inet6.c |  1 +
 6 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 2248a052061d8aeb0ae08d233f181f09cba6384b..6554d3ba4396b3df49acac934ad16eeb71a695f4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -197,6 +197,7 @@ struct proto_ops {
 					   int offset, size_t size, int flags);
 	int		(*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
 					  size_t size);
+	int		(*set_rcvlowat)(struct sock *sk, int val);
 };
 
 #define DECLARE_SOCKADDR(type, dst, src)	\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9c9b3768b350abfd51776563d220d5e97ca9da69..b2318242cad89176d3c2c027affd4db3c2549ff4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -402,6 +402,7 @@ void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
+int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6444525f610cf8039516744ad26aec58485b9b8a..b2c3db169ca1892c4d624fc5e30af12f4eed0adb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -905,7 +905,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	case SO_RCVLOWAT:
 		if (val < 0)
 			val = INT_MAX;
-		sk->sk_rcvlowat = val ? : 1;
+		if (sock->ops->set_rcvlowat)
+			ret = sock->ops->set_rcvlowat(sk, val);
+		else
+			sk->sk_rcvlowat = val ? : 1;
 		break;
 
 	case SO_RCVTIMEO:
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eaed0367e669aec7635b3cc41de4ece63bb018ec..f5c562aaef3522519bcf1ae37782a7e14e278723 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1006,6 +1006,7 @@ const struct proto_ops inet_stream_ops = {
 	.compat_getsockopt = compat_sock_common_getsockopt,
 	.compat_ioctl	   = inet_compat_ioctl,
 #endif
+	.set_rcvlowat	   = tcp_set_rcvlowat,
 };
 EXPORT_SYMBOL(inet_stream_ops);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bccc4c2700870b8c7ff592a6bd27acebd9bc6471..0abd8d1d3d1d4f0bd6e2762c8a2b862ecf31e4ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1701,6 +1701,27 @@ int tcp_peek_len(struct socket *sock)
 }
 EXPORT_SYMBOL(tcp_peek_len);
 
+/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
+int tcp_set_rcvlowat(struct sock *sk, int val)
+{
+	sk->sk_rcvlowat = val ? : 1;
+	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+		return 0;
+
+	/* val comes from user space and might be close to INT_MAX */
+	val <<= 1;
+	if (val < 0)
+		val = INT_MAX;
+
+	val = min(val, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
+	if (val > sk->sk_rcvbuf) {
+		sk->sk_rcvbuf = val;
+		tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(tcp_set_rcvlowat);
+
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 8da0b513f1882b39be4fa72a8233d702ae9ec53b..e70d59fb26e16ace1eb484d23964946092a2cd57 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -590,6 +590,7 @@ const struct proto_ops inet6_stream_ops = {
 	.compat_setsockopt = compat_sock_common_setsockopt,
 	.compat_getsockopt = compat_sock_common_getsockopt,
 #endif
+	.set_rcvlowat	   = tcp_set_rcvlowat,
 };
 
 const struct proto_ops inet6_dgram_ops = {
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 2/5] tcp: fix delayed acks behavior for SO_RCVLOWAT
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

We should not delay acks if there are not enough bytes
in receive queue to satisfy SO_RCVLOWAT.

Since [E]POLLIN event is not going to be generated, there is little
hope for a delayed ack to be useful.

In fact, delaying ACK prevents sender from completing
the transfer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6ddeda950db841c0b9ccec98787e19e728..d854363a43875e98adbeea72c3434afb06f0f2b4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5026,9 +5026,12 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* More than one full frame received... */
 	if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
 	     /* ... and right edge of window advances far enough.
-	      * (tcp_recvmsg() will send ACK otherwise). Or...
+	      * (tcp_recvmsg() will send ACK otherwise).
+	      * If application uses SO_RCVLOWAT, we want send ack now if
+	      * we have not received enough bytes to satisfy the condition.
 	      */
-	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
+	    (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
+	     __tcp_select_window(sk) >= tp->rcv_wnd)) ||
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(sk) ||
 	    /* We have out of order data. */
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 3/5] tcp: avoid extra wakeups for SO_RCVLOWAT users
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
generated when enough bytes are available in receive queue, after
David change (commit c7004482e8dc "tcp: Respect SO_RCVLOWAT in tcp_poll().")

But TCP still calls sk->sk_data_ready() for each chunk added in receive
queue, meaning thread is awaken, and goes back to sleep shortly after.

Tested:

tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB

-> Should get ~2 wakeups (c-switches) per MB, regardless of how many
(tiny or big) packets were received.

High speed (mostly full size GRO packets)

received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
  cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches

received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
  cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches

Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)

received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
  cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h    |  1 +
 net/ipv4/tcp.c       |  4 ++++
 net/ipv4/tcp_input.c | 15 +++++++++++++--
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b2318242cad89176d3c2c027affd4db3c2549ff4..0ee85c47c185afcb8e1017d59e02313cb5df78ec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -403,6 +403,7 @@ void tcp_syn_ack_timeout(const struct request_sock *req);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 int tcp_set_rcvlowat(struct sock *sk, int val);
+void tcp_data_ready(struct sock *sk);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0abd8d1d3d1d4f0bd6e2762c8a2b862ecf31e4ae..c768d306b65714bb8740c60110c43042508af6b7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1705,6 +1705,10 @@ EXPORT_SYMBOL(tcp_peek_len);
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {
 	sk->sk_rcvlowat = val ? : 1;
+
+	/* Check if we need to signal EPOLLIN right now */
+	tcp_data_ready(sk);
+
 	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
 		return 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d854363a43875e98adbeea72c3434afb06f0f2b4..f93687f97d805732f1093d55a402638c4290700a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4576,6 +4576,17 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 
 }
 
+void tcp_data_ready(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int avail = tp->rcv_nxt - tp->copied_seq;
+
+	if (avail < sk->sk_rcvlowat && !sock_flag(sk, SOCK_DONE))
+		return;
+
+	sk->sk_data_ready(sk);
+}
+
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4633,7 +4644,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		if (eaten > 0)
 			kfree_skb_partial(skb, fragstolen);
 		if (!sock_flag(sk, SOCK_DEAD))
-			sk->sk_data_ready(sk);
+			tcp_data_ready(sk);
 		return;
 	}
 
@@ -5434,7 +5445,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 no_ack:
 			if (eaten)
 				kfree_skb_partial(skb, fragstolen);
-			sk->sk_data_ready(sk);
+			tcp_data_ready(sk);
 			return;
 		}
 	}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.

Implement mmap() system call so that applications can avoid
copying data without complex splice() games.

Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)

Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.

If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.

Application must fallback to recvmsg() to read the problematic sequence.

mmap() wont block,  regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.

An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()

On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.

Tested:

mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)

Without mmap() (tcp_mmap -s)

received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
  cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
  cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
  cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
  cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches

With mmap() on receiver (tcp_mmap -s -z)

received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
  cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
  cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
  cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
  cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h   |   2 +
 net/ipv4/af_inet.c  |   2 +-
 net/ipv4/tcp.c      | 113 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c |   2 +-
 4 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0ee85c47c185afcb8e1017d59e02313cb5df78ec..833154e3df173ea41aa16dd1ec739a175c679c5c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -404,6 +404,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_data_ready(struct sock *sk);
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f5c562aaef3522519bcf1ae37782a7e14e278723..3ebf599cebaea4926decc1aad7274b12ec7e1566 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -994,7 +994,7 @@ const struct proto_ops inet_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = inet_recvmsg,
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.splice_read	   = tcp_splice_read,
 	.read_sock	   = tcp_read_sock,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c768d306b65714bb8740c60110c43042508af6b7..438fbca96cd3100d722e1bd8bcc6f49624495a21 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,6 +1726,119 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
+/* When user wants to mmap X pages, we first need to perform the mapping
+ * before freeing any skbs in receive queue, otherwise user would be unable
+ * to fallback to standard recvmsg(). This happens if some data in the
+ * requested block is not exactly fitting in a page.
+ *
+ * We only support order-0 pages for the moment.
+ * mmap() on TCP is very strict, there is no point
+ * trying to accommodate with pathological layouts.
+ */
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma)
+{
+	unsigned long size = vma->vm_end - vma->vm_start;
+	unsigned int nr_pages = size >> PAGE_SHIFT;
+	struct page **pages_array = NULL;
+	u32 seq, len, offset, nr = 0;
+	struct sock *sk = sock->sk;
+	const skb_frag_t *frags;
+	struct tcp_sock *tp;
+	struct sk_buff *skb;
+	int ret;
+
+	if (vma->vm_pgoff || !nr_pages)
+		return -EINVAL;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+	/* TODO: Maybe the following is not needed if pages are COW */
+	vma->vm_flags &= ~VM_MAYWRITE;
+
+	lock_sock(sk);
+
+	ret = -ENOTCONN;
+	if (sk->sk_state == TCP_LISTEN)
+		goto out;
+
+	sock_rps_record_flow(sk);
+
+	if (tcp_inq(sk) < size) {
+		ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN;
+		goto out;
+	}
+	tp = tcp_sk(sk);
+	seq = tp->copied_seq;
+	/* Abort if urgent data is in the area */
+	if (unlikely(tp->urg_data)) {
+		u32 urg_offset = tp->urg_seq - seq;
+
+		ret = -EINVAL;
+		if (urg_offset < size)
+			goto out;
+	}
+	ret = -ENOMEM;
+	pages_array = kvmalloc_array(nr_pages, sizeof(struct page *),
+				     GFP_KERNEL);
+	if (!pages_array)
+		goto out;
+	skb = tcp_recv_skb(sk, seq, &offset);
+	ret = -EINVAL;
+skb_start:
+	/* We do not support anything not in page frags */
+	offset -= skb_headlen(skb);
+	if ((int)offset < 0)
+		goto out;
+	if (skb_has_frag_list(skb))
+		goto out;
+	len = skb->data_len - offset;
+	frags = skb_shinfo(skb)->frags;
+	while (offset) {
+		if (frags->size > offset)
+			goto out;
+		offset -= frags->size;
+		frags++;
+	}
+	while (nr < nr_pages) {
+		if (len) {
+			if (len < PAGE_SIZE)
+				goto out;
+			if (frags->size != PAGE_SIZE || frags->page_offset)
+				goto out;
+			pages_array[nr++] = skb_frag_page(frags);
+			frags++;
+			len -= PAGE_SIZE;
+			seq += PAGE_SIZE;
+			continue;
+		}
+		skb = skb->next;
+		offset = seq - TCP_SKB_CB(skb)->seq;
+		goto skb_start;
+	}
+	/* OK, we have a full set of pages ready to be inserted into vma */
+	for (nr = 0; nr < nr_pages; nr++) {
+		ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT),
+				     pages_array[nr]);
+		if (ret)
+			goto out;
+	}
+	/* operation is complete, we can 'consume' all skbs */
+	tp->copied_seq = seq;
+	tcp_rcv_space_adjust(sk);
+
+	/* Clean up data we have read: This will do ACK frames. */
+	tcp_recv_skb(sk, seq, &offset);
+	tcp_cleanup_rbuf(sk, size);
+
+	ret = 0;
+out:
+	release_sock(sk);
+	kvfree(pages_array);
+	return ret;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e70d59fb26e16ace1eb484d23964946092a2cd57..2c694912df2e77b414de5cc2aa43e2ec59286836 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -579,7 +579,7 @@ const struct proto_ops inet6_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = inet_recvmsg,		/* ok		*/
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.sendmsg_locked    = tcp_sendmsg_locked,
 	.sendpage_locked   = tcp_sendpage_locked,
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 5/5] selftests: net: add tcp_mmap program
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

This is a reference program showing how mmap() can be used
on TCP flows to implement receive zero copy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 tools/testing/selftests/net/Makefile   |   2 +
 tools/testing/selftests/net/tcp_mmap.c | 437 +++++++++++++++++++++++++
 2 files changed, 439 insertions(+)
 create mode 100644 tools/testing/selftests/net/tcp_mmap.c

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 785fc18a16b4701f3ef875b60648726750b0cd26..23e725f4305eae6bc23fe705c8d3fe262110395a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -8,9 +8,11 @@ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetl
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
+TEST_GEN_FILES += tcp_mmap
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 
 include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
+$(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c
new file mode 100644
index 0000000000000000000000000000000000000000..dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -0,0 +1,437 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Eric Dumazet (edumazet@google.com)
+ *
+ * Reference program demonstrating tcp mmap() usage,
+ * and SO_RCVLOWAT hints for receiver.
+ *
+ * Note : NIC with header split is needed to use mmap() on TCP :
+ * Each incoming frame must be a multiple of PAGE_SIZE bytes of TCP payload.
+ *
+ * How to use on loopback interface :
+ *
+ *  ifconfig lo mtu 61512  # 15*4096 + 40 (ipv6 header) + 32 (TCP with TS option header)
+ *  tcp_mmap -s -z &
+ *  tcp_mmap -H ::1 -z
+ *
+ *  Or leave default lo mtu, but use -M option to set TCP_MAXSEG option to (4096 + 12)
+ *      (4096 : page size on x86, 12: TCP TS option length)
+ *  tcp_mmap -s -z -M $((4096+12)) &
+ *  tcp_mmap -H ::1 -z -M $((4096+12))
+ *
+ * Note: -z option on sender uses MSG_ZEROCOPY, which forces a copy when packets go through loopback interface.
+ *       We might use sendfile() instead, but really this test program is about mmap(), for receivers ;)
+ *
+ * $ ./tcp_mmap -s &                                 # Without mmap()
+ * $ for i in {1..4}; do ./tcp_mmap -H ::1 -z ; done
+ * received 32768 MB (0 % mmap'ed) in 14.1157 s, 19.4732 Gbit
+ *   cpu usage user:0.057 sys:7.815, 240.234 usec per MB, 65531 c-switches
+ * received 32768 MB (0 % mmap'ed) in 14.6833 s, 18.7204 Gbit
+ *  cpu usage user:0.043 sys:8.103, 248.596 usec per MB, 65524 c-switches
+ * received 32768 MB (0 % mmap'ed) in 11.143 s, 24.6682 Gbit
+ *   cpu usage user:0.044 sys:6.576, 202.026 usec per MB, 65519 c-switches
+ * received 32768 MB (0 % mmap'ed) in 14.9056 s, 18.4413 Gbit
+ *   cpu usage user:0.036 sys:8.193, 251.129 usec per MB, 65530 c-switches
+ * $ kill %1   # kill tcp_mmap server
+ *
+ * $ ./tcp_mmap -s -z &                              # With mmap()
+ * $ for i in {1..4}; do ./tcp_mmap -H ::1 -z ; done
+ * received 32768 MB (99.9939 % mmap'ed) in 6.73792 s, 40.7956 Gbit
+ *   cpu usage user:0.045 sys:2.827, 87.6465 usec per MB, 65532 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.26732 s, 37.8238 Gbit
+ *   cpu usage user:0.037 sys:3.087, 95.3369 usec per MB, 65532 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.61661 s, 36.0893 Gbit
+ *   cpu usage user:0.046 sys:3.559, 110.016 usec per MB, 65529 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.43764 s, 36.9577 Gbit
+ *   cpu usage user:0.035 sys:3.467, 106.873 usec per MB, 65530 c-switches
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <error.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/time.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <arpa/inet.h>
+#include <poll.h>
+
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY    0x4000000
+#endif
+
+#define FILE_SZ (1UL << 35)
+static int cfg_family = AF_INET6;
+static socklen_t cfg_alen = sizeof(struct sockaddr_in6);
+static int cfg_port = 8787;
+
+static int rcvbuf; /* Default: autotuning.  Can be set with -r <integer> option */
+static int sndbuf; /* Default: autotuning.  Can be set with -w <integer> option */
+static int zflg; /* zero copy option. (MSG_ZEROCOPY for sender, mmap() for receiver */
+static int xflg; /* hash received data (simple xor) (-h option) */
+static int keepflag; /* -k option: receiver shall keep all received file in memory (no munmap() calls) */
+
+static int chunk_size  = 512*1024;
+
+unsigned long htotal;
+
+static inline void prefetch(const void *x)
+{
+#if defined(__x86_64__)
+	asm volatile("prefetcht0 %P0" : : "m" (*(const char *)x));
+#endif
+}
+
+void hash_zone(void *zone, unsigned int length)
+{
+	unsigned long temp = htotal;
+
+	while (length >= 8*sizeof(long)) {
+		prefetch(zone + 384);
+		temp ^= *(unsigned long *)zone;
+		temp ^= *(unsigned long *)(zone + sizeof(long));
+		temp ^= *(unsigned long *)(zone + 2*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 3*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 4*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 5*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 6*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 7*sizeof(long));
+		zone += 8*sizeof(long);
+		length -= 8*sizeof(long);
+	}
+	while (length >= 1) {
+		temp ^= *(unsigned char *)zone;
+		zone += 1;
+		length--;
+	}
+	htotal = temp;
+}
+
+void *child_thread(void *arg)
+{
+	unsigned long total_mmap = 0, total = 0;
+	unsigned long delta_usec;
+	int flags = MAP_SHARED;
+	struct timeval t0, t1;
+	char *buffer = NULL;
+	void *oaddr = NULL;
+	double throughput;
+	struct rusage ru;
+	int lu, fd;
+
+	fd = (int)(unsigned long)arg;
+
+	gettimeofday(&t0, NULL);
+
+	fcntl(fd, F_SETFL, O_NDELAY);
+	buffer = malloc(chunk_size);
+	if (!buffer) {
+		perror("malloc");
+		goto error;
+	}
+	while (1) {
+		struct pollfd pfd = { .fd = fd, .events = POLLIN, };
+		int sub;
+
+		poll(&pfd, 1, 10000);
+		if (zflg) {
+			void *naddr;
+
+			naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 0);
+			if (naddr == (void *)-1) {
+				if (errno == EAGAIN) {
+					/* That is if SO_RCVLOWAT is buggy */
+					usleep(1000);
+					continue;
+				}
+				if (errno == EINVAL) {
+					flags = MAP_SHARED;
+					oaddr = NULL;
+					goto fallback;
+				}
+				if (errno != EIO)
+					perror("mmap()");
+				break;
+			}
+			total_mmap += chunk_size;
+			if (xflg)
+				hash_zone(naddr, chunk_size);
+			total += chunk_size;
+			if (!keepflag) {
+				flags |= MAP_FIXED;
+				oaddr = naddr;
+			}
+			continue;
+		}
+fallback:
+		sub = 0;
+		while (sub < chunk_size) {
+			lu = read(fd, buffer + sub, chunk_size - sub);
+			if (lu == 0)
+				goto end;
+			if (lu < 0)
+				break;
+			if (xflg)
+				hash_zone(buffer + sub, lu);
+			total += lu;
+			sub += lu;
+		}
+	}
+end:
+	gettimeofday(&t1, NULL);
+	delta_usec = (t1.tv_sec - t0.tv_sec) * 1000000 + t1.tv_usec - t0.tv_usec;
+
+	throughput = 0;
+	if (delta_usec)
+		throughput = total * 8.0 / (double)delta_usec / 1000.0;
+	getrusage(RUSAGE_THREAD, &ru);
+	if (total > 1024*1024) {
+		unsigned long total_usec;
+		unsigned long mb = total >> 20;
+		total_usec = 1000000*ru.ru_utime.tv_sec + ru.ru_utime.tv_usec +
+			     1000000*ru.ru_stime.tv_sec + ru.ru_stime.tv_usec;
+		printf("received %lg MB (%lg %% mmap'ed) in %lg s, %lg Gbit\n"
+		       "  cpu usage user:%lg sys:%lg, %lg usec per MB, %lu c-switches\n",
+				total / (1024.0 * 1024.0),
+				100.0*total_mmap/total,
+				(double)delta_usec / 1000000.0,
+				throughput,
+				(double)ru.ru_utime.tv_sec + (double)ru.ru_utime.tv_usec / 1000000.0,
+				(double)ru.ru_stime.tv_sec + (double)ru.ru_stime.tv_usec / 1000000.0,
+				(double)total_usec/mb,
+				ru.ru_nvcsw);
+	}
+error:
+	free(buffer);
+	close(fd);
+	pthread_exit(0);
+}
+
+static void apply_rcvsnd_buf(int fd)
+{
+	if (rcvbuf && setsockopt(fd, SOL_SOCKET,
+				 SO_RCVBUF, &rcvbuf, sizeof(rcvbuf)) == -1) {
+		perror("setsockopt SO_RCVBUF");
+	}
+
+	if (sndbuf && setsockopt(fd, SOL_SOCKET,
+				 SO_SNDBUF, &sndbuf, sizeof(sndbuf)) == -1) {
+		perror("setsockopt SO_SNDBUF");
+	}
+}
+
+
+static void setup_sockaddr(int domain, const char *str_addr,
+			   struct sockaddr_storage *sockaddr)
+{
+	struct sockaddr_in6 *addr6 = (void *) sockaddr;
+	struct sockaddr_in *addr4 = (void *) sockaddr;
+
+	switch (domain) {
+	case PF_INET:
+		memset(addr4, 0, sizeof(*addr4));
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(cfg_port);
+		if (str_addr &&
+		    inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
+			error(1, 0, "ipv4 parse error: %s", str_addr);
+		break;
+	case PF_INET6:
+		memset(addr6, 0, sizeof(*addr6));
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(cfg_port);
+		if (str_addr &&
+		    inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
+			error(1, 0, "ipv6 parse error: %s", str_addr);
+		break;
+	default:
+		error(1, 0, "illegal domain");
+	}
+}
+
+static void do_accept(int fdlisten)
+{
+	if (setsockopt(fdlisten, SOL_SOCKET, SO_RCVLOWAT,
+		       &chunk_size, sizeof(chunk_size)) == -1) {
+		perror("setsockopt SO_RCVLOWAT");
+	}
+
+	apply_rcvsnd_buf(fdlisten);
+
+	while (1) {
+		struct sockaddr_in addr;
+		socklen_t addrlen = sizeof(addr);
+		pthread_t th;
+		int fd, res;
+
+		fd = accept(fdlisten, (struct sockaddr *)&addr, &addrlen);
+		if (fd == -1) {
+			perror("accept");
+			continue;
+		}
+		res = pthread_create(&th, NULL, child_thread,
+				     (void *)(unsigned long)fd);
+		if (res) {
+			errno = res;
+			perror("pthread_create");
+			close(fd);
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_storage listenaddr, addr;
+	unsigned int max_pacing_rate = 0;
+	unsigned long total = 0;
+	char *host = NULL;
+	int fd, c, on = 1;
+	char *buffer;
+	int sflg = 0;
+	int mss = 0;
+
+	while ((c = getopt(argc, argv, "46p:svr:w:H:zxkP:M:")) != -1) {
+		switch (c) {
+		case '4':
+			cfg_family = PF_INET;
+			cfg_alen = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			cfg_family = PF_INET6;
+			cfg_alen = sizeof(struct sockaddr_in6);
+			break;
+		case 'p':
+			cfg_port = atoi(optarg);
+			break;
+		case 'H':
+			host = optarg;
+			break;
+		case 's': /* server : listen for incoming connections */
+			sflg++;
+			break;
+		case 'r':
+			rcvbuf = atoi(optarg);
+			break;
+		case 'w':
+			sndbuf = atoi(optarg);
+			break;
+		case 'z':
+			zflg = 1;
+			break;
+		case 'M':
+			mss = atoi(optarg);
+			break;
+		case 'x':
+			xflg = 1;
+			break;
+		case 'k':
+			keepflag = 1;
+			break;
+		case 'P':
+			max_pacing_rate = atoi(optarg) ;
+			break;
+		default:
+			exit(1);
+		}
+	}
+	if (sflg) {
+		int fdlisten = socket(cfg_family, SOCK_STREAM, 0);
+
+		if (fdlisten == -1) {
+			perror("socket");
+			exit(1);
+		}
+		apply_rcvsnd_buf(fdlisten);
+		setsockopt(fdlisten, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
+
+		setup_sockaddr(cfg_family, host, &listenaddr);
+
+		if (mss &&
+		    setsockopt(fdlisten, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+			perror("setsockopt TCP_MAXSEG");
+			exit(1);
+		}
+		if (bind(fdlisten, (const struct sockaddr *)&listenaddr, cfg_alen) == -1) {
+			perror("bind");
+			exit(1);
+		}
+		if (listen(fdlisten, 128) == -1) {
+			perror("listen");
+			exit(1);
+		}
+		do_accept(fdlisten);
+	}
+	buffer = mmap(NULL, chunk_size, PROT_READ | PROT_WRITE,
+			      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (buffer == (char *)-1) {
+		perror("mmap");
+		exit(1);
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (fd == -1) {
+		perror("socket");
+		exit(1);
+	}
+	apply_rcvsnd_buf(fd);
+
+	setup_sockaddr(cfg_family, host, &addr);
+
+	if (mss &&
+	    setsockopt(fd, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+		perror("setsockopt TCP_MAXSEG");
+		exit(1);
+	}
+	if (connect(fd, (const struct sockaddr *)&addr, cfg_alen) == -1) {
+		perror("connect");
+		exit(1);
+	}
+	if (max_pacing_rate &&
+	    setsockopt(fd, SOL_SOCKET, SO_MAX_PACING_RATE,
+		       &max_pacing_rate, sizeof(max_pacing_rate)) == -1)
+		perror("setsockopt SO_MAX_PACING_RATE");
+
+	if (zflg && setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY,
+			       &on, sizeof(on)) == -1) {
+		perror("setsockopt SO_ZEROCOPY, (-z option disabled)");
+		zflg = 0;
+	}
+	while (total < FILE_SZ) {
+		long wr = FILE_SZ - total;
+
+		if (wr > chunk_size)
+			wr = chunk_size;
+		/* Note : we just want to fill the pipe with 0 bytes */
+		wr = send(fd, buffer, wr, zflg ? MSG_ZEROCOPY : 0);
+		if (wr <= 0)
+			break;
+		total += wr;
+	}
+	close(fd);
+	munmap(buffer, chunk_size);
+	return 0;
+}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net] Count IPv6 interface receive statistics on the ingress netdev
From: Stephen Suryaputra @ 2018-04-16 17:42 UTC (permalink / raw)
  To: netdev, ja; +Cc: Stephen Suryaputra

The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 include/net/addrconf.h          | 14 +++++++++++
 net/ipv6/exthdrs.c              | 55 ++++++++++++++++-------------------------
 net/ipv6/ip6_input.c            |  2 +-
 net/ipv6/ip6_output.c           | 18 ++++++--------
 net/ipv6/reassembly.c           |  6 ++---
 net/ipv6/route.c                |  3 ++-
 net/netfilter/ipvs/ip_vs_xmit.c |  5 ++--
 7 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 378d601..8312cc2 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -308,6 +308,20 @@ static inline struct inet6_dev *__in6_dev_get(const struct net_device *dev)
 }
 
 /**
+ * __in6_dev_get_safely - get inet6_dev pointer from netdevice
+ * @dev: network device
+ *
+ * This is a safer version of __in6_dev_get
+ */
+static inline struct inet6_dev *__in6_dev_get_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference_rtnl(dev->ip6_ptr);
+	else
+		return NULL;
+}
+
+/**
  * in6_dev_get - get inet6_dev pointer from netdevice
  * @dev: network device
  *
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index bc68eb6..5bc2bf3 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -280,6 +280,7 @@ static const struct tlvtype_proc tlvprocdestopt_lst[] = {
 
 static int ipv6_destopt_rcv(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get(skb->dev);
 	struct inet6_skb_parm *opt = IP6CB(skb);
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 	__u16 dstbuf;
@@ -291,7 +292,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) ||
 	    !pskb_may_pull(skb, (skb_transport_offset(skb) +
 				 ((skb_transport_header(skb)[1] + 1) << 3)))) {
-		__IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst),
+		__IP6_INC_STATS(dev_net(dst->dev), idev,
 				IPSTATS_MIB_INHDRERRORS);
 fail_and_free:
 		kfree_skb(skb);
@@ -319,8 +320,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
-	__IP6_INC_STATS(dev_net(dst->dev),
-			ip6_dst_idev(dst), IPSTATS_MIB_INHDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 	return -1;
 }
 
@@ -416,8 +416,7 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 	}
 
 	if (hdr->segments_left >= (hdr->hdrlen >> 1)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((&hdr->segments_left) -
 				   skb_network_header(skb)));
@@ -456,8 +455,7 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 
 	if (skb_dst(skb)->dev->flags & IFF_LOOPBACK) {
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED,
 				    ICMPV6_EXC_HOPLIMIT, 0);
 			kfree_skb(skb);
@@ -481,10 +479,10 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 /* called with rcu_read_lock() */
 static int ipv6_rthdr_rcv(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get(skb->dev);
 	struct inet6_skb_parm *opt = IP6CB(skb);
 	struct in6_addr *addr = NULL;
 	struct in6_addr daddr;
-	struct inet6_dev *idev;
 	int n, i;
 	struct ipv6_rt_hdr *hdr;
 	struct rt0_hdr *rthdr;
@@ -498,8 +496,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) ||
 	    !pskb_may_pull(skb, (skb_transport_offset(skb) +
 				 ((skb_transport_header(skb)[1] + 1) << 3)))) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -508,8 +505,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr) ||
 	    skb->pkt_type != PACKET_HOST) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INADDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -527,7 +523,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 			 * processed by own
 			 */
 			if (!addr) {
-				__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+				__IP6_INC_STATS(net, idev,
 						IPSTATS_MIB_INADDRERRORS);
 				kfree_skb(skb);
 				return -1;
@@ -553,8 +549,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 			goto unknown_rh;
 		/* Silently discard invalid RTH type 2 */
 		if (hdr->hdrlen != 2 || hdr->segments_left != 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
@@ -572,8 +567,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	n = hdr->hdrlen >> 1;
 
 	if (hdr->segments_left > n) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((&hdr->segments_left) -
 				   skb_network_header(skb)));
@@ -609,14 +603,12 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 		if (xfrm6_input_addr(skb, (xfrm_address_t *)addr,
 				     (xfrm_address_t *)&ipv6_hdr(skb)->saddr,
 				     IPPROTO_ROUTING) < 0) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INADDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
 		if (!ipv6_chk_home_addr(dev_net(skb_dst(skb)->dev), addr)) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INADDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
@@ -627,8 +619,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	}
 
 	if (ipv6_addr_is_multicast(addr)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INADDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -647,8 +638,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 
 	if (skb_dst(skb)->dev->flags&IFF_LOOPBACK) {
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT,
 				    0);
 			kfree_skb(skb);
@@ -663,7 +653,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	return -1;
 
 unknown_rh:
-	__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INHDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 	icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 			  (&hdr->type) - skb_network_header(skb));
 	return -1;
@@ -755,34 +745,31 @@ static bool ipv6_hop_ra(struct sk_buff *skb, int optoff)
 static bool ipv6_hop_jumbo(struct sk_buff *skb, int optoff)
 {
 	const unsigned char *nh = skb_network_header(skb);
+	struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
 	struct net *net = ipv6_skb_net(skb);
 	u32 pkt_len;
 
 	if (nh[optoff + 1] != 4 || (optoff & 3) != 2) {
 		net_dbg_ratelimited("ipv6_hop_jumbo: wrong jumbo opt length/alignment %d\n",
 				    nh[optoff+1]);
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
 	pkt_len = ntohl(*(__be32 *)(nh + optoff + 2));
 	if (pkt_len <= IPV6_MAXPLEN) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff+2);
 		return false;
 	}
 	if (ipv6_hdr(skb)->payload_len) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff);
 		return false;
 	}
 
 	if (pkt_len > skb->len - sizeof(struct ipv6hdr)) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	}
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a..f08d344 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -336,7 +336,7 @@ int ip6_mc_input(struct sk_buff *skb)
 	bool deliver;
 
 	__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
-			 ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INMCAST,
+			 __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
 			 skb->len);
 
 	hdr = ipv6_hdr(skb);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2e891d2..a39b04f9 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -425,6 +425,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 
 int ip6_forward(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
 	struct dst_entry *dst = skb_dst(skb);
 	struct ipv6hdr *hdr = ipv6_hdr(skb);
 	struct inet6_skb_parm *opt = IP6CB(skb);
@@ -444,8 +445,7 @@ int ip6_forward(struct sk_buff *skb)
 		goto drop;
 
 	if (!xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INDISCARDS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -476,8 +476,7 @@ int ip6_forward(struct sk_buff *skb)
 		/* Force OUTPUT device used as source address */
 		skb->dev = dst->dev;
 		icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0);
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 
 		kfree_skb(skb);
 		return -ETIMEDOUT;
@@ -490,15 +489,13 @@ int ip6_forward(struct sk_buff *skb)
 		if (proxied > 0)
 			return ip6_input(skb);
 		else if (proxied < 0) {
-			__IP6_INC_STATS(net, ip6_dst_idev(dst),
-					IPSTATS_MIB_INDISCARDS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 			goto drop;
 		}
 	}
 
 	if (!xfrm6_route_forward(skb)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INDISCARDS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 	dst = skb_dst(skb);
@@ -554,8 +551,7 @@ int ip6_forward(struct sk_buff *skb)
 		/* Again, force OUTPUT device used as source address */
 		skb->dev = dst->dev;
 		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INTOOBIGERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INTOOBIGERRORS);
 		__IP6_INC_STATS(net, ip6_dst_idev(dst),
 				IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
@@ -579,7 +575,7 @@ int ip6_forward(struct sk_buff *skb)
 		       ip6_forward_finish);
 
 error:
-	__IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_INADDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 drop:
 	kfree_skb(skb);
 	return -EINVAL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 4979610..2cdf3dc 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -179,7 +179,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 			((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
 	if ((unsigned int)end > IPV6_MAXPLEN) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+		__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 				IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((u8 *)&fhdr->frag_off -
@@ -214,7 +214,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 			/* RFC2460 says always send parameter problem in
 			 * this case. -DaveM
 			 */
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+			__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 					IPSTATS_MIB_INHDRERRORS);
 			icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 					  offsetof(struct ipv6hdr, payload_len));
@@ -536,7 +536,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 	return -1;
 
 fail_hdr:
-	__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+	__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 			IPSTATS_MIB_INHDRERRORS);
 	icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, skb_network_header_len(skb));
 	return -1;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49b954d..1d738bf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3541,7 +3541,8 @@ static int ip6_pkt_drop(struct sk_buff *skb, u8 code, int ipstats_mib_noroutes)
 	case IPSTATS_MIB_INNOROUTES:
 		type = ipv6_addr_type(&ipv6_hdr(skb)->daddr);
 		if (type == IPV6_ADDR_ANY) {
-			IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst),
+			IP6_INC_STATS(dev_net(dst->dev),
+				      __in6_dev_get_safely(skb->dev),
 				      IPSTATS_MIB_INADDRERRORS);
 			break;
 		}
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..ba0a0fd 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -266,12 +266,13 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 
 		/* check and decrement ttl */
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
+			struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
+
 			/* Force OUTPUT device used as source address */
 			skb->dev = dst->dev;
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED,
 				    ICMPV6_EXC_HOPLIMIT, 0);
-			__IP6_INC_STATS(net, ip6_dst_idev(dst),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 
 			return false;
 		}
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] net: mediatek: use of_device_get_match_data()
From: David Miller @ 2018-04-16 17:43 UTC (permalink / raw)
  To: ryder.lee
  Cc: sean.wang, netdev, linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <9bf87205f84062580934168774d690d4cd2bf26b.1523347340.git.ryder.lee@mediatek.com>

From: Ryder Lee <ryder.lee@mediatek.com>
Date: Mon, 16 Apr 2018 10:33:41 +0800

> The usage of of_device_get_match_data() reduce the code size a bit.
> 
> Also, the only way to call mtk_probe() is to match an entry in
> of_mtk_match[], so match cannot be NULL.
> 
> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback
From: David Miller @ 2018-04-16 17:46 UTC (permalink / raw)
  To: jassisinghbrar; +Cc: netdev, masahisa.kojima, ard.biesheuvel, jaswinder.singh
In-Reply-To: <1523863336-12653-1-git-send-email-jassisinghbrar@gmail.com>

From: jassisinghbrar@gmail.com
Date: Mon, 16 Apr 2018 12:52:16 +0530

> From: Jassi Brar <jaswinder.singh@linaro.org>
> 
> Enable TX-irq as well during ndo_open() as we can not count upon
> RX to arrive early enough to trigger the napi. This patch is critical
> for installation over network.
> 
> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox