Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net 0/2] tipc: Better check user provided attributes
From: Jon Maloy @ 2018-04-16 19:21 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20180416152943.103989-1-edumazet@google.com>

Acked-by: Jon Maloy <jon.maloy@ericsson.com>

Thank you, Eric.


> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Monday, April 16, 2018 11:30
> To: David S . Miller <davem@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>; Eric Dumazet
> <edumazet@google.com>; Eric Dumazet <eric.dumazet@gmail.com>
> Subject: [PATCH net 0/2] tipc: Better check user provided attributes
> 
> syzbot reported a crash in __tipc_nl_net_set()
> 
> While fixing it, I also had to fix an old bug involving TIPC_NLA_NET_ADDR
> 
> Eric Dumazet (2):
>   tipc: add policy for TIPC_NLA_NET_ADDR
>   tipc: fix possible crash in __tipc_nl_net_set()
> 
>  net/tipc/net.c     | 2 ++
>  net/tipc/netlink.c | 5 ++++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> --
> 2.17.0.484.g0c8726318c-goog

^ permalink raw reply

* Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx
From: Rob Herring @ 2018-04-16 19:22 UTC (permalink / raw)
  To: Phil Elwell
  Cc: Woojung Huh, Microchip Linux Driver Support, Mark Rutland,
	David S. Miller, Mauro Carvalho Chehab, Greg Kroah-Hartman,
	Linus Walleij, Andrew Morton, Randy Dunlap, netdev, devicetree,
	linux-kernel, linux-usb
In-Reply-To: <1523541336-145953-5-git-send-email-phil@raspberrypi.org>

On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.
> 
> Document the supported properties in a bindings file, adding it to
> MAINTAINERS at the same time.
> 
> Signed-off-by: Phil Elwell <phil@raspberrypi.org>
> ---
>  .../devicetree/bindings/net/microchip,lan78xx.txt  | 44 ++++++++++++++++++++++
>  MAINTAINERS                                        |  1 +
>  2 files changed, 45 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> new file mode 100644
> index 0000000..e7d7850
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
> @@ -0,0 +1,44 @@
> +Microchip LAN78xx Gigabit Ethernet controller
> +
> +The LAN78XX devices are usually configured by programming their OTP or with
> +an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither.
> +
> +Please refer to ethernet.txt for a description of common Ethernet bindings.
> +
> +Optional properties:
> +- microchip,eee-enabled: if present, enable Energy Efficient Ethernet support;

I see we have some flags for broken EEE, but nothing already defined to 
enable EEE. Seems like this should either be a user option (therefore 
not in DT) or we should use the broken EEE properties if this is h/w 
dependent.

> +- microchip,led-modes: a two-element vector, with each element configuring
> +  the operating mode of an LED. The values supported by the device are;
> +  0: Link/Activity
> +  1: Link1000/Activity
> +  2: Link100/Activity
> +  3: Link10/Activity
> +  4: Link100/1000/Activity
> +  5: Link10/1000/Activity
> +  6: Link10/100/Activity
> +  7: RESERVED
> +  8: Duplex/Collision
> +  9: Collision
> +  10: Activity
> +  11: RESERVED
> +  12: Auto-negotiation Fault
> +  13: RESERVED
> +  14: Off
> +  15: On
> +- microchip,tx-lpi-timer: the delay (in microseconds) between the TX fifo
> +  becoming empty and invoking Low Power Idles (default 600).

Needs a unit suffix as defined in property-units.txt.

> +
> +Example:
> +
> +	/* Standard configuration for a Raspberry Pi 3 B+ */
> +	ethernet: usbether@1 {
> +		compatible = "usb424,7800";
> +		reg = <1>;
> +		microchip,eee-enabled;
> +		microchip,tx-lpi-timer = <600>;
> +		/*
> +		 * led0 = 1:link1000/activity
> +		 * led1 = 6:link10/100/activity
> +		 */
> +		microchip,led-modes = <1 6>;
> +	};
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2328eed..b637aad 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14482,6 +14482,7 @@ M:	Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
>  L:	netdev@vger.kernel.org
>  S:	Maintained
>  F:	drivers/net/usb/lan78xx.*
> +F:	Documentation/devicetree/bindings/net/microchip,lan78xx.txt
>  
>  USB MASS STORAGE DRIVER
>  M:	Alan Stern <stern@rowland.harvard.edu>
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH] PCI: Add two more values for PCIe Max_Read_Request_Size
From: Heiner Kallweit @ 2018-04-16 19:21 UTC (permalink / raw)
  To: Bjorn Helgaas, David Miller
  Cc: Bjorn Helgaas, linux-pci, netdev@vger.kernel.org
In-Reply-To: <e24bac5a-cf03-3579-2d44-6548a0998258@gmail.com>

Am 13.04.2018 um 07:48 schrieb Heiner Kallweit:
> Am 13.04.2018 um 00:55 schrieb Bjorn Helgaas:
>> Hello Heiner,
>>
>> On Thu, Apr 12, 2018 at 11:08:04PM +0200, Heiner Kallweit wrote:
>>> This patch adds missing values for the max read request size.
>>> E.g. network driver r8169 uses a value of 4K.
>>
>> Is there a r8169 patch that adds uses of PCI_EXP_DEVCTL_READRQ_4096B?
>> If so, we should probably keep it together with this one since this
>> one would have to be merged first.
>>
> Not yet, so far r8169 uses the following to set a MRRS of 4K.
> 
> #define MAX_READ_REQUEST_SHIFT	12
> 0x5 << MAX_READ_REQUEST_SHIFT
> 
> I have to check when setting 4K MRRS was added, but it seems
> that so far it's fine with all users.
> 
> Before submitting the r8169 patch I wanted to check for feedback
> regarding use of the two additional MRRS values, so what you write
> in the following is exactly the type of feedback I was looking for.
> 
> I will read through the linked discussion and then also submit the
> r8169 patch.
> 
Hello Bjorn,

I checked and using 4K MRRS was added to the r8169 driver about
10 years ago. So it seems the described potential problems didn't
hit anybody in reality.

When submitting the patch series consisting of pci core change
and r8169 change I will address it to pci and netdev.
Then I leave it to David and you to agree on through which tree
the series should go.

To mention it briefly as I added David:
The patch won't include any functional change.

Regards, Heiner

> Thanks, Heiner
> 
>> There's also a larger issue in that when Linux configures MPS and MRRS
>> at enumeration-time, it makes some assumptions about how MRRS is set.
>> If a driver like r8169 changes MRRS later, it may break those
>> assumptions, which might lead to PCIe errors.
>>
>> If a user boots with "pci=pcie_bus_perf", we use PCIE_BUS_PERFORMANCE
>> mode, and in that case we may set MPS to something larger than some
>> devices can support, and we rely on MRRS to avoid problems.
>>
>> I don't really *like* that scheme because it makes assumptions like
>> "drivers never change MRRS", but that's what we have right now.
>>
>> So just be aware that if r8169 changes MRRS and users boot with
>> "pci=pcie_bus_perf", there is the potential for PCIe bus errors.
>>
>> There was some recent discussion about this; [1] is a good place to
>> start.
>>
>> [1] http://lkml.kernel.org/r/20180119205153.GB160618@bhelgaas-glaptop.roam.corp.google.com
>>
>>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>>> ---
>>>  include/uapi/linux/pci_regs.h | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
>>> index 0c79eac5..699257fb 100644
>>> --- a/include/uapi/linux/pci_regs.h
>>> +++ b/include/uapi/linux/pci_regs.h
>>> @@ -506,6 +506,8 @@
>>>  #define  PCI_EXP_DEVCTL_READRQ_256B  0x1000 /* 256 Bytes */
>>>  #define  PCI_EXP_DEVCTL_READRQ_512B  0x2000 /* 512 Bytes */
>>>  #define  PCI_EXP_DEVCTL_READRQ_1024B 0x3000 /* 1024 Bytes */
>>> +#define  PCI_EXP_DEVCTL_READRQ_2048B 0x4000 /* 2048 Bytes */
>>> +#define  PCI_EXP_DEVCTL_READRQ_4096B 0x5000 /* 4096 Bytes */
>>>  #define  PCI_EXP_DEVCTL_BCR_FLR 0x8000  /* Bridge Configuration Retry / FLR */
>>>  #define PCI_EXP_DEVSTA		10	/* Device Status */
>>>  #define  PCI_EXP_DEVSTA_CED	0x0001	/* Correctable Error Detected */
>>> -- 
>>> 2.17.0
>>>
>>
> 

^ permalink raw reply

* Re: [PATCH net-next 07/21] net/ipv6: Save route type in rt6_info
From: David Miller @ 2018-04-16 19:20 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji
In-Reply-To: <20180416152255.2256-8-dsahern@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 16 Apr 2018 08:22:41 -0700

> @@ -2394,6 +2395,7 @@ static void addrconf_add_mroute(struct net_device *dev)
>  		.fc_ifindex = dev->ifindex,
>  		.fc_dst_len = 8,
>  		.fc_flags = RTF_UP,
> +		.fc_type = RTN_UNICAST,
>  		.fc_nlinfo.nl_net = dev_net(dev),
>  	};
>  

Multicast route is of type RTN_UNICAST?

All of these cases where passing in a zero initialized value of
fc_type up until this patch.  Perhaps you should discuss that in your
commit message a little bit.

Thanks.

^ permalink raw reply

* Re: [PATCH 08/12] mmc: reduce use of block bounce buffers (fwd)
From: Julia Lawall @ 2018-04-16 19:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: iommu, linux-arch, linux-block, linux-ide, linux-scsi, netdev,
	linux-kernel, kbuild-all

There is a duplicated test on line 360.

julia

---------- Forwarded message ----------
Date: Mon, 16 Apr 2018 23:04:18 +0800
From: kbuild test robot <lkp@intel.com>
To: kbuild@01.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Subject: Re: [PATCH 08/12] mmc: reduce use of block bounce buffers

CC: kbuild-all@01.org
In-Reply-To: <20180416085032.7367-9-hch@lst.de>
References: <20180416085032.7367-9-hch@lst.de>
TO: Christoph Hellwig <hch@lst.de>
CC: iommu@lists.linux-foundation.org, linux-arch@vger.kernel.org, linux-block@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Hi Christoph,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc1 next-20180416]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Christoph-Hellwig/iscsi_tcp-don-t-set-a-bounce-limit/20180416-172618
:::::: branch date: 6 hours ago
:::::: commit date: 6 hours ago

>> drivers/mmc/core/queue.c:360:5-29: duplicated argument to && or ||

# https://github.com/0day-ci/linux/commit/6620a69f0eea8e8b7586f08f721c95a336022497
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 6620a69f0eea8e8b7586f08f721c95a336022497
vim +360 drivers/mmc/core/queue.c

81196976 Adrian Hunter     2017-11-29  350
c8b5fd03 Adrian Hunter     2017-09-22  351  static void mmc_setup_queue(struct mmc_queue *mq, struct mmc_card *card)
c8b5fd03 Adrian Hunter     2017-09-22  352  {
c8b5fd03 Adrian Hunter     2017-09-22  353  	struct mmc_host *host = card->host;
c8b5fd03 Adrian Hunter     2017-09-22  354
8b904b5b Bart Van Assche   2018-03-07  355  	blk_queue_flag_set(QUEUE_FLAG_NONROT, mq->queue);
8b904b5b Bart Van Assche   2018-03-07  356  	blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, mq->queue);
c8b5fd03 Adrian Hunter     2017-09-22  357  	if (mmc_can_erase(card))
c8b5fd03 Adrian Hunter     2017-09-22  358  		mmc_queue_setup_discard(mq->queue, card);
c8b5fd03 Adrian Hunter     2017-09-22  359
6620a69f Christoph Hellwig 2018-04-16 @360  	if (!mmc_dev(host)->dma_mask || !mmc_dev(host)->dma_mask)
6620a69f Christoph Hellwig 2018-04-16  361  		blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_HIGH);
c8b5fd03 Adrian Hunter     2017-09-22  362  	blk_queue_max_hw_sectors(mq->queue,
c8b5fd03 Adrian Hunter     2017-09-22  363  		min(host->max_blk_count, host->max_req_size / 512));
c8b5fd03 Adrian Hunter     2017-09-22  364  	blk_queue_max_segments(mq->queue, host->max_segs);
c8b5fd03 Adrian Hunter     2017-09-22  365  	blk_queue_max_segment_size(mq->queue, host->max_seg_size);
c8b5fd03 Adrian Hunter     2017-09-22  366
1e8e55b6 Adrian Hunter     2017-11-29  367  	INIT_WORK(&mq->recovery_work, mmc_mq_recovery_handler);
81196976 Adrian Hunter     2017-11-29  368  	INIT_WORK(&mq->complete_work, mmc_blk_mq_complete_work);
81196976 Adrian Hunter     2017-11-29  369
81196976 Adrian Hunter     2017-11-29  370  	mutex_init(&mq->complete_lock);
81196976 Adrian Hunter     2017-11-29  371
81196976 Adrian Hunter     2017-11-29  372  	init_waitqueue_head(&mq->wait);
81196976 Adrian Hunter     2017-11-29  373  }
81196976 Adrian Hunter     2017-11-29  374

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: Donation
From: M. M. Fridman @ 2018-04-14  6:14 UTC (permalink / raw)




-- 
I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to 
give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.

^ permalink raw reply

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77965 SoC
From: jacopo mondi @ 2018-04-16 18:50 UTC (permalink / raw)
  To: David Miller
  Cc: jacopo+renesas, sergei.shtylyov, robh+dt, mark.rutland, netdev,
	devicetree, linux-renesas-soc, linux-kernel
In-Reply-To: <20180416.101755.99677892602830691.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 841 bytes --]

Hi David,

On Mon, Apr 16, 2018 at 10:17:55AM -0400, David Miller wrote:
> From: Jacopo Mondi <jacopo+renesas@jmondi.org>
> Date: Mon, 16 Apr 2018 15:55:17 +0200
>
> > Add documentation for r8a77965 compatible string to renesas ravb device
> > tree bindings documentation.
> >
> > Signed-off-by: Jacopo Mondi <jacopo+renesas@jmondi.org>
> > Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
> > Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
> > Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
> > ---
> >
> > Renesas R-Car M3-N support has been merged for v4.17.
> > Document the missing device tree bindings.
>
> Since this is purely a devicetree update, I'm assuming that it doesn't
> go through my networking tree.

Rob has picked this patch up in his tree, so no need for you to do the
same.

Thanks
   j

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 1/5] dt-bindings: allow dwmac-sun8i to use other devices' exported regmap
From: Rob Herring @ 2018-04-16 18:47 UTC (permalink / raw)
  To: Icenowy Zheng
  Cc: Maxime Ripard, Chen-Yu Tsai, Giuseppe Cavallaro, Corentin Labbe,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw
In-Reply-To: <20180411141641.14675-2-icenowy-h8G6r0blFSE@public.gmane.org>

On Wed, Apr 11, 2018 at 10:16:37PM +0800, Icenowy Zheng wrote:
> On some Allwinner SoCs the EMAC clock register needed by dwmac-sun8i is
> in another device's memory space. In this situation dwmac-sun8i can use
> a regmap exported by the other device with only the EMAC clock register.

If this is a clock, then why not use the clock binding?

> 
> Document this situation in the dwmac-sun8i device tree binding
> documentation.
> 
> Signed-off-by: Icenowy Zheng <icenowy-h8G6r0blFSE@public.gmane.org>
> ---
>  Documentation/devicetree/bindings/net/dwmac-sun8i.txt | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> index 3d6d5fa0c4d5..0c5f63a80617 100644
> --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> @@ -20,8 +20,9 @@ Required properties:
>  - phy-handle: See ethernet.txt
>  - #address-cells: shall be 1
>  - #size-cells: shall be 0
> -- syscon: A phandle to the syscon of the SoC with one of the following
> - compatible string:
> +- syscon: A phandle to a device which exports the EMAC clock register as a
> + regmap or to the syscon of the SoC with one of the following compatible
> + string:
>    - allwinner,sun8i-h3-system-controller
>    - allwinner,sun8i-v3s-system-controller
>    - allwinner,sun50i-a64-system-controller
> -- 
> 2.15.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: dma-mapping: bypass dma_ops for direct mappings
From: kbuild test robot @ 2018-04-16 18:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kbuild-all, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse,
	William Tu, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <20180416122706.GA20624@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 12130 bytes --]

Hi Christoph,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc1 next-20180416]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   include/net/mac80211.h:2083: warning: bad line: >
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg.signal' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'rx_stats_avg.chain_signal' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.filtered' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.retry_failed' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.retry_count' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.lost_packets' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_tdls_pkt_time' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.msdu_retries' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.msdu_failed' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_ack' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.last_ack_signal' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'status_stats.ack_signal_filled' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.packets' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.bytes' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.last_rate' not described in 'sta_info'
   net/mac80211/sta_info.h:586: warning: Function parameter or member 'tx_stats.msdu' not described in 'sta_info'
   kernel/sched/fair.c:3731: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.cb' not described in 'dma_buf'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.poll' not described in 'dma_buf'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_excl.active' not described in 'dma_buf'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.cb' not described in 'dma_buf'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.poll' not described in 'dma_buf'
   include/linux/dma-buf.h:307: warning: Function parameter or member 'cb_shared.active' not described in 'dma_buf'
   include/linux/dma-fence-array.h:54: warning: Function parameter or member 'work' not described in 'dma_fence_array'
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware drivers/base/firmware_class.c' failed with return code 1
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_direct drivers/base/firmware_class.c' failed with return code 1
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_into_buf drivers/base/firmware_class.c' failed with return code 1
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function request_firmware_nowait drivers/base/firmware_class.c' failed with return code 1
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -function firmware_request_cache drivers/base/firmware_class.c' failed with return code 1
   include/linux/gpio/driver.h:142: warning: Function parameter or member 'request_key' not described in 'gpio_irq_chip'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.sign' not described in 'iio_chan_spec'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.realbits' not described in 'iio_chan_spec'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.storagebits' not described in 'iio_chan_spec'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.shift' not described in 'iio_chan_spec'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.repeat' not described in 'iio_chan_spec'
   include/linux/iio/iio.h:270: warning: Function parameter or member 'scan_type.endianness' not described in 'iio_chan_spec'
   include/linux/iio/hw-consumer.h:1: warning: no structured comments found
>> include/linux/device.h:995: warning: Function parameter or member 'is_dma_direct' not described in 'device'
   Error: Cannot open file drivers/base/firmware_class.c
   Error: Cannot open file drivers/base/firmware_class.c
   WARNING: kernel-doc 'scripts/kernel-doc -rst -enable-lineno -export drivers/base/firmware_class.c' failed with return code 2
   include/linux/input/sparse-keymap.h:46: warning: Function parameter or member 'sw' not described in 'key_entry'
   include/linux/mtd/rawnand.h:752: warning: Function parameter or member 'timings.sdr' not described in 'nand_data_interface'
   include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf' not described in 'nand_op_data_instr'
   include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf.in' not described in 'nand_op_data_instr'
   include/linux/mtd/rawnand.h:817: warning: Function parameter or member 'buf.out' not described in 'nand_op_data_instr'
   include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx' not described in 'nand_op_instr'
   include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.cmd' not described in 'nand_op_instr'
   include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.addr' not described in 'nand_op_instr'
   include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.data' not described in 'nand_op_instr'
   include/linux/mtd/rawnand.h:863: warning: Function parameter or member 'ctx.waitrdy' not described in 'nand_op_instr'
   include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx' not described in 'nand_op_parser_pattern_elem'
   include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx.addr' not described in 'nand_op_parser_pattern_elem'
   include/linux/mtd/rawnand.h:1010: warning: Function parameter or member 'ctx.data' not described in 'nand_op_parser_pattern_elem'
   include/linux/mtd/rawnand.h:1313: warning: Function parameter or member 'manufacturer.desc' not described in 'nand_chip'
   include/linux/mtd/rawnand.h:1313: warning: Function parameter or member 'manufacturer.priv' not described in 'nand_chip'
   include/linux/regulator/driver.h:222: warning: Function parameter or member 'resume_early' not described in 'regulator_ops'
   drivers/regulator/core.c:4306: warning: Excess function parameter 'state' description in 'regulator_suspend_late'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw0' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw1' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw2' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.esw3' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 'esw.eadm' not described in 'irb'
   drivers/usb/typec/mux.c:186: warning: Function parameter or member 'mux' not described in 'typec_mux_unregister'
   drivers/usb/typec/mux.c:186: warning: Excess function parameter 'sw' description in 'typec_mux_unregister'

vim +995 include/linux/device.h

^1da177e Linus Torvalds 2005-04-16 @995  

:::::: The code at line 995 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6353 bytes --]

^ permalink raw reply

* Re: dma-mapping: bypass dma_ops for direct mappings
From: Jesper Dangaard Brouer @ 2018-04-16 18:26 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Christoph Hellwig, kbuild-all, netdev@vger.kernel.org,
	Christoph Hellwig, David Woodhouse, William Tu,
	Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Arnaldo Carvalho de Melo, brouer
In-Reply-To: <201804170016.fO08NqAo%fengguang.wu@intel.com>

On Tue, 17 Apr 2018 02:05:12 +0800
kbuild test robot <lkp@intel.com> wrote:

> Hi Christoph,
> 
> I love your patch! Yet something to improve:

I was just about to complain about the same compile error ;-)

> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.17-rc1 next-20180416]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032
> config: i386-defconfig (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    drivers/char/agp/intel-gtt.o: In function `intel_gmch_probe':
> >> intel-gtt.c:(.text+0x11e4): undefined reference to `swiotlb_dma_ops'  
>    drivers/ata/ahci.o: In function `ahci_init_one':
> >> ahci.c:(.text+0x108d): undefined reference to `swiotlb_dma_ops'  
>    drivers/net/ethernet/broadcom/bnx2.o: In function `bnx2_init_one':
> >> bnx2.c:(.text+0x7fe7): undefined reference to `swiotlb_dma_ops'  
>    drivers/net/ethernet/broadcom/tg3.o: In function `tg3_init_one':
> >> tg3.c:(.text+0x13549): undefined reference to `swiotlb_dma_ops'  
>    drivers/net/ethernet/intel/e1000/e1000_main.o: In function `e1000_probe':
> >> e1000_main.c:(.text+0x49b3): undefined reference to `swiotlb_dma_ops'  
>    drivers/net/ethernet/intel/e1000e/netdev.o:netdev.c:(.text+0xa65e): more undefined references to `swiotlb_dma_ops' follow
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net-next 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-16 18:15 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei
In-Reply-To: <1523902550-10767-1-git-send-email-yihung.wei@gmail.com>

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others to commit valid conntrack entry into the conntrack
table.  Even if we can possibly put the VM in different network namespace,
the current nf_conntrack_max configuration is kind of rigid that we cannot
limit different VM/container to have different # conntrack entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 net/openvswitch/Kconfig     |   3 +-
 net/openvswitch/conntrack.c | 497 +++++++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 511 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 				     (!NF_NAT || NF_NAT) && \
 				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..2f51da91d056 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -17,7 +17,9 @@
 #include <linux/udp.h>
 #include <linux/sctp.h>
 #include <net/ip.h>
+#include <net/genetlink.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +78,38 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED	0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+
+struct ovs_ct_limit {
+	/* Elements in ovs_ct_limit_info->limits hash table */
+	struct hlist_node hlist_node;
+	struct rcu_head rcu;
+	u16 zone;
+	u32 limit;
+};
+
+struct ovs_ct_limit_info {
+	u32 default_limit;
+	struct hlist_head *limits;
+	struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+	[OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+	ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+		[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+		[OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+		[OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+		[OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1070,94 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 	return false;
 }
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+	const struct ovs_ct_limit_info *info, u16 zone)
+{
+	return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+			 struct ovs_ct_limit *new_ct_limit)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == new_ct_limit->zone) {
+			hlist_replace_rcu(&ct_limit->hlist_node,
+					  &new_ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+
+	hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone) {
+			hlist_del_rcu(&ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone)
+			return ct_limit->limit;
+	}
+
+	return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+			      const struct ovs_conntrack_info *info,
+			      const struct nf_conntrack_tuple *tuple)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	u32 per_zone_limit, connections;
+	u32 conncount_key[5];
+
+	conncount_key[0] = info->zone.id;
+
+	rcu_read_lock();
+	per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+	if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	connections = nf_conncount_count(net, ct_limit_info->data,
+					 conncount_key, tuple, &info->zone);
+	if (connections > per_zone_limit) {
+		rcu_read_unlock();
+		return -ENOMEM;
+	}
+
+	rcu_read_unlock();
+	return 0;
+}
+#endif
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
@@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 	if (!ct)
 		return 0;
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	err = ovs_ct_check_limit(net, info,
+				 &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+	if (err)
+		return err;
+#endif
+
 	/* Set the conntrack event mask if given.  NEW and DELETE events have
 	 * their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
 	 * typically would receive many kinds of updates.  Setting the event
@@ -1655,7 +1784,363 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
 		nf_ct_tmpl_free(ct_info->ct);
 }
 
-void ovs_ct_init(struct net *net)
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+	int i;
+
+	ovs_net->ct_limit_info = kmalloc(sizeof *ovs_net->ct_limit_info,
+					 GFP_KERNEL);
+	if (!ovs_net->ct_limit_info)
+		return -ENOMEM;
+
+	ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+	ovs_net->ct_limit_info->limits =
+		kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+			      GFP_KERNEL);
+	if (!ovs_net->ct_limit_info->limits) {
+		kfree(ovs_net->ct_limit_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+		INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+	ovs_net->ct_limit_info->data =
+		nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+	if (IS_ERR(ovs_net->ct_limit_info->data)) {
+		kfree(ovs_net->ct_limit_info->limits);
+		kfree(ovs_net->ct_limit_info);
+		return PTR_ERR(ovs_net->ct_limit_info->data);
+	}
+	return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+	const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+	int i;
+
+	nf_conncount_destroy(net, NFPROTO_INET, info->data);
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		struct hlist_head *head = &info->limits[i];
+		struct ovs_ct_limit *ct_limit;
+
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+			kfree_rcu(ct_limit, rcu);
+	}
+	kfree(ovs_net->ct_limit_info->limits);
+	kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+			     struct ovs_header **ovs_reply_header)
+{
+	struct sk_buff *skb;
+	struct ovs_header *ovs_header = info->userhdr;
+
+	skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+					info->snd_seq,
+					&dp_ct_limit_genl_family, 0, cmd);
+
+	if (!*ovs_reply_header) {
+		nlmsg_free(skb);
+		return ERR_PTR(-EMSGSIZE);
+	}
+	(*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+	return skb;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+		struct ovs_ct_limit *ct_limit;
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_SET_REQ)
+			return  -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (attr[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT]) {
+			u32 default_limit = nla_get_u32(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT]);
+			ovs_lock();
+			info->default_limit = default_limit;
+			ovs_unlock();
+		} else {
+			if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE] ||
+			    !attr[OVS_CT_ZONE_LIMIT_ATTR_LIMIT]) {
+				return -EINVAL;
+			}
+
+			ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+			if (!ct_limit)
+				return -ENOMEM;
+
+			ct_limit->zone = nla_get_u16(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+			ct_limit->limit = nla_get_u32(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_LIMIT]);
+
+			ovs_lock();
+			ct_limit_set(info, ct_limit);
+			ovs_unlock();
+		}
+	}
+	return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+		u16 zone;
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ)
+			return  -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE])
+			return -EINVAL;
+
+		zone = nla_get_u16(attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+
+		ovs_lock();
+		ct_limit_del(info, zone);
+		ovs_unlock();
+	}
+	return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+					  struct sk_buff *reply)
+{
+	int err;
+	struct nlattr *nla_nested;
+
+	nla_nested = nla_nest_start(reply, OVS_CT_ZONE_LIMIT_ATTR_GET_RLY);
+
+	err = nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,
+			  info->default_limit);
+	if (err)
+		return err;
+
+	nla_nest_end(reply, nla_nested);
+	return 0;
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+				       struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info,
+				       struct sk_buff *reply)
+{
+	struct nlattr *nla, *nla_nested;
+	int rem, err;
+	u16 zone;
+	u32 limit, count, conncount_key[5];
+	struct nf_conntrack_zone ct_zone;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_GET_REQ)
+			return -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE])
+			return -EINVAL;
+
+		zone = nla_get_u16(attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+		nf_ct_zone_init(&ct_zone, zone, NF_CT_DEFAULT_ZONE_DIR, 0);
+		rcu_read_lock();
+		limit = ct_limit_get(info, zone);
+		rcu_read_unlock();
+
+		conncount_key[0] = zone;
+		count = nf_conncount_count(net, info->data, conncount_key,
+					   NULL, &ct_zone);
+
+		nla_nested = nla_nest_start(reply,
+					    OVS_CT_ZONE_LIMIT_ATTR_GET_RLY);
+		if (nla_put_u16(reply, OVS_CT_ZONE_LIMIT_ATTR_ZONE, zone) ||
+		    nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_LIMIT, limit) ||
+		    nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_COUNT, count))
+			return -EMSGSIZE;
+		nla_nest_end(reply, nla_nested);
+	}
+
+	return 0;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_OPTION])
+		return -EINVAL;
+
+	err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_OPTION],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_OPTION])
+		return -EINVAL;
+
+	err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_OPTION],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct nlattr *nla_reply;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct net *net = sock_net(skb->sk);
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_OPTION);
+
+	err = ovs_ct_limit_get_default_limit(ct_limit_info, reply);
+	if (err)
+		goto exit_err;
+
+	if (a[OVS_CT_LIMIT_ATTR_OPTION]) {
+		err = ovs_ct_limit_get_zone_limit(
+			net, a[OVS_CT_LIMIT_ATTR_OPTION], ct_limit_info,
+			reply);
+		if (err)
+			goto exit_err;
+	}
+
+	nla_nest_end(reply, nla_reply);
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+	{ .cmd = OVS_CT_LIMIT_CMD_SET,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_set,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_DEL,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_del,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_GET,
+		.flags = 0,		  /* OK for unprivileged users. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_get,
+	},
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+	.name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+	.hdrsize = sizeof(struct ovs_header),
+	.name = OVS_CT_LIMIT_FAMILY,
+	.version = OVS_CT_LIMIT_VERSION,
+	.maxattr = OVS_CT_LIMIT_ATTR_MAX,
+	.netnsok = true,
+	.parallel_ops = true,
+	.ops = ct_limit_genl_ops,
+	.n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+	.mcgrps = &ovs_ct_limit_multicast_group,
+	.n_mcgrps = 1,
+	.module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
 {
 	unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2151,22 @@ void ovs_ct_init(struct net *net)
 	} else {
 		ovs_net->xt_label = true;
 	}
+
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	return ovs_ct_limit_init(net, ovs_net);
+#else
+	return 0;
+#endif
 }
 
 void ovs_ct_exit(struct net *net)
 {
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	ovs_ct_limit_exit(net, ovs_net);
+#endif
+
 	if (ovs_net->xt_label)
 		nf_connlabels_put(net);
 }
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
 #include "flow.h"
 
 struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
 enum ovs_key_attr;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
 void ovs_ct_exit(struct net *);
 bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
 int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
 #else
 #include <linux/errno.h>
 
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
 
 static inline void ovs_ct_exit(struct net *net) { }
 
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
 
 #define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
 	&dp_flow_genl_family,
 	&dp_packet_genl_family,
 	&dp_meter_genl_family,
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	&dp_ct_limit_genl_family,
+#endif
 };
 
 static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
 
 	INIT_LIST_HEAD(&ovs_net->dps);
 	INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
-	ovs_ct_init(net);
-	return 0;
+	return ovs_ct_init(net);
 }
 
 static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..51bd4dcb6c8b 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,7 @@ struct dp_upcall_info {
 struct ovs_net {
 	struct list_head dps;
 	struct work_struct dp_notify_work;
+	struct ovs_ct_limit_info *ct_limit_info;
 
 	/* Module reference for configuring conntrack. */
 	bool xt_label;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-04-16 18:15 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei
In-Reply-To: <1523902550-10767-1-git-send-email-yihung.wei@gmail.com>

Define netlink messages and attributes to support user kernel
communication that using conntrack limit feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 include/uapi/linux/openvswitch.h | 62 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+	OVS_CT_LIMIT_CMD_UNSPEC,
+	OVS_CT_LIMIT_CMD_SET,		/* Add or modify ct limit. */
+	OVS_CT_LIMIT_CMD_DEL,		/* Delete ct limit. */
+	OVS_CT_LIMIT_CMD_GET		/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+	OVS_CT_LIMIT_ATTR_UNSPEC,
+	OVS_CT_LIMIT_ATTR_OPTION,	/* Nested OVS_CT_LIMIT_ATTR_* */
+	__OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+	OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+	OVS_CT_ZONE_LIMIT_ATTR_SET_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_GET_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_GET_RLY,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	__OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+	OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+	OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,	/* u32 default conntrack limit
+						 * for all zones. */
+	OVS_CT_ZONE_LIMIT_ATTR_ZONE,		/* u16 conntrack zone id. */
+	OVS_CT_ZONE_LIMIT_ATTR_LIMIT,		/* u32 max number of conntrack
+						 * entries allowed in the
+						 * corresponding zone. */
+	OVS_CT_ZONE_LIMIT_ATTR_COUNT,		/* u32 number of conntrack
+						 * entries in the corresponding
+						 * zone. */
+	__OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-16 18:15 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others to commit valid conntrack entry into the conntrack
table.  Even if we can possibly put the VM in different network namespace,
the current nf_conntrack_max configuration is kind of rigid that we cannot
limit different VM/container to have different # conntrack entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
scecond patch provides the implementation.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +++++
 net/openvswitch/Kconfig          |   3 +-
 net/openvswitch/conntrack.c      | 497 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h      |   9 +-
 net/openvswitch/datapath.c       |   7 +-
 net/openvswitch/datapath.h       |   1 +
 6 files changed, 573 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: dma-mapping: bypass dma_ops for direct mappings
From: kbuild test robot @ 2018-04-16 18:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kbuild-all, Jesper Dangaard Brouer, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, Christoph Hellwig, David Woodhouse,
	William Tu, Björn Töpel, Karlsson, Magnus,
	Alexander Duyck, Arnaldo Carvalho de Melo
In-Reply-To: <20180416122706.GA20624@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

Hi Christoph,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc1 next-20180416]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Christoph-Hellwig/dma-mapping-bypass-dma_ops-for-direct-mappings/20180416-230032
config: i386-defconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/char/agp/intel-gtt.o: In function `intel_gmch_probe':
>> intel-gtt.c:(.text+0x11e4): undefined reference to `swiotlb_dma_ops'
   drivers/ata/ahci.o: In function `ahci_init_one':
>> ahci.c:(.text+0x108d): undefined reference to `swiotlb_dma_ops'
   drivers/net/ethernet/broadcom/bnx2.o: In function `bnx2_init_one':
>> bnx2.c:(.text+0x7fe7): undefined reference to `swiotlb_dma_ops'
   drivers/net/ethernet/broadcom/tg3.o: In function `tg3_init_one':
>> tg3.c:(.text+0x13549): undefined reference to `swiotlb_dma_ops'
   drivers/net/ethernet/intel/e1000/e1000_main.o: In function `e1000_probe':
>> e1000_main.c:(.text+0x49b3): undefined reference to `swiotlb_dma_ops'
   drivers/net/ethernet/intel/e1000e/netdev.o:netdev.c:(.text+0xa65e): more undefined references to `swiotlb_dma_ops' follow

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26269 bytes --]

^ permalink raw reply

* Re: [ethtool] ethtool: Add register dump support for MICROCHIP LAN78xx
From: John W. Linville @ 2018-04-16 17:57 UTC (permalink / raw)
  To: Raghuram Chary J; +Cc: davem, netdev, unglinuxdriver, woojung.huh
In-Reply-To: <20180405061128.5479-1-raghuramchary.jallipalli@microchip.com>

On Thu, Apr 05, 2018 at 11:41:28AM +0530, Raghuram Chary J wrote:
> This patch adds support for Microchip's lan78xx families
> of USB Ethernet controllers to ethtool's dump registers
> command.
> 
> This patch is for use with the lan78xx driver.
> 
> Signed-off-by: Raghuram Chary J <raghuramchary.jallipalli@microchip.com>

LGTM -- queued for next release...

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [PATCH 2/2] net: socionext: reset hardware in ndo_stop
From: David Miller @ 2018-04-16 17:47 UTC (permalink / raw)
  To: jassisinghbrar; +Cc: netdev, masahisa.kojima, ard.biesheuvel, jaswinder.singh
In-Reply-To: <1523864399-27296-1-git-send-email-jassisinghbrar@gmail.com>

From: jassisinghbrar@gmail.com
Date: Mon, 16 Apr 2018 13:09:59 +0530

> From: Masahisa KOJIMA <masahisa.kojima@linaro.org>
> 
> When the interface is down, head/tail of the descriptor
> ring address is set to 0 in netsec_netdev_stop().
> But netsec hardware still keeps the previous descriptor
> ring address, so there is inconsistency between driver
> and hardware after interface is up at a later time.
> To address this inconsistency, add netsec_reset_hardware()
> when the interface is down.
> 
> In addition, to minimize the reset process,
> add flag to decide whether driver loads the netsec microcode.
> Even if driver resets the netsec hardware, netsec microcode
> keeps resident on RAM, so it is ok we only load the microcode
> at initialization.
> 
> This patch is critical for installation over network.
> 
> Signed-off-by: Masahisa KOJIMA <masahisa.kojima@linaro.org>
> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] net: netsec: enable tx-irq during open callback
From: David Miller @ 2018-04-16 17:46 UTC (permalink / raw)
  To: jassisinghbrar; +Cc: netdev, masahisa.kojima, ard.biesheuvel, jaswinder.singh
In-Reply-To: <1523863336-12653-1-git-send-email-jassisinghbrar@gmail.com>

From: jassisinghbrar@gmail.com
Date: Mon, 16 Apr 2018 12:52:16 +0530

> From: Jassi Brar <jaswinder.singh@linaro.org>
> 
> Enable TX-irq as well during ndo_open() as we can not count upon
> RX to arrive early enough to trigger the napi. This patch is critical
> for installation over network.
> 
> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>

Applied.

^ permalink raw reply

* Re: [PATCH] net: mediatek: use of_device_get_match_data()
From: David Miller @ 2018-04-16 17:43 UTC (permalink / raw)
  To: ryder.lee
  Cc: sean.wang, netdev, linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <9bf87205f84062580934168774d690d4cd2bf26b.1523347340.git.ryder.lee@mediatek.com>

From: Ryder Lee <ryder.lee@mediatek.com>
Date: Mon, 16 Apr 2018 10:33:41 +0800

> The usage of of_device_get_match_data() reduce the code size a bit.
> 
> Also, the only way to call mtk_probe() is to match an entry in
> of_mtk_match[], so match cannot be NULL.
> 
> Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>

Applied to net-next.

^ permalink raw reply

* [PATCH net] Count IPv6 interface receive statistics on the ingress netdev
From: Stephen Suryaputra @ 2018-04-16 17:42 UTC (permalink / raw)
  To: netdev, ja; +Cc: Stephen Suryaputra

The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 include/net/addrconf.h          | 14 +++++++++++
 net/ipv6/exthdrs.c              | 55 ++++++++++++++++-------------------------
 net/ipv6/ip6_input.c            |  2 +-
 net/ipv6/ip6_output.c           | 18 ++++++--------
 net/ipv6/reassembly.c           |  6 ++---
 net/ipv6/route.c                |  3 ++-
 net/netfilter/ipvs/ip_vs_xmit.c |  5 ++--
 7 files changed, 51 insertions(+), 52 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 378d601..8312cc2 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -308,6 +308,20 @@ static inline struct inet6_dev *__in6_dev_get(const struct net_device *dev)
 }
 
 /**
+ * __in6_dev_get_safely - get inet6_dev pointer from netdevice
+ * @dev: network device
+ *
+ * This is a safer version of __in6_dev_get
+ */
+static inline struct inet6_dev *__in6_dev_get_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference_rtnl(dev->ip6_ptr);
+	else
+		return NULL;
+}
+
+/**
  * in6_dev_get - get inet6_dev pointer from netdevice
  * @dev: network device
  *
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index bc68eb6..5bc2bf3 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -280,6 +280,7 @@ static const struct tlvtype_proc tlvprocdestopt_lst[] = {
 
 static int ipv6_destopt_rcv(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get(skb->dev);
 	struct inet6_skb_parm *opt = IP6CB(skb);
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 	__u16 dstbuf;
@@ -291,7 +292,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) ||
 	    !pskb_may_pull(skb, (skb_transport_offset(skb) +
 				 ((skb_transport_header(skb)[1] + 1) << 3)))) {
-		__IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst),
+		__IP6_INC_STATS(dev_net(dst->dev), idev,
 				IPSTATS_MIB_INHDRERRORS);
 fail_and_free:
 		kfree_skb(skb);
@@ -319,8 +320,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
-	__IP6_INC_STATS(dev_net(dst->dev),
-			ip6_dst_idev(dst), IPSTATS_MIB_INHDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 	return -1;
 }
 
@@ -416,8 +416,7 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 	}
 
 	if (hdr->segments_left >= (hdr->hdrlen >> 1)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((&hdr->segments_left) -
 				   skb_network_header(skb)));
@@ -456,8 +455,7 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 
 	if (skb_dst(skb)->dev->flags & IFF_LOOPBACK) {
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED,
 				    ICMPV6_EXC_HOPLIMIT, 0);
 			kfree_skb(skb);
@@ -481,10 +479,10 @@ static int ipv6_srh_rcv(struct sk_buff *skb)
 /* called with rcu_read_lock() */
 static int ipv6_rthdr_rcv(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get(skb->dev);
 	struct inet6_skb_parm *opt = IP6CB(skb);
 	struct in6_addr *addr = NULL;
 	struct in6_addr daddr;
-	struct inet6_dev *idev;
 	int n, i;
 	struct ipv6_rt_hdr *hdr;
 	struct rt0_hdr *rthdr;
@@ -498,8 +496,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) ||
 	    !pskb_may_pull(skb, (skb_transport_offset(skb) +
 				 ((skb_transport_header(skb)[1] + 1) << 3)))) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -508,8 +505,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 
 	if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr) ||
 	    skb->pkt_type != PACKET_HOST) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INADDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -527,7 +523,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 			 * processed by own
 			 */
 			if (!addr) {
-				__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+				__IP6_INC_STATS(net, idev,
 						IPSTATS_MIB_INADDRERRORS);
 				kfree_skb(skb);
 				return -1;
@@ -553,8 +549,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 			goto unknown_rh;
 		/* Silently discard invalid RTH type 2 */
 		if (hdr->hdrlen != 2 || hdr->segments_left != 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
@@ -572,8 +567,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	n = hdr->hdrlen >> 1;
 
 	if (hdr->segments_left > n) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((&hdr->segments_left) -
 				   skb_network_header(skb)));
@@ -609,14 +603,12 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 		if (xfrm6_input_addr(skb, (xfrm_address_t *)addr,
 				     (xfrm_address_t *)&ipv6_hdr(skb)->saddr,
 				     IPPROTO_ROUTING) < 0) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INADDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
 		if (!ipv6_chk_home_addr(dev_net(skb_dst(skb)->dev), addr)) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INADDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 			kfree_skb(skb);
 			return -1;
 		}
@@ -627,8 +619,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	}
 
 	if (ipv6_addr_is_multicast(addr)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-				IPSTATS_MIB_INADDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 		kfree_skb(skb);
 		return -1;
 	}
@@ -647,8 +638,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 
 	if (skb_dst(skb)->dev->flags&IFF_LOOPBACK) {
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT,
 				    0);
 			kfree_skb(skb);
@@ -663,7 +653,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb)
 	return -1;
 
 unknown_rh:
-	__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INHDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 	icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 			  (&hdr->type) - skb_network_header(skb));
 	return -1;
@@ -755,34 +745,31 @@ static bool ipv6_hop_ra(struct sk_buff *skb, int optoff)
 static bool ipv6_hop_jumbo(struct sk_buff *skb, int optoff)
 {
 	const unsigned char *nh = skb_network_header(skb);
+	struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
 	struct net *net = ipv6_skb_net(skb);
 	u32 pkt_len;
 
 	if (nh[optoff + 1] != 4 || (optoff & 3) != 2) {
 		net_dbg_ratelimited("ipv6_hop_jumbo: wrong jumbo opt length/alignment %d\n",
 				    nh[optoff+1]);
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
 	pkt_len = ntohl(*(__be32 *)(nh + optoff + 2));
 	if (pkt_len <= IPV6_MAXPLEN) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff+2);
 		return false;
 	}
 	if (ipv6_hdr(skb)->payload_len) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff);
 		return false;
 	}
 
 	if (pkt_len > skb->len - sizeof(struct ipv6hdr)) {
-		__IP6_INC_STATS(net, ipv6_skb_idev(skb),
-				IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	}
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a..f08d344 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -336,7 +336,7 @@ int ip6_mc_input(struct sk_buff *skb)
 	bool deliver;
 
 	__IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev),
-			 ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INMCAST,
+			 __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST,
 			 skb->len);
 
 	hdr = ipv6_hdr(skb);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2e891d2..a39b04f9 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -425,6 +425,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 
 int ip6_forward(struct sk_buff *skb)
 {
+	struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
 	struct dst_entry *dst = skb_dst(skb);
 	struct ipv6hdr *hdr = ipv6_hdr(skb);
 	struct inet6_skb_parm *opt = IP6CB(skb);
@@ -444,8 +445,7 @@ int ip6_forward(struct sk_buff *skb)
 		goto drop;
 
 	if (!xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INDISCARDS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -476,8 +476,7 @@ int ip6_forward(struct sk_buff *skb)
 		/* Force OUTPUT device used as source address */
 		skb->dev = dst->dev;
 		icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0);
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INHDRERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 
 		kfree_skb(skb);
 		return -ETIMEDOUT;
@@ -490,15 +489,13 @@ int ip6_forward(struct sk_buff *skb)
 		if (proxied > 0)
 			return ip6_input(skb);
 		else if (proxied < 0) {
-			__IP6_INC_STATS(net, ip6_dst_idev(dst),
-					IPSTATS_MIB_INDISCARDS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 			goto drop;
 		}
 	}
 
 	if (!xfrm6_route_forward(skb)) {
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INDISCARDS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 	dst = skb_dst(skb);
@@ -554,8 +551,7 @@ int ip6_forward(struct sk_buff *skb)
 		/* Again, force OUTPUT device used as source address */
 		skb->dev = dst->dev;
 		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		__IP6_INC_STATS(net, ip6_dst_idev(dst),
-				IPSTATS_MIB_INTOOBIGERRORS);
+		__IP6_INC_STATS(net, idev, IPSTATS_MIB_INTOOBIGERRORS);
 		__IP6_INC_STATS(net, ip6_dst_idev(dst),
 				IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
@@ -579,7 +575,7 @@ int ip6_forward(struct sk_buff *skb)
 		       ip6_forward_finish);
 
 error:
-	__IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_INADDRERRORS);
+	__IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS);
 drop:
 	kfree_skb(skb);
 	return -EINVAL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 4979610..2cdf3dc 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -179,7 +179,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 			((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
 
 	if ((unsigned int)end > IPV6_MAXPLEN) {
-		__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+		__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 				IPSTATS_MIB_INHDRERRORS);
 		icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 				  ((u8 *)&fhdr->frag_off -
@@ -214,7 +214,7 @@ static int ip6_frag_queue(struct frag_queue *fq, struct sk_buff *skb,
 			/* RFC2460 says always send parameter problem in
 			 * this case. -DaveM
 			 */
-			__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+			__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 					IPSTATS_MIB_INHDRERRORS);
 			icmpv6_param_prob(skb, ICMPV6_HDR_FIELD,
 					  offsetof(struct ipv6hdr, payload_len));
@@ -536,7 +536,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 	return -1;
 
 fail_hdr:
-	__IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
+	__IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev),
 			IPSTATS_MIB_INHDRERRORS);
 	icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, skb_network_header_len(skb));
 	return -1;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 49b954d..1d738bf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3541,7 +3541,8 @@ static int ip6_pkt_drop(struct sk_buff *skb, u8 code, int ipstats_mib_noroutes)
 	case IPSTATS_MIB_INNOROUTES:
 		type = ipv6_addr_type(&ipv6_hdr(skb)->daddr);
 		if (type == IPV6_ADDR_ANY) {
-			IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst),
+			IP6_INC_STATS(dev_net(dst->dev),
+				      __in6_dev_get_safely(skb->dev),
 				      IPSTATS_MIB_INADDRERRORS);
 			break;
 		}
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..ba0a0fd 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -266,12 +266,13 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 
 		/* check and decrement ttl */
 		if (ipv6_hdr(skb)->hop_limit <= 1) {
+			struct inet6_dev *idev = __in6_dev_get_safely(skb->dev);
+
 			/* Force OUTPUT device used as source address */
 			skb->dev = dst->dev;
 			icmpv6_send(skb, ICMPV6_TIME_EXCEED,
 				    ICMPV6_EXC_HOPLIMIT, 0);
-			__IP6_INC_STATS(net, ip6_dst_idev(dst),
-					IPSTATS_MIB_INHDRERRORS);
+			__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
 
 			return false;
 		}
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 5/5] selftests: net: add tcp_mmap program
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

This is a reference program showing how mmap() can be used
on TCP flows to implement receive zero copy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 tools/testing/selftests/net/Makefile   |   2 +
 tools/testing/selftests/net/tcp_mmap.c | 437 +++++++++++++++++++++++++
 2 files changed, 439 insertions(+)
 create mode 100644 tools/testing/selftests/net/tcp_mmap.c

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 785fc18a16b4701f3ef875b60648726750b0cd26..23e725f4305eae6bc23fe705c8d3fe262110395a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -8,9 +8,11 @@ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetl
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
+TEST_GEN_FILES += tcp_mmap
 TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict
 
 include ../lib.mk
 
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
+$(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c
new file mode 100644
index 0000000000000000000000000000000000000000..dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -0,0 +1,437 @@
+/*
+ * Copyright 2018 Google Inc.
+ * Author: Eric Dumazet (edumazet@google.com)
+ *
+ * Reference program demonstrating tcp mmap() usage,
+ * and SO_RCVLOWAT hints for receiver.
+ *
+ * Note : NIC with header split is needed to use mmap() on TCP :
+ * Each incoming frame must be a multiple of PAGE_SIZE bytes of TCP payload.
+ *
+ * How to use on loopback interface :
+ *
+ *  ifconfig lo mtu 61512  # 15*4096 + 40 (ipv6 header) + 32 (TCP with TS option header)
+ *  tcp_mmap -s -z &
+ *  tcp_mmap -H ::1 -z
+ *
+ *  Or leave default lo mtu, but use -M option to set TCP_MAXSEG option to (4096 + 12)
+ *      (4096 : page size on x86, 12: TCP TS option length)
+ *  tcp_mmap -s -z -M $((4096+12)) &
+ *  tcp_mmap -H ::1 -z -M $((4096+12))
+ *
+ * Note: -z option on sender uses MSG_ZEROCOPY, which forces a copy when packets go through loopback interface.
+ *       We might use sendfile() instead, but really this test program is about mmap(), for receivers ;)
+ *
+ * $ ./tcp_mmap -s &                                 # Without mmap()
+ * $ for i in {1..4}; do ./tcp_mmap -H ::1 -z ; done
+ * received 32768 MB (0 % mmap'ed) in 14.1157 s, 19.4732 Gbit
+ *   cpu usage user:0.057 sys:7.815, 240.234 usec per MB, 65531 c-switches
+ * received 32768 MB (0 % mmap'ed) in 14.6833 s, 18.7204 Gbit
+ *  cpu usage user:0.043 sys:8.103, 248.596 usec per MB, 65524 c-switches
+ * received 32768 MB (0 % mmap'ed) in 11.143 s, 24.6682 Gbit
+ *   cpu usage user:0.044 sys:6.576, 202.026 usec per MB, 65519 c-switches
+ * received 32768 MB (0 % mmap'ed) in 14.9056 s, 18.4413 Gbit
+ *   cpu usage user:0.036 sys:8.193, 251.129 usec per MB, 65530 c-switches
+ * $ kill %1   # kill tcp_mmap server
+ *
+ * $ ./tcp_mmap -s -z &                              # With mmap()
+ * $ for i in {1..4}; do ./tcp_mmap -H ::1 -z ; done
+ * received 32768 MB (99.9939 % mmap'ed) in 6.73792 s, 40.7956 Gbit
+ *   cpu usage user:0.045 sys:2.827, 87.6465 usec per MB, 65532 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.26732 s, 37.8238 Gbit
+ *   cpu usage user:0.037 sys:3.087, 95.3369 usec per MB, 65532 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.61661 s, 36.0893 Gbit
+ *   cpu usage user:0.046 sys:3.559, 110.016 usec per MB, 65529 c-switches
+ * received 32768 MB (99.9939 % mmap'ed) in 7.43764 s, 36.9577 Gbit
+ *   cpu usage user:0.035 sys:3.467, 106.873 usec per MB, 65530 c-switches
+ *
+ * License (GPLv2):
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <error.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/time.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <arpa/inet.h>
+#include <poll.h>
+
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY    0x4000000
+#endif
+
+#define FILE_SZ (1UL << 35)
+static int cfg_family = AF_INET6;
+static socklen_t cfg_alen = sizeof(struct sockaddr_in6);
+static int cfg_port = 8787;
+
+static int rcvbuf; /* Default: autotuning.  Can be set with -r <integer> option */
+static int sndbuf; /* Default: autotuning.  Can be set with -w <integer> option */
+static int zflg; /* zero copy option. (MSG_ZEROCOPY for sender, mmap() for receiver */
+static int xflg; /* hash received data (simple xor) (-h option) */
+static int keepflag; /* -k option: receiver shall keep all received file in memory (no munmap() calls) */
+
+static int chunk_size  = 512*1024;
+
+unsigned long htotal;
+
+static inline void prefetch(const void *x)
+{
+#if defined(__x86_64__)
+	asm volatile("prefetcht0 %P0" : : "m" (*(const char *)x));
+#endif
+}
+
+void hash_zone(void *zone, unsigned int length)
+{
+	unsigned long temp = htotal;
+
+	while (length >= 8*sizeof(long)) {
+		prefetch(zone + 384);
+		temp ^= *(unsigned long *)zone;
+		temp ^= *(unsigned long *)(zone + sizeof(long));
+		temp ^= *(unsigned long *)(zone + 2*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 3*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 4*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 5*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 6*sizeof(long));
+		temp ^= *(unsigned long *)(zone + 7*sizeof(long));
+		zone += 8*sizeof(long);
+		length -= 8*sizeof(long);
+	}
+	while (length >= 1) {
+		temp ^= *(unsigned char *)zone;
+		zone += 1;
+		length--;
+	}
+	htotal = temp;
+}
+
+void *child_thread(void *arg)
+{
+	unsigned long total_mmap = 0, total = 0;
+	unsigned long delta_usec;
+	int flags = MAP_SHARED;
+	struct timeval t0, t1;
+	char *buffer = NULL;
+	void *oaddr = NULL;
+	double throughput;
+	struct rusage ru;
+	int lu, fd;
+
+	fd = (int)(unsigned long)arg;
+
+	gettimeofday(&t0, NULL);
+
+	fcntl(fd, F_SETFL, O_NDELAY);
+	buffer = malloc(chunk_size);
+	if (!buffer) {
+		perror("malloc");
+		goto error;
+	}
+	while (1) {
+		struct pollfd pfd = { .fd = fd, .events = POLLIN, };
+		int sub;
+
+		poll(&pfd, 1, 10000);
+		if (zflg) {
+			void *naddr;
+
+			naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 0);
+			if (naddr == (void *)-1) {
+				if (errno == EAGAIN) {
+					/* That is if SO_RCVLOWAT is buggy */
+					usleep(1000);
+					continue;
+				}
+				if (errno == EINVAL) {
+					flags = MAP_SHARED;
+					oaddr = NULL;
+					goto fallback;
+				}
+				if (errno != EIO)
+					perror("mmap()");
+				break;
+			}
+			total_mmap += chunk_size;
+			if (xflg)
+				hash_zone(naddr, chunk_size);
+			total += chunk_size;
+			if (!keepflag) {
+				flags |= MAP_FIXED;
+				oaddr = naddr;
+			}
+			continue;
+		}
+fallback:
+		sub = 0;
+		while (sub < chunk_size) {
+			lu = read(fd, buffer + sub, chunk_size - sub);
+			if (lu == 0)
+				goto end;
+			if (lu < 0)
+				break;
+			if (xflg)
+				hash_zone(buffer + sub, lu);
+			total += lu;
+			sub += lu;
+		}
+	}
+end:
+	gettimeofday(&t1, NULL);
+	delta_usec = (t1.tv_sec - t0.tv_sec) * 1000000 + t1.tv_usec - t0.tv_usec;
+
+	throughput = 0;
+	if (delta_usec)
+		throughput = total * 8.0 / (double)delta_usec / 1000.0;
+	getrusage(RUSAGE_THREAD, &ru);
+	if (total > 1024*1024) {
+		unsigned long total_usec;
+		unsigned long mb = total >> 20;
+		total_usec = 1000000*ru.ru_utime.tv_sec + ru.ru_utime.tv_usec +
+			     1000000*ru.ru_stime.tv_sec + ru.ru_stime.tv_usec;
+		printf("received %lg MB (%lg %% mmap'ed) in %lg s, %lg Gbit\n"
+		       "  cpu usage user:%lg sys:%lg, %lg usec per MB, %lu c-switches\n",
+				total / (1024.0 * 1024.0),
+				100.0*total_mmap/total,
+				(double)delta_usec / 1000000.0,
+				throughput,
+				(double)ru.ru_utime.tv_sec + (double)ru.ru_utime.tv_usec / 1000000.0,
+				(double)ru.ru_stime.tv_sec + (double)ru.ru_stime.tv_usec / 1000000.0,
+				(double)total_usec/mb,
+				ru.ru_nvcsw);
+	}
+error:
+	free(buffer);
+	close(fd);
+	pthread_exit(0);
+}
+
+static void apply_rcvsnd_buf(int fd)
+{
+	if (rcvbuf && setsockopt(fd, SOL_SOCKET,
+				 SO_RCVBUF, &rcvbuf, sizeof(rcvbuf)) == -1) {
+		perror("setsockopt SO_RCVBUF");
+	}
+
+	if (sndbuf && setsockopt(fd, SOL_SOCKET,
+				 SO_SNDBUF, &sndbuf, sizeof(sndbuf)) == -1) {
+		perror("setsockopt SO_SNDBUF");
+	}
+}
+
+
+static void setup_sockaddr(int domain, const char *str_addr,
+			   struct sockaddr_storage *sockaddr)
+{
+	struct sockaddr_in6 *addr6 = (void *) sockaddr;
+	struct sockaddr_in *addr4 = (void *) sockaddr;
+
+	switch (domain) {
+	case PF_INET:
+		memset(addr4, 0, sizeof(*addr4));
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(cfg_port);
+		if (str_addr &&
+		    inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
+			error(1, 0, "ipv4 parse error: %s", str_addr);
+		break;
+	case PF_INET6:
+		memset(addr6, 0, sizeof(*addr6));
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(cfg_port);
+		if (str_addr &&
+		    inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
+			error(1, 0, "ipv6 parse error: %s", str_addr);
+		break;
+	default:
+		error(1, 0, "illegal domain");
+	}
+}
+
+static void do_accept(int fdlisten)
+{
+	if (setsockopt(fdlisten, SOL_SOCKET, SO_RCVLOWAT,
+		       &chunk_size, sizeof(chunk_size)) == -1) {
+		perror("setsockopt SO_RCVLOWAT");
+	}
+
+	apply_rcvsnd_buf(fdlisten);
+
+	while (1) {
+		struct sockaddr_in addr;
+		socklen_t addrlen = sizeof(addr);
+		pthread_t th;
+		int fd, res;
+
+		fd = accept(fdlisten, (struct sockaddr *)&addr, &addrlen);
+		if (fd == -1) {
+			perror("accept");
+			continue;
+		}
+		res = pthread_create(&th, NULL, child_thread,
+				     (void *)(unsigned long)fd);
+		if (res) {
+			errno = res;
+			perror("pthread_create");
+			close(fd);
+		}
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_storage listenaddr, addr;
+	unsigned int max_pacing_rate = 0;
+	unsigned long total = 0;
+	char *host = NULL;
+	int fd, c, on = 1;
+	char *buffer;
+	int sflg = 0;
+	int mss = 0;
+
+	while ((c = getopt(argc, argv, "46p:svr:w:H:zxkP:M:")) != -1) {
+		switch (c) {
+		case '4':
+			cfg_family = PF_INET;
+			cfg_alen = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			cfg_family = PF_INET6;
+			cfg_alen = sizeof(struct sockaddr_in6);
+			break;
+		case 'p':
+			cfg_port = atoi(optarg);
+			break;
+		case 'H':
+			host = optarg;
+			break;
+		case 's': /* server : listen for incoming connections */
+			sflg++;
+			break;
+		case 'r':
+			rcvbuf = atoi(optarg);
+			break;
+		case 'w':
+			sndbuf = atoi(optarg);
+			break;
+		case 'z':
+			zflg = 1;
+			break;
+		case 'M':
+			mss = atoi(optarg);
+			break;
+		case 'x':
+			xflg = 1;
+			break;
+		case 'k':
+			keepflag = 1;
+			break;
+		case 'P':
+			max_pacing_rate = atoi(optarg) ;
+			break;
+		default:
+			exit(1);
+		}
+	}
+	if (sflg) {
+		int fdlisten = socket(cfg_family, SOCK_STREAM, 0);
+
+		if (fdlisten == -1) {
+			perror("socket");
+			exit(1);
+		}
+		apply_rcvsnd_buf(fdlisten);
+		setsockopt(fdlisten, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
+
+		setup_sockaddr(cfg_family, host, &listenaddr);
+
+		if (mss &&
+		    setsockopt(fdlisten, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+			perror("setsockopt TCP_MAXSEG");
+			exit(1);
+		}
+		if (bind(fdlisten, (const struct sockaddr *)&listenaddr, cfg_alen) == -1) {
+			perror("bind");
+			exit(1);
+		}
+		if (listen(fdlisten, 128) == -1) {
+			perror("listen");
+			exit(1);
+		}
+		do_accept(fdlisten);
+	}
+	buffer = mmap(NULL, chunk_size, PROT_READ | PROT_WRITE,
+			      MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (buffer == (char *)-1) {
+		perror("mmap");
+		exit(1);
+	}
+
+	fd = socket(AF_INET6, SOCK_STREAM, 0);
+	if (fd == -1) {
+		perror("socket");
+		exit(1);
+	}
+	apply_rcvsnd_buf(fd);
+
+	setup_sockaddr(cfg_family, host, &addr);
+
+	if (mss &&
+	    setsockopt(fd, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+		perror("setsockopt TCP_MAXSEG");
+		exit(1);
+	}
+	if (connect(fd, (const struct sockaddr *)&addr, cfg_alen) == -1) {
+		perror("connect");
+		exit(1);
+	}
+	if (max_pacing_rate &&
+	    setsockopt(fd, SOL_SOCKET, SO_MAX_PACING_RATE,
+		       &max_pacing_rate, sizeof(max_pacing_rate)) == -1)
+		perror("setsockopt SO_MAX_PACING_RATE");
+
+	if (zflg && setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY,
+			       &on, sizeof(on)) == -1) {
+		perror("setsockopt SO_ZEROCOPY, (-z option disabled)");
+		zflg = 0;
+	}
+	while (total < FILE_SZ) {
+		long wr = FILE_SZ - total;
+
+		if (wr > chunk_size)
+			wr = chunk_size;
+		/* Note : we just want to fill the pipe with 0 bytes */
+		wr = send(fd, buffer, wr, zflg ? MSG_ZEROCOPY : 0);
+		if (wr <= 0)
+			break;
+		total += wr;
+	}
+	close(fd);
+	munmap(buffer, chunk_size);
+	return 0;
+}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 4/5] tcp: implement mmap() for zero copy receive
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.

Implement mmap() system call so that applications can avoid
copying data without complex splice() games.

Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)

Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.

If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.

Application must fallback to recvmsg() to read the problematic sequence.

mmap() wont block,  regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.

An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()

On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.

Tested:

mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)

Without mmap() (tcp_mmap -s)

received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
  cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
  cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
  cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
  cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches

With mmap() on receiver (tcp_mmap -s -z)

received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
  cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
  cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
  cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
  cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h   |   2 +
 net/ipv4/af_inet.c  |   2 +-
 net/ipv4/tcp.c      | 113 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/af_inet6.c |   2 +-
 4 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0ee85c47c185afcb8e1017d59e02313cb5df78ec..833154e3df173ea41aa16dd1ec739a175c679c5c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -404,6 +404,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_data_ready(struct sock *sk);
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f5c562aaef3522519bcf1ae37782a7e14e278723..3ebf599cebaea4926decc1aad7274b12ec7e1566 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -994,7 +994,7 @@ const struct proto_ops inet_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,
 	.sendmsg	   = inet_sendmsg,
 	.recvmsg	   = inet_recvmsg,
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.splice_read	   = tcp_splice_read,
 	.read_sock	   = tcp_read_sock,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c768d306b65714bb8740c60110c43042508af6b7..438fbca96cd3100d722e1bd8bcc6f49624495a21 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,6 +1726,119 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
+/* When user wants to mmap X pages, we first need to perform the mapping
+ * before freeing any skbs in receive queue, otherwise user would be unable
+ * to fallback to standard recvmsg(). This happens if some data in the
+ * requested block is not exactly fitting in a page.
+ *
+ * We only support order-0 pages for the moment.
+ * mmap() on TCP is very strict, there is no point
+ * trying to accommodate with pathological layouts.
+ */
+int tcp_mmap(struct file *file, struct socket *sock,
+	     struct vm_area_struct *vma)
+{
+	unsigned long size = vma->vm_end - vma->vm_start;
+	unsigned int nr_pages = size >> PAGE_SHIFT;
+	struct page **pages_array = NULL;
+	u32 seq, len, offset, nr = 0;
+	struct sock *sk = sock->sk;
+	const skb_frag_t *frags;
+	struct tcp_sock *tp;
+	struct sk_buff *skb;
+	int ret;
+
+	if (vma->vm_pgoff || !nr_pages)
+		return -EINVAL;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+	/* TODO: Maybe the following is not needed if pages are COW */
+	vma->vm_flags &= ~VM_MAYWRITE;
+
+	lock_sock(sk);
+
+	ret = -ENOTCONN;
+	if (sk->sk_state == TCP_LISTEN)
+		goto out;
+
+	sock_rps_record_flow(sk);
+
+	if (tcp_inq(sk) < size) {
+		ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN;
+		goto out;
+	}
+	tp = tcp_sk(sk);
+	seq = tp->copied_seq;
+	/* Abort if urgent data is in the area */
+	if (unlikely(tp->urg_data)) {
+		u32 urg_offset = tp->urg_seq - seq;
+
+		ret = -EINVAL;
+		if (urg_offset < size)
+			goto out;
+	}
+	ret = -ENOMEM;
+	pages_array = kvmalloc_array(nr_pages, sizeof(struct page *),
+				     GFP_KERNEL);
+	if (!pages_array)
+		goto out;
+	skb = tcp_recv_skb(sk, seq, &offset);
+	ret = -EINVAL;
+skb_start:
+	/* We do not support anything not in page frags */
+	offset -= skb_headlen(skb);
+	if ((int)offset < 0)
+		goto out;
+	if (skb_has_frag_list(skb))
+		goto out;
+	len = skb->data_len - offset;
+	frags = skb_shinfo(skb)->frags;
+	while (offset) {
+		if (frags->size > offset)
+			goto out;
+		offset -= frags->size;
+		frags++;
+	}
+	while (nr < nr_pages) {
+		if (len) {
+			if (len < PAGE_SIZE)
+				goto out;
+			if (frags->size != PAGE_SIZE || frags->page_offset)
+				goto out;
+			pages_array[nr++] = skb_frag_page(frags);
+			frags++;
+			len -= PAGE_SIZE;
+			seq += PAGE_SIZE;
+			continue;
+		}
+		skb = skb->next;
+		offset = seq - TCP_SKB_CB(skb)->seq;
+		goto skb_start;
+	}
+	/* OK, we have a full set of pages ready to be inserted into vma */
+	for (nr = 0; nr < nr_pages; nr++) {
+		ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT),
+				     pages_array[nr]);
+		if (ret)
+			goto out;
+	}
+	/* operation is complete, we can 'consume' all skbs */
+	tp->copied_seq = seq;
+	tcp_rcv_space_adjust(sk);
+
+	/* Clean up data we have read: This will do ACK frames. */
+	tcp_recv_skb(sk, seq, &offset);
+	tcp_cleanup_rbuf(sk, size);
+
+	ret = 0;
+out:
+	release_sock(sk);
+	kvfree(pages_array);
+	return ret;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e70d59fb26e16ace1eb484d23964946092a2cd57..2c694912df2e77b414de5cc2aa43e2ec59286836 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -579,7 +579,7 @@ const struct proto_ops inet6_stream_ops = {
 	.getsockopt	   = sock_common_getsockopt,	/* ok		*/
 	.sendmsg	   = inet_sendmsg,		/* ok		*/
 	.recvmsg	   = inet_recvmsg,		/* ok		*/
-	.mmap		   = sock_no_mmap,
+	.mmap		   = tcp_mmap,
 	.sendpage	   = inet_sendpage,
 	.sendmsg_locked    = tcp_sendmsg_locked,
 	.sendpage_locked   = tcp_sendpage_locked,
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 3/5] tcp: avoid extra wakeups for SO_RCVLOWAT users
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only
generated when enough bytes are available in receive queue, after
David change (commit c7004482e8dc "tcp: Respect SO_RCVLOWAT in tcp_poll().")

But TCP still calls sk->sk_data_ready() for each chunk added in receive
queue, meaning thread is awaken, and goes back to sleep shortly after.

Tested:

tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB

-> Should get ~2 wakeups (c-switches) per MB, regardless of how many
(tiny or big) packets were received.

High speed (mostly full size GRO packets)

received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit,
  cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches

received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit,
  cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches

Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping)

received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit,
  cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h    |  1 +
 net/ipv4/tcp.c       |  4 ++++
 net/ipv4/tcp_input.c | 15 +++++++++++++--
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b2318242cad89176d3c2c027affd4db3c2549ff4..0ee85c47c185afcb8e1017d59e02313cb5df78ec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -403,6 +403,7 @@ void tcp_syn_ack_timeout(const struct request_sock *req);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 int tcp_set_rcvlowat(struct sock *sk, int val);
+void tcp_data_ready(struct sock *sk);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0abd8d1d3d1d4f0bd6e2762c8a2b862ecf31e4ae..c768d306b65714bb8740c60110c43042508af6b7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1705,6 +1705,10 @@ EXPORT_SYMBOL(tcp_peek_len);
 int tcp_set_rcvlowat(struct sock *sk, int val)
 {
 	sk->sk_rcvlowat = val ? : 1;
+
+	/* Check if we need to signal EPOLLIN right now */
+	tcp_data_ready(sk);
+
 	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
 		return 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d854363a43875e98adbeea72c3434afb06f0f2b4..f93687f97d805732f1093d55a402638c4290700a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4576,6 +4576,17 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 
 }
 
+void tcp_data_ready(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	int avail = tp->rcv_nxt - tp->copied_seq;
+
+	if (avail < sk->sk_rcvlowat && !sock_flag(sk, SOCK_DONE))
+		return;
+
+	sk->sk_data_ready(sk);
+}
+
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -4633,7 +4644,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		if (eaten > 0)
 			kfree_skb_partial(skb, fragstolen);
 		if (!sock_flag(sk, SOCK_DEAD))
-			sk->sk_data_ready(sk);
+			tcp_data_ready(sk);
 		return;
 	}
 
@@ -5434,7 +5445,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 no_ack:
 			if (eaten)
 				kfree_skb_partial(skb, fragstolen);
-			sk->sk_data_ready(sk);
+			tcp_data_ready(sk);
 			return;
 		}
 	}
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 2/5] tcp: fix delayed acks behavior for SO_RCVLOWAT
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

We should not delay acks if there are not enough bytes
in receive queue to satisfy SO_RCVLOWAT.

Since [E]POLLIN event is not going to be generated, there is little
hope for a delayed ack to be useful.

In fact, delaying ACK prevents sender from completing
the transfer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 367def6ddeda950db841c0b9ccec98787e19e728..d854363a43875e98adbeea72c3434afb06f0f2b4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5026,9 +5026,12 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* More than one full frame received... */
 	if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
 	     /* ... and right edge of window advances far enough.
-	      * (tcp_recvmsg() will send ACK otherwise). Or...
+	      * (tcp_recvmsg() will send ACK otherwise).
+	      * If application uses SO_RCVLOWAT, we want send ack now if
+	      * we have not received enough bytes to satisfy the condition.
 	      */
-	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
+	    (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
+	     __tcp_select_window(sk) >= tp->rcv_wnd)) ||
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(sk) ||
 	    /* We have out of order data. */
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 1/5] tcp: fix SO_RCVLOWAT and RCVBUF autotuning
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20180416173339.6310-1-edumazet@google.com>

Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.

Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.

Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/net.h |  1 +
 include/net/tcp.h   |  1 +
 net/core/sock.c     |  5 ++++-
 net/ipv4/af_inet.c  |  1 +
 net/ipv4/tcp.c      | 21 +++++++++++++++++++++
 net/ipv6/af_inet6.c |  1 +
 6 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 2248a052061d8aeb0ae08d233f181f09cba6384b..6554d3ba4396b3df49acac934ad16eeb71a695f4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -197,6 +197,7 @@ struct proto_ops {
 					   int offset, size_t size, int flags);
 	int		(*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
 					  size_t size);
+	int		(*set_rcvlowat)(struct sock *sk, int val);
 };
 
 #define DECLARE_SOCKADDR(type, dst, src)	\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9c9b3768b350abfd51776563d220d5e97ca9da69..b2318242cad89176d3c2c027affd4db3c2549ff4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -402,6 +402,7 @@ void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
+int tcp_set_rcvlowat(struct sock *sk, int val);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       struct tcp_options_received *opt_rx,
 		       int estab, struct tcp_fastopen_cookie *foc);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6444525f610cf8039516744ad26aec58485b9b8a..b2c3db169ca1892c4d624fc5e30af12f4eed0adb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -905,7 +905,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 	case SO_RCVLOWAT:
 		if (val < 0)
 			val = INT_MAX;
-		sk->sk_rcvlowat = val ? : 1;
+		if (sock->ops->set_rcvlowat)
+			ret = sock->ops->set_rcvlowat(sk, val);
+		else
+			sk->sk_rcvlowat = val ? : 1;
 		break;
 
 	case SO_RCVTIMEO:
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eaed0367e669aec7635b3cc41de4ece63bb018ec..f5c562aaef3522519bcf1ae37782a7e14e278723 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1006,6 +1006,7 @@ const struct proto_ops inet_stream_ops = {
 	.compat_getsockopt = compat_sock_common_getsockopt,
 	.compat_ioctl	   = inet_compat_ioctl,
 #endif
+	.set_rcvlowat	   = tcp_set_rcvlowat,
 };
 EXPORT_SYMBOL(inet_stream_ops);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bccc4c2700870b8c7ff592a6bd27acebd9bc6471..0abd8d1d3d1d4f0bd6e2762c8a2b862ecf31e4ae 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1701,6 +1701,27 @@ int tcp_peek_len(struct socket *sock)
 }
 EXPORT_SYMBOL(tcp_peek_len);
 
+/* Make sure sk_rcvbuf is big enough to satisfy SO_RCVLOWAT hint */
+int tcp_set_rcvlowat(struct sock *sk, int val)
+{
+	sk->sk_rcvlowat = val ? : 1;
+	if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+		return 0;
+
+	/* val comes from user space and might be close to INT_MAX */
+	val <<= 1;
+	if (val < 0)
+		val = INT_MAX;
+
+	val = min(val, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
+	if (val > sk->sk_rcvbuf) {
+		sk->sk_rcvbuf = val;
+		tcp_sk(sk)->window_clamp = tcp_win_from_space(sk, val);
+	}
+	return 0;
+}
+EXPORT_SYMBOL(tcp_set_rcvlowat);
+
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
 {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 8da0b513f1882b39be4fa72a8233d702ae9ec53b..e70d59fb26e16ace1eb484d23964946092a2cd57 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -590,6 +590,7 @@ const struct proto_ops inet6_stream_ops = {
 	.compat_setsockopt = compat_sock_common_setsockopt,
 	.compat_getsockopt = compat_sock_common_getsockopt,
 #endif
+	.set_rcvlowat	   = tcp_set_rcvlowat,
 };
 
 const struct proto_ops inet6_dgram_ops = {
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* [PATCH net-next 0/5] tcp: add zero copy receive
From: Eric Dumazet @ 2018-04-16 17:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet

This patch series add mmap() support to TCP sockets for RX zero copy.

While tcp_mmap() patch itself is quite small (~100 LOC), optimal support
for asynchronous mmap() required better SO_RCVLOWAT behavior, and a
test program to demonstrate how mmap() on TCP sockets can be used.

Note that mmap() (and associated munmap()) calls are adding more
pressure on per-process VM semaphore, so might not show benefit
for processus with high number of threads.

Eric Dumazet (5):
  tcp: fix SO_RCVLOWAT and RCVBUF autotuning
  tcp: fix delayed acks behavior for SO_RCVLOWAT
  tcp: avoid extra wakeups for SO_RCVLOWAT users
  tcp: implement mmap() for zero copy receive
  selftests: net: add tcp_mmap program

 include/linux/net.h                    |   1 +
 include/net/tcp.h                      |   4 +
 net/core/sock.c                        |   5 +-
 net/ipv4/af_inet.c                     |   3 +-
 net/ipv4/tcp.c                         | 138 ++++++++
 net/ipv4/tcp_input.c                   |  22 +-
 net/ipv6/af_inet6.c                    |   3 +-
 tools/testing/selftests/net/Makefile   |   2 +
 tools/testing/selftests/net/tcp_mmap.c | 437 +++++++++++++++++++++++++
 9 files changed, 608 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/net/tcp_mmap.c

-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox