Netdev List
 help / color / mirror / Atom feed
* [iproute PATCH 1/2] link_gre6: Fix for changing tclass/flowlabel
From: Phil Sutter @ 2017-09-01 14:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20170901140809.13230-1-phil@nwl.cc>

When trying to change tclass or flowlabel of a GREv6 tunnel which has
the respective value set already, the code accidentally bitwise OR'ed
the old and the new value, leading to unexpected results. Fix this by
clearing the relevant bits of flowinfo variable prior to assigning the
new value.

Fixes: af89576d7a8c4 ("iproute2: GRE over IPv6 tunnel support.")
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/link_gre6.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 4d3d4b54210b9..447ac5d78ab7b 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -288,6 +288,7 @@ get_failed:
 			else {
 				if (get_u8(&uval, *argv, 16))
 					invarg("invalid TClass", *argv);
+				flowinfo &= ~IP6_FLOWINFO_TCLASS;
 				flowinfo |= htonl((__u32)uval << 20) & IP6_FLOWINFO_TCLASS;
 				flags &= ~IP6_TNL_F_USE_ORIG_TCLASS;
 			}
@@ -303,6 +304,7 @@ get_failed:
 					invarg("invalid Flowlabel", *argv);
 				if (uval > 0xFFFFF)
 					invarg("invalid Flowlabel", *argv);
+				flowinfo &= ~IP6_FLOWINFO_FLOWLABEL;
 				flowinfo |= htonl(uval) & IP6_FLOWINFO_FLOWLABEL;
 				flags &= ~IP6_TNL_F_USE_ORIG_FLOWLABEL;
 			}
-- 
2.13.1

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters
From: Tejun Heo @ 2017-09-01 14:11 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Ahern, netdev, daniel, ast, davem
In-Reply-To: <20170901032754.ogqmgqaqp3dwfdbw@ast-mbp>

Hello, Alexei.

On Thu, Aug 31, 2017 at 08:27:56PM -0700, Alexei Starovoitov wrote:
> > > The 2 flags are completely independent. The existing override logic is
> > > unchanged. If a program can not be overridden, then the new recursive
> > > flag is irrelevant.
> > 
> > I'm not sure all four combo of the two flags makes sense.  Can't we
> > have something simpler like the following?
> > 
> > 1. None: No further bpf programs allowed in the subtree.
> > 
> > 2. Overridable: If a sub-cgroup installs the same bpf program, this
> >    one yields to that one.
> > 
> > 3. Recursive: If a sub-cgroup installs the same bpf program, that
> >    cgroup program gets run in addition to this one.
> > 
> > Note that we can have combinations of overridables and recursives -
> > both allow further programs in the sub-hierarchy and the only
> > distinction is whether that specific program behaves when that
> > happens.
> 
> If I understand the proposal correctly in case of:
> A (with recurs) -> B (with override) -> C (with recurse) -> D (with override)
> when something happens in D, you propose to run D,C,A  ?

Yes, B gets overridden by C, so the effective progarms are A, C and D.

> With the order of execution from children to parent?

Hmm... I'm not sure about the execution ordering.  How these programs
chain would be dependent on the type of the program, right?  Would we
be able to use the same chaining order for all types of programs?

> That's a bit a different then what I was proposing with 'multi-prog' flag,
> but the more I think about it the more I like it.

Great.

> In your case detach is sort of transparent to everything around.
> And you would also allow to say 'None' to one of the substrees too, right?
> So something like:
> A (with recurs) -> B (with override) -> C (with recurse) -> D (None) -> E
> would mean that E cannot attach anything and events in E will
> call D->C->A, right?

Yeap.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH net-next 2/3] bpf: Inline LRU map lookup
From: Alexei Starovoitov @ 2017-09-01 14:22 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Daniel Borkmann, kernel-team
In-Reply-To: <20170901062713.1842249-3-kafai@fb.com>

On 8/31/17 11:27 PM, Martin KaFai Lau wrote:
> Inline the lru map lookup to save the cost in making calls to
> bpf_map_lookup_elem() and htab_lru_map_lookup_elem().
>
> Different LRU hash size is tested.  The benefit diminishes when
> the cache miss starts to dominate in the bigger LRU hash.
> Considering the change is simple, it is still worth to optimize.
>
> First column: Size of the LRU hash
> Second column: Number of lookups/s
>
> Before:
>> for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
> 513: 1132020
> 1025: 1056826
> 2049: 1007024
> 4097: 853298
> 8193: 742723
> 16385: 712600
> 32769: 688142
> 65537: 677028
> 131073: 619437
> 262145: 498770
> 524289: 316695
> 1048577: 260038
>
> After:
>> for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
> 513: 1221851
> 1025: 1144695
> 2049: 1049902
> 4097: 884460
> 8193: 773731
> 16385: 729673
> 32769: 721989
> 65537: 715530
> 131073: 671665
> 262145: 516987
> 524289: 321125
> 1048577: 260048
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 1/3] bpf: Add lru_hash_lookup performance test
From: Alexei Starovoitov @ 2017-09-01 14:22 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Daniel Borkmann, kernel-team
In-Reply-To: <20170901062713.1842249-2-kafai@fb.com>

On 8/31/17 11:27 PM, Martin KaFai Lau wrote:
> Create a new case to test the LRU lookup performance.
>
> At the beginning, the LRU map is fully loaded (i.e. the number of keys
> is equal to map->max_entries).   The lookup is done through key 0
> to num_map_entries and then repeats from 0 again.
>
> This patch also creates an anonymous struct to properly
> name the test params in stress_lru_hmap_alloc() in map_perf_test_kern.c.
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 3/3] bpf: Only set node->ref = 1 if it has not been set
From: Alexei Starovoitov @ 2017-09-01 14:22 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Daniel Borkmann, kernel-team
In-Reply-To: <20170901062713.1842249-4-kafai@fb.com>

On 8/31/17 11:27 PM, Martin KaFai Lau wrote:
> This patch writes 'node->ref = 1' only if node->ref is 0.
> The number of lookups/s for a ~1M entries LRU map increased by
> ~30% (260097 to 343313).
>
> Other writes on 'node->ref = 0' is not changed.  In those cases, the
> same cache line has to be changed anyway.
>
> First column: Size of the LRU hash
> Second column: Number of lookups/s
>
> Before:
>> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
> 1048577: 260097
>
> After:
>> echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
> 1048577: 343313
>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* pull-request: wireless-drivers-next 2017-09-01
From: Kalle Valo @ 2017-09-01 14:34 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel

Hi Dave,

here's a pull request to net-next for 4.14. If the merge window opens on
Sunday I'm planning to have this as the last one.

Please let me know if there are any problems.

Kalle

The following changes since commit d081a16db80ef7a260fb178aa1199e01f7432625:

  net: bcmgenet: Do not return from void function (2017-08-29 22:39:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git tags/wireless-drivers-next-for-davem-2017-09-01

for you to fetch changes up to eb464d4a8d092a793b97b724cd3cc6eeb229232a:

  Merge ath-next from git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2017-08-31 21:34:22 +0300)

----------------------------------------------------------------
wireless-drivers-next patches for 4.14

Few last patches for 4.14, nothing really major here.

Major changes:

wil6210

* support FW RSSI reporting (by mistake this was accidentally
  mentioned already in the previous pull request, but now it's really
  included)

* make debugfs optional, adds new Kconfig option CONFIG_WIL6210_DEBUGFS

qtnfmac

* implement 64-bit DMA support

----------------------------------------------------------------
Andy Shevchenko (1):
      ath10k: switch to use new generic UUID API

Arvind Yadav (2):
      ath6kl: constify usb_device_id
      ath9k: constify usb_device_id

Bhumika Goyal (1):
      ath9k: make ath_ps_ops structures as const

Bjorn Andersson (1):
      wcn36xx: Introduce mutual exclusion of fw configuration

Dan Carpenter (2):
      rsi: update some comments
      rsi: missing unlocks on error paths

David Spinadel (1):
      iwlwifi: mvm: Avoid deferring non bufferable frames

Dedy Lansky (4):
      wil6210: support FW RSSI reporting
      wil6210: store FW RF calibration result
      wil6210: move pre-FW configuration to separate function
      wil6210: clear PAL_UNIT_ICR part of device reset

Emmanuel Grumbach (1):
      iwlwifi: mvm: bump API to 34 for 8000 and up

Erik Stromdahl (1):
      ath10k: sdio: remove unused struct member

Gabriel Craciunescu (1):
      ath10k: ath10k_htt_rx_amsdu_allowed() use ath10k_dbg()

Gidon Studinski (2):
      wil6210: move vring_idle_trsh definition to wil6210_priv
      wil6210: make debugfs compilation optional

Gustavo A. R. Silva (1):
      rtlwifi: rtl8723be: fix duplicated code for different branches

Hamad Kadmany (2):
      wil6210: protect against invalid length of tx management frame
      wil6210: fix interface-up check

Hans de Goede (1):
      brcmfmac: Log chip id and revision

Hauke Mehrtens (1):
      ath10k: activate user space firmware loading again

Himanshu Jha (1):
      rsi: remove memset before memcpy

Kalle Valo (2):
      Merge tag 'iwlwifi-next-for-kalle-2017-08-30' of git://git.kernel.org/.../iwlwifi/iwlwifi-next
      Merge ath-next from git://git.kernel.org/.../kvalo/ath.git

Lazar Alexei (1):
      wil6210: align to latest auto generated wmi.h

Liad Kaufman (1):
      iwlwifi: fix long debug print

Lior David (3):
      wil6210: ratelimit errors in TX/RX interrupts
      wil6210: increase connect timeout
      wil6210: ensure P2P device is stopped before removing interface

Maya Erez (3):
      wil6210: check no_fw_recovery in resume failure recovery
      wil6210: add statistics for suspend time
      wil6210: notify wiphy on wowlan support

Rakesh Pillai (1):
      ath10k: fix memory leak in rx ring buffer allocation

Ryan Hsu (3):
      ath10k: fix napi_poll budget overflow
      ath10k: add the PCI PM core suspend/resume ops
      ath10k: configure and enable the wakeup capability

Sergey Matyukevich (5):
      qtnfmac: drop -D__CHECK_ENDIAN from cflags
      qtnfmac: module param sanity check
      qtnfmac: modify qtnf_map_bar not to return NULL
      qtnfmac: fix free_xfer_buffer cleanup
      qtnfmac: implement 64-bit dma support

Stanislaw Gruszka (1):
      rt2800: fix TX_PIN_CFG setting for non MT7620 chips

 drivers/net/wireless/ath/ath10k/core.c             |  14 +-
 drivers/net/wireless/ath/ath10k/core.h             |   2 +-
 drivers/net/wireless/ath/ath10k/debug.c            |   6 +-
 drivers/net/wireless/ath/ath10k/htt_rx.c           |  19 +-
 drivers/net/wireless/ath/ath10k/mac.c              |   1 +
 drivers/net/wireless/ath/ath10k/pci.c              |  50 +-
 drivers/net/wireless/ath/ath10k/sdio.c             |   4 -
 drivers/net/wireless/ath/ath10k/sdio.h             |   2 -
 drivers/net/wireless/ath/ath10k/wow.c              |  14 +
 drivers/net/wireless/ath/ath10k/wow.h              |   1 +
 drivers/net/wireless/ath/ath6kl/usb.c              |   2 +-
 drivers/net/wireless/ath/ath9k/hif_usb.c           |   2 +-
 drivers/net/wireless/ath/ath9k/htc_drv_init.c      |   2 +-
 drivers/net/wireless/ath/ath9k/init.c              |   2 +-
 drivers/net/wireless/ath/wcn36xx/main.c            |  52 +-
 drivers/net/wireless/ath/wcn36xx/wcn36xx.h         |   3 +
 drivers/net/wireless/ath/wil6210/Kconfig           |  12 +
 drivers/net/wireless/ath/wil6210/Makefile          |   2 +-
 drivers/net/wireless/ath/wil6210/cfg80211.c        |  84 ++-
 drivers/net/wireless/ath/wil6210/debugfs.c         |  27 +-
 drivers/net/wireless/ath/wil6210/interrupt.c       |  14 +-
 drivers/net/wireless/ath/wil6210/main.c            |  42 +-
 drivers/net/wireless/ath/wil6210/pcie_bus.c        |   3 +
 drivers/net/wireless/ath/wil6210/pm.c              |  27 +-
 drivers/net/wireless/ath/wil6210/txrx.c            |   6 +-
 drivers/net/wireless/ath/wil6210/wil6210.h         |  20 +-
 drivers/net/wireless/ath/wil6210/wmi.c             |  14 +-
 drivers/net/wireless/ath/wil6210/wmi.h             | 720 ++++++++++++++-------
 .../broadcom/brcm80211/brcmfmac/firmware.c         |   3 +
 drivers/net/wireless/intel/iwlwifi/cfg/8000.c      |   4 +-
 drivers/net/wireless/intel/iwlwifi/cfg/9000.c      |   2 +-
 drivers/net/wireless/intel/iwlwifi/cfg/a000.c      |   2 +-
 drivers/net/wireless/intel/iwlwifi/iwl-nvm-parse.c |   9 +-
 drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c  |   9 +-
 drivers/net/wireless/quantenna/qtnfmac/Makefile    |   4 -
 .../net/wireless/quantenna/qtnfmac/pearl/pcie.c    | 101 ++-
 .../wireless/quantenna/qtnfmac/pearl/pcie_ipc.h    |  10 +-
 .../quantenna/qtnfmac/pearl/pcie_regs_pearl.h      |   1 +
 drivers/net/wireless/ralink/rt2x00/rt2800lib.c     |   5 +-
 .../net/wireless/realtek/rtlwifi/rtl8723be/dm.c    |   8 +-
 drivers/net/wireless/rsi/rsi_91x_mac80211.c        |  25 +-
 drivers/net/wireless/rsi/rsi_91x_sdio.c            |   1 -
 drivers/net/wireless/rsi/rsi_91x_usb.c             |   1 -
 43 files changed, 931 insertions(+), 401 deletions(-)

^ permalink raw reply

* [PATCH] mvneta: Driver and hardware supports IPv6 offload, so enable it
From: Andrew Pilloud @ 2017-09-01 14:49 UTC (permalink / raw)
  To: Thomas Petazzoni; +Cc: netdev, Andrew Pilloud

The mvneta driver and hardware supports IPv6 offload, however it
isn't enabled. Set the NETIF_F_IPV6_CSUM feature to inform the
network layer that this driver can offload IPV6 TCP and UDP
checksums. This change has been tested on an Armada 370 and the
feature support confirmed with several device datasheets
including the Armada XP and Armada 3700.

Signed-off-by: Andrew Pilloud <andrewpilloud@igneoussystems.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 0aab74c2a209..369ad971b42a 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -4332,7 +4332,7 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
+	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_TSO;
 	dev->hw_features |= dev->features;
 	dev->vlan_features |= dev->features;
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next] bpf: Collapse offset checks in sock_filter_is_valid_access
From: David Ahern @ 2017-09-01 15:18 UTC (permalink / raw)
  To: netdev, daniel; +Cc: David Ahern

Make sock_filter_is_valid_access consistent with other is_valid_access
helpers.

Requested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/filter.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9dad3e7e2e10..f9add024d92f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3468,9 +3468,7 @@ static bool sock_filter_is_valid_access(int off, int size,
 	if (type == BPF_WRITE) {
 		switch (off) {
 		case offsetof(struct bpf_sock, bound_dev_if):
-			break;
 		case offsetof(struct bpf_sock, mark):
-			break;
 		case offsetof(struct bpf_sock, priority):
 			break;
 		default:
-- 
2.1.4

^ permalink raw reply related

* Re: [net-next PATCH] ixgbe: add counter for times rx pages gets allocated, not recycled
From: Alexander Duyck @ 2017-09-01 15:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Netdev, Jeff Kirsher
In-Reply-To: <150426329512.26774.12697329010993174947.stgit@firesoul>

On Fri, Sep 1, 2017 at 3:54 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> The ixgbe driver have page recycle scheme based around the RX-ring
> queue, where a RX page is shared between two packets. Based on the
> refcnt, the driver can determine if the RX-page is currently only used
> by a single packet, if so it can then directly refill/recycle the
> RX-slot by with the opposite "side" of the page.
>
> While this is a clever trick, it is hard to determine when this
> recycling is successful and when it fails.  Adding a counter, which is
> available via ethtool --statistics as 'alloc_rx_page'.  Which counts
> the number of times the recycle fails and the real page allocator is
> invoked.  When interpreting the stats, do remember that every alloc
> will serve two packets.
>
> The counter is collected per rx_ring, but is summed and ethtool
> exported as 'alloc_rx_page'.  It would be relevant to know what
> rx_ring that cannot keep up, but that can be exported later if
> someone experience a need for this.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h         |    2 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |    1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |    4 ++++
>  3 files changed, 7 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index dd5578756ae0..008d0085e01f 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -275,6 +275,7 @@ struct ixgbe_rx_queue_stats {
>         u64 rsc_count;
>         u64 rsc_flush;
>         u64 non_eop_descs;
> +       u64 alloc_rx_page;
>         u64 alloc_rx_page_failed;
>         u64 alloc_rx_buff_failed;
>         u64 csum_err;
> @@ -655,6 +656,7 @@ struct ixgbe_adapter {
>         u64 rsc_total_count;
>         u64 rsc_total_flush;
>         u64 non_eop_descs;
> +       u32 alloc_rx_page;
>         u32 alloc_rx_page_failed;
>         u32 alloc_rx_buff_failed;
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> index 72c565712a5f..d96d9d6c3492 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> @@ -104,6 +104,7 @@ static const struct ixgbe_stats ixgbe_gstrings_stats[] = {
>         {"tx_flow_control_xoff", IXGBE_STAT(stats.lxofftxc)},
>         {"rx_flow_control_xoff", IXGBE_STAT(stats.lxoffrxc)},
>         {"rx_csum_offload_errors", IXGBE_STAT(hw_csum_rx_error)},
> +       {"alloc_rx_page", IXGBE_STAT(alloc_rx_page)},
>         {"alloc_rx_page_failed", IXGBE_STAT(alloc_rx_page_failed)},
>         {"alloc_rx_buff_failed", IXGBE_STAT(alloc_rx_buff_failed)},
>         {"rx_no_dma_resources", IXGBE_STAT(hw_rx_no_dma_resources)},
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index d962368d08d0..7d2e4b08cdf4 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1598,6 +1598,7 @@ static bool ixgbe_alloc_mapped_page(struct ixgbe_ring *rx_ring,
>                 rx_ring->rx_stats.alloc_rx_page_failed++;
>                 return false;
>         }
> +       rx_ring->rx_stats.alloc_rx_page++;

So this line should be moved down past the DMA page mapping as it can
fail on some architectures. My personal preference would be to have
this placed in the lines just after the "bi" members being updated,
and before the return.

>         /* map page for use */
>         dma = dma_map_page_attrs(rx_ring->dev, page, 0,
> @@ -6771,6 +6772,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
>         u32 i, missed_rx = 0, mpc, bprc, lxon, lxoff, xon_off_tot;
>         u64 non_eop_descs = 0, restart_queue = 0, tx_busy = 0;
>         u64 alloc_rx_page_failed = 0, alloc_rx_buff_failed = 0;
> +       u64 alloc_rx_page = 0;
>         u64 bytes = 0, packets = 0, hw_csum_rx_error = 0;
>
>         if (test_bit(__IXGBE_DOWN, &adapter->state) ||
> @@ -6791,6 +6793,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
>         for (i = 0; i < adapter->num_rx_queues; i++) {
>                 struct ixgbe_ring *rx_ring = adapter->rx_ring[i];
>                 non_eop_descs += rx_ring->rx_stats.non_eop_descs;
> +               alloc_rx_page += rx_ring->rx_stats.alloc_rx_page;
>                 alloc_rx_page_failed += rx_ring->rx_stats.alloc_rx_page_failed;
>                 alloc_rx_buff_failed += rx_ring->rx_stats.alloc_rx_buff_failed;
>                 hw_csum_rx_error += rx_ring->rx_stats.csum_err;
> @@ -6798,6 +6801,7 @@ void ixgbe_update_stats(struct ixgbe_adapter *adapter)
>                 packets += rx_ring->stats.packets;
>         }
>         adapter->non_eop_descs = non_eop_descs;
> +       adapter->alloc_rx_page = alloc_rx_page;
>         adapter->alloc_rx_page_failed = alloc_rx_page_failed;
>         adapter->alloc_rx_buff_failed = alloc_rx_buff_failed;
>         adapter->hw_csum_rx_error = hw_csum_rx_error;
>

^ permalink raw reply

* ptp device strangeness
From: Tim Sander @ 2017-09-01 15:28 UTC (permalink / raw)
  To: Richard Cochran, linux-kernel; +Cc: netdev

Hi

I am currently using ptp on a Altera/Intel SOC with a dp8640 PHY.
PTP functionality seems to be right. But i am doing timestamping
with gpio0 and sometimes i loose the sync of the stamping and
the events. So i would like to read out all messages. Reading O_NONBLOCK
does not work so i tried polling from usermode with the below code:

		np = poll(&ev, 1, 0);
		ev.fd=ptpDev;
		ev.events = POLLIN;
		if (np>0) {
			if (ev.revents>0) {
				std::cout<<"discarded ptp event"<<std::endl;
				read(ptpDev, &event, sizeof(event));
			}
But as confirmed in the debugger np=1 and read blocks forever.
I don't think that this is correct behavior?

For pinning down this misbehavior I would like to know it this is a local 
problem of my hardware or if this is a general problem with the ptp chardev 
interface? 

I am currently on 4.11.12. As this is the latest preempt rt release.

Best regards
Tim

^ permalink raw reply

* (unknown), 
From: stef.ryckmans @ 2017-09-01 15:30 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 8247893720.doc --]
[-- Type: application/msword, Size: 40147 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: Add limits for encapsulation and EH
From: Tom Herbert @ 2017-09-01 15:38 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S . Miller, Linux Kernel Network Developers, alex.popov
In-Reply-To: <87fuc6zgra.fsf@stressinduktion.org>

On Fri, Sep 1, 2017 at 6:32 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@quantonium.net> writes:
>
>> In flow dissector there are no limits to the number of nested
>> encapsulations that might be dissected which makes for a nice DOS
>> attack. This patch limits for dissecting nested encapsulations
>> as well as for dissecting over extension headers.
>
> I was actually more referring to your patch, because the flow dissector
> right now is not stack recursive. Your changes would make it doing
> recursion on the stack.

I don't believe those patches had any recursion.

>> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>> ---
>>  net/core/flow_dissector.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 5110180a3e96..1bca748de27d 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -396,6 +396,35 @@ __skb_flow_dissect_ipv6(const struct sk_buff *skb,
>>       key_ip->ttl = iph->hop_limit;
>>  }
>>
>> +/* Maximum number of nested encapsulations that can be processed in
>> + * __skb_flow_dissect
>> + */
>> +#define MAX_FLOW_DISSECT_ENCAPS      5
>
> I think you can exactly parse one encapsulation layer safely. This is
> because you can only keep state on one encapsulation layer protocol
> right now.
>
> For example this scenario:
>
> I would like to circumvent tc flower rules that filter base
> simply construct a packet looking like:
>
> Ethernet|Vlan|IP|GRE|Ethernet|Vlan|
>
> because we don't recurse in the flow keys either, the second vlan header
> would overwrite the information of the first one.
>
Flow dissector returns the inner most information it sees. If only
information from the outermost headers is needed then the caller
should set FLOW_DISSECTOR_F_STOP_AT_ENCAP in the flags.

>> +
>> +static bool skb_flow_dissect_encap_allowed(int *num_encaps, unsigned int *flags)
>> +{
>> +     ++*num_encaps;
>> +
>> +     if (*num_encaps >= MAX_FLOW_DISSECT_ENCAPS) {
>> +             if (*num_encaps == MAX_FLOW_DISSECT_ENCAPS) {
>> +                     /* Allow one more pass but ignore disregard
>> +                      * further encapsulations
>> +                      */
>> +                     *flags |= FLOW_DISSECTOR_F_STOP_AT_ENCAP;
>> +             } else {
>> +                     /* Max encaps reached */
>> +                     return  false;
>> +             }
>> +     }
>> +
>> +     return true;
>> +}
>> +
>> +/* Maximum number of extension headers can be processed in __skb_flow_dissect
>> + * per IPv6 packet
>> + */
>> +#define MAX_FLOW_DISSECT_EH  5
>
> I would at least allow each extension header once, DEST_OPS twice, thus
> let's say 12? It is safe in this version because it does not consume
> stack space anyway and doesn't update internal state.

This per each IPv6 header. This patch would allow up to five
encapsulations of IPv6/IPv6 so maximum number of EHs we would consider
is 6*5=30.

>
>> +
>>  /**
>>   * __skb_flow_dissect - extract the flow_keys struct and return it
>>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
>> @@ -426,6 +455,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>>       struct flow_dissector_key_tags *key_tags;
>>       struct flow_dissector_key_vlan *key_vlan;
>>       enum flow_dissect_ret fdret;
>> +     int num_eh, num_encaps = 0;
>>       bool skip_vlan = false;
>>       u8 ip_proto = 0;
>>       bool ret;
>> @@ -714,7 +744,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>>       case FLOW_DISSECT_RET_OUT_GOOD:
>>               goto out_good;
>>       case FLOW_DISSECT_RET_PROTO_AGAIN:
>> -             goto proto_again;
>> +             if (skb_flow_dissect_encap_allowed(&num_encaps, &flags))
>> +                     goto proto_again;
>
> I think you should get the check to the proto_again label. In case you
> loop to often you can `goto out_good'.
>
That would add an extra conditional in the common case of no
encapsulation. Having it here means we only care about the
encapsulation limit when there is encapsulation.

Tom

^ permalink raw reply

* Re: virtio_net: ethtool supported link modes
From: Michael S. Tsirkin @ 2017-09-01 15:43 UTC (permalink / raw)
  To: Radu Rendec; +Cc: virtualization, netdev, linux-kernel, Jason Wang, virtio-dev
In-Reply-To: <1504199044.22080.11.camel@arista.com>

On Thu, Aug 31, 2017 at 06:04:04PM +0100, Radu Rendec wrote:
> Hello,
> 
> Looking at the code in virtnet_set_link_ksettings, it seems the speed
> and duplex can be set to any valid value. The driver will "remember"
> them and report them back in virtnet_get_link_ksettings.
> 
> However, the supported link modes (link_modes.supported in struct
> ethtool_link_ksettings) is always 0, indicating that no speed/duplex
> setting is supported.
> 
> Does it make more sense to set (at least a few of) the supported link
> modes, such as 10baseT_Half ... 10000baseT_Full?
> 
> I would expect to see consistency between what is reported in
> link_modes.supported and what can actually be set. Could you please
> share your opinion on this?
> 
> Thank you,
> Radu Rendec


I would like to know more about why this is desirable.

We used not to support the modes at all, but it turned out
some tools are confused by this: e.g. people would try to
bond virtio with a hardware device, tools would see
a mismatch in speed and features between bonded devices
and get confused.

See

	commit 16032be56c1f66770da15cb94f0eb366c37aff6e
	Author: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
	Date:   Wed Feb 3 04:04:37 2016 +0100

	    virtio_net: add ethtool support for set and get of settings


as well as the discussion around it
	https://www.spinics.net/lists/netdev/msg362111.html


If you think we need to add more hacks like this, a stronger
motivation than "to see consistency" would be needed.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next v5 2/2] tcp_diag: report TCP MD5 signing keys and addresses
From: Sabrina Dubroca @ 2017-09-01 15:47 UTC (permalink / raw)
  To: Ivan Delalande; +Cc: David Miller, Eric Dumazet, netdev
In-Reply-To: <20170901002118.sprwhuybmk4spy34@ycc.fr>

2017-09-01, 02:21:18 +0200, Ivan Delalande wrote:
> On Fri, Sep 01, 2017 at 01:26:33AM +0200, Sabrina Dubroca wrote:
> > 2017-08-31, 09:59:39 -0700, Ivan Delalande wrote:
> > > diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
> > > index a748c74aa8b7..abbf0edcf6c2 100644
> > > --- a/net/ipv4/tcp_diag.c
> > > +++ b/net/ipv4/tcp_diag.c
> > [...]
> > > +static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
> > > +			    struct sk_buff *skb)
> > > +{
> > > +#ifdef CONFIG_TCP_MD5SIG
> > > +	if (net_admin) {
> > 
> > In tcp_diag_get_aux_size() you put a check for sk_fullsock. I don't
> > see anything preventing you from reaching this with a !fullsock?
> 
> Currently handler->idiag_get_aux is only called from inet_sk_diag_fill
> which has a `BUG_ON(!sk_fullsock(sk));`, but I could add another

Oh, right, that's the part I was missing. Thanks.

> explicit check in that function if you think it's more consistent.
> 
> Actually, I wasn't sure when adding this idiag_get_aux in v2 if it
> should be called from inet_twsk_diag_fill, inet_req_diag_fill and
> inet_csk_diag_fill, or just the last one. I chose that simpler approach
> for now to avoid duplicating these state checks in the idiag_get_aux
> defined by protocols and because we didn't need for INET_DIAG_MD5SIG,
> but it shouldn't be too hard to change. Do you think this could be
> useful for other protocols or attributes?

No opinion. It can be added later if necessary.

-- 
Sabrina

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Willem de Bruijn @ 2017-09-01 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, David Miller, Willem de Bruijn
In-Reply-To: <20170901031009.2p4vv6fncma2y2l7@ast-mbp>

On Thu, Aug 31, 2017 at 11:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 11:04:41PM -0400, Willem de Bruijn wrote:
>> On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
>> >> From: Willem de Bruijn <willemb@google.com>
>> >>
>> >> Documentation for this feature was missing from the patchset.
>> >> Copied a lot from the netdev 2.1 paper, addressing some small
>> >> interface changes since then.
>> >>
>> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> > ...
>> >> +Notification Batching
>> >> +~~~~~~~~~~~~~~~~~~~~~
>> >> +
>> >> +Multiple outstanding packets can be read at once using the recvmmsg
>> >> +call. This is often not needed. In each message the kernel returns not
>> >> +a single value, but a range. It coalesces consecutive notifications
>> >> +while one is outstanding for reception on the error queue.
>> >> +
>> >> +When a new notification is about to be queued, it checks whether the
>> >> +new value extends the range of the notification at the tail of the
>> >> +queue. If so, it drops the new notification packet and instead increases
>> >> +the range upper value of the outstanding notification.
>> >
>> > Would it make sense to mention that max notification range is 32-bit?
>> > So each 4Gbyte of xmit bytes there will be a notification.
>> > In modern 40Gbps NICs it's not a lot. Means that there will be
>> > at least one notification every second.
>> > Or I misread the code?
>>
>> You're right. The doc does mention that the counter and range
>> are 32-bit. I can state more explicitly that that bounds the working
>> set size to 4GB. Do you expect this to be problematic? Processing
>> a single notification per 4GB of data should not be a significant
>> cost in itself.

Actually, the counter is not a byte counter. It is incremented on each
system call that sends data with MSG_ZEROCOPY. So the 4GB limit
would only hold if a caller sends single byte requests at a time.

I will make this more clear in v2.

>
> I think 4GB is fine. Just there was an idea that in cases when
> notification of transmission can be known by other means the user space
> could have skipped reading errqeuee completely, but looks like it
> still needs to poll. That's fine.
>> > Thanks for the doc!
>>
>> Thanks for reviewing :)
>>
>> >
>> > Acked-by: Alexei Starovoitov <ast@kernel.org>
>> >

^ permalink raw reply

* Re: [PATCH net] vhost_net: correctly check tx avail during rx busy polling
From: Michael S. Tsirkin @ 2017-09-01 15:51 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, virtualization, netdev, linux-kernel
In-Reply-To: <1504256570-3488-1-git-send-email-jasowang@redhat.com>

On Fri, Sep 01, 2017 at 05:02:50PM +0800, Jason Wang wrote:
> We check tx avail through vhost_enable_notify() in the past which is
> wrong since it only checks whether or not guest has filled more
> available buffer since last avail idx synchronization which was just
> done by vhost_vq_avail_empty() before. What we really want is checking
> pending buffers in the avail ring.

These are rx buffers, right? I'm not even sure why do we need to poll
for them. Running out of rx buffers is a slow path.

> Fix this by calling
> vhost_vq_avail_empty() instead.
> 
> This issue could be noticed by doing netperf TCP_RR benchmark as
> client from guest (but not host). With this fix, TCP_RR from guest to
> localhost restores from 1375.91 trans per sec to 55235.28 trans per
> sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz).
> 
> Fixes: 030881372460 ("vhost_net: basic polling support")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> - The patch is needed for -stable
> ---
>  drivers/vhost/net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 06d0448..1b68253 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -634,7 +634,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)

In fact why does it poll the ring at all? I thought this function's
job is to poll the socket, isn't it?


>  
>  		preempt_enable();
>  
> -		if (vhost_enable_notify(&net->dev, vq))
> +		if (!vhost_vq_avail_empty(&net->dev, vq))
>  			vhost_poll_queue(&vq->poll);
>  		mutex_unlock(&vq->mutex);


Adding more contex:

                mutex_lock(&vq->mutex);
                vhost_disable_notify(&net->dev, vq);

                preempt_disable();
                endtime = busy_clock() + vq->busyloop_timeout;

                while (vhost_can_busy_poll(&net->dev, endtime) &&
                       !sk_has_rx_data(sk) &&
                       vhost_vq_avail_empty(&net->dev, vq))
                        cpu_relax();
                
                preempt_enable();
        
                if (vhost_enable_notify(&net->dev, vq))
                        vhost_poll_queue(&vq->poll);
                mutex_unlock(&vq->mutex);

                len = peek_head_len(rvq, sk);


If you drop this we'll exit the function with notifications
disabled. Seems wrong to me.


>  
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net-next] bpf: Collapse offset checks in sock_filter_is_valid_access
From: Daniel Borkmann @ 2017-09-01 15:53 UTC (permalink / raw)
  To: David Ahern, netdev
In-Reply-To: <1504279087-11379-1-git-send-email-dsahern@gmail.com>

On 09/01/2017 05:18 PM, David Ahern wrote:
> Make sock_filter_is_valid_access consistent with other is_valid_access
> helpers.
>
> Requested-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: David Ahern <dsahern@gmail.com>

Thanks for following up,

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* [PATCH net 1/2] l2tp: prevent creation of sessions on terminated tunnels
From: Guillaume Nault @ 2017-09-01 15:58 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1504277892.git.g.nault@alphalink.fr>

l2tp_tunnel_destruct() sets tunnel->sock to NULL, then removes the
tunnel from the pernet list and finally closes all its sessions.
Therefore, it's possible to add a session to a tunnel that is still
reachable, but for which tunnel->sock has already been reset. This can
make l2tp_session_create() dereference a NULL pointer when calling
sock_hold(tunnel->sock).

This patch adds the .acpt_newsess field to struct l2tp_tunnel, which is
used by l2tp_tunnel_closeall() to prevent addition of new sessions to
tunnels. Resetting tunnel->sock is done after l2tp_tunnel_closeall()
returned, so that l2tp_session_add_to_tunnel() can safely take a
reference on it when .acpt_newsess is true.

The .acpt_newsess field is modified in l2tp_tunnel_closeall(), rather
than in l2tp_tunnel_destruct(), so that it benefits all tunnel removal
mechanisms. E.g. on UDP tunnels, a session could be added to a tunnel
after l2tp_udp_encap_destroy() proceeded. This would prevent the tunnel
from being removed because of the references held by this new session
on the tunnel and its socket. Even though the session could be removed
manually later on, this defeats the purpose of
commit 9980d001cec8 ("l2tp: add udp encap socket destroy handler").

Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_core.c | 41 ++++++++++++++++++++++++++++-------------
 net/l2tp/l2tp_core.h |  4 ++++
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 90165a6874bc..ee485df73ccd 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -329,13 +329,21 @@ static int l2tp_session_add_to_tunnel(struct l2tp_tunnel *tunnel,
 	struct hlist_head *g_head;
 	struct hlist_head *head;
 	struct l2tp_net *pn;
+	int err;
 
 	head = l2tp_session_id_hash(tunnel, session->session_id);
 
 	write_lock_bh(&tunnel->hlist_lock);
+	if (!tunnel->acpt_newsess) {
+		err = -ENODEV;
+		goto err_tlock;
+	}
+
 	hlist_for_each_entry(session_walk, head, hlist)
-		if (session_walk->session_id == session->session_id)
-			goto exist;
+		if (session_walk->session_id == session->session_id) {
+			err = -EEXIST;
+			goto err_tlock;
+		}
 
 	if (tunnel->version == L2TP_HDR_VER_3) {
 		pn = l2tp_pernet(tunnel->l2tp_net);
@@ -343,12 +351,21 @@ static int l2tp_session_add_to_tunnel(struct l2tp_tunnel *tunnel,
 						session->session_id);
 
 		spin_lock_bh(&pn->l2tp_session_hlist_lock);
+
 		hlist_for_each_entry(session_walk, g_head, global_hlist)
-			if (session_walk->session_id == session->session_id)
-				goto exist_glob;
+			if (session_walk->session_id == session->session_id) {
+				err = -EEXIST;
+				goto err_tlock_pnlock;
+			}
 
+		l2tp_tunnel_inc_refcount(tunnel);
+		sock_hold(tunnel->sock);
 		hlist_add_head_rcu(&session->global_hlist, g_head);
+
 		spin_unlock_bh(&pn->l2tp_session_hlist_lock);
+	} else {
+		l2tp_tunnel_inc_refcount(tunnel);
+		sock_hold(tunnel->sock);
 	}
 
 	hlist_add_head(&session->hlist, head);
@@ -356,12 +373,12 @@ static int l2tp_session_add_to_tunnel(struct l2tp_tunnel *tunnel,
 
 	return 0;
 
-exist_glob:
+err_tlock_pnlock:
 	spin_unlock_bh(&pn->l2tp_session_hlist_lock);
-exist:
+err_tlock:
 	write_unlock_bh(&tunnel->hlist_lock);
 
-	return -EEXIST;
+	return err;
 }
 
 /* Lookup a tunnel by id
@@ -1251,7 +1268,6 @@ static void l2tp_tunnel_destruct(struct sock *sk)
 	/* Remove hooks into tunnel socket */
 	sk->sk_destruct = tunnel->old_sk_destruct;
 	sk->sk_user_data = NULL;
-	tunnel->sock = NULL;
 
 	/* Remove the tunnel struct from the tunnel list */
 	pn = l2tp_pernet(tunnel->l2tp_net);
@@ -1261,6 +1277,8 @@ static void l2tp_tunnel_destruct(struct sock *sk)
 	atomic_dec(&l2tp_tunnel_count);
 
 	l2tp_tunnel_closeall(tunnel);
+
+	tunnel->sock = NULL;
 	l2tp_tunnel_dec_refcount(tunnel);
 
 	/* Call the original destructor */
@@ -1285,6 +1303,7 @@ void l2tp_tunnel_closeall(struct l2tp_tunnel *tunnel)
 		  tunnel->name);
 
 	write_lock_bh(&tunnel->hlist_lock);
+	tunnel->acpt_newsess = false;
 	for (hash = 0; hash < L2TP_HASH_SIZE; hash++) {
 again:
 		hlist_for_each_safe(walk, tmp, &tunnel->session_hlist[hash]) {
@@ -1581,6 +1600,7 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
 	tunnel->magic = L2TP_TUNNEL_MAGIC;
 	sprintf(&tunnel->name[0], "tunl %u", tunnel_id);
 	rwlock_init(&tunnel->hlist_lock);
+	tunnel->acpt_newsess = true;
 
 	/* The net we belong to */
 	tunnel->l2tp_net = net;
@@ -1829,11 +1849,6 @@ struct l2tp_session *l2tp_session_create(int priv_size, struct l2tp_tunnel *tunn
 			return ERR_PTR(err);
 		}
 
-		l2tp_tunnel_inc_refcount(tunnel);
-
-		/* Ensure tunnel socket isn't deleted */
-		sock_hold(tunnel->sock);
-
 		/* Ignore management session in session count value */
 		if (session->session_id != 0)
 			atomic_inc(&l2tp_session_count);
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 9101297f27ad..4593d48df953 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -162,6 +162,10 @@ struct l2tp_tunnel {
 	int			magic;		/* Should be L2TP_TUNNEL_MAGIC */
 	struct rcu_head rcu;
 	rwlock_t		hlist_lock;	/* protect session_hlist */
+	bool			acpt_newsess;	/* Indicates whether this
+						 * tunnel accepts new sessions.
+						 * Protected by hlist_lock.
+						 */
 	struct hlist_head	session_hlist[L2TP_HASH_SIZE];
 						/* hashed list of sessions,
 						 * hashed by id */
-- 
2.14.1

^ permalink raw reply related

* [PATCH net 0/2] l2tp: session creation fixes
From: Guillaume Nault @ 2017-09-01 15:58 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman

The session creation process has a few issues wrt. concurrent tunnel
deletion.

Patch #1 avoids creating sessions in tunnels that are getting removed.
This prevents races where sessions could try to take tunnel resources
that were already released.

Patch #2 removes some racy l2tp_tunnel_find() calls in session creation
callbacks. Together with path #1 it ensures that sessions can only
access tunnel resources that are guaranteed to remain valid during the
session creation process.


There are other problems with how sessions are created: pseudo-wire
specific data are set after the session is added to the tunnel. So
the session can be used, or deleted, before it has been completely
initialised. Separating session allocation from session registration
would be necessary, but we'd still have circular dependencies
preventing race-free registration. I'll consider this issue in future
series.


Guillaume Nault (2):
  l2tp: prevent creation of sessions on terminated tunnels
  l2tp: pass tunnel pointer to ->session_create()

 net/l2tp/l2tp_core.c    | 41 ++++++++++++++++++++++++++++-------------
 net/l2tp/l2tp_core.h    |  8 +++++++-
 net/l2tp/l2tp_eth.c     | 11 +++--------
 net/l2tp/l2tp_netlink.c |  8 ++++----
 net/l2tp/l2tp_ppp.c     | 19 +++++++------------
 5 files changed, 49 insertions(+), 38 deletions(-)

-- 
2.14.1

^ permalink raw reply

* [PATCH net 2/2] l2tp: pass tunnel pointer to ->session_create()
From: Guillaume Nault @ 2017-09-01 15:58 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman
In-Reply-To: <cover.1504277892.git.g.nault@alphalink.fr>

Using l2tp_tunnel_find() in pppol2tp_session_create() and
l2tp_eth_create() is racy, because no reference is held on the
returned session. These functions are only used to implement the
->session_create callback which is run by l2tp_nl_cmd_session_create().
Therefore searching for the parent tunnel isn't necessary because
l2tp_nl_cmd_session_create() already has a pointer to it and holds a
reference.

This patch modifies ->session_create()'s prototype to directly pass the
the parent tunnel as parameter, thus avoiding searching for it in
pppol2tp_session_create() and l2tp_eth_create().

Since we have to touch the ->session_create() call in
l2tp_nl_cmd_session_create(), let's also remove the useless conditional:
we know that ->session_create isn't NULL at this point because it's
already been checked earlier in this same function.

Finally, one might be tempted to think that the removed
l2tp_tunnel_find() calls were harmless because they would return the
same tunnel as the one held by l2tp_nl_cmd_session_create() anyway.
But that tunnel might be removed and a new one created with same tunnel
Id before the l2tp_tunnel_find() call. In this case l2tp_tunnel_find()
would return the new tunnel which wouldn't be protected by the
reference held by l2tp_nl_cmd_session_create().

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Fixes: d9e31d17ceba ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
---
 net/l2tp/l2tp_core.h    |  4 +++-
 net/l2tp/l2tp_eth.c     | 11 +++--------
 net/l2tp/l2tp_netlink.c |  8 ++++----
 net/l2tp/l2tp_ppp.c     | 19 +++++++------------
 4 files changed, 17 insertions(+), 25 deletions(-)

diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 4593d48df953..a305e0c5925a 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -201,7 +201,9 @@ struct l2tp_tunnel {
 };
 
 struct l2tp_nl_cmd_ops {
-	int (*session_create)(struct net *net, u32 tunnel_id, u32 session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg);
+	int (*session_create)(struct net *net, struct l2tp_tunnel *tunnel,
+			      u32 session_id, u32 peer_session_id,
+			      struct l2tp_session_cfg *cfg);
 	int (*session_delete)(struct l2tp_session *session);
 };
 
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index 4de2ec94b08c..87da9ef61860 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -262,24 +262,19 @@ static void l2tp_eth_adjust_mtu(struct l2tp_tunnel *tunnel,
 	dev->needed_headroom += session->hdr_len;
 }
 
-static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg)
+static int l2tp_eth_create(struct net *net, struct l2tp_tunnel *tunnel,
+			   u32 session_id, u32 peer_session_id,
+			   struct l2tp_session_cfg *cfg)
 {
 	unsigned char name_assign_type;
 	struct net_device *dev;
 	char name[IFNAMSIZ];
-	struct l2tp_tunnel *tunnel;
 	struct l2tp_session *session;
 	struct l2tp_eth *priv;
 	struct l2tp_eth_sess *spriv;
 	int rc;
 	struct l2tp_eth_net *pn;
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-	if (!tunnel) {
-		rc = -ENODEV;
-		goto out;
-	}
-
 	if (cfg->ifname) {
 		strlcpy(name, cfg->ifname, IFNAMSIZ);
 		name_assign_type = NET_NAME_USER;
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 57427d430f10..7135f4645d3a 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -643,10 +643,10 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		break;
 	}
 
-	ret = -EPROTONOSUPPORT;
-	if (l2tp_nl_cmd_ops[cfg.pw_type]->session_create)
-		ret = (*l2tp_nl_cmd_ops[cfg.pw_type]->session_create)(net, tunnel_id,
-			session_id, peer_session_id, &cfg);
+	ret = l2tp_nl_cmd_ops[cfg.pw_type]->session_create(net, tunnel,
+							   session_id,
+							   peer_session_id,
+							   &cfg);
 
 	if (ret >= 0) {
 		session = l2tp_session_get(net, tunnel, session_id, false);
diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index f0edb7209079..50e3ee9a9d61 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -788,25 +788,20 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
 
 #ifdef CONFIG_L2TP_V3
 
-/* Called when creating sessions via the netlink interface.
- */
-static int pppol2tp_session_create(struct net *net, u32 tunnel_id, u32 session_id, u32 peer_session_id, struct l2tp_session_cfg *cfg)
+/* Called when creating sessions via the netlink interface. */
+static int pppol2tp_session_create(struct net *net, struct l2tp_tunnel *tunnel,
+				   u32 session_id, u32 peer_session_id,
+				   struct l2tp_session_cfg *cfg)
 {
 	int error;
-	struct l2tp_tunnel *tunnel;
 	struct l2tp_session *session;
 	struct pppol2tp_session *ps;
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
-
-	/* Error if we can't find the tunnel */
-	error = -ENOENT;
-	if (tunnel == NULL)
-		goto out;
-
 	/* Error if tunnel socket is not prepped */
-	if (tunnel->sock == NULL)
+	if (!tunnel->sock) {
+		error = -ENOENT;
 		goto out;
+	}
 
 	/* Default MTU values. */
 	if (cfg->mtu == 0)
-- 
2.14.1

^ permalink raw reply related

* [PATCH net-next v2] doc: document MSG_ZEROCOPY
From: Willem de Bruijn @ 2017-09-01 16:01 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, brouer, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Documentation for this feature was missing from the patchset.
Copied a lot from the netdev 2.1 paper, addressing some small
interface changes since then.

Changes
  v1 -> v2
    - change email discussion URL format
    - clarify that u32 counter is per-syscall, unsigned and
      wraps after UINT_MAX calls
    - describe errno on send failure specific to MSG_ZEROCOPY
    - a few very minor rewordings

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/networking/msg_zerocopy.rst | 258 ++++++++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 Documentation/networking/msg_zerocopy.rst

diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
new file mode 100644
index 000000000000..1d18e79fba86
--- /dev/null
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -0,0 +1,258 @@
+
+============
+MSG_ZEROCOPY
+============
+
+Intro
+=====
+
+The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
+The feature is currently implemented for TCP sockets.
+
+
+Opportunity and Caveats
+-----------------------
+
+Copying large buffers between user process and kernel can be
+expensive. Linux supports various interfaces that eschew copying,
+such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+underlying copy avoidance mechanism to common socket send calls.
+
+Copy avoidance is not a free lunch. As implemented, with page pinning,
+it replaces per byte copy cost with page accounting and completion
+notification overhead. As a result, MSG_ZEROCOPY is generally only
+effective at writes over around 10 KB.
+
+Page pinning also changes system call semantics. It temporarily shares
+the buffer between process and network stack. Unlike with copying, the
+process cannot immediately overwrite the buffer after system call
+return without possibly modifying the data in flight. Kernel integrity
+is not affected, but a buggy program can possibly corrupt its own data
+stream.
+
+The kernel returns a notification when it is safe to modify data.
+Converting an existing application to MSG_ZEROCOPY is not always as
+trivial as just passing the flag, then.
+
+
+More Info
+---------
+
+Much of this document was derived from a longer paper presented at
+netdev 2.1. For more in-depth information see that paper and talk,
+the excellent reporting over at LWN.net or read the original code.
+
+  paper, slides, video
+    https://netdevconf.org/2.1/session.html?debruijn
+
+  LWN article
+    https://lwn.net/Articles/726917/
+
+  patchset
+    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
+    http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+
+
+Interface
+=========
+
+Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
+avoidance, but not the only one.
+
+Socket Setup
+------------
+
+The kernel is permissive when applications pass undefined flags to the
+send system call. By default it simply ignores these. To avoid enabling
+copy avoidance mode for legacy processes that accidentally already pass
+this flag, a process must first signal intent by setting a socket option:
+
+::
+
+	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
+		error(1, errno, "setsockopt zerocopy");
+
+
+Transmission
+------------
+
+The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
+Pass the new flag.
+
+::
+
+	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
+
+A zerocopy failure will return -1 with errno ENOBUFS. This happens if
+the socket option was not set, the socket exceeds its optmem limit or
+the user exceeds its ulimit on locked pages.
+
+
+Mixing copy avoidance and copying
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many workloads have a mixture of large and small buffers. Because copy
+avoidance is more expensive than copying for small packets, the
+feature is implemented as a flag. It is safe to mix calls with the flag
+with those without.
+
+
+Notifications
+-------------
+
+The kernel has to notify the process when it is safe to reuse a
+previously passed buffer. It queues completion notifications on the
+socket error queue, akin to the transmit timestamping interface.
+
+The notification itself is a simple scalar value. Each socket
+maintains an internal unsigned 32-bit counter. Each send call with
+MSG_ZEROCOPY that successfully sends data increments the counter. The
+counter is not incremented on failure or if called with length zero.
+The counter counts system call invocations, not bytes. It wraps after
+UINT_MAX calls.
+
+
+Notification Reception
+~~~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates the API. In the simplest case, each
+send syscall is followed by a poll and recvmsg on the error queue.
+
+Reading from the error queue is always a non-blocking operation. The
+poll call is there to block until an error is outstanding. It will set
+POLLERR in its output flags. That flag does not have to be set in the
+events field. Errors are signaled unconditionally.
+
+::
+
+	pfd.fd = fd;
+	pfd.events = 0;
+	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
+		error(1, errno, "poll");
+
+	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+	if (ret == -1)
+		error(1, errno, "recvmsg");
+
+	read_notification(msg);
+
+The example is for demonstration purpose only. In practice, it is more
+efficient to not wait for notifications, but read without blocking
+every couple of send calls.
+
+Notifications can be processed out of order with other operations on
+the socket. A socket that has an error queued would normally block
+other operations until the error is read. Zerocopy notifications have
+a zero error code, however, to not block send and recv calls.
+
+
+Notification Batching
+~~~~~~~~~~~~~~~~~~~~~
+
+Multiple outstanding packets can be read at once using the recvmmsg
+call. This is often not needed. In each message the kernel returns not
+a single value, but a range. It coalesces consecutive notifications
+while one is outstanding for reception on the error queue.
+
+When a new notification is about to be queued, it checks whether the
+new value extends the range of the notification at the tail of the
+queue. If so, it drops the new notification packet and instead increases
+the range upper value of the outstanding notification.
+
+For protocols that acknowledge data in-order, like TCP, each
+notification can be squashed into the previous one, so that no more
+than one notification is outstanding at any one point.
+
+Ordered delivery is the common case, but not guaranteed. Notifications
+may arrive out of order on retransmission and socket teardown.
+
+
+Notification Parsing
+~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates how to parse the control message: the
+read_notification() call in the previous snippet. A notification
+is encoded in the standard error format, sock_extended_err.
+
+The level and type fields in the control data are protocol family
+specific, IP_RECVERR or IPV6_RECVERR.
+
+Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
+as explained before, to avoid blocking read and write system calls on
+the socket.
+
+The 32-bit notification range is encoded as [ee_info, ee_data]. This
+range is inclusive. Other fields in the struct must be treated as
+undefined, bar for ee_code, as discussed below.
+
+::
+
+	struct sock_extended_err *serr;
+	struct cmsghdr *cm;
+
+	cm = CMSG_FIRSTHDR(msg);
+	if (cm->cmsg_level != SOL_IP &&
+	    cm->cmsg_type != IP_RECVERR)
+		error(1, 0, "cmsg");
+
+	serr = (void *) CMSG_DATA(cm);
+	if (serr->ee_errno != 0 ||
+	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+		error(1, 0, "serr");
+
+	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+
+
+Deferred copies
+~~~~~~~~~~~~~~~
+
+Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
+avoidance, and a contract that the kernel will queue a completion
+notification. It is not a guarantee that the copy is elided.
+
+Copy avoidance is not always feasible. Devices that do not support
+scatter-gather I/O cannot send packets made up of kernel generated
+protocol headers plus zerocopy user data. A packet may need to be
+converted to a private copy of data deep in the stack, say to compute
+a checksum.
+
+In all these cases, the kernel returns a completion notification when
+it releases its hold on the shared pages. That notification may arrive
+before the (copied) data is fully transmitted. A zerocopy completion
+notification is not a transmit completion notification, therefore.
+
+Deferred copies can be more expensive than a copy immediately in the
+system call, if the data is no longer warm in the cache. The process
+also incurs notification processing cost for no benefit. For this
+reason, the kernel signals if data was completed with a copy, by
+setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
+A process may use this signal to stop passing flag MSG_ZEROCOPY on
+subsequent requests on the same socket.
+
+
+Implementation
+==============
+
+Loopback
+--------
+
+Data sent to local sockets can be queued indefinitely if the receive
+process does not read its socket. Unbound notification latency is not
+acceptable. For this reason all packets generated with MSG_ZEROCOPY
+that are looped to a local socket will incur a deferred copy. This
+includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+tools/testing/selftests/net/msg_zerocopy.c.
+
+Be cognizant of the loopback constraint. The test can be run between
+a pair of hosts. But if run between a local pair of processes, for
+instance when run with msg_zerocopy.sh between a veth pair across
+namespaces, the test will not show any improvement. For testing, the
+loopback restriction can be temporarily relaxed by making
+skb_orphan_frags_rx identical to skb_orphan_frags.
+
-- 
2.14.1.581.gf28d330327-goog

^ permalink raw reply related

* Re: [PATCH net-next v2] doc: document MSG_ZEROCOPY
From: Alexei Starovoitov @ 2017-09-01 16:12 UTC (permalink / raw)
  To: Willem de Bruijn, netdev; +Cc: davem, brouer, Willem de Bruijn
In-Reply-To: <20170901160141.54616-1-willemdebruijn.kernel@gmail.com>

On 9/1/17 9:01 AM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> Documentation for this feature was missing from the patchset.
> Copied a lot from the netdev 2.1 paper, addressing some small
> interface changes since then.
>
> Changes
>   v1 -> v2
>     - change email discussion URL format
>     - clarify that u32 counter is per-syscall, unsigned and
>       wraps after UINT_MAX calls
>     - describe errno on send failure specific to MSG_ZEROCOPY
>     - a few very minor rewordings
>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 1/2] flow_dissector: Cleanup control flow
From: Tom Herbert @ 2017-09-01 16:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S . Miller, Linux Kernel Network Developers, alex.popov
In-Reply-To: <87k21izjdc.fsf@stressinduktion.org>

On Fri, Sep 1, 2017 at 5:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@quantonium.net> writes:
>
>> __skb_flow_dissect is riddled with gotos that make discerning the flow,
>> debugging, and extending the capability difficult. This patch
>> reorganizes things so that we only perform goto's after the two main
>> switch statements (no gotos within the cases now). It also eliminates
>> several goto labels so that there are only two labels that can be target
>> for goto.
>
> The problem with the
>
> fdret = ... ;
> break;
>
> is that we now have to count curly braces to look what break does
> actually break and where fdret is being processed. With the number of
> context lines you send for the patch this is hard to review.
>
> I tend to like the gotos a bit more for now.

This is a step towards a more modular design for flow dissector. The
goto's force a monolithic design and make it hard to implement new
functionality like trying to enforce limits on encapsulation which
requires a single point for logic. The ip, ipv6, and mpls labels were
really unnecessary to begin with, the proto again logic works fine for
those. This is also a segway to breaking up the very large
__skb_flow_dissect function into more manageable components (IP
protocol dissection should be its own function for instance). Follow
on patches will allow protocol specific implementations of flow
dissection located with the rest of the protocol implementation, so
hopefully we can end the practice of adding support for every
networking protocol in one single core function (analogous to how we
parse protocols in GRO).

Tom

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-09-01 16:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization,
	Network Development
In-Reply-To: <5ef7fcf3-d4f0-be16-6ddb-724d954cfc68@redhat.com>

On Thu, Aug 31, 2017 at 11:25 PM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年08月31日 22:30, Willem de Bruijn wrote:
>>>
>>> Incomplete results at this stage, but I do see this correlation between
>>> flows. It occurs even while not running out of zerocopy descriptors,
>>> which I cannot yet explain.
>>>
>>> Running two threads in a guest, each with a udp socket, each
>>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>>
>>> Sender A sends 1B datagrams.
>>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>>> to trigger zcopy_used in vhost net.
>>>
>>> A local receive process on the host receives both flows. To avoid
>>> a deep copy when looping the packet onto the receive path,
>>> changed skb_orphan_frags_rx to always return false (gross hack).
>>>
>>> The flow with the larger packets is redirected through netem on ifb0:
>>>
>>>    modprobe ifb
>>>    ip link set dev ifb0 up
>>>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>>
>>>    tc qdisc add dev tap0 ingress
>>>    tc filter add dev tap0 parent ffff: protocol ip \
>>>        u32 match ip dport 8000 0xffff \
>>>        action mirred egress redirect dev ifb0
>>>
>>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>>
>>> no filter
>>>    rx.A: ~840,000
>>>    rx.B: ~840,000
>>>
>>> limit 1
>>>    rx.A: ~500,000
>>>    rx.B: ~3100
>>>    ifb0: 3273 sent, 371141 dropped
>>>
>>> limit 100
>>>    rx.A: ~9000
>>>    rx.B: ~4200
>>>    ifb0: 4630 sent, 1491 dropped
>>>
>>> limit 1000
>>>    rx.A: ~6800
>>>    rx.B: ~4200
>>>    ifb0: 4651 sent, 0 dropped
>>>
>>> Sender B is always correctly rate limited to 1 MBps or less. With a
>>> short queue, it ends up dropping a lot and sending even less.
>>>
>>> When a queue builds up for sender B, sender A throughput is strongly
>>> correlated with queue length. With queue length 1, it can send almost
>>> at unthrottled speed. But even at limit 100 its throughput is on the
>>> same order as sender B.
>>>
>>> What is surprising to me is that this happens even though the number
>>> of ubuf_info in use at limit 100 is around 100 at all times. In other
>>> words,
>>> it does not exhaust the pool.
>>>
>>> When forcing zcopy_used to be false for all packets, this effect of
>>> sender A throughput being correlated with sender B does not happen.
>>>
>>> no filter
>>>    rx.A: ~850,000
>>>    rx.B: ~850,000
>>>
>>> limit 100
>>>    rx.A: ~850,000
>>>    rx.B: ~4200
>>>    ifb0: 4518 sent, 876182 dropped
>>>
>>> Also relevant is that with zerocopy, the sender processes back off
>>> and report the same count as the receiver. Without zerocopy,
>>> both senders send at full speed, even if only 4200 packets from flow
>>> B arrive at the receiver.
>>>
>>> This is with the default virtio_net driver, so without napi-tx.
>>>
>>> It appears that the zerocopy notifications are pausing the guest.
>>> Will look at that now.
>>
>> It was indeed as simple as that. With 256 descriptors, queuing even
>> a hundred or so packets causes the guest to stall the device as soon
>> as the qdisc is installed.
>>
>> Adding this check
>>
>> +                       in_use = nvq->upend_idx - nvq->done_idx;
>> +                       if (nvq->upend_idx < nvq->done_idx)
>> +                               in_use += UIO_MAXIOV;
>> +
>> +                       if (in_use > (vq->num >> 2))
>> +                               zcopy_used = false;
>>
>> Has the desired behavior of reverting zerocopy requests to copying.
>>
>> Without this change, the result is, as previously reported, throughput
>> dropping to hundreds of packets per second on both flows.
>>
>> With the change, pps as observed for a few seconds at handle_tx is
>>
>> zerocopy=165 copy=168435
>> zerocopy=0 copy=168500
>> zerocopy=65 copy=168535
>>
>> Both flows continue to send at more or less normal rate, with only
>> sender B observing massive drops at the netem.
>>
>> With the queue removed the rate reverts to
>>
>> zerocopy=58878 copy=110239
>> zerocopy=58833 copy=110207
>>
>> This is not a 50/50 split, which impliesTw that some packets from the
>> large
>> packet flow are still converted to copying. Without the change the rate
>> without queue was 80k zerocopy vs 80k copy, so this choice of
>> (vq->num >> 2) appears too conservative.
>>
>> However, testing with (vq->num >> 1) was not as effective at mitigating
>> stalls. I did not save that data, unfortunately. Can run more tests on
>> fine
>> tuning this variable, if the idea sounds good.
>
>
> Looks like there're still two cases were left:

To be clear, this patch is not intended to fix all issues. It is a small
improvement to avoid HoL blocking due to queued zerocopy skbs.

The trade-off is that reverting to copying in these cases increases
cycle cost. I think that that is a trade-off worth making compared to
the alternative drop in throughput. It probably would be good to be
able to measure this without kernel instrumentation: export
counters similar to net->tx_zcopy_err and net->tx_packets (though
without reset to zero, as in vhost_net_tx_packet).

> 1) sndbuf is not INT_MAX

You mean the case where the device stalls, later zerocopy notifications
are queued, but these are never cleaned in free_old_xmit_skbs,
because it requires a start_xmit and by now the (only) socket is out of
descriptors?

A watchdog would help somewhat. With tx-napi, this case cannot occur,
either, as free_old_xmit_skbs no longer depends on a call to start_xmit.

> 2) tx napi is used for virtio-net

I am not aware of any issue specific to the use of tx-napi?

> 1) could be a corner case, and for 2) what your suggest here may not solve
> the issue since it still do in order completion.

Somewhat tangential, but it might also help to break the in-order
completion processing in vhost_zerocopy_signal_used. Complete
all descriptors between done_idx and upend_idx. done_idx should
then only be forward to the oldest still not-completed descriptor.

In the test I ran, where the oldest descriptors are held in a queue and
all newer ones are tail-dropped, this would avoid blocking a full ring
of completions, when only a small number (or 1) is actually delayed.

Dynamic switching between copy and zerocopy using zcopy_used
already returns completions out-of-order, so this is not a huge leap.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-09-01 16:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization,
	Network Development
In-Reply-To: <CAF=yD-JREJqpsRdCi31qo1QNFEE2DDCaTGYw1f1waPgtov8WEg@mail.gmail.com>

>>> This is not a 50/50 split, which impliesTw that some packets from the
>>> large
>>> packet flow are still converted to copying. Without the change the rate
>>> without queue was 80k zerocopy vs 80k copy, so this choice of
>>> (vq->num >> 2) appears too conservative.
>>>
>>> However, testing with (vq->num >> 1) was not as effective at mitigating
>>> stalls. I did not save that data, unfortunately. Can run more tests on
>>> fine
>>> tuning this variable, if the idea sounds good.
>>
>>
>> Looks like there're still two cases were left:
>
> To be clear, this patch is not intended to fix all issues. It is a small
> improvement to avoid HoL blocking due to queued zerocopy skbs.
>
> The trade-off is that reverting to copying in these cases increases
> cycle cost. I think that that is a trade-off worth making compared to
> the alternative drop in throughput. It probably would be good to be
> able to measure this without kernel instrumentation: export
> counters similar to net->tx_zcopy_err and net->tx_packets (though
> without reset to zero, as in vhost_net_tx_packet).
>
>> 1) sndbuf is not INT_MAX
>
> You mean the case where the device stalls, later zerocopy notifications
> are queued, but these are never cleaned in free_old_xmit_skbs,
> because it requires a start_xmit and by now the (only) socket is out of
> descriptors?

Typo, sorry. I meant out of sndbuf.

> A watchdog would help somewhat. With tx-napi, this case cannot occur,
> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
>
>> 2) tx napi is used for virtio-net
>
> I am not aware of any issue specific to the use of tx-napi?
>
>> 1) could be a corner case, and for 2) what your suggest here may not solve
>> the issue since it still do in order completion.
>
> Somewhat tangential, but it might also help to break the in-order
> completion processing in vhost_zerocopy_signal_used. Complete
> all descriptors between done_idx and upend_idx. done_idx should
> then only be forward to the oldest still not-completed descriptor.
>
> In the test I ran, where the oldest descriptors are held in a queue and
> all newer ones are tail-dropped, this would avoid blocking a full ring
> of completions, when only a small number (or 1) is actually delayed.
>
> Dynamic switching between copy and zerocopy using zcopy_used
> already returns completions out-of-order, so this is not a huge leap.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox