Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [pull request][net-next 0/5] Mellanox, mlx5 updates 2017-04-22
From: David Miller @ 2017-04-24 18:11 UTC (permalink / raw)
  To: saeedm; +Cc: netdev, ogerlitz, roid, stephen
In-Reply-To: <20170422184507.26569-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Sat, 22 Apr 2017 21:45:02 +0300

> This series contains some updates to mlx5 driver.
> 
> Sparse and compiler warnings fixes from Stephen Hemminger.
> 
> From Roi Dayan and Or Gerlitz, Add devlink and mlx5 support for controlling
> E-Switch encapsulation mode, this knob will enable HW support for applying
> encapsulation/decapsulation to VF traffic as part of SRIOV e-switch offloading.
> 
> Please pull and let me know if there's any problem.

Applied, thank you.

^ permalink raw reply

* Re: [PATCH v2] brcmfmac: Make skb header writable before use
From: Eric Dumazet @ 2017-04-24 18:09 UTC (permalink / raw)
  To: James Hughes
  Cc: Arend van Spriel, Franky Lin, Hante Meuleman, Kalle Valo,
	linux-wireless, brcm80211-dev-list.pdl, netdev
In-Reply-To: <20170424130322.476-1-james.hughes@raspberrypi.org>

On Mon, 2017-04-24 at 14:03 +0100, James Hughes wrote:
> The driver was making changes to the skb_header without
> ensuring it was writable (i.e. uncloned).
> This patch also removes some boiler plate header size
> checking/adjustment code as that is also handled by the
> skb_cow_header function used to make header writable.
> 
> This patch depends on
> brcmfmac: Ensure pointer correctly set if skb data location changes
> 
> Signed-off-by: James Hughes <james.hughes@raspberrypi.org>
> ---
> Changes in v2
>   Makes the _cow_ call at the entry point of the skb in to the
>   stack, means only needs to be done once, and error handling
>   is easier.
>   Split a separate minor bug fix off to a separate patch (which
>   this patch depends on)

Best way to handle dependencies is to send a patch series.

1/2 brcmfmac: Ensure pointer correctly set if skb data location changes
2/2 brcmfmac: Make skb header writable before use

^ permalink raw reply

* Re: [PATCH net-next] net: add rcu locking when changing early demux
From: David Miller @ 2017-04-24 18:08 UTC (permalink / raw)
  To: dsa; +Cc: netdev, subashab
In-Reply-To: <1492878796-31328-1-git-send-email-dsa@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Sat, 22 Apr 2017 09:33:16 -0700

> systemd-sysctl is triggering a suspicious RCU usage message when
> net.ipv4.tcp_early_demux or net.ipv4.udp_early_demux is changed via
> a sysctl config file:
> 
> [   33.896184] ===============================
> [   33.899558] [ ERR: suspicious RCU usage.  ]
> [   33.900624] 4.11.0-rc7+ #104 Not tainted
> [   33.901698] -------------------------------
> [   33.903059] /home/dsa/kernel-2.git/net/ipv4/sysctl_net_ipv4.c:305 suspicious rcu_dereference_check() usage!
> [   33.905724]
> other info that might help us debug this:
> 
> [   33.907656]
> rcu_scheduler_active = 2, debug_locks = 0
> [   33.909288] 1 lock held by systemd-sysctl/143:
> [   33.910373]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff8123a370>] file_start_write+0x45/0x48
> [   33.912407]
> stack backtrace:
> [   33.914018] CPU: 0 PID: 143 Comm: systemd-sysctl Not tainted 4.11.0-rc7+ #104
> [   33.915631] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [   33.917870] Call Trace:
> [   33.918431]  dump_stack+0x81/0xb6
> [   33.919241]  lockdep_rcu_suspicious+0x10f/0x118
> [   33.920263]  proc_configure_early_demux+0x65/0x10a
> [   33.921391]  proc_udp_early_demux+0x3a/0x41
> 
> add rcu locking to proc_configure_early_demux.
> 
> Fixes: dddb64bcb3461 ("net: Add sysctl to toggle early demux for tcp and udp")
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

Applied, thanks David.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: ipv6: send unsolicited NA if enabled for all interfaces
From: David Miller @ 2017-04-24 18:07 UTC (permalink / raw)
  To: dsa; +Cc: netdev, hannes
In-Reply-To: <1492877413-21906-1-git-send-email-dsa@cumulusnetworks.com>

From: David Ahern <dsa@cumulusnetworks.com>
Date: Sat, 22 Apr 2017 09:10:13 -0700

> When arp_notify is set to 1 for either a specific interface or for 'all'
> interfaces, gratuitous arp requests are sent. Since ndisc_notify is the
> ipv6 equivalent to arp_notify, it should follow the same semantics.
> Commit 4a6e3c5def13 ("net: ipv6: send unsolicited NA on admin up") sends
> the NA on admin up. The final piece is checking devconf_all->ndisc_notify
> in addition to the per device setting. Add it.
> 
> Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer address change")
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---
> v2
> - update commit message with subject of commit 4a6e3c5def13 per comment
>   from Sergei

Applied, thanks David.

^ permalink raw reply

* Re: pull request: bluetooth-next 2017-04-22
From: David Miller @ 2017-04-24 18:06 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20170422131302.GA17928@x1c>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Sat, 22 Apr 2017 16:13:02 +0300

> Here are some more Bluetooth patches (and one 802.15.4 patch) in the
> bluetooth-next tree targeting the 4.12 kernel. Most of them are pure
> fixes.
> 
> Please let me know if there are any issues pulling. Thanks.

Pulled, thank you.

^ permalink raw reply

* Re: [PATCH v2] net: natsemi: ns83820: add checks for dma mapping error
From: David Miller @ 2017-04-24 18:04 UTC (permalink / raw)
  To: khoroshilov; +Cc: netdev, linux-kernel, ldv-project
In-Reply-To: <1492866365-5422-1-git-send-email-khoroshilov@ispras.ru>

From: Alexey Khoroshilov <khoroshilov@ispras.ru>
Date: Sat, 22 Apr 2017 16:06:05 +0300

> +		free_idx = (free_idx + NR_TX_DESC - 1) % NR_TX_DESC;

This is more simply stated as "(free_idx - 1) % NR_TX_DESC.

^ permalink raw reply

* Re: [PATCH] net: netcp: fix spelling mistake: "memomry" -> "memory"
From: David Miller @ 2017-04-24 17:59 UTC (permalink / raw)
  To: colin.king; +Cc: w-kwok2, m-karicheri2, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20170422122201.7979-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Sat, 22 Apr 2017 13:22:01 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> Trivial fix to spelling mistake in dev_err message and rejoin
> line.
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] ravb: Double free on error in ravb_start_xmit()
From: David Miller @ 2017-04-24 17:59 UTC (permalink / raw)
  To: dan.carpenter
  Cc: sergei.shtylyov, kazuya.mizuguchi.ks, horms+renesas, ykaneko0929,
	masaru.nagai.vx, geert+renesas, niklas.soderlund+renesas, tremyfr,
	netdev, linux-renesas-soc, kernel-janitors
In-Reply-To: <20170422104656.jjnooh3or4x3lbw3@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Sat, 22 Apr 2017 13:46:56 +0300

> If skb_put_padto() fails then it frees the skb.  I shifted that code
> up a bit to make my error handling a little simpler.
> 
> Fixes: a0d2f20650e8 ("Renesas Ethernet AVB PTP clock driver")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied.

^ permalink raw reply

* Re: [PATCH] net: atheros: atl1: use offset_in_page() macro
From: David Miller @ 2017-04-24 17:58 UTC (permalink / raw)
  To: geliangtang; +Cc: jcliburn, chris.snook, edumazet, netdev, linux-kernel
In-Reply-To: <c8746e014d4f7121ab4602548e199f8e8f2fc083.1492758007.git.geliangtang@gmail.com>

From: Geliang Tang <geliangtang@gmail.com>
Date: Sat, 22 Apr 2017 09:21:10 +0800

> Use offset_in_page() macro instead of open-coding.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/5] bnxt_en: Updates for net-next.
From: David Miller @ 2017-04-24 17:55 UTC (permalink / raw)
  To: michael.chan; +Cc: netdev
In-Reply-To: <1492819886-19625-1-git-send-email-michael.chan@broadcom.com>

From: Michael Chan <michael.chan@broadcom.com>
Date: Fri, 21 Apr 2017 20:11:21 -0400

> Miscellaneous updates include passing DCBX RoCE VLAN priority to firmware,
> checking one more new firmware flag before allowing DCBX to run on the host,
> adding 100Gbps speed support, adding check to disallow speed settings on
> Multi-host NICs, and a minor fix for reporting VF attributes.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] openvswitch: Add eventmask support to CT action.
From: David Miller @ 2017-04-24 17:53 UTC (permalink / raw)
  To: jarno; +Cc: netdev
In-Reply-To: <1492818486-57292-2-git-send-email-jarno@ovn.org>

From: Jarno Rajahalme <jarno@ovn.org>
Date: Fri, 21 Apr 2017 16:48:06 -0700

> Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
> which can be used in conjunction with the commit flag
> (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
> conntrack events (IPCT_*) should be delivered via the Netfilter
> netlink multicast groups.  Default behavior depends on the system
> configuration, but typically a lot of events are delivered.  This can be
> very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
> types of events are of interest.
> 
> Netfilter core init_conntrack() adds the event cache extension, so we
> only need to set the ctmask value.  However, if the system is
> configured without support for events, the setting will be skipped due
> to extension not being found.
> 
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> Reviewed-by: Greg Rose <gvrose8192@gmail.com>
> Acked-by: Joe Stringer <joe@ovn.org>

Applied.

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] openvswitch: Typo fix.
From: David Miller @ 2017-04-24 17:53 UTC (permalink / raw)
  To: jarno; +Cc: netdev
In-Reply-To: <1492818486-57292-1-git-send-email-jarno@ovn.org>

From: Jarno Rajahalme <jarno@ovn.org>
Date: Fri, 21 Apr 2017 16:48:05 -0700

> Fix typo in a comment.
> 
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> Acked-by: Greg Rose <gvrose8192@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next v2 2/5] virtio-net: transmit napi
From: Willem de Bruijn @ 2017-04-24 17:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Network Development, virtualization, David Miller,
	Willem de Bruijn
In-Reply-To: <20170424201346-mutt-send-email-mst@kernel.org>

On Mon, Apr 24, 2017 at 1:14 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 24, 2017 at 01:05:45PM -0400, Willem de Bruijn wrote:
>> On Mon, Apr 24, 2017 at 12:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Apr 21, 2017 at 10:50:12AM -0400, Willem de Bruijn wrote:
>> >> >>> Maybe I was wrong, but according to Michael's comment it looks like he
>> >> >>> want
>> >> >>> check affinity_hint_set just for speculative tx polling on rx napi
>> >> >>> instead
>> >> >>> of disabling it at all.
>> >> >>>
>> >> >>> And I'm not convinced this is really needed, driver only provide affinity
>> >> >>> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
>> >> >>> are in the same vcpus.
>> >> >>
>> >> >> You're right. I made the restriction broader than the request, to really
>> >> >> err
>> >> >> on the side of caution for the initial merge of napi tx. And enabling
>> >> >> the optimization is always a win over keeping it off, even without irq
>> >> >> affinity.
>> >> >>
>> >> >> The cycle cost is significant without affinity regardless of whether the
>> >> >> optimization is used.
>> >> >
>> >> >
>> >> > Yes, I noticed this in the past too.
>> >> >
>> >> >> Though this is not limited to napi-tx, it is more
>> >> >> pronounced in that mode than without napi.
>> >> >>
>> >> >> 1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:
>> >> >>
>> >> >> upstream:
>> >> >>
>> >> >> 1,1,1: 28985 Mbps, 278 Gcyc
>> >> >> 1,0,2: 30067 Mbps, 402 Gcyc
>> >> >>
>> >> >> napi tx:
>> >> >>
>> >> >> 1,1,1: 34492 Mbps, 269 Gcyc
>> >> >> 1,0,2: 36527 Mbps, 537 Gcyc (!)
>> >> >> 1,0,1: 36269 Mbps, 394 Gcyc
>> >> >> 1,0,0: 34674 Mbps, 402 Gcyc
>> >> >>
>> >> >> This is a particularly strong example. It is also representative
>> >> >> of most RR tests. It is less pronounced in other streaming tests.
>> >> >> 10x TCP_RR, for instance:
>> >> >>
>> >> >> upstream:
>> >> >>
>> >> >> 1,1,1: 42267 Mbps, 301 Gcyc
>> >> >> 1,0,2: 40663 Mbps, 445 Gcyc
>> >> >>
>> >> >> napi tx:
>> >> >>
>> >> >> 1,1,1: 42420 Mbps, 303 Gcyc
>> >> >> 1,0,2:  42267 Mbps, 431 Gcyc
>> >> >>
>> >> >> These numbers were obtained with the virtqueue_enable_cb_delayed
>> >> >> optimization after xmit_skb, btw. It turns out that moving that before
>> >> >> increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
>> >> >> 100x TCP_RR a bit.
>> >> >
>> >> >
>> >> > I see, so I think we can leave the affinity hint optimization/check for
>> >> > future investigation:
>> >> >
>> >> > - to avoid endless optimization (e.g we may want to share a single
>> >> > vector/napi for tx/rx queue pairs in the future) for this series.
>> >> > - tx napi is disabled by default which means we can do optimization on top.
>> >>
>> >> Okay. I'll drop the vi->affinity_hint_set from the patch set for now.
>> >
>> > I kind of like it, let's be conservative. But I'd prefer a comment
>> > near it explaining why it's there.
>>
>> I don't feel strongly. Was minutes away from sending a v3 with this
>> code reverted, but I'll reinstate it and add a comment. Other planned
>> changes based on Jason's feedback to v2:
>>
>>   v2 -> v3:
>>     - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
>>           ensure that the handler always cleans, to avoid deadlock
>>     - unconditionally clean in start_xmit
>>           avoid adding an unnecessary "if (use_napi)" branch
>>     - remove virtqueue_disable_cb in patch 5/5
>>           a noop in the common event_idx based loop
>
> Makes sense, thanks!

Great. Sent that, thanks.

The actual diff to v2 is quite small:

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b107ae011632..003143835766 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -986,6 +986,9 @@ static void virtnet_napi_tx_enable(struct virtnet_info *vi,
        if (!napi->weight)
                return;

+       /* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+        * enable the feature if this is likely affine with the transmit path.
+        */
        if (!vi->affinity_hint_set) {
                napi->weight = 0;
                return;
@@ -1131,10 +1134,9 @@ static int virtnet_poll_tx(struct napi_struct
*napi, int budget)
        struct virtnet_info *vi = sq->vq->vdev->priv;
        struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));

-       if (__netif_tx_trylock(txq)) {
-               free_old_xmit_skbs(sq);
-               __netif_tx_unlock(txq);
-       }
+       __netif_tx_lock(txq, raw_smp_processor_id());
+       free_old_xmit_skbs(sq);
+       __netif_tx_unlock(txq);

        virtqueue_napi_complete(napi, sq->vq, 0);

@@ -1196,14 +1198,10 @@ static netdev_tx_t start_xmit(struct sk_buff
*skb, struct net_device *dev)
        bool use_napi = sq->napi.weight;

        /* Free up any pending old buffers before queueing new ones. */
-       if (use_napi) {
-               if (kick)
-                       virtqueue_enable_cb_delayed(sq->vq);
-               else
-                       virtqueue_disable_cb(sq->vq);
-       } else {
-               free_old_xmit_skbs(sq);
-       }
+       free_old_xmit_skbs(sq);
+
+       if (use_napi && kick)
+               virtqueue_enable_cb_delayed(sq->vq);

(gmail will munge the identation, sorry)

^ permalink raw reply related

* Re: [PATCH v3 07/29] x86: bpf_jit, use ENTRY+ENDPROC
From: Jiri Slaby @ 2017-04-24 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, David Miller, mingo, tglx, hpa, x86, jpoimboe,
	linux-kernel, netdev, daniel, edumazet
In-Reply-To: <20170424164717.GA80404@ast-mbp.thefacebook.com>

On 04/24/2017, 06:47 PM, Alexei Starovoitov wrote:
> On Mon, Apr 24, 2017 at 06:02:51PM +0200, Jiri Slaby wrote:
>> On 04/24/2017, 05:55 PM, Ingo Molnar wrote:
>>> * Jiri Slaby <jslaby@suse.cz> wrote:
>>>
>>>> On 04/24/2017, 05:08 PM, David Miller wrote:
>>>>> If you align the entry points, then the code sequence as a whole is
>>>>> are no longer densely packed.
>>>>
>>>> Sure.
>>>>
>>>>> Or do I misunderstand how your macros work?
>>>>
>>>> Perhaps. So the suggested macros for the code are:
>>>> #define BPF_FUNC_START_LOCAL(name) \
>>>> 		SYM_START(name, SYM_V_LOCAL, SYM_A_NONE)
>>>> #define BPF_FUNC_START(name) \
>>>> 		SYM_START(name, SYM_V_GLOBAL, SYM_A_NONE)
>>>>
>>>> and they differ from the standard ones:
>>>> #define SYM_FUNC_START_LOCAL(name)                      \
>>>>         SYM_START(name, SYM_V_LOCAL, SYM_A_ALIGN)
>>>> #define SYM_FUNC_START(name)                            \
>>>>         SYM_START(name, SYM_V_GLOBAL, SYM_A_ALIGN)
>>>>
>>>>
>>>> The difference is SYM_A_NONE vs. SYM_A_ALIGN, which means:
>>>> #define SYM_A_ALIGN                             ALIGN
>>>> #define SYM_A_NONE                              /* nothing */
>>>>
>>>> Does it look OK now?
>>>
>>> No, the patch changes alignment which is undesirable, it needs to preserve the 
>>> existing (non-)alignment of the symbols!
>>
>> OK, so I am not expressing myself explicitly enough, it seems.
>>
>> So, correct, the patch v3 adds alignments. I suggested in the discussion
>> the macros above. They do not add alignments. If everybody is OK with
>> that, v4 of the patch won't add alignments. OK?
> 
> can we go back to what problem this patch set is trying to solve?
> Sounds like you want to add _function_ start/end marks to aid debugging?
> Debugging with what? What tool will recognize this stuff?

objtool will generate DWARF debuginfo between every ENTRY and ENDPROC
(dubbed differently at the end of the series). DWARF is understood by
everything, including the kernel proper (we have DWARF unwinder in
SUSE's kernels available for decades).

> Take a look at what your patch does:
> +ENTRY(sk_load_word)
>         test    %esi,%esi
>         js      bpf_slow_path_word_neg
> +ENDPROC(sk_load_word)
> 
> Does above two assembler instructions look like a function?

Yes, you are right, the code is complete mess.

For example what's the point of making the sk_load_word_positive_offset
label a global, callable function? Note that this is exactly the reason
why this particular two hunks look weird to you even though the
annotations only mechanically paraphrase what is in the current code.

> or this:
> +ENTRY(sk_load_byte_positive_offset)
>         cmp     %esi,%r9d   /* if (offset >= hlen) goto bpf_slow_path_byte */
>         jle     bpf_slow_path_byte
>         movzbl  (SKBDATA,%rsi),%eax
>         ret
> +ENDPROC(sk_load_byte_positive_offset)
> 
> This assembler code doesn't represent functions. There is no prologue/epilogue
> and no stack frame. JITed code uses 'call' insn to jump into them, but they're
> not your typical C functions.

I am not looking for C functions everywhere in assembly, actually. (In
the ideal assembly, I would and in most cases I really am.) But you are
right the annotations of the current code aren't right. It results in my
annotations being wrong too -- based on invalid assumptions.

> Take a look at bpf_slow_path_common() macro that creates the frame before
> calling into C code with 'call skb_copy_bits;'
> 
> I still think that this code should be left alone.
> Even macro names you're proposing:
>  #define BPF_FUNC_START_LOCAL
> don't sound right. These are not functions.

Well, what is the reason to call them FUNC in the current code then?

thanks,
-- 
js
suse labs

^ permalink raw reply

* PATCH drivers:net:cris/eth_v10: alternate string char arrary
From: Karim Eshapa @ 2017-04-24 17:49 UTC (permalink / raw)
  To: davem
  Cc: felipe.balbi, paul.gortmaker, mugunthanvnm, jarod, fw, netdev,
	linux-kernel, Karim Eshapa

static char pointer creates two variables in final assembly.
static string and pointer to it according to
Jeff Garzik janitors TODO.

Signed-off-by: Karim Eshapa <karim.eshapa@gmail.com>
---
 drivers/net/cris/eth_v10.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/cris/eth_v10.c b/drivers/net/cris/eth_v10.c
index 91c876a..370765f 100644
--- a/drivers/net/cris/eth_v10.c
+++ b/drivers/net/cris/eth_v10.c
@@ -44,7 +44,7 @@
  * io regions, irqs and dma channels
  */
 
-static const char* cardname = "ETRAX 100LX built-in ethernet controller";
+static const char cardname[] = "ETRAX 100LX built-in ethernet controller";
 
 /* A default ethernet address. Highlevel SW will set the real one later */
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v3 5/5] virtio-net: keep tx interrupts disabled unless kick
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn
In-Reply-To: <20170424174930.82623-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.

This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9dd978f34c1f..003143835766 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1200,6 +1200,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Free up any pending old buffers before queueing new ones. */
 	free_old_xmit_skbs(sq);
 
+	if (use_napi && kick)
+		virtqueue_enable_cb_delayed(sq->vq);
+
 	/* timestamp packet in software */
 	skb_tx_timestamp(skb);
 
-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply related

* [PATCH net-next v3 4/5] virtio-net: clean tx descriptors from rx napi
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn
In-Reply-To: <20170424174930.82623-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4ec79e5d7a86..9dd978f34c1f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1075,12 +1075,33 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 	u64_stats_update_end(&stats->tx_syncp);
 }
 
+static void virtnet_poll_cleantx(struct receive_queue *rq)
+{
+	struct virtnet_info *vi = rq->vq->vdev->priv;
+	unsigned int index = vq2rxq(rq->vq);
+	struct send_queue *sq = &vi->sq[index];
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
+
+	if (!sq->napi.weight)
+		return;
+
+	if (__netif_tx_trylock(txq)) {
+		free_old_xmit_skbs(sq);
+		__netif_tx_unlock(txq);
+	}
+
+	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+		netif_tx_wake_queue(txq);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
 	struct receive_queue *rq =
 		container_of(napi, struct receive_queue, napi);
 	unsigned int received;
 
+	virtnet_poll_cleantx(rq);
+
 	received = virtnet_receive(rq, budget);
 
 	/* Out of packets? */
-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply related

* [PATCH net-next v3 3/5] virtio-net: move free_old_xmit_skbs
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn
In-Reply-To: <20170424174930.82623-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.

This is a pure move: no code changes.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/virtio_net.c | 60 ++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 356d18481ee4..4ec79e5d7a86 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1045,6 +1045,36 @@ static int virtnet_receive(struct receive_queue *rq, int budget)
 	return received;
 }
 
+static void free_old_xmit_skbs(struct send_queue *sq)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	unsigned int packets = 0;
+	unsigned int bytes = 0;
+
+	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+
+		bytes += skb->len;
+		packets++;
+
+		dev_kfree_skb_any(skb);
+	}
+
+	/* Avoid overhead when no packets have been processed
+	 * happens when called speculatively from start_xmit.
+	 */
+	if (!packets)
+		return;
+
+	u64_stats_update_begin(&stats->tx_syncp);
+	stats->tx_bytes += bytes;
+	stats->tx_packets += packets;
+	u64_stats_update_end(&stats->tx_syncp);
+}
+
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
 	struct receive_queue *rq =
@@ -1077,36 +1107,6 @@ static int virtnet_open(struct net_device *dev)
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-	unsigned int packets = 0;
-	unsigned int bytes = 0;
-
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-
-		bytes += skb->len;
-		packets++;
-
-		dev_kfree_skb_any(skb);
-	}
-
-	/* Avoid overhead when no packets have been processed
-	 * happens when called speculatively from start_xmit.
-	 */
-	if (!packets)
-		return;
-
-	u64_stats_update_begin(&stats->tx_syncp);
-	stats->tx_bytes += bytes;
-	stats->tx_packets += packets;
-	u64_stats_update_end(&stats->tx_syncp);
-}
-
 static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 {
 	struct send_queue *sq = container_of(napi, struct send_queue, napi);
-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply related

* [PATCH net-next v3 2/5] virtio-net: transmit napi
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn
In-Reply-To: <20170424174930.82623-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.

The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.

Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 76 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 67 insertions(+), 9 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b9c1df29892c..356d18481ee4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -33,9 +33,10 @@
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
-static bool csum = true, gso = true;
+static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
+module_param(napi_tx, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -86,6 +87,8 @@ struct send_queue {
 
 	/* Name of the send queue: output.$index */
 	char name[40];
+
+	struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -262,12 +265,16 @@ static void virtqueue_napi_complete(struct napi_struct *napi,
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
+	struct napi_struct *napi = &vi->sq[vq2txq(vq)].napi;
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(vq);
 
-	/* We were probably waiting for more output buffers. */
-	netif_wake_subqueue(vi->dev, vq2txq(vq));
+	if (napi->weight)
+		virtqueue_napi_schedule(napi, vq);
+	else
+		/* We were probably waiting for more output buffers. */
+		netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -972,6 +979,24 @@ static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 	local_bh_enable();
 }
 
+static void virtnet_napi_tx_enable(struct virtnet_info *vi,
+				   struct virtqueue *vq,
+				   struct napi_struct *napi)
+{
+	if (!napi->weight)
+		return;
+
+	/* Tx napi touches cachelines on the cpu handling tx interrupts. Only
+	 * enable the feature if this is likely affine with the transmit path.
+	 */
+	if (!vi->affinity_hint_set) {
+		napi->weight = 0;
+		return;
+	}
+
+	return virtnet_napi_enable(vq, napi);
+}
+
 static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi =
@@ -1046,6 +1071,7 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
 
 	return 0;
@@ -1081,6 +1107,24 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 	u64_stats_update_end(&stats->tx_syncp);
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct send_queue *sq = container_of(napi, struct send_queue, napi);
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+
+	__netif_tx_lock(txq, raw_smp_processor_id());
+	free_old_xmit_skbs(sq);
+	__netif_tx_unlock(txq);
+
+	virtqueue_napi_complete(napi, sq->vq, 0);
+
+	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
+		netif_tx_wake_queue(txq);
+
+	return 0;
+}
+
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	int err;
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
 	bool kick = !skb->xmit_more;
+	bool use_napi = sq->napi.weight;
 
 	/* Free up any pending old buffers before queueing new ones. */
 	free_old_xmit_skbs(sq);
@@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-	nf_reset(skb);
+	if (!use_napi) {
+		skb_orphan(skb);
+		nf_reset(skb);
+	}
 
 	/* If running out of space, stop queue to avoid getting packets that we
 	 * are then unable to transmit.
@@ -1167,7 +1214,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 */
 	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 		netif_stop_subqueue(dev, qnum);
-		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
+		if (!use_napi &&
+		    unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
 			free_old_xmit_skbs(sq);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
@@ -1371,8 +1419,10 @@ static int virtnet_close(struct net_device *dev)
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_disable(&vi->rq[i].napi);
+		napi_disable(&vi->sq[i].napi);
+	}
 
 	return 0;
 }
@@ -1727,8 +1777,10 @@ static void virtnet_freeze_down(struct virtio_device *vdev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	if (netif_running(vi->dev)) {
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			napi_disable(&vi->sq[i].napi);
+		}
 	}
 }
 
@@ -1751,8 +1803,11 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+					       &vi->sq[i].napi);
+		}
 	}
 
 	netif_device_attach(vi->dev);
@@ -1957,6 +2012,7 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_hash_del(&vi->rq[i].napi);
 		netif_napi_del(&vi->rq[i].napi);
+		netif_napi_del(&vi->sq[i].napi);
 	}
 
 	/* We called napi_hash_del() before netif_napi_del(),
@@ -2142,6 +2198,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		vi->rq[i].pages = NULL;
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
+		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+			       napi_tx ? napi_weight : 0);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_pkt_len_init(&vi->rq[i].mrg_avg_pkt_len);
-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply related

* [PATCH net-next v3 1/5] virtio-net: napi helper functions
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn
In-Reply-To: <20170424174930.82623-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 65 ++++++++++++++++++++++++++----------------------
 1 file changed, 35 insertions(+), 30 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 666ada6130ab..b9c1df29892c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -239,6 +239,26 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
+static void virtqueue_napi_schedule(struct napi_struct *napi,
+				    struct virtqueue *vq)
+{
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
+	}
+}
+
+static void virtqueue_napi_complete(struct napi_struct *napi,
+				    struct virtqueue *vq, int processed)
+{
+	int opaque;
+
+	opaque = virtqueue_enable_cb_prepare(vq);
+	if (napi_complete_done(napi, processed) &&
+	    unlikely(virtqueue_poll(vq, opaque)))
+		virtqueue_napi_schedule(napi, vq);
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
@@ -936,27 +956,20 @@ static void skb_recv_done(struct virtqueue *rvq)
 	struct virtnet_info *vi = rvq->vdev->priv;
 	struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
 
-	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&rq->napi)) {
-		virtqueue_disable_cb(rvq);
-		__napi_schedule(&rq->napi);
-	}
+	virtqueue_napi_schedule(&rq->napi, rvq);
 }
 
-static void virtnet_napi_enable(struct receive_queue *rq)
+static void virtnet_napi_enable(struct virtqueue *vq, struct napi_struct *napi)
 {
-	napi_enable(&rq->napi);
+	napi_enable(napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
-	 * won't get another interrupt, so process any outstanding packets
-	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
-	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&rq->napi)) {
-		virtqueue_disable_cb(rq->vq);
-		local_bh_disable();
-		__napi_schedule(&rq->napi);
-		local_bh_enable();
-	}
+	 * won't get another interrupt, so process any outstanding packets now.
+	 * Call local_bh_enable after to trigger softIRQ processing.
+	 */
+	local_bh_disable();
+	virtqueue_napi_schedule(napi, vq);
+	local_bh_enable();
 }
 
 static void refill_work(struct work_struct *work)
@@ -971,7 +984,7 @@ static void refill_work(struct work_struct *work)
 
 		napi_disable(&rq->napi);
 		still_empty = !try_fill_recv(vi, rq, GFP_KERNEL);
-		virtnet_napi_enable(rq);
+		virtnet_napi_enable(rq->vq, &rq->napi);
 
 		/* In theory, this can happen: if we don't get any buffers in
 		 * we will *never* try to fill again.
@@ -1011,21 +1024,13 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
 {
 	struct receive_queue *rq =
 		container_of(napi, struct receive_queue, napi);
-	unsigned int r, received;
+	unsigned int received;
 
 	received = virtnet_receive(rq, budget);
 
 	/* Out of packets? */
-	if (received < budget) {
-		r = virtqueue_enable_cb_prepare(rq->vq);
-		if (napi_complete_done(napi, received)) {
-			if (unlikely(virtqueue_poll(rq->vq, r)) &&
-			    napi_schedule_prep(napi)) {
-				virtqueue_disable_cb(rq->vq);
-				__napi_schedule(napi);
-			}
-		}
-	}
+	if (received < budget)
+		virtqueue_napi_complete(napi, rq->vq, received);
 
 	return received;
 }
@@ -1040,7 +1045,7 @@ static int virtnet_open(struct net_device *dev)
 			/* Make sure we have some buffers: if oom use wq. */
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
-		virtnet_napi_enable(&vi->rq[i]);
+		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 	}
 
 	return 0;
@@ -1747,7 +1752,7 @@ static int virtnet_restore_up(struct virtio_device *vdev)
 				schedule_delayed_work(&vi->refill, 0);
 
 		for (i = 0; i < vi->max_queue_pairs; i++)
-			virtnet_napi_enable(&vi->rq[i]);
+			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 	}
 
 	netif_device_attach(vi->dev);
-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply related

* [PATCH net-next v3 0/5] virtio-net tx napi
From: Willem de Bruijn @ 2017-04-24 17:49 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, virtualization, davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add napi for virtio-net transmit completion processing.

Changes:
  v2 -> v3:
    - convert __netif_tx_trylock to __netif_tx_lock on tx napi poll
          ensure that the handler always cleans, to avoid deadlock
    - unconditionally clean in start_xmit
          avoid adding an unnecessary "if (use_napi)" branch
    - remove virtqueue_disable_cb in patch 5/5
          a noop in the common event_idx based loop
    - document affinity_hint_set constraint

  v1 -> v2:
    - disable by default
    - disable unless affinity_hint_set
          because cache misses add up to a third higher cycle cost,
	  e.g., in TCP_RR tests. This is not limited to the patch
	  that enables tx completion cleaning in rx napi.
    - use trylock to avoid contention between tx and rx napi
    - keep interrupts masked during xmit_more (new patch 5/5)
          this improves cycles especially for multi UDP_STREAM, which
	  does not benefit from cleaning tx completions on rx napi.
    - move free_old_xmit_skbs (new patch 3/5)
          to avoid forward declaration

    not changed:
    - deduplicate virnet_poll_tx and virtnet_poll_txclean
          they look similar, but have differ too much to make it
	  worthwhile.
    - delay netif_wake_subqueue for more than 2 + MAX_SKB_FRAGS
          evaluated, but made no difference
    - patch 1/5

  RFC -> v1:
    - dropped vhost interrupt moderation patch:
          not needed and likely expensive at light load
    - remove tx napi weight
        - always clean all tx completions
        - use boolean to toggle tx-napi, instead
    - only clean tx in rx if tx-napi is enabled
        - then clean tx before rx
    - fix: add missing braces in virtnet_freeze_down
    - testing: add 4KB TCP_RR + UDP test results

Based on previous patchsets by Jason Wang:

  [RFC V7 PATCH 0/7] enable tx interrupts for virtio-net
  http://lkml.iu.edu/hypermail/linux/kernel/1505.3/00245.html

Before commit b0c39dbdc204 ("virtio_net: don't free buffers in xmit
ring") the virtio-net driver would free transmitted packets on
transmission of new packets in ndo_start_xmit and, to catch the edge
case when no new packet is sent, also in a timer at 10HZ.

A timer can cause long stalls. VIRTIO_F_NOTIFY_ON_EMPTY avoids stalls
due to low free descriptor count. It does not address a stalls due to
low socket SO_SNDBUF. Increasing timer frequency decreases that stall
time, but increases interrupt rate and, thus, cycle count.

Currently, with no timer, packets are freed only at ndo_start_xmit.
Latency of consume_skb is now unbounded. To avoid a deadlock if a sock
reaches SO_SNDBUF, packets are orphaned on tx. This breaks TCP small
queues.

Reenable TCP small queues by removing the orphan. Instead of using a
timer, convert the driver to regular tx napi. This does not have the
unresolved stall issue and does not have any frequency to tune.

By keeping interrupts enabled by default, napi increases tx
interrupt rate. VIRTIO_F_EVENT_IDX avoids sending an interrupt if
one is already unacknowledged, so makes this more feasible today.
Combine that with an optimization that brings interrupt rate
back in line with the existing version for most workloads:

Tx completion cleaning on rx interrupts elides most explicit tx
interrupts by relying on the fact that many rx interrupts fire.

Tested by running {1, 10, 100} {TCP, UDP} STREAM, RR, 4K_RR benchmarks
from a guest to a server on the host, on an x86_64 Haswell. The guest
runs 4 vCPUs pinned to 4 cores. vhost and the test server are
pinned to a core each.

All results are the median of 5 runs, with variance well < 10%.
Used neper (github.com/google/neper) as test process.

Napi increases single stream throughput, but increases cycle cost.
The optimizations bring this down. The previous patchset saw a
regression with UDP_STREAM, which does not benefit from cleaning tx
interrupts in rx napi. This regression is now gone for 10x, 100x.
Remaining difference is higher 1x TCP_STREAM, lower 1x UDP_STREAM.

The latest results are with process, rx napi and tx napi affine to
the same core. All numbers are lower than the previous patchset.

             upstream     napi
TCP_STREAM:
1x:
  Mbps          27816    39805
  Gcycles         274      285

10x:
  Mbps          42947    42531
  Gcycles         300      296

100x:
  Mbps          31830    28042
  Gcycles         279      269

TCP_RR Latency (us):
1x:
  p50              21       21
  p99              27       27
  Gcycles         180      167

10x:
  p50              40       39
  p99              52       52
  Gcycles         214      211

100x:
  p50             281      241
  p99             411      337
  Gcycles         218      226

TCP_RR 4K:
1x:
  p50              28       29
  p99              34       36
  Gcycles         177      167

10x:
  p50              70       71
  p99              85      134
  Gcycles         213      214

100x:
  p50             442      611
  p99             802      785
  Gcycles         237      216

UDP_STREAM:
1x:
  Mbps          29468    26800
  Gcycles         284      293

10x:
  Mbps          29891    29978
  Gcycles         285      312

100x:
  Mbps          30269    30304
  Gcycles         318      316

UDP_RR:
1x:
  p50              19       19
  p99              23       23
  Gcycles         180      173

10x:
  p50              35       40
  p99              54       64
  Gcycles         245      237

100x:
  p50             234      286
  p99             484      473
  Gcycles         224      214

Note that GSO is enabled, so 4K RR still translates to one packet
per request.

Lower throughput at 100x vs 10x can be (at least in part)
explained by looking at bytes per packet sent (nstat). It likely
also explains the lower throughput of 1x for some variants.

upstream:

 N=1   bytes/pkt=16581
 N=10  bytes/pkt=61513
 N=100 bytes/pkt=51558

at_rx:

 N=1   bytes/pkt=65204
 N=10  bytes/pkt=65148
 N=100 bytes/pkt=56840

Willem de Bruijn (5):
  virtio-net: napi helper functions
  virtio-net: transmit napi
  virtio-net: move free_old_xmit_skbs
  virtio-net: clean tx descriptors from rx napi
  virtio-net: keep tx interrupts disabled unless kick

 drivers/net/virtio_net.c | 193 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 132 insertions(+), 61 deletions(-)

-- 
2.12.2.816.g2cccc81164-goog

^ permalink raw reply

* Re: [PATCH net] udp: disable inner UDP checksum offloads in IPsec case
From: David Miller @ 2017-04-24 17:49 UTC (permalink / raw)
  To: aatteka; +Cc: netdev
In-Reply-To: <1492813385-31736-1-git-send-email-aatteka@ovn.org>

From: Ansis Atteka <aatteka@ovn.org>
Date: Fri, 21 Apr 2017 15:23:05 -0700

> Otherwise, UDP checksum offloads could corrupt ESP packets by attempting
> to calculate UDP checksum when this inner UDP packet is already protected
> by IPsec.
> 
> One way to reproduce this bug is to have a VM with virtio_net driver (UFO
> set to ON in the guest VM); and then encapsulate all guest's Ethernet
> frames in Geneve; and then further encrypt Geneve with IPsec.  In this
> case following symptoms are observed:
> 1. If using ixgbe NIC, then it will complain with following error message:
>    ixgbe 0000:01:00.1: partial checksum but l4 proto=32!
> 2. Receiving IPsec stack will drop all the corrupted ESP packets and
>    increase XfrmInStateProtoError counter in /proc/net/xfrm_stat.
> 3. iperf UDP test from the VM with packet sizes above MTU will not work at
>    all.
> 4. iperf TCP test from the VM will get ridiculously low performance because.
> 
> Signed-off-by: Ansis Atteka <aatteka@ovn.org>
> Co-authored-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] macsec: avoid heap overflow in skb_to_sgvec
From: David Miller @ 2017-04-24 17:47 UTC (permalink / raw)
  To: Jason; +Cc: netdev, linux-kernel, stable, security
In-Reply-To: <20170421211448.16995-1-Jason@zx2c4.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Fri, 21 Apr 2017 23:14:48 +0200

> While this may appear as a humdrum one line change, it's actually quite
> important. An sk_buff stores data in three places:
> 
> 1. A linear chunk of allocated memory in skb->data. This is the easiest
>    one to work with, but it precludes using scatterdata since the memory
>    must be linear.
> 2. The array skb_shinfo(skb)->frags, which is of maximum length
>    MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
>    can point to different pages.
> 3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
>    which in turn can have data in either (1) or (2).
> 
> The first two are rather easy to deal with, since they're of a fixed
> maximum length, while the third one is not, since there can be
> potentially limitless chains of fragments. Fortunately dealing with
> frag_list is opt-in for drivers, so drivers don't actually have to deal
> with this mess. For whatever reason, macsec decided it wanted pain, and
> so it explicitly specified NETIF_F_FRAGLIST.
> 
> Because dealing with (1), (2), and (3) is insane, most users of sk_buff
> doing any sort of crypto or paging operation calls a convenient function
> called skb_to_sgvec (which happens to be recursive if (3) is in use!).
> This takes a sk_buff as input, and writes into its output pointer an
> array of scattergather list items. Sometimes people like to declare a
> fixed size scattergather list on the stack; othertimes people like to
> allocate a fixed size scattergather list on the heap. However, if you're
> doing it in a fixed-size fashion, you really shouldn't be using
> NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
> frag_list children arent't shared and then you check the number of
> fragments in total required.)
> 
> Macsec specifically does this:
> 
>         size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
>         tmp = kmalloc(size, GFP_ATOMIC);
>         *sg = (struct scatterlist *)(tmp + sg_offset);
> 	...
>         sg_init_table(sg, MAX_SKB_FRAGS + 1);
>         skb_to_sgvec(skb, sg, 0, skb->len);
> 
> Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
> using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
> overflow the heap, and disaster ensues.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

This is definitely the simplest fix for now, applied, thanks.

^ permalink raw reply

* Re: cpsw regression in mainline with "cpsw/netcp: cpts depends on posix_timers"
From: Tony Lindgren @ 2017-04-24 17:44 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Grygorii Strashko, Networking, linux-omap
In-Reply-To: <CAK8P3a3kvMubq5Mtu3XKnityAgRv5R35dMtTX=BO8Qv_iX9=VQ@mail.gmail.com>

* Arnd Bergmann <arnd@arndb.de> [170424 10:38]:
> On Mon, Apr 24, 2017 at 6:51 PM, Tony Lindgren <tony@atomide.com> wrote:
> > Hi,
> >
> > Looks like commit 07fef3623407 ("cpsw/netcp: cpts depends on posix_timers")
> > in mainline started triggering the following oops at least on j5eco-evm.
> >
> > Adding CONFIG_PTP_1588_CLOCK to .config solves it, but the oops hints
> > something is wrong with the dependencies.. CONFIG_TI_CPTS defaults to N
> > and not selecting it causes the oops.
> >
> > Any ideas what's needed to properly fix this?
> 
> Does your configuration have POSIX_TIMERS enabled? If not, then CPTS
> is now also disabled. There are two issues here that we need to solve:
> 
> a) find out why POSIX_TIMERS got disabled, and make sure it's always
>     turned on for normal users

$ make omap2plus_defconfig
...
CONFIG_POSIX_TIMERS=y
# CONFIG_PTP_1588_CLOCK is not set

So CONFIG_TI_CPTS is unselectable.

> b) find out what's wrong with the driver when CPTS is disabled. This would
>     be an existing problem that you just never saw because CPTS was
>     always enabled so far.

See a) above, yes seems there are two problems here.

Regards,

Tony

^ permalink raw reply

* qed*: debug infrastructures
From: Elior, Ariel @ 2017-04-24 17:38 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, Mintz, Yuval, Tayar, Tomer, Dupuis, Chad

Hi Dave,

According to the recent messages on the list indicating debugfs is not the way
to go, I am looking for some guidance on what is. dpipe approach was
mentioned as favorable, but I wanted to make sure molding our debug features to
this infrastructure will result in something acceptable. A few points:

1.
One of our HW debug features is a signal recording feature. HW is configured to
output specific signals, which are then continuously dumped into a cyclic
buffer on host. There are ~8000 signals, which can be logically divided to ~100
groups. I believe this can be modeled in dpipe (or similar tool) as a set of
~100 tables with ~80 entries each, and user would be able to see them all and
choose what they like. The output data itself is binary, and meaningless to
"the user". The amount of data is basically as large a contiguous buffer as
driver can allocate, i.e. usually 4MB. When user selects the signals, and sets
meta data regarding to mode of operations, some device configuration will have
to take place. Does that sound reasonable?
This debug feature already exists out of tree for bnx2x and qed* drivers and is
*very* effective in field deployments. I would very much like to see this as an
in-tree feature via some infrastructure or another.

2.
Sometimes we want to debug the probe or removal flow of the driver. ethtool has
the disadvantage of only being available once network device is available. If a
bug stops the load flow before the ethtool debug paths are available, there is
no way to collect a dump. Similarly, removal flows which hit a bug but do remove
the network device, can't be debugged from ethtool. Does dpipe suffer from the
same problem? qed* (like mlx*) has a common-functionality module. This allows
creating debugfs nodes even before the network drivers are probed, providing a
solution for this (debug nodes are also available after network driver removal).
If dpipe does hold an answer here (e.g. provide preconfiguration which would
kick in when network device registers) then we might want to port all of our
register dump logic over there for this advantage. Does that sound reasonable?

3.
Yuval mentioned this, but I wanted to reiterate that the same is necessary for
our storage drivers (qedi/qedf). debugfs does have the advantage of being non
sub-system specific. Is there perhaps another non subsystem specific debug
infrastructure which *is* acceptable to the networking subsystem? My guess is
that the storage drivers will turn to debugfs (in their own subsystem).

Thanks,
Ariel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox