Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: RTL8723BE performance regression
From: Steve deRosier @ 2018-04-04 14:54 UTC (permalink / raw)
  To: João Paulo Rechi Vita
  Cc: Larry Finger, Yan-Hsuan Chuang, Ping-Ke Shih, Birming Chiu,
	Shaofu, Steven Ting, Chaoming Li, Kalle Valo, linux-wireless,
	Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, Linux Upstreaming Team
In-Reply-To: <CA+A7VXX+z8FrDhzW2BQ6AGMJ=_j_53wkUL_YNgkreEbWSh+jCw@mail.gmail.com>

On Tue, Apr 3, 2018 at 6:51 PM, João Paulo Rechi Vita <jprvita@gmail.com> wrote:
>
> This are the results (testing with speedtest.net) I got at some key points:
>
> Version        Commit        Ping        Down        Up
>
> v4.11            a351e9b        12        25.44        5.99
> v4.11            a351e9b        131      17.02        5.89
>
> v4.13            569dbb8        174      14.08        0.00
> v4.13            569dbb8        261      8.41          0.00
>
> v4.15+revert d8a5b80        19        23.86        1.41
> v4.15+revert d8a5b80        189      18.69        1.39
>

I recommend doing throughput testing in a closed system using iperf.
speedtest.net is potentially useful for testing your ISP's bandwidth
at some particular point in time, but little else as it exposes you to
too many variables. I wouldn't take those numbers to mean much and the
inconclusive results you're getting could be explained by external
network loading and having little to do with your bisect effort. I can
get that spread in numbers from speedtest.net without making any
changes other than the time of day I do the test.

Here's how to do it. Install iperf2 (you could use iperf3, personal
choice) on two machines, one being your device under test (DUT). Setup
a network configuration that looks similar to this:

server <==hardwire==> AP <--wireless link--> DUT

Be sure your hardwire is more bandwidth than your wireless link is
capable of, or set it up where the server is the AP. What you're
looking for here is environmental consistency, not maximum throughput
numbers.

On the computer hardwired to the network, start the server, we'll
assume it has an ip of 192.168.33.18:

    iperf -s

On your DUT:

    iperf -c 192.168.33.18

That's the most basic setup, check the man page for more options.

You will get best results if you can exclude other computers from your
test network and other wireless devices from your airspace.

- Steve

--
Steve deRosier
Cal-Sierra Consulting LLC
https://www.cal-sierra.com/

^ permalink raw reply

* RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Jon Rosen (jrosen) @ 2018-04-04 14:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list
In-Reply-To: <20180403161630.24f8339c@xeon-e3>



> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e0f3f4a..264d7b2 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> >  		if (po->stats.stats1.tp_drops)
> >  			status |= TP_STATUS_LOSING;
> >  	}
> > +
> > +        /*
> > +         * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > +         * kernel threads from re-using this same entry.
> > +         */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> > +	if (po->tp_version <= TPACKET_V2)
> > +            __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> >  	po->stats.stats1.tp_packets++;
> >  	if (copy_skb) {
> >  		status |= TP_STATUS_COPY;
> 
> This patch looks correct. Please resend it with proper signed-off-by
> and with a kernel code indenting style (tabs).  Is this bug present
> since the beginning of af_packet and multiqueue devices or did it get
> introduced in some previous kernel?

Sorry about the tabs, I'll fix that and try to figure out what I did wrong with
the signed-off-by.

I've looked back as far as I could find online (2.6.11) and it would appear that
this bug has always been there.

Thanks, jon.

^ permalink raw reply

* Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
From: Willem de Bruijn @ 2018-04-04 13:49 UTC (permalink / raw)
  To: Jon Rosen
  Cc: David S. Miller, Willem de Bruijn, Eric Dumazet, Kees Cook,
	David Windsor, Rosen, Rami, Reshetova, Elena, Mike Maloney,
	Benjamin Poirier, open list:NETWORKING [GENERAL], open list
In-Reply-To: <20180403215526.15934-1-jrosen@cisco.com>

On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jrosen@cisco.com> wrote:
> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry, Mark the ring entry as
> already being used within the spin_lock to prevent other kernel
> threads from reusing the same entry before it's fully filled in,
> passed to user space, and then eventually passed back to the kernel
> for use with a new packet.
>
> Note that the proposed change may modify the semantics of the
> interface between kernel space and user space in a way which may cause
> some applications to no longer work properly.

As long as TP_STATUS_USER (1) is not set, userspace should ignore
the slot..

>    One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
>    is that the documentation of the tp_status field is somewhat
>    inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
>    meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
>    meaning the entry is owned by user space.  In other places ownership
>    by user space is defined by the TP_STATUS_USER(1) bit being set.

But indeed this example in packet_mmap.txt is problematic

    if (status == TP_STATUS_KERNEL)
        retval = poll(&pfd, 1, timeout);

It does not really matter whether the docs are possibly inconsistent and
which one is authoritative. Examples like the above make it likely that
some user code expects such code to work.

> +++ b/net/packet/af_packet.c
> @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>                 if (po->stats.stats1.tp_drops)
>                         status |= TP_STATUS_LOSING;
>         }
> +
> +        /*
> +         * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> +         * kernel threads from re-using this same entry.
> +         */
> +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING

No need to reinterpret existing flags. tp_status is a u32 with
sufficient undefined bits.

> +       if (po->tp_version <= TPACKET_V2)
> +            __packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> +
>         po->stats.stats1.tp_packets++;
>         if (copy_skb) {
>                 status |= TP_STATUS_COPY;
> --
> 2.10.3.dirty
>

^ permalink raw reply

* Re: [PATCH] [net] nvmem: disallow modular CONFIG_NVMEM
From: Arnd Bergmann @ 2018-04-04 13:38 UTC (permalink / raw)
  To: Mike Looijmans
  Cc: Srinivas Kandagatla, Florian Fainelli, Andrew Lunn,
	David S . Miller, Networking, Greg Kroah-Hartman, Oleksij Rempel,
	Martin Blumenstingl, Linux Kernel Mailing List
In-Reply-To: <23fe2eeb-127e-7917-0d56-714414cc8c06@topic.nl>

On Wed, Apr 4, 2018 at 3:32 PM, Mike Looijmans <mike.looijmans@topic.nl> wrote:

>
> Which may still be very confusing, since it means that CONFIG_MACB=y with
> CONFIG_NVMEM=m will fail, but setting both to "y" or both to "m" will work.
> So that would introduce more build failures again, right?

Right, it would require having a

      depends on NVMEM || !NVMEM

line in Kconfig for each such driver, or a

      depends on NVMEM != m

for drivers that cannot be modules themselves.

      Arnd

^ permalink raw reply

* Re: [net-next V9 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call
From: Jesper Dangaard Brouer @ 2018-04-04 13:36 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, brouer
In-Reply-To: <72eccc60-21d2-0789-81c6-1bad2adb41ed@mellanox.com>

On Wed, 4 Apr 2018 16:12:14 +0300
Tariq Toukan <tariqt@mellanox.com> wrote:

> > @@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
> >   
> >   	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> >   	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
> > +	pool_size = 1 << params->log_rq_mtu_frames;
> >   
> >   	switch (rq->wq_type) {
> >   	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> > +
> > +		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;  
> 
> For rq->wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, please use:
> pool_size = 1 << params->log_rq_mtu_frames;
> 
> For rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, please use:
> pool_size = MLX5_MPWRQ_PAGES_PER_WQ * mlx5e_mpwqe_get_log_rq_size(params);

Okay, fixed.  Ready for V10, when net-next opens again...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] [net] nvmem: disallow modular CONFIG_NVMEM
From: Mike Looijmans @ 2018-04-04 13:32 UTC (permalink / raw)
  To: Arnd Bergmann, Srinivas Kandagatla
  Cc: Florian Fainelli, Andrew Lunn, David S . Miller, netdev,
	Greg Kroah-Hartman, Oleksij Rempel, Martin Blumenstingl,
	linux-kernel
In-Reply-To: <20180404103900.4090118-1-arnd@arndb.de>

For making things build again:

Acked-by: Mike Looijmans

I'd be interested in use-cases for having nvmem as a module, I cannot think of 
any.


I am working on splitting of_get_nvmem_mac_address(), by moving most of its
burden to the nvmem interface. Intend to create this "just give me my bytes" 
function in nvmem:

int of_nvmem_cell_read(struct device_node *np, const char *name,
                        void *buf, size_t bytes);

With that in place, of_get_nvmem_mac_address becomes a one-liner:

return of_nvmem_cell_read(np, "mac-address", addr, ETH_ALEN);

That would allow of_get_nvmem_mac_address to be inlined, and this would move 
the link issues into modules that actually use that function.

Which may still be very confusing, since it means that CONFIG_MACB=y with 
CONFIG_NVMEM=m will fail, but setting both to "y" or both to "m" will work. So 
that would introduce more build failures again, right?

--
Mike.


On 04-04-18 12:38, Arnd Bergmann wrote:
> The new of_get_nvmem_mac_address() helper function causes a link error
> with CONFIG_NVMEM=m:
> 
> drivers/of/of_net.o: In function `of_get_nvmem_mac_address':
> of_net.c:(.text+0x168): undefined reference to `of_nvmem_cell_get'
> of_net.c:(.text+0x19c): undefined reference to `nvmem_cell_read'
> of_net.c:(.text+0x1a8): undefined reference to `nvmem_cell_put'
> 
> I could not come up with a good solution for this, as the code is always
> built-in. Using an #if IS_REACHABLE() check around it would solve the
> link time issue but then stop it from working in that configuration.
> Making of_nvmem_cell_get() an inline function could also solve that, but
> seems a bit ugly since it's somewhat larger than most inline functions,
> and it would just bring that problem into the callers.  Splitting the
> function into a separate file might be an alternative.
> 
> This uses the big hammer by making CONFIG_NVMEM itself a 'bool' symbol,
> which avoids the problem entirely but makes the vmlinux larger for anyone
> that might use NVMEM support but doesn't need it built-in otherwise.
> 
> Fixes: 9217e566bdee ("of_net: Implement of_get_nvmem_mac_address helper")
> Cc: Mike Looijmans <mike.looijmans@topic.nl>
> Cc: Florian Fainelli <f.fainelli@gmail.com>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> The problem arrived through the networking tree, but it's now in
> mainline, so the fix could go through either tree
> ---
>   drivers/nvmem/Kconfig | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvmem/Kconfig b/drivers/nvmem/Kconfig
> index 5f9bc787d634..1090924efdb1 100644
> --- a/drivers/nvmem/Kconfig
> +++ b/drivers/nvmem/Kconfig
> @@ -1,5 +1,5 @@
>   menuconfig NVMEM
> -	tristate "NVMEM Support"
> +	bool "NVMEM Support"
>   	help
>   	  Support for NVMEM(Non Volatile Memory) devices like EEPROM, EFUSES...
>   
> 



Kind regards,

Mike Looijmans
System Expert

TOPIC Products
Materiaalweg 4, NL-5681 RJ Best
Postbus 440, NL-5680 AK Best
Telefoon: +31 (0) 499 33 69 79
E-mail: mike.looijmans@topicproducts.com
Website: www.topicproducts.com

Please consider the environment before printing this e-mail

^ permalink raw reply

* [PATCH net-next] vxlan: add ttl inherit support
From: Hangbin Liu @ 2018-04-04 13:26 UTC (permalink / raw)
  To: netdev; +Cc: Jiri Benc, Xin Long, David S. Miller, Hangbin Liu

Like tos inherit, ttl inherit should also means inherit the inner protocol's
ttl values, which actually not implemented in vxlan yet.

But we could not treat ttl == 0 as "use the inner TTL", because that would be
used also when the "ttl" option is not specified and that would be a behavior
change, and breaking real use cases.

So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is
specified with ip cmd.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 drivers/net/vxlan.c          | 17 ++++++++++++++---
 include/net/ip_tunnels.h     | 11 +++++++++++
 include/net/vxlan.h          |  1 +
 include/uapi/linux/if_link.h |  1 +
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aa5f034..209a840 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2085,9 +2085,13 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		local_ip = vxlan->cfg.saddr;
 		dst_cache = &rdst->dst_cache;
 		md->gbp = skb->mark;
-		ttl = vxlan->cfg.ttl;
-		if (!ttl && vxlan_addr_multicast(dst))
-			ttl = 1;
+		if (flags & VXLAN_F_TTL_INHERIT) {
+			ttl = ip_tunnel_get_ttl(old_iph, skb);
+		} else {
+			ttl = vxlan->cfg.ttl;
+			if (!ttl && vxlan_addr_multicast(dst))
+				ttl = 1;
+		}
 
 		tos = vxlan->cfg.tos;
 		if (tos == 1)
@@ -2709,6 +2713,7 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
 	[IFLA_VXLAN_GBP]	= { .type = NLA_FLAG, },
 	[IFLA_VXLAN_GPE]	= { .type = NLA_FLAG, },
 	[IFLA_VXLAN_REMCSUM_NOPARTIAL]	= { .type = NLA_FLAG },
+	[IFLA_VXLAN_TTL_INHERIT]	= { .type = NLA_FLAG },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -3254,6 +3259,12 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
 	if (data[IFLA_VXLAN_TTL])
 		conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
 
+	if (data[IFLA_VXLAN_TTL_INHERIT]) {
+		if (changelink)
+			return -EOPNOTSUPP;
+		conf->flags |= VXLAN_F_TTL_INHERIT;
+	}
+
 	if (data[IFLA_VXLAN_LABEL])
 		conf->label = nla_get_be32(data[IFLA_VXLAN_LABEL]) &
 			     IPV6_FLOWLABEL_MASK;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index cbe5add..6c3c421 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -377,6 +377,17 @@ static inline u8 ip_tunnel_get_dsfield(const struct iphdr *iph,
 		return 0;
 }
 
+static inline u8 ip_tunnel_get_ttl(const struct iphdr *iph,
+				       const struct sk_buff *skb)
+{
+	if (skb->protocol == htons(ETH_P_IP))
+		return iph->ttl;
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		return ((const struct ipv6hdr *)iph)->hop_limit;
+	else
+		return 0;
+}
+
 /* Propogate ECN bits out */
 static inline u8 ip_tunnel_ecn_encap(u8 tos, const struct iphdr *iph,
 				     const struct sk_buff *skb)
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index ad73d8b..b99a02ae 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -262,6 +262,7 @@ struct vxlan_dev {
 #define VXLAN_F_COLLECT_METADATA	0x2000
 #define VXLAN_F_GPE			0x4000
 #define VXLAN_F_IPV6_LINKLOCAL		0x8000
+#define VXLAN_F_TTL_INHERIT		0x10000
 
 /* Flags that are used in the receive path. These flags must match in
  * order for a socket to be shareable
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 11d0c0e..e771a63 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -516,6 +516,7 @@ enum {
 	IFLA_VXLAN_COLLECT_METADATA,
 	IFLA_VXLAN_LABEL,
 	IFLA_VXLAN_GPE,
+	IFLA_VXLAN_TTL_INHERIT,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH net-next] net: avoid unneeded atomic operation in ip*_append_data()
From: Eric Dumazet @ 2018-04-04 13:26 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <7e55c93c2c7cddf4c077aa77aa1ab58396f502ff.1522844999.git.pabeni@redhat.com>

On 04/04/2018 05:30 AM, Paolo Abeni wrote:
> After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates
> done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6:
> factorize sk_wmem_alloc updates done by __ip6_append_data()"),
> when transmitting sub MTU datagram, an addtional, unneeded atomic
> operation is performed in ip*_append_data() to update wmem_alloc:
> in the above condition the delta is 0.
> 
> The above cause small but measurable performance regression in UDP
> xmit tput test with packet size below MTU.
> 
> This change avoids such overhead updating wmem_alloc only if
> wmem_alloc_delta is non zero.
> 
> The error path is left intentionally unmodified: it's a slow path
> and simplicity is preferred to performances.
> 
> Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()")
> Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---

SGTM, thanks Paolo

Reviewed-by: Eric Dumazet <edumazet@google.com>

My intent was to modify sock_alloc_send_pskb() to accept to opt-out
the skb_set_owner_w() call, but I forgot that the merge window was starting last week-end.

So if a UDP datagram needs 2 skb, only one sk_wmem_alloc change would happen, not 1 + 1.

^ permalink raw reply

* Re: [RFC PATCH net] tcp: allow to use TCP Fastopen with MSG_ZEROCOPY
From: Willem de Bruijn @ 2018-04-04 13:22 UTC (permalink / raw)
  To: Alexey Kodanev
  Cc: Network Development, Willem de Bruijn, Eric Dumazet, David Miller,
	Soheil Hassas Yeganeh, Neal Cardwell
In-Reply-To: <1522759399-25693-1-git-send-email-alexey.kodanev@oracle.com>

On Tue, Apr 3, 2018 at 2:43 PM, Alexey Kodanev
<alexey.kodanev@oracle.com> wrote:
> With TCP Fastopen we can have the following cases, which could also
> use MSG_ZEROCOPY flag with send() and sendto():
>
> * sendto() + MSG_FASTOPEN flag, sk state can be in TCP_CLOSE at
>   the start of tcp_sendmsg()
>
> * set socket option TCP_FASTOPEN_CONNECT, then connect()
>   and send(), sk state in TCP_SYN_SENT
>
> Currently, both cases with tcp_sendmsg() and MSG_ZEROCOPY flag results
> to EINVAL error, because of the check for TCP_ESTABLISHED sk state in
> the beginning of tcp_sendmsg().
>
> Both conditions require two more checks there: !tp->fastopen_connect
> and !(flags & MSG_FASTOPEN).  It looks like we could remove the original
> check altogether for this unlikely event instead. That way tcp_sendmsg()
> without TFO should fail with EPIPE on sk_stream_wait_connect(), as
> before the introduction of MSG_ZEROCOPY there. And work smoothly for
> the TFO cases.
>
> Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")

This patch adds MSG_ZEROCOPY support for TFO sockets. It is not
a fix that needs to go to stable.

> Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
> ---
>
> Is there something that I've overlooked and we can't use it here, and
> we should handle this type of error, while using sendto() + TFO,
> in userspace?
>
>  net/ipv4/tcp.c | 5 -----
>  1 file changed, 5 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9225610..768f02c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1193,11 +1193,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>         flags = msg->msg_flags;
>
>         if (flags & MSG_ZEROCOPY && size) {
> -               if (sk->sk_state != TCP_ESTABLISHED) {
> -                       err = -EINVAL;
> -                       goto out_err;
> -               }
> -
>                 skb = tcp_write_queue_tail(sk);
>                 uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
>                 if (!uarg) {
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [net-next V9 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call
From: Tariq Toukan @ 2018-04-04 13:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel,
	magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Daniel Borkmann, Alexei Starovoitov,
	Tariq Toukan
In-Reply-To: <152275372705.1026.11963048035925660701.stgit@firesoul>



On 03/04/2018 2:08 PM, Jesper Dangaard Brouer wrote:
> This patch shows how it is possible to have both the driver local page
> cache, which uses elevated refcnt for "catching"/avoiding SKB
> put_page returns the page through the page allocator.  And at the
> same time, have pages getting returned to the page_pool from
> ndp_xdp_xmit DMA completion.
> 
> The performance improvement for XDP_REDIRECT in this patch is really
> good.  Especially considering that (currently) the xdp_return_frame
> API and page_pool_put_page() does per frame operations of both
> rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
> (It is the plan to remove these per frame operation in a followup
> patchset).
> 
> The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
> with xdp_redirect_map (using devmap) . And the target/maximum
> capability of ixgbe is 13Mpps (on this HW setup).
> 
> Before this patch for mlx5, XDP redirected frames were returned via
> the page allocator.  The single flow performance was 6Mpps, and if I
> started two flows the collective performance drop to 4Mpps, because we
> hit the page allocator lock (further negative scaling occurs).
> 
> Two test scenarios need to be covered, for xdp_return_frame API, which
> is DMA-TX completion running on same-CPU or cross-CPU free/return.
> Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
> close to our 13Mpps max target.
> 
> The reason max target isn't reached in cross-CPU test, is likely due
> to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
> ixgbe testing).  It is also planned to remove this unnecessary DMA
> unmap in a later patchset
> 
> V2: Adjustments requested by Tariq
>   - Changed page_pool_create return codes not return NULL, only
>     ERR_PTR, as this simplifies err handling in drivers.
>   - Save a branch in mlx5e_page_release
>   - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
> 
> V5: Updated patch desc
> 
> V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
> V9:
>   - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
>   - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
>   - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
>   3 files changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 1a05d1072c5e..3317a4da87cb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -53,6 +53,8 @@
>   #include "mlx5_core.h"
>   #include "en_stats.h"
>   
> +struct page_pool;
> +
>   #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
>   
>   #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
> @@ -534,6 +536,7 @@ struct mlx5e_rq {
>   	unsigned int           hw_mtu;
>   	struct mlx5e_xdpsq     xdpsq;
>   	DECLARE_BITMAP(flags, 8);
> +	struct page_pool      *page_pool;
>   
>   	/* control */
>   	struct mlx5_wq_ctrl    wq_ctrl;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 13c1e61258a7..d0f2cd86ef32 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -35,6 +35,7 @@
>   #include <linux/mlx5/fs.h>
>   #include <net/vxlan.h>
>   #include <linux/bpf.h>
> +#include <net/page_pool.h>
>   #include "eswitch.h"
>   #include "en.h"
>   #include "en_tc.h"
> @@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>   			  struct mlx5e_rq_param *rqp,
>   			  struct mlx5e_rq *rq)
>   {
> +	struct page_pool_params pp_params = { 0 };
>   	struct mlx5_core_dev *mdev = c->mdev;
>   	void *rqc = rqp->rqc;
>   	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
> -	u32 byte_count;
> +	u32 byte_count, pool_size;
>   	int npages;
>   	int wq_sz;
>   	int err;
> @@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>   
>   	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>   	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
> +	pool_size = 1 << params->log_rq_mtu_frames;
>   
>   	switch (rq->wq_type) {
>   	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> +
> +		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;

For rq->wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, please use:
pool_size = 1 << params->log_rq_mtu_frames;

For rq->wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ, please use:
pool_size = MLX5_MPWRQ_PAGES_PER_WQ * mlx5e_mpwqe_get_log_rq_size(params);

Thanks,
Tariq

^ permalink raw reply

* [net-next  1/1] tipc: Fix namespace violation in tipc_sk_fill_sock_diag
From: GhantaKrishnamurthy MohanKrishna @ 2018-04-04 12:49 UTC (permalink / raw)
  To: tipc-discussion, jon.maloy, maloy, ying.xue,
	mohan.krishna.ghanta.krishnamurthy, davem, netdev

To fetch UID info for socket diagnostics, we determine the
namespace of user context using tipc socket instance. This
may cause namespace violation, as the kernel will remap based
on UID.

We fix this by fetching namespace info using the calling userspace
netlink socket.

Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC)
Reported-by: syzbot+326e587eff1074657718@syzkaller.appspotmail.com
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com>
---
 net/tipc/socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 3e5eba30865e..cee6674a3bf4 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3280,7 +3280,8 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
 	    nla_put_u32(skb, TIPC_NLA_SOCK_TIPC_STATE, (u32)sk->sk_state) ||
 	    nla_put_u32(skb, TIPC_NLA_SOCK_INO, sock_i_ino(sk)) ||
 	    nla_put_u32(skb, TIPC_NLA_SOCK_UID,
-			from_kuid_munged(sk_user_ns(sk), sock_i_uid(sk))) ||
+			from_kuid_munged(sk_user_ns(NETLINK_CB(skb).sk),
+					 sock_i_uid(sk))) ||
 	    nla_put_u64_64bit(skb, TIPC_NLA_SOCK_COOKIE,
 			      tipc_diag_gen_cookie(sk),
 			      TIPC_NLA_SOCK_PAD))
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
From: Andrew Lunn @ 2018-04-04 12:42 UTC (permalink / raw)
  To: Stuart Yoder
  Cc: Arnd Bergmann, Ioana Ciornei, gregkh, Laurentiu Tudor,
	Linux Kernel Mailing List, Ruxandra Ioana Ciocoi Radulescu,
	Razvan Stefanescu, Roy Pledge, Networking
In-Reply-To: <CAEac7tYSAmgt-kAcLNToKdrBOZhvq4Q+HxYEG+NkGZAk6kDADw@mail.gmail.com>

> I hear you.  It is more complicated this way...having all these individual
> objects vs just a single "bundle" of them that represents a NIC.  But, that's
> the way the DPAA2 hardware is, and we're implementing kernel support for
> the hardware as it is.

Hi Stuart

I see we are not making any progress here.

So what i suggest is you post the kernel code and configuration tool
concept to netdev for a full review. You want reviews from David
Miller, Jiri Pirko, Jakub Kicinski, David Ahern, etc.

	Andrew

^ permalink raw reply

* Re: [PATCH] net: improve ipv4 performances
From: Paolo Abeni @ 2018-04-04 12:34 UTC (permalink / raw)
  To: Anton Gary Ceph, netdev, linux-kernel
In-Reply-To: <20180401183121.13022-1-agaceph@gmail.com>

On Sun, 2018-04-01 at 20:31 +0200, Anton Gary Ceph wrote:
> After a few profiling and analysis, turned out that the ethertype field
> of the packets has the following distribution:
[...]
>      0.6% don't know/no opinion

Am I the only one finding the submission date and the above info
suspicious ?!?

/P

^ permalink raw reply

* [PATCH net-next] net: avoid unneeded atomic operation in ip*_append_data()
From: Paolo Abeni @ 2018-04-04 12:30 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet

After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates
done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6:
factorize sk_wmem_alloc updates done by __ip6_append_data()"),
when transmitting sub MTU datagram, an addtional, unneeded atomic
operation is performed in ip*_append_data() to update wmem_alloc:
in the above condition the delta is 0.

The above cause small but measurable performance regression in UDP
xmit tput test with packet size below MTU.

This change avoids such overhead updating wmem_alloc only if
wmem_alloc_delta is non zero.

The error path is left intentionally unmodified: it's a slow path
and simplicity is preferred to performances.

Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()")
Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/ip_output.c  | 3 ++-
 net/ipv6/ip6_output.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 94cacae76aca..4c11b810a447 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1090,7 +1090,8 @@ static int __ip_append_data(struct sock *sk,
 		length -= copy;
 	}

-	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
+	if (wmem_alloc_delta)
+		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return 0;

 error_efault:
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index e6eaa4dd9f60..b4ae9b7941fa 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1536,7 +1536,8 @@ static int __ip6_append_data(struct sock *sk,
 		length -= copy;
 	}

-	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
+	if (wmem_alloc_delta)
+		refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return 0;

 error_efault:
-- 
2.14.3

^ permalink raw reply related

* [PATCH iproute2-next] tc: Correct json output for actions
From: Yuval Mintz @ 2018-04-04 12:24 UTC (permalink / raw)
  To: dsahern; +Cc: mlxsw, netdev, Yuval Mintz

Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON
support for tc-actions at the expense of breaking other use cases that
reach tc_print_action(), as the latter don't expect the 'actions' array
to be a new object.

Consider the following taken duringrun of tc_chain.sh selftest,
and see the latter command output is broken:

$ ./tc/tc -j -p actions list action gact | grep -C 3 actions
[ {
        "total acts": 1
    },{
        "actions": [ {
                "order": 0,

$ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions
            },
            "skip_hw": true,
            "not_in_hw": true,{
                "actions": [ {
                        "order": 1,
                        "kind": "gact",
                        "control_action": {

Relocate the open/close of the JSON object to declare the object only
for the case that needs it.

Signed-off-by: Yuval Mintz <yuvalm@mellanox.com>
---
 tc/m_action.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tc/m_action.c b/tc/m_action.c
index 2f85d35..8993b93 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -366,7 +366,6 @@ tc_print_action(FILE *f, const struct rtattr *arg, unsigned short tot_acts)
 	if (tab_flush && NULL != tb[0]  && NULL == tb[1])
 		return tc_print_action_flush(f, tb[0]);
 
-	open_json_object(NULL);
 	open_json_array(PRINT_JSON, "actions");
 	for (i = 0; i <= tot_acts; i++) {
 		if (tb[i]) {
@@ -383,7 +382,6 @@ tc_print_action(FILE *f, const struct rtattr *arg, unsigned short tot_acts)
 
 	}
 	close_json_array(PRINT_JSON, NULL);
-	close_json_object();
 
 	return 0;
 }
@@ -439,8 +437,9 @@ int print_action(const struct sockaddr_nl *who,
 		}
 	}
 
-
+	open_json_object(NULL);
 	tc_print_action(fp, tb[TCA_ACT_TAB], tot_acts ? *tot_acts:0);
+	close_json_object();
 
 	return 0;
 }
-- 
2.4.3

^ permalink raw reply related

* [PATCH] netdevsim: remove incorrect __net_initdata annotations
From: Arnd Bergmann @ 2018-04-04 12:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Arnd Bergmann, David Ahern, David S. Miller, netdev, linux-kernel

The __net_initdata section cannot currently be used for structures that
get cleaned up in an exitcall using unregister_pernet_operations:

WARNING: vmlinux.o(.text+0x868c34): Section mismatch in reference from the function nsim_devlink_exit() to the (unknown reference) .init.data:(unknown)
The function nsim_devlink_exit() references
the (unknown reference) __initdata (unknown).
This is often because nsim_devlink_exit lacks a __initdata
annotation or the annotation of (unknown) is wrong.
WARNING: vmlinux.o(.text+0x868c64): Section mismatch in reference from the function nsim_devlink_init() to the (unknown reference) .init.data:(unknown)
WARNING: vmlinux.o(.text+0x8692bc): Section mismatch in reference from the function nsim_fib_exit() to the (unknown reference) .init.data:(unknown)
WARNING: vmlinux.o(.text+0x869300): Section mismatch in reference from the function nsim_fib_init() to the (unknown reference) .init.data:(unknown)

As that warning tells us, discarding the structure after a module is
loaded would lead to a undefined behavior when that module is removed.

It might be possible to change that annotation so it has no effect for
loadable modules, but I have not figured out exactly how to do that, and
we want this to be fixed in -rc1.

This just removes the annotations, just like we do for all other such
modules.

Fixes: 37923ed6b8ce ("netdevsim: Add simple FIB resource controller via devlink")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/netdevsim/devlink.c | 2 +-
 drivers/net/netdevsim/fib.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netdevsim/devlink.c b/drivers/net/netdevsim/devlink.c
index 27ae05c5fdaf..1dba47936456 100644
--- a/drivers/net/netdevsim/devlink.c
+++ b/drivers/net/netdevsim/devlink.c
@@ -267,7 +267,7 @@ static int __net_init nsim_devlink_netns_init(struct net *net)
 	return 0;
 }

-static struct pernet_operations nsim_devlink_net_ops __net_initdata = {
+static struct pernet_operations nsim_devlink_net_ops = {
 	.init = nsim_devlink_netns_init,
 	.id   = &nsim_devlink_id,
 	.size = sizeof(bool),
diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index 0d105bafa261..9bfe9e151e13 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -230,7 +230,7 @@ static int __net_init nsim_fib_netns_init(struct net *net)
 	return 0;
 }

-static struct pernet_operations nsim_fib_net_ops __net_initdata = {
+static struct pernet_operations nsim_fib_net_ops = {
 	.init = nsim_fib_netns_init,
 	.id   = &nsim_fib_net_id,
 	.size = sizeof(struct nsim_fib_data),
-- 
2.9.0

^ permalink raw reply related

* [PATCH] mac80211: Fix bad line warning
From: Masanari Iida @ 2018-04-04 11:53 UTC (permalink / raw)
  To: davem, netdev, linux-kernel; +Cc: Masanari Iida

After 03fe2debbb2771fb90881e merged during 4.17 marge window,
I start to see following warning during "make xmldocs"

./include/net/mac80211.h:2083: warning: bad line:  >

Replace ">" with "*" fix the issue.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
 include/net/mac80211.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/mac80211.h b/include/net/mac80211.h
index d2279b2d61aa..b2f3a0c018e7 100644
--- a/include/net/mac80211.h
+++ b/include/net/mac80211.h
@@ -2080,7 +2080,7 @@ struct ieee80211_txq {
  *	virtual interface might not be given air time for the transmission of
  *	the frame, as it is not synced with the AP/P2P GO yet, and thus the
  *	deauthentication frame might not be transmitted.
- >
+ *
  * @IEEE80211_HW_DOESNT_SUPPORT_QOS_NDP: The driver (or firmware) doesn't
  *	support QoS NDP for AP probing - that's most likely a driver bug.
  *
-- 
2.17.0.rc2.3.gc2a499e6c31e

^ permalink raw reply related

* Re: [PATCH net-next 0/9] devlink: Add support for region access
From: Alex Vesker @ 2018-04-04 11:07 UTC (permalink / raw)
  To: David Ahern, Andrew Lunn
  Cc: David S. Miller, netdev, Tariq Toukan, Jiri Pirko
In-Reply-To: <5790065d-1eb3-ceb5-3f3d-17babd7d77c2@gmail.com>

On 3/31/2018 8:21 PM, David Ahern wrote:
> On 3/31/18 9:53 AM, Andrew Lunn wrote:
>>> I want to be able to login to a customer and accessing this snapshot
>>> without any previous configuration from the user and not asking for
>>> enabling the feature and then waiting for a repro...this will help
>>> debugging issues that are hard to reproduce, I don't see any reason
>>> to disable this.
>> The likely reality is 99.9% of these snapshots will never be seen or
>> used. But they take up memory sitting there doing nothing. And if the
>> snapshot is 2GB, that is a lot of memory. I expect a system admin
>> wants to be able to choose to enable this feature or not, because of
>> that memory. You should also consider implementing the memory pressure
>> callbacks, so you can discard snapshots, rather than OOM the machine.
>>
> That is exactly my point. Nobody wants one rogue device triggering
> snapshots, consuming system resources and with no options to disable it.

OK, currently there is a task to support persistent/permanent configuration
to devlink. Once this support is in I will add my code on top of it.
This will allow a user to enable the snapshot functionality on the driver.
Regarding the double continuous allocation of memory, I will fix to a single
vmalloc on the driver in case of adding a snapshot. Tell me what you think.

^ permalink raw reply

* Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
From: Ilpo Järvinen @ 2018-04-04 10:42 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <CAK6E8=ffjLxqYuQf0FMge9XjNbrFvPzE9NvM5ahJxhxAwZwsOA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8656 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ycheng@google.com> wrote:
> >
> > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen
> > <ilpo.jarvinen@helsinki.fi> wrote:
> > > On Tue, 27 Mar 2018, Yuchung Cheng wrote:
> > >
> > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen
> > >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote:
> > >> >
> > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> > >> >> <ilpo.jarvinen@helsinki.fi> wrote:
> > >> >> >
> > >> >> > A miscalculation for the number of acknowledged packets occurs during
> > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value
> > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > >> >> > RTO as it may include segments that were not lost and therefore did
> > >> >> > not need retransmissions in the slow start following the RTO. Then
> > >> >> > tcp_slow_start will add the excess into cwnd bloating it and
> > >> >> > triggering a burst.
> > >> >> >
> > >> >> > Instead, we want to pass only the number of retransmitted segments
> > >> >> > that were covered by the cumulative ACK (and potentially newly sent
> > >> >> > data segments too if the cumulative ACK covers that far).
> > >> >> >
> > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > >> >> > ---
> > >> >>
> > >> >> My understanding is there are two problems
> > >> >>
> > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> > >> >> precise cumulatively acked count, not newly acked count?
> > >> >
> > >> > While I'm not entirely sure if you intented to say that my fix is broken
> > >> > or not, I thought this very difference alot while making the fix and I
> > >> > believe that this fix is needed because of the discontinuity at RTO
> > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact
> > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that
> > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we
> > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.
> > >> >
> > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate
> > >> > tp->delivered, using plain cumulative acked causes congestion control
> > >> > breakage later as call to tcp_cong_control will directly use the
> > >> > difference in tp->delivered.
> > >> >
> > >> > This boils down the exact definition of tp->delivered (the one given in
> > >> > the header is not detailed enough). I guess you might have better idea
> > >> > what it exactly is since one of you has added it? There are subtle things
> > >> > in the defination that can make it entirely unsuitable for cc decisions.
> > >> > Should those segments that we (possibly) already counted into
> > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it
> > >> > for _second time_ or not? This fix avoids such double counting (the
> > >> Where is the double counting, assuming normal DUPACK behavior?
> > >>
> > >> In the non-sack case:
> > >>
> > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by
> > >> incrementing tp->delivered in tcp_add_reno_sack()
> > >
> > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity
> > > I've tried to point out quite many times already)...
> > >
> > >> 2. upon receiving a partial ACK or an ACK that acks recovery point
> > >> (high_seq), tp->delivered is incremented by (cumulatively acked -
> > >> #dupacks) in tcp_remove_reno_sacks()
> > >
> > > ...and this won't happen correctly anymore after RTO (since non-SACK
> > > won't keep #dupacks due to the discontinuity). Thus we end up adding
> > > cumulatively acked - 0 to tp->delivered on those ACKs.
> > >
> > >> therefore tp->delivered is tracking the # of packets delivered
> > >> (sacked, acked, DUPACK'd) with the most information it could have
> > >> inferred.
> > >
> > > Since you didn't answer any of my questions about tp->delivered directly,
> > > let me rephrase them to this example (non-SACK, of course):
> > >
> > > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0).
> > > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for
> > > three segments. How much should tp->delivered be incremented? 2 or 3?
> > >
> > > ...I think 2 is the right answer.
> > >
> > >> From congestion control's perspective, it cares about the delivery
> > >> information (e.g. how much), not the sequences (what or how).
> > >
> > > I guess you must have missed my point. I'm talking about "how much"
> > > whole the time. It's just about when can we account for it (and when not).
> > >
> > >> I am pointing out that
> > >>
> > >> 1) your fix may fix one corner CC packet accounting issue in the
> > >> non-SACK and CA_Loss
> > >> 2) but it does not fix the other (major) CC packet accounting issue in
> > >> the SACK case
> > >
> > > You say major but it's major if and only if one is using one of the
> > > affected cc modules which were not that many and IMHO not very mainstream
> > > anyway (check my list from the previous email, not that I'd mind if they'd
> > > get fixed too). The rest of the cc modules do not seem to use the incorrect
> > > value even if some of them have the pkts_acked callback.
> > >
> > > Other CC packet accounting (regardless of SACK) is based on tp->delivered
> > > and tp->delivered is NOT similarly miscounted with SACK because the logic
> > > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK
> > > loss).
> > >
> > >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case
> > >> as tcp_check_reno_reordering requires strictly cum. ack.
> > >
> > > I think that the current non-SACK reordering detection logic is not that
> > > sound after RTO (but I'm yet to prove this). ...There seems to be some
> > > failure modes which is why I even thought of disabling the whole thing
> > > for non-SACK RTO recoveries and as it now seems you do care more about
> > > non-SACK than you initial claimed, I might even have motivation to fix
> > > more those corners rather than the worst bugs only ;-). ...But I'll make
> > > the tp->delivered fix only in this patch to avoid any change this part of
> > > the code.
> > >
> > >> Therefore I prefer
> > >> a) using tp->delivered to fix (1)(2)
> > >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case
> > >
> > > Somehow I get an impression that you might assume/say here that
> > [resending in plaintext]
> > That's wrong impression. Perhaps it's worth re-iterating what I agree
> > and disagree
> >
> > 1. [agree] there's accounting issue in non-SACK as you discovered
> > which causes CC misbehavior
> >
> > 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack
> >     => that field is not used by common C.C. (you said so too)
> >
> > 3. [disagree] adjusting pkts_acked may not affect reordering
> > accounting in non-sack
> >
> >
> >
> > For cubic or reno, the main source is the "acked_sacked" passed into
> > tcp_cong_avoid(). that variable is derived from tp->delivered.
> > Therefore we need to fix that to address the problem in (1)
> >
> > I have yet to read your code. Will read later today.
> Your patch looks good. Some questions / comments:

Just to be sure that we understand each other the correct way around this 
time, I guess it resolved both of your "disagree" points above?

> 1. Why only apply to CA_Loss and not also CA_Recovery? this may
> mitigate GRO issue I noted in other thread.

Hmm, that's indeed a good idea. I'll give it some more thought but my 
initial impression is that it should work.

I initially thought that the GRO issues just had to do with missing xmit 
opportunities and was confused by the use of "mitigate" here but then 
realized you meant also (or perhaps mainly?) that GRO causes similar 
bursts (once the too small sacked_out runs out).

> 2. minor style nit: can we adjust tp->delivered directly in
> tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno
> (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion
> control when SACK is used. One day I'd like to rename all these *reno*
> to _nonsack_.

That's what I actually did first but put ended up putting it into own 
function because of line lengths (not a particularly good reason).

...So yes, I can put it into the tcp_clean_rtx_queue in the next version.

I'll also try to keep that "reno" thing in my mind.


-- 
 i.

^ permalink raw reply

* [PATCH] [net] nvmem: disallow modular CONFIG_NVMEM
From: Arnd Bergmann @ 2018-04-04 10:38 UTC (permalink / raw)
  To: Srinivas Kandagatla
  Cc: Arnd Bergmann, Mike Looijmans, Florian Fainelli, Andrew Lunn,
	David S . Miller, netdev, Greg Kroah-Hartman, Oleksij Rempel,
	Martin Blumenstingl, linux-kernel

The new of_get_nvmem_mac_address() helper function causes a link error
with CONFIG_NVMEM=m:

drivers/of/of_net.o: In function `of_get_nvmem_mac_address':
of_net.c:(.text+0x168): undefined reference to `of_nvmem_cell_get'
of_net.c:(.text+0x19c): undefined reference to `nvmem_cell_read'
of_net.c:(.text+0x1a8): undefined reference to `nvmem_cell_put'

I could not come up with a good solution for this, as the code is always
built-in. Using an #if IS_REACHABLE() check around it would solve the
link time issue but then stop it from working in that configuration.
Making of_nvmem_cell_get() an inline function could also solve that, but
seems a bit ugly since it's somewhat larger than most inline functions,
and it would just bring that problem into the callers.  Splitting the
function into a separate file might be an alternative.

This uses the big hammer by making CONFIG_NVMEM itself a 'bool' symbol,
which avoids the problem entirely but makes the vmlinux larger for anyone
that might use NVMEM support but doesn't need it built-in otherwise.

Fixes: 9217e566bdee ("of_net: Implement of_get_nvmem_mac_address helper")
Cc: Mike Looijmans <mike.looijmans@topic.nl>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
The problem arrived through the networking tree, but it's now in
mainline, so the fix could go through either tree
---
 drivers/nvmem/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvmem/Kconfig b/drivers/nvmem/Kconfig
index 5f9bc787d634..1090924efdb1 100644
--- a/drivers/nvmem/Kconfig
+++ b/drivers/nvmem/Kconfig
@@ -1,5 +1,5 @@
 menuconfig NVMEM
-	tristate "NVMEM Support"
+	bool "NVMEM Support"
 	help
 	  Support for NVMEM(Non Volatile Memory) devices like EEPROM, EFUSES...

-- 
2.9.0

^ permalink raw reply related

* Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Ilpo Järvinen @ 2018-04-04 10:35 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <CAK6E8=d_NS+X0wm1mFZ7=FRkJD+wLFTmdBPHhrjOGHn-rU801w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4262 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 4a26c09..c60745c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >                                 pkts_acked = rexmit_acked + newdata_acked;
> >
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +                       /* If any of the cumulatively ACKed segments was
> > +                        * retransmitted, non-SACK case cannot confirm that
> > +                        * progress was due to original transmission due to
> > +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> > +                        * the packets may have been never retransmitted.
> > +                        */
> > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > +                               flag &= ~FLAG_ORIG_SACK_ACKED;
> 
> How about keeping your excellent comment but move the fix to F-RTO
> code directly so it's more clear? this way the flag remains clear that
> indicates some never-retransmitted data are acked/sacked.
> 
> // pseudo code for illustration
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..f7f3357de618 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2629,8 +2629,15 @@ static void tcp_process_loss(struct sock *sk,
> int flag, bool is_dupack,
>         if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
>                 /* Step 3.b. A timeout is spurious if not all data are
>                  * lost, i.e., never-retransmitted data are (s)acked.
> +                *
> +                * If any of the cumulatively ACKed segments was
> +                * retransmitted, non-SACK case cannot confirm that
> +                * progress was due to original transmission due to
> +                * lack of TCPCB_SACKED_ACKED bits even if some of
> +                * the packets may have been never retransmitted.
>                  */
>                 if ((flag & FLAG_ORIG_SACK_ACKED) &&
> +                   (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED) &&
>                     tcp_try_undo_loss(sk, true))
>                         return;

Of course I could put the back there but I really like the new place more 
(which was a result of your suggestion to place the code elsewhere).
IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we 
weren't successful in proving (there in tcp_clean_rtx_queue) that progress 
was due original transmission and thus I would not want falsely indicate 
it with that flag. And there's the non-SACK related block anyway already 
there so it keeps the non-SACK "pollution" off from the SACK code paths.

(In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to 
FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition 
we're after regardless of SACK and less ambiguous in non-SACK case).

-- 
 i.

^ permalink raw reply

* Re: [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
From: Ilpo Järvinen @ 2018-04-04 10:23 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: netdev, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <CAK6E8=ea8JJu2J369vdUa3FPwj8JNRAGeh4d1WC8RcNZpQp=4Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Wed, 28 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > When a cumulative ACK lands to high_seq at the end of loss
> > recovery and SACK is not enabled, the sender needs to avoid
> > false fast retransmits (RFC6582). The avoidance mechanisms is
> > implemented by remaining in the loss recovery CA state until
> > one additional cumulative ACK arrives. During the operation of
> > this avoidance mechanism, there is internal transient in the
> > use of state variables which will always trigger a bogus undo.
>
> Do we have to make undo in non-sack perfect? can we consider a much
> simpler but imperfect fix of
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 8d480542aa07..95225d9de0af 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2356,6 +2356,7 @@ static bool tcp_try_undo_recovery(struct sock *sk)
>                  * fast retransmits (RFC2582). SACK TCP is safe. */
>                 if (!tcp_any_retrans_done(sk))
>                         tp->retrans_stamp = 0;
> +               tp->undo_marker = 0;
>                 return true;
>         }

Yes, that's of course a possible and would workaround the issue too. 
In fact, I initially did that kind of fix for myself (I put it into a 
block with tp->retrans_stamp = 0 though). But then I realized that it is 
not that complicated to make the fix locally into tcp_packet_delayed()
(except the annoyance of passing all the necessary state parameters
through the deep static call-chain but that should pose no big challenge 
for the compiler to handle I guess).

BTW, do you know under what circumstances that tcp_any_retrans_done(sk) 
would return non-zero here (snd_una == high_seq so those rexmit 
would need to be above high_seq)?

-- 
 i.

^ permalink raw reply

* Re: PROBLEM: Using BPF_PROG_TEST_RUN with data_out != NULL is unsafe
From: Daniel Borkmann @ 2018-04-04 10:01 UTC (permalink / raw)
  To: Lorenz Bauer, ast; +Cc: netdev, linux-kernel
In-Reply-To: <CACAyw9-TvamC2HMNv=WDk_aeAM=R7J_BgLiKjevY5+am71W6-A@mail.gmail.com>

On 04/04/2018 11:04 AM, Lorenz Bauer wrote:
> Hello,
> 
> I’ve encountered an issue when using BPF_PROG_TEST_RUN and capturing the output.
> The kernel copies data into user space without checking the length of
> the destination buffer.
> 
> In bpf_test_finish(), size is the amount of data in the XDP buffer /
> skb after the program is run. This can be larger than data_size_in due
> to bpf_xdp_adjust_head() and friends.
> bpf_test_finish doesn’t clamp size to data_size_out, which is what I
> was expecting.
> 
> What is the correct way to use this interface?

Yeah, so XDP has a headroom of XDP_PACKET_HEADROOM + NET_IP_ALIGN which in case
of x86 is 256 bytes (NET_IP_ALIGN being 0 there). This means that attr.test.data_out
buffer needs to be 256 bytes larger than attr.test.data_in in case the BPF prog
calls bpf_xdp_adjust_head() or bpf_xdp_adjust_meta(). In case you point data_in
and data_out to the same address, then the total buffer size therefore has to be
attr.test.data_size_in + 256 in order to not overrun anything while not being
aware of the BPF test program. The XDP_PACKET_HEADROOM is exposed to user space
in linux/bpf.h.

^ permalink raw reply

* [PATCH] net: hns3: fix length overflow when CONFIG_ARM64_64K_PAGES
From: Tan Xiaojun @ 2018-04-04  9:40 UTC (permalink / raw)
  To: yisen.zhuang, salil.mehta
  Cc: huawei.libin, tanxiaojun, netdev, linux-kernel, wangzhou1,
	linyunsheng

When enable the config item "CONFIG_ARM64_64K_PAGES", the size of PAGE_SIZE
is 65536(64K). But the type of length is u16, it will overflow. So change it
to u32.

Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 9e4cfbb..98cdbd3 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -288,7 +288,7 @@ struct hns3_desc_cb {
 	u16 page_offset;
 	u16 reuse_flag;
 
-	u16 length;     /* length of the buffer */
+	u32 length;     /* length of the buffer */
 
        /* desc type, used by the ring user to mark the type of the priv data */
 	u16 type;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-04-04  9:34 UTC (permalink / raw)
  To: David Lebrun; +Cc: Alexei Starovoitov, netdev, David Lebrun, Daniel Borkmann
In-Reply-To: <c4e374e8-d385-1f86-cafe-85d983f6c45e@gmail.com>

2018-04-03 16:25 GMT+02:00 David Lebrun <dav.lebrun@gmail.com>:
> Actually I'm wrong here. dst_input() will call either ip6_input() or
> ip6_forward(), not ipv6_rcv(). Both functions expect IP6CB() to be set,
> so using skb->cb here will interfere with them.
>
> What about saving and restoring the IPv6 CB, similarly to what TCP does with
> tcp_v6_restore_cb() ?


Yes. I can change the call to bpf_prog_run_save_cb to bpf_prog_run_clear_cb,
and then manually save/restore the IPv6 CB in input_action_end_bpf.

Or is there maybe a better solution to share some state between the bpf caller
and helpers, that does not need access to skb->cb ?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox