Netdev List
 help / color / mirror / Atom feed
* [PATCH] net: fec: Detect and recover receive queue hangs
From: Chris Lesiak @ 2016-11-17 21:14 UTC (permalink / raw)
  To: Fugang Duan; +Cc: netdev, linux-kernel, Jaccon Bastiaansen, chris.lesiak

This corrects a problem that appears to be similar to ERR006358.  But
while ERR006358 is a race when the tx queue transitions from empty to
not empty, this problem is a race when the rx queue transitions from
full to not full.

The symptom is a receive queue that is stuck.  The ENET_RDAR register
will read 0, indicating that there are no empty receive descriptors in
the receive ring.  Since no additional frames can be queued, no RXF
interrupts occur.

This problem can be triggered with a 1 Gb link and about 400 Mbps of
traffic.

This patch detects this condition, sets the work_rx bit, and
reschedules the poll method.

Signed-off-by: Chris Lesiak <chris.lesiak@licor.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index fea0f33..8a87037 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
 	return ret;
 }
 
+static inline bool
+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id)
+{
+	int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
+
+	if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
+		return false;
+
+	dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n");
+
+	fep->work_rx |= 1 << work_bit;
+
+	return true;
+}
+
+static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep)
+{
+	unsigned int q;
+	bool ret = false;
+
+	for (q = 0; q < fep->num_rx_queues; q++) {
+		if (fec_enet_recover_rxq(fep, q))
+			ret = true;
+	}
+
+	return ret;
+}
+
 static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 {
 	struct net_device *ndev = napi->dev;
@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 	if (pkts < budget) {
 		napi_complete(napi);
 		writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+
+		if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
+			writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
 	}
 	return pkts;
 }
-- 
2.5.5

^ permalink raw reply related

* Re: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer @ 2016-11-17 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer
In-Reply-To: <1479408683.8455.273.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).    
> 
> 
> As I said, you are fooled by TX completions.
> 
> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default
> 
> lpaa23:~# sar -n DEV 1 10|grep eth1
> 10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
> 10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
> 10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
> 10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
> 10:49:29         eth1      1.00 9239218.00      0.06 2179044.83      0.00      0.00      0.00
> 10:49:30         eth1      3.00 9248775.00      0.23 2181257.84      0.00      0.00      1.00
> 10:49:31         eth1      4.00 9225471.00      0.65 2175772.75      0.00      0.00      0.00
> 10:49:32         eth1      2.00 9253536.00      0.33 2182666.44      0.00      0.00      1.00
> 10:49:33         eth1      1.00 9265900.00      0.06 2185598.40      0.00      0.00      0.00
> 10:49:34         eth1      1.00 6949031.00      0.06 1638889.63      0.00      0.00      1.00
> Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50
> 
> 
> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>      xmit_more: 2251366909
>      xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

xmit more not happen that frequently for my setup, it does happen
sometimes. And I do monitor with "ethtool -S".

~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2  ) stat:     92900913 (     92,900,913) <= tx0_bytes /sec
Ethtool(mlx5p2  ) stat:        36073 (         36,073) <= tx0_nop /sec
Ethtool(mlx5p2  ) stat:      1548349 (      1,548,349) <= tx0_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx0_xmit_more /sec
Ethtool(mlx5p2  ) stat:     92884899 (     92,884,899) <= tx_bytes /sec
Ethtool(mlx5p2  ) stat:     99297696 (     99,297,696) <= tx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_csum_partial /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_packets /sec
Ethtool(mlx5p2  ) stat:      1551527 (      1,551,527) <= tx_packets_phy /sec
Ethtool(mlx5p2  ) stat:     99076658 (     99,076,658) <= tx_prio1_bytes /sec
Ethtool(mlx5p2  ) stat:      1548073 (      1,548,073) <= tx_prio1_packets /sec
Ethtool(mlx5p2  ) stat:     92936078 (     92,936,078) <= tx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:      1548934 (      1,548,934) <= tx_vport_unicast_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx_xmit_more /sec

(after several attempts I got:)
$ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more
     tx_xmit_more: 14048
     tx0_xmit_more: 14048
     tx_xmit_more: 14049
     tx0_xmit_more: 14049

This was with:
 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920

>    PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
> ---------------------------------------------------------------------------------------------
> 
>     11.64%  [kernel]  [k] skb_set_owner_w               
>      6.21%  [kernel]  [k] queued_spin_lock_slowpath     
>      4.76%  [kernel]  [k] _raw_spin_lock                
>      4.40%  [kernel]  [k] __ip_make_skb                 
>      3.10%  [kernel]  [k] sock_wfree                    
>      2.87%  [kernel]  [k] ipt_do_table                  
>      2.76%  [kernel]  [k] fq_dequeue                    
>      2.71%  [kernel]  [k] mlx4_en_xmit                  
>      2.50%  [kernel]  [k] __dev_queue_xmit              
>      2.29%  [kernel]  [k] __ip_append_data.isra.40      
>      2.28%  [kernel]  [k] udp_sendmsg                   
>      2.01%  [kernel]  [k] __alloc_skb                   
>      1.90%  [kernel]  [k] napi_consume_skb              
>      1.63%  [kernel]  [k] udp_send_skb                  
>      1.62%  [kernel]  [k] skb_release_data              
>      1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
>      1.56%  [kernel]  [k] dev_hard_start_xmit           
>      1.55%  udpsnd    [.] __libc_send                   
>      1.48%  [kernel]  [k] netif_skb_features            
>      1.42%  [kernel]  [k] __qdisc_run                   
>      1.35%  [kernel]  [k] sk_dst_check                  
>      1.33%  [kernel]  [k] sock_def_write_space          
>      1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
>      1.29%  [kernel]  [k] __local_bh_enable_ip          
>      1.21%  [kernel]  [k] copy_user_enhanced_fast_string
>      1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
>      1.08%  [kernel]  [k] SYSC_sendto                   
>      1.07%  [kernel]  [k] kmem_cache_alloc_node         
>      0.95%  [kernel]  [k] ip_finish_output2             
>      0.95%  [kernel]  [k] ktime_get                     
>      0.91%  [kernel]  [k] validate_xmit_skb             
>      0.88%  [kernel]  [k] sock_alloc_send_pskb          
>      0.82%  [kernel]  [k] sock_sendmsg                  

I'm more interested in why I see fib_table_lookup() and
__ip_route_output_key_hash() when you don't ?!?  There must be some
mistake in my setup!

Maybe you can share your udp flood "udpsnd" program source?

Maybe I'm missing some important sysctl /proc/net/sys/ ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

p.s. I placed my testing software here:
 https://github.com/netoptimizer/network-testing/tree/master/src

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 21:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20161117221941.3b525181@redhat.com>

On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:

> 
> Maybe you can share your udp flood "udpsnd" program source?

Very ugly. This is based on what I wrote when tracking the UDP v6
checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
checksum in skb_checksum_help()), because netperf sends the same message
over and over...


Use -d 2   to remove the ip_idents_reserve() overhead.


#define _GNU_SOURCE

#include <errno.h>
#include <error.h>
#include <linux/errqueue.h>
#include <netinet/in.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>

char buffer[1400];

int main(int argc, char** argv) {
  int fd, i;
  struct sockaddr_in6 addr;
  char *host = "2002:af6:798::1";
  int family = AF_INET6;
  int discover = -1;

  while ((i = getopt(argc, argv, "4H:d:")) != -1) {
    switch (i) {
    case 'H': host = optarg; break;
    case '4': family = AF_INET; break;
    case 'd': discover = atoi(optarg); break;
    }
  }
  fd = socket(family, SOCK_DGRAM, 0);
  if (fd < 0)
    error(1, errno, "failed to create socket");
  if (discover != -1)
    setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
               &discover, sizeof(discover));

  memset(&addr, 0, sizeof(addr));
  if (family == AF_INET6) {
	  addr.sin6_family = AF_INET6;
	  addr.sin6_port = htons(9);
      inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
  } else {
    struct sockaddr_in *in = (struct sockaddr_in *)&addr;
    in->sin_family = family;
    in->sin_port = htons(9);
      inet_pton(family, host, &in->sin_addr);
  }
  connect(fd, (struct sockaddr *)&addr,
          (family == AF_INET6) ? sizeof(addr) :
                                 sizeof(struct sockaddr_in));
  memset(buffer, 1, 1400);
  for (i = 0; i < 655360000; i++) {
    memcpy(buffer, &i, sizeof(i));
    send(fd, buffer, 100 + rand() % 200, 0);
  }
  return 0;
}

^ permalink raw reply

* Re: net: BUG still has locks held in unix_stream_splice_read
From: Cong Wang @ 2016-11-17 21:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Dmitry Vyukov, David Miller, Hannes Frederic Sowa, Eric Dumazet,
	netdev, LKML, syzkaller, Colin Cross, Mandeep Singh Baines
In-Reply-To: <20161010031450.GW19539@ZenIV.linux.org.uk>

On Sun, Oct 9, 2016 at 8:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> E.g what will happen if some code does a read on AF_UNIX socket with
> some local mutex held?  AFAICS, there are exactly two callers of
> freezable_schedule_timeout() - this one and one in XFS; the latter is
> in a kernel thread where we do have good warranties about the locking
> environment, but here it's in the bleeding ->recvmsg/->splice_read and
> for those assumption that caller doesn't hold any locks is pretty
> strong, especially since it's not documented anywhere.
>
> What's going on there?

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably correct
at that time, but later, commit 2b514574f7e88c8498027ee366
("net: af_unix: implement splice for stream af_unix sockets") breaks its
requirement for a freezable sleep:

    commit 0f9548ca10916dec166eaf74c816bded7d8e611d

    lockdep: check that no locks held at freeze time

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

So probably we just need to revert commit 2b15af6f95 now.

I am going to send a revert for at least -net and -stable, since Dmitry
saw this warning again.

^ permalink raw reply

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-17 21:47 UTC (permalink / raw)
  To: André Roth
  Cc: Martin Blumenstingl, Johnson Leung, Giuseppe CAVALLARO,
	linux-amlogic, Alexandre Torgue, netdev
In-Reply-To: <20161117194405.4ca7899b@gmail.com>

On Thu, 2016-11-17 at 19:44 +0100, André Roth wrote:
> Hi all,
> 
> > 
> > I checked again the kernel
> > at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The
> > version you mention (3.14.65-73) seems to be:
> > sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb
> > s905
> > kernel-73
> 
> I downloaded the prebuilt image from hardkernel, I did not build the
> kernel myself. but hardkernel has an earlier release of the same
> kernel
> version, which works fine too. I assume they would have committed the
> change in the newer version..
>  
> > 
> > In this particular version, both realtek drivers:
> > - drivers/net/phy/realtek.c
> > - drivers/amlogic/ethernet/phy/am_realtek.c
> > 
> > have the hack to disable 1000M advertisement. I don't understand
> > how
> > it possible for you to have 1000Base-T Full Duplex with this, maybe
> > I'm missing something here ?
> 
> that's what I don't understand as well...
> 
> the patched kernel shows the following:
> 
> $ uname -a
> Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016
> 
> $ sudo ethtool eth0
> Settings for eth0:
>         Supported ports: [ TP MII ]
>         Supported link modes:   10baseT/Half 10baseT/Full 
>                                 100baseT/Half 100baseT/Full 
>                                 1000baseT/Full 
>         Supported pause frame use: No
>         Supports auto-negotiation: Yes
>         Advertised link modes:  10baseT/Half 10baseT/Full 
>                                 100baseT/Half 100baseT/Full 
>                                 1000baseT/Full 
>         Advertised pause frame use: No
>         Advertised auto-negotiation: Yes
>         Link partner advertised link modes:  10baseT/Half
> 10baseT/Full 
>                                              100baseT/Half
> 100baseT/Full 1000baseT/Half 1000baseT/Full 
>         Link partner advertised pause frame use: Symmetric Receive-
> only
>         Link partner advertised auto-negotiation: Yes
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: MII
>         PHYAD: 0
>         Transceiver: external
>         Auto-negotiation: on
>         Supports Wake-on: ug
>         Wake-on: d
>         Current message level: 0x0000003f (63)
>                                drv probe link timer ifdown ifup
>         Link detected: yes
> 
> $ sudo ethtool --show-eee eth0
> EEE Settings for eth0:
>         EEE status: disabled
>         Tx LPI: disabled
>         Supported EEE link modes:  100baseT/Full 
>                                    1000baseT/Full 
>         Advertised EEE link modes:  100baseT/Full 
>         Link partner advertised EEE link modes:  100baseT/Full 
>                                                  1000baseT/Full 
> 
> can it be that "EEE link modes" and the "normal" link modes are two
> different things ? 

Exactly, They are.
Hardkernel code disable both.

With hardkernel's kernel, you should not have 1000baseT/Full in
"Advertised link modes" and you would have nothing reported in
"Advertised EEE link modes"

> 
> > 
> > If you did compile the kernel yourself, could you check the 2 file
> > mentioned above ? Just to be sure there was no patch applied at the
> > last minute, which would not show up in the git history of
> > hardkernel ?
> 
> I cannot check this easily at the moment..
> 
> Regards,
> 
>  André
> 

^ permalink raw reply

* Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
From: Jerome Brunet @ 2016-11-17 21:48 UTC (permalink / raw)
  To: Anand Moon
  Cc: netdev, devicetree, Florian Fainelli, Alexandre TORGUE,
	Neil Armstrong, Martin Blumenstingl, Kevin Hilman, Linux Kernel,
	Andre Roth, linux-amlogic, Carlo Caione, Giuseppe Cavallaro,
	linux-arm-kernel
In-Reply-To: <CANAwSgTB5=8tVj1FrZF3EgjK5mjUMwLYb90NeWSFDCsRjsuB6A@mail.gmail.com>

On Thu, 2016-11-17 at 23:30 +0530, Anand Moon wrote:
> Hi Jerone,
> 
> > > How about adding callback functionality for .soft_reset to handle
> > > BMCR
> > > where we update the Auto-Negotiation for the phy,
> > > as per the datasheet of the rtl8211f.

I think BMCR is already pretty well handled by the genphy, don't you
think ?

> > 
> > I'm not sure I understand how this would help with our issue (and
> > EEE).
> > Am I missing something or is it something unrelated that you would
> > like
> > to see happening on this driver ?
> > 
> [snip]
> 
> I was just tying other phy module to understand the feature.
> But in order to improve  the throughput I tried to integrate blow u-
> boot commit.
> 
> commit 3d6af748ebd831524cb22a29433e9092af469ec7
> Author: Shengzhou Liu <Shengzhou.Liu@freescale.com>
> Date:   Thu Mar 12 18:54:59 2015 +0800
> 
>     net/phy: Add support for realtek RTL8211F
> 
>     RTL8211F has different registers from RTL8211E.
>     This patch adds support for RTL8211F PHY which
>     can be found on Freescale's T1023 RDB board.
> 
> And added the similar functionality to  .config_aneg    =
> &rtl8211f_config_aneg,
> 

I assume this is the commit you are referring to : 
http://git.denx.de/?p=u-boot.git;a=commit;h=3d6af748ebd831524cb22a29433
e9092af469ec7

I tried looking a this particular commit and the other ones in
realtek.c history of u-boot. I don't really see what it does that linux
is not already doing.

> And I seem to have better results in through put with periodic drop
> but it recovers.
> -----
> odroid@odroid64:~$ iperf3 -c 10.0.0.102 -p 2006 -i 1 -t 100 -V
> iperf 3.0.11
> Linux odroid64 4.9.0-rc5-xc2ml #18 SMP PREEMPT Thu Nov 17 22:56:00 

[...]

> 
> Test Complete. Summary Results:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec    4             sender
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec                  receiver
> CPU Utilization: local/sender 5.6% (0.2%u/5.4%s), remote/receiver
> 17.1% (1.2%u/15.9%s)
> 

That's the kind of throughput we have on the C2 once the link is
reliable (with EEE switch off for GbE)

> Can your confirm this at your end.
> Once confirm I will try to send this as a fix for this issue.
> 

I'm testing the code to disable EEE in a generic way. I'll post the RFC
for it asap.

> -Best Regards
> Anand Moon

^ permalink raw reply

* Re: [net-next PATCH v2] net: dummy: Introduce dummy virtual functions
From: Or Gerlitz @ 2016-11-17 22:04 UTC (permalink / raw)
  To: Phil Sutter; +Cc: David Miller, Linux Netdev List, Sabrina Dubroca
In-Reply-To: <20161114130205.18333-1-phil@nwl.cc>

On Mon, Nov 14, 2016 at 3:02 PM, Phil Sutter <phil@nwl.cc> wrote:

> Due to the assumption that all PFs are PCI devices, this implementation
> is not completely straightforward: In order to allow for
> rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
> attached to the dummy netdev. This has to happen at the right spot so
> register_netdevice() does not get confused. This patch abuses
> ndo_fix_features callback for that. In ndo_uninit callback, the fake
> parent is removed again for the same purpose.

So you did some mimic-ing of PCI interface, how do you let the user to
config the number of VFs? though a module param? why? if the module
param only serves to say how many VF the device supports, maybe
support the maximum possible by PCI spec and skip the module param?


> +module_param(num_vfs, int, 0);
> +MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device");
> +

> @@ -190,6 +382,7 @@ static int __init dummy_init_one(void)
>         err = register_netdevice(dev_dummy);
>         if (err < 0)
>                 goto err;
> +
>         return 0;

nit, remove this added blank line..

^ permalink raw reply

* [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Cong Wang @ 2016-11-17 22:09 UTC (permalink / raw)
  To: netdev; +Cc: dvyukov, Cong Wang, Tejun Heo, Colin Cross, Rafael J. Wysocki

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/unix/af_unix.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..6f37f9e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -116,7 +116,6 @@
 #include <linux/mount.h>
 #include <net/checksum.h>
 #include <linux/security.h>
-#include <linux/freezer.h>
 
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
@@ -2220,7 +2219,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		timeo = freezable_schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))
-- 
2.1.0

^ permalink raw reply related

* [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
From: Woojung.Huh @ 2016-11-17 22:10 UTC (permalink / raw)
  To: davem, f.fainelli, netdev; +Cc: andrew, UNGLinuxDriver

From: Woojung Huh <woojung.huh@microchip.com>

Relocate mdix code to phy driver to be called at config_init().

Signed-off-by: Woojung Huh <woojung.huh@microchip.com>
---
 drivers/net/phy/microchip.c | 36 +++++++++++++++++++++-
 drivers/net/usb/lan78xx.c   | 73 ++-------------------------------------------
 2 files changed, 38 insertions(+), 71 deletions(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 7c00e50..eb4db22 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -106,6 +106,40 @@ static int lan88xx_set_wol(struct phy_device *phydev,
 	return 0;
 }
 
+static void lan88xx_set_mdix(struct phy_device *phydev)
+{
+	int buf;
+	int val;
+
+	switch (phydev->mdix) {
+	case ETH_TP_MDI:
+		val = LAN88XX_EXT_MODE_CTRL_MDI_;
+		break;
+	case ETH_TP_MDI_X:
+		val = LAN88XX_EXT_MODE_CTRL_MDI_X_;
+		break;
+	case ETH_TP_MDI_AUTO:
+		val = LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_;
+		break;
+	default:
+		return;
+	}
+
+	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
+	buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+	buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
+	buf |= val;
+	phy_write(phydev, LAN88XX_EXT_MODE_CTRL, buf);
+	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
+}
+
+static int lan88xx_config_aneg(struct phy_device *phydev)
+{
+	lan88xx_set_mdix(phydev);
+
+	return genphy_config_aneg(phydev);
+}
+
 static struct phy_driver microchip_phy_driver[] = {
 {
 	.phy_id		= 0x0007c130,
@@ -120,7 +154,7 @@ static struct phy_driver microchip_phy_driver[] = {
 	.remove		= lan88xx_remove,
 
 	.config_init	= genphy_config_init,
-	.config_aneg	= genphy_config_aneg,
+	.config_aneg	= lan88xx_config_aneg,
 	.read_status	= genphy_read_status,
 
 	.ack_interrupt	= lan88xx_phy_ack_interrupt,
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index cf2857f..0c459e9 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1471,62 +1471,12 @@ static void lan78xx_set_msglevel(struct net_device *net, u32 level)
 	dev->msg_enable = level;
 }
 
-static int lan78xx_get_mdix_status(struct net_device *net)
-{
-	struct phy_device *phydev = net->phydev;
-	int buf;
-
-	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
-	buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-	phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
-
-	return buf;
-}
-
-static void lan78xx_set_mdix_status(struct net_device *net, __u8 mdix_ctrl)
-{
-	struct lan78xx_net *dev = netdev_priv(net);
-	struct phy_device *phydev = net->phydev;
-	int buf;
-
-	if (mdix_ctrl == ETH_TP_MDI) {
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_1);
-		buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-		buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-		phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
-			  buf | LAN88XX_EXT_MODE_CTRL_MDI_);
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_0);
-	} else if (mdix_ctrl == ETH_TP_MDI_X) {
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_1);
-		buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-		buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-		phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
-			  buf | LAN88XX_EXT_MODE_CTRL_MDI_X_);
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_0);
-	} else if (mdix_ctrl == ETH_TP_MDI_AUTO) {
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_1);
-		buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-		buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-		phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
-			  buf | LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_);
-		phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
-			  LAN88XX_EXT_PAGE_SPACE_0);
-	}
-	dev->mdix_ctrl = mdix_ctrl;
-}
-
 static int lan78xx_get_link_ksettings(struct net_device *net,
 				      struct ethtool_link_ksettings *cmd)
 {
 	struct lan78xx_net *dev = netdev_priv(net);
 	struct phy_device *phydev = net->phydev;
 	int ret;
-	int buf;
 
 	ret = usb_autopm_get_interface(dev->intf);
 	if (ret < 0)
@@ -1534,20 +1484,6 @@ static int lan78xx_get_link_ksettings(struct net_device *net,
 
 	ret = phy_ethtool_ksettings_get(phydev, cmd);
 
-	buf = lan78xx_get_mdix_status(net);
-
-	buf &= LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-	if (buf == LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_) {
-		cmd->base.eth_tp_mdix = ETH_TP_MDI_AUTO;
-		cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI_AUTO;
-	} else if (buf == LAN88XX_EXT_MODE_CTRL_MDI_) {
-		cmd->base.eth_tp_mdix = ETH_TP_MDI;
-		cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI;
-	} else if (buf == LAN88XX_EXT_MODE_CTRL_MDI_X_) {
-		cmd->base.eth_tp_mdix = ETH_TP_MDI_X;
-		cmd->base.eth_tp_mdix_ctrl = ETH_TP_MDI_X;
-	}
-
 	usb_autopm_put_interface(dev->intf);
 
 	return ret;
@@ -1565,9 +1501,6 @@ static int lan78xx_set_link_ksettings(struct net_device *net,
 	if (ret < 0)
 		return ret;
 
-	if (dev->mdix_ctrl != cmd->base.eth_tp_mdix_ctrl)
-		lan78xx_set_mdix_status(net, cmd->base.eth_tp_mdix_ctrl);
-
 	/* change speed & duplex */
 	ret = phy_ethtool_ksettings_set(phydev, cmd);
 
@@ -2019,6 +1952,9 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 		phydev->irq = 0;
 	netdev_dbg(dev->net, "phydev->irq = %d\n", phydev->irq);
 
+	/* set to AUTOMDIX */
+	phydev->mdix = ETH_TP_MDI_AUTO;
+
 	ret = phy_connect_direct(dev->net, phydev,
 				 lan78xx_link_status_change,
 				 PHY_INTERFACE_MODE_GMII);
@@ -2028,9 +1964,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 		return -EIO;
 	}
 
-	/* set to AUTOMDIX */
-	lan78xx_set_mdix_status(dev->net, ETH_TP_MDI_AUTO);
-
 	/* MAC doesn't support 1000T Half */
 	phydev->supported &= ~SUPPORTED_1000baseT_Half;
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
From: Florian Fainelli @ 2016-11-17 22:16 UTC (permalink / raw)
  To: Woojung.Huh, davem, netdev; +Cc: andrew, UNGLinuxDriver
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D40966AA4@CHN-SV-EXMX02.mchp-main.com>

On 11/17/2016 02:10 PM, Woojung.Huh@microchip.com wrote:
> From: Woojung Huh <woojung.huh@microchip.com>
> 
> Relocate mdix code to phy driver to be called at config_init().
> 
> Signed-off-by: Woojung Huh <woojung.huh@microchip.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: Debugging Ethernet issues
From: Måns Rullgård @ 2016-11-17 22:17 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sebastian Frias, Mason, Andrew Lunn, netdev, Sergei Shtylyov,
	Tom Lendacky, Zach Brown, Shaohui Xie, Tim Beale, Brian Hill,
	Vince Bridgers, Balakumaran Kannan, David S. Miller,
	Kirill Kapranov
In-Reply-To: <8efc016a-3390-3bec-d74c-7c215c151b2f@gmail.com>

Florian Fainelli <f.fainelli@gmail.com> writes:

> On 11/14/2016 11:00 AM, Måns Rullgård wrote:
>> Florian Fainelli <f.fainelli@gmail.com> writes:
>> 
>>> On 11/14/2016 10:20 AM, Florian Fainelli wrote:
>>>> On 11/14/2016 09:59 AM, Sebastian Frias wrote:
>>>>> On 11/14/2016 06:32 PM, Florian Fainelli wrote:
>>>>>> On 11/14/2016 07:33 AM, Mason wrote:
>>>>>>> On 14/11/2016 15:58, Mason wrote:
>>>>>>>
>>>>>>>> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
>>>>>>>> vs
>>>>>>>> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off
>>>>>>>>
>>>>>>>> I'm not sure whether "flow control" is relevant...
>>>>>>>
>>>>>>> Based on phy_print_status()
>>>>>>> phydev->pause ? "rx/tx" : "off"
>>>>>>> I added the following patch.
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/aurora/nb8800.c b/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> index defc22a15f67..4e758c1cfa4e 100644
>>>>>>> --- a/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> +++ b/drivers/net/ethernet/aurora/nb8800.c
>>>>>>> @@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct net_device *dev)
>>>>>>>         struct phy_device *phydev = priv->phydev;
>>>>>>>         int change = 0;
>>>>>>>  
>>>>>>> +       printk("%s from %pf\n", __func__, __builtin_return_address(0));
>>>>>>> +
>>>>>>>         if (phydev->link) {
>>>>>>>                 if (phydev->speed != priv->speed) {
>>>>>>>                         priv->speed = phydev->speed;
>>>>>>> @@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev)
>>>>>>>         nb8800_writeb(priv, NB8800_PQ2, val & 0xff);
>>>>>>>  
>>>>>>>         /* Auto-negotiate by default */
>>>>>>> -       priv->pause_aneg = true;
>>>>>>> -       priv->pause_rx = true;
>>>>>>> -       priv->pause_tx = true;
>>>>>>> +       priv->pause_aneg = false;
>>>>>>> +       priv->pause_rx = false;
>>>>>>> +       priv->pause_tx = false;
>>>>>>>  
>>>>>>>         nb8800_mc_init(dev, 0);
>>>>>>>  
>>>>>>>
>> 
>> [...]
>> 
>>>>>> And the time difference is clearly accounted for auto-negotiation time
>>>>>> here, as you can see it takes about 3 seconds for Gigabit Ethernet to
>>>>>> auto-negotiate and that seems completely acceptable and normal to me
>>>>>> since it is a more involved process than lower speeds.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> OK, so now it works (by accident?) even on 100 Mbps switch, but it still
>>>>>>> prints "flow control rx/tx"...
>>>>>>
>>>>>> Because your link partner advertises flow control, and that's what
>>>>>> phydev->pause and phydev->asym_pause report (I know it's confusing, but
>>>>>> that's what it is at the moment).
>>>>>
>>>>> Thanks.
>>>>> Could you confirm that Mason's patch is correct and/or that it does not
>>>>> has negative side-effects?
>>>>
>>>> The patch is not correct nor incorrect per-se, it changes the default
>>>> policy of having pause frames advertised by default to not having them
>>>> advertised by default.
>> 
>> I was advised to advertise flow control by default back when I was
>> working on the driver, and I think it makes sense to do so.
>> 
>>>> This influences both your Ethernet MAC and the link partner in that
>>>> the result is either flow control is enabled (before) or it is not
>>>> (with the patch). There must be something amiss if you see packet
>>>> loss or some kind of problem like that with an early exchange such as
>>>> DHCP. Flow control tend to kick in under higher packet rates (at
>>>> least, that's what you expect).
>>>>
>>>>>
>>>>> Right now we know that Mason's patch makes this work, but we do not
>>>>> understand why nor its implications.
>>>>
>>>> You need to understand why, right now, the way this problem is
>>>> presented, you came up with a workaround, not with the root cause or the
>>>> solution. What does your link partner (switch?) reports, that is, what
>>>> is the ethtool output when you have a link up from  your nb8800 adapter?
>>>
>>> Actually, nb8800_pause_config() seems to be doing a complete MAC/DMA
>>> reconfiguration when pause frames get auto-negotiated while the link is
>>> UP,
>> 
>> This is due to a silly hardware limitation.  The register containing the
>> flow control bits can't be written while rx is enabled.
>
> You do a DMA stop, but you don't disable the MAC receiver unlike what
> nb8800_stop() does, why is not calling nb8800_mac_rx() necessary here?

Oh, right.  That's because the RXC_CR register (where the flow control
bits are) can't be modified when the RCR_EN bit (rx dma enable) is set.
The MAC core register controlled by nb8800_mac_rx() doesn't matter here.
There is no way of changing the flow control setting without briefly
stopping rx dma.

None of this should be relevant here though since everything should be
all set up before dma is enabled the first time.

>>> and it does not differentiate being called from
>>> ethtool::set_pauseparam or the PHYLIB adjust_link callback (which it
>>> probably should),
>> 
>> Differentiate how?
>
> Differentiate in that when you are called from adjust_link, why bother
> checking with netif_running() since you are only configuring the pause
> settings when phydev->link is set. Not that this matters much, but
> that's something the caller can tell you.

netif_running() can be true or false independently of the link state.

-- 
Måns Rullgård

^ permalink raw reply

* Re: net: BUG still has locks held in unix_stream_splice_read
From: Hannes Frederic Sowa @ 2016-11-17 22:27 UTC (permalink / raw)
  To: Cong Wang, Al Viro
  Cc: Dmitry Vyukov, David Miller, Eric Dumazet, netdev, LKML,
	syzkaller, Colin Cross, Mandeep Singh Baines
In-Reply-To: <CAM_iQpUEEDE+OcfX66YJDC6dA+b-URHhUtWtv+sn-t5Esk_FWw@mail.gmail.com>

On 17.11.2016 22:44, Cong Wang wrote:
> On Sun, Oct 9, 2016 at 8:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> E.g what will happen if some code does a read on AF_UNIX socket with
>> some local mutex held?  AFAICS, there are exactly two callers of
>> freezable_schedule_timeout() - this one and one in XFS; the latter is
>> in a kernel thread where we do have good warranties about the locking
>> environment, but here it's in the bleeding ->recvmsg/->splice_read and
>> for those assumption that caller doesn't hold any locks is pretty
>> strong, especially since it's not documented anywhere.
>>
>> What's going on there?
> 
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably correct
> at that time, but later, commit 2b514574f7e88c8498027ee366
> ("net: af_unix: implement splice for stream af_unix sockets") breaks its
> requirement for a freezable sleep:
> 
>     commit 0f9548ca10916dec166eaf74c816bded7d8e611d
> 
>     lockdep: check that no locks held at freeze time
> 
>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>     deadlock if the lock is later acquired in the suspend or hibernate path
>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>     acquired by a process outside that group.
> 
> So probably we just need to revert commit 2b15af6f95 now.
> 
> I am going to send a revert for at least -net and -stable, since Dmitry
> saw this warning again.

I am not an expert on freezing but this looks around right from the
freezer code. Awesome, thanks a lot for spotting this one!

^ permalink raw reply

* Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Colin Cross @ 2016-11-17 22:30 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, dvyukov, Tejun Heo, Rafael J. Wysocki
In-Reply-To: <1479420564-15799-1-git-send-email-xiyou.wangcong@gmail.com>

On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably
> correct at that time, but later, commit 2b514574f7e8
> ("net: af_unix: implement splice for stream af_unix sockets") breaks
> the strong requirement for a freezable sleep, according to
> commit 0f9548ca1091:
>
>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>     deadlock if the lock is later acquired in the suspend or hibernate path
>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>     acquired by a process outside that group.
>
> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
revert of this patch will cause every one of those threads to wake up
twice per suspend cycle, which can be multiple times a second.  How
about adding a freezable flag to unix_stream_read_state so
unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
can be unfreezable?

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Alexander Duyck @ 2016-11-17 22:39 UTC (permalink / raw)
  To: David Laight
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Rick Jones,
	netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB0222982@AcuExch.aculab.com>

On Thu, Nov 17, 2016 at 9:34 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>>    Overhead  Command        Shared Object        Symbol
>>  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
>>    - _raw_spin_lock
>>       + 90.78% __dev_queue_xmit
>>       + 7.83% dev_queue_xmit
>>       + 1.30% ___slab_alloc
>>  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w
>>  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
>>  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup
>>  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
>>  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash
>>  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892).  This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
>         David
>

The problem isn't only the doorbell.  It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.

You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.

- Alex

^ permalink raw reply

* Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Cong Wang @ 2016-11-17 22:47 UTC (permalink / raw)
  To: Colin Cross; +Cc: netdev, Dmitry Vyukov, Tejun Heo, Rafael J. Wysocki
In-Reply-To: <CAMbhsRSQ+eVid4OG_xVMDyuZGfOJ2gj6DvHakOH5vbu3ySHF3A@mail.gmail.com>

On Thu, Nov 17, 2016 at 2:30 PM, Colin Cross <ccross@android.com> wrote:
> On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
>> converts schedule_timeout() to its freezable version, it was probably
>> correct at that time, but later, commit 2b514574f7e8
>> ("net: af_unix: implement splice for stream af_unix sockets") breaks
>> the strong requirement for a freezable sleep, according to
>> commit 0f9548ca1091:
>>
>>     We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>>     deadlock if the lock is later acquired in the suspend or hibernate path
>>     (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>>     cgroup_freezer if a lock is held inside a frozen cgroup that is later
>>     acquired by a process outside that group.
>>
>> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.
>
> On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
> revert of this patch will cause every one of those threads to wake up
> twice per suspend cycle, which can be multiple times a second.  How
> about adding a freezable flag to unix_stream_read_state so
> unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
> can be unfreezable?

Fair enough, I didn't know it could have such an impact.
I will send v2.

Thanks.

^ permalink raw reply

* [RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.
From: Jarno Rajahalme @ 2016-11-17 23:01 UTC (permalink / raw)
  To: netdev; +Cc: jarno, james.zhangming, mst, vyasevic, ailan

This patch is a proof-of-concept I did a few months ago for UDP tunnel
offload support in virtio_net interface, and rebased on to the current
net-next.

Real implementation needs to extend the virtio_net header rather than
piggy-backing on existing fields.  Inner MAC length (or inner network
offset) also needs to be passed as a new field.  Control plane (QEMU)
also needs to be updated.

All testing was done using Geneve, but this should work for all UDP
tunnels the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 drivers/net/tun.c               |  7 ++++-
 drivers/net/virtio_net.c        | 16 +++++++---
 include/linux/skbuff.h          |  5 ++++
 include/linux/virtio_net.h      | 66 ++++++++++++++++++++++++++++++-----------
 include/uapi/linux/virtio_net.h |  7 +++++
 5 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1588469..36f3219 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -198,7 +198,9 @@ struct tun_struct {
 	struct net_device	*dev;
 	netdev_features_t	set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
-			  NETIF_F_TSO6|NETIF_F_UFO)
+			   NETIF_F_TSO6|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL| \
+			   NETIF_F_GSO_UDP_TUNNEL_CSUM| \
+			   NETIF_F_GSO_TUNNEL_REMCSUM)
 
 	int			align;
 	int			vnet_hdr_sz;
@@ -1877,6 +1879,9 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
 
 		if (arg & TUN_F_UFO) {
 			features |= NETIF_F_UFO;
+#if 1
+			features |= NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
 			arg &= ~TUN_F_UFO;
 		}
 	}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ca5239a..eb8d887 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1789,7 +1789,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
 			dev->hw_features |= NETIF_F_TSO | NETIF_F_UFO
-				| NETIF_F_TSO_ECN | NETIF_F_TSO6;
+				| NETIF_F_TSO_ECN | NETIF_F_TSO6
+				| NETIF_F_GSO_UDP_TUNNEL
+				| NETIF_F_GSO_UDP_TUNNEL_CSUM
+				| NETIF_F_GSO_TUNNEL_REMCSUM;
 		}
 		/* Individual feature bits: what can host handle? */
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4))
@@ -1798,13 +1801,18 @@ static int virtnet_probe(struct virtio_device *vdev)
 			dev->hw_features |= NETIF_F_TSO6;
 		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN))
 			dev->hw_features |= NETIF_F_TSO_ECN;
-		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO))
+		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) {
 			dev->hw_features |= NETIF_F_UFO;
-
+#if 1
+			dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
+			dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+			dev->hw_features |= NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
+		}
 		dev->features |= NETIF_F_GSO_ROBUST;
 
 		if (gso)
-			dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO);
+			dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM);
 		/* (!csum && gso) case will be fixed by register_netdev() */
 	}
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a4aeeca..992ad30 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2115,6 +2115,11 @@ static inline unsigned char *skb_inner_mac_header(const struct sk_buff *skb)
 	return skb->head + skb->inner_mac_header;
 }
 
+static inline int skb_inner_mac_offset(const struct sk_buff *skb)
+{
+	return skb_inner_mac_header(skb) - skb->data;
+}
+
 static inline void skb_reset_inner_mac_header(struct sk_buff *skb)
 {
 	skb->inner_mac_header = skb->data - skb->head;
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..17384d1 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -8,10 +8,19 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 					const struct virtio_net_hdr *hdr,
 					bool little_endian)
 {
-	unsigned short gso_type = 0;
+	u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+
+		if (!skb_partial_csum_set(skb, start, off))
+			return -EINVAL;
+	}
 
 	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		unsigned short gso_type = 0;
+
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_FLAGS) {
 		case VIRTIO_NET_HDR_GSO_TCPV4:
 			gso_type = SKB_GSO_TCPV4;
 			break;
@@ -27,23 +36,28 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
 
 		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
 			gso_type |= SKB_GSO_TCP_ECN;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_UDP_TUNNEL)
+			gso_type |= SKB_GSO_UDP_TUNNEL;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM)
+			gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
+		if (hdr->gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM) {
+			gso_type |= SKB_GSO_TUNNEL_REMCSUM;
+			skb->remcsum_offload = true;
+		}
+		if (gso_type & (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM)) {
+			u16 hdr_len = __virtio16_to_cpu(little_endian,
+							hdr->hdr_len);
+			skb->encapsulation = 1;
+			skb_set_inner_mac_header(skb, hdr_len);
+			skb_set_inner_network_header(skb, hdr_len + ETH_HLEN);
+			/* XXX: What if start is not set? */
+			skb_set_inner_transport_header(skb, start);
+		}
 
 		if (hdr->gso_size == 0)
 			return -EINVAL;
-	}
-
-	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
-		u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
-
-		if (!skb_partial_csum_set(skb, start, off))
-			return -EINVAL;
-	}
-
-	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		u16 gso_size = __virtio16_to_cpu(little_endian, hdr->gso_size);
-
-		skb_shinfo(skb)->gso_size = gso_size;
+		skb_shinfo(skb)->gso_size = __virtio16_to_cpu(little_endian,
+							      hdr->gso_size);
 		skb_shinfo(skb)->gso_type = gso_type;
 
 		/* Header must be checked, and gso_segs computed. */
@@ -64,8 +78,8 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 		/* This is a hint as to how much should be linear. */
-		hdr->hdr_len = __cpu_to_virtio16(little_endian,
-						 skb_headlen(skb));
+		u16 hdr_len = skb_headlen(skb);
+
 		hdr->gso_size = __cpu_to_virtio16(little_endian,
 						  sinfo->gso_size);
 		if (sinfo->gso_type & SKB_GSO_TCPV4)
@@ -78,6 +92,22 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 			return -EINVAL;
 		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
 			hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+		if (sinfo->gso_type & SKB_GSO_UDP_TUNNEL)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_UDP_TUNNEL;
+		if (sinfo->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM;
+		if (sinfo->gso_type & SKB_GSO_TUNNEL_REMCSUM)
+			hdr->gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM;
+
+		if (sinfo->gso_type &
+		    (SKB_GSO_UDP_TUNNEL | SKB_GSO_UDP_TUNNEL_CSUM))
+			/* For encapsulated packets 'hdr_len' is the offset to
+			 * the beginning of the inner packet.  This way the
+			 * encapsulation can remain ignorant of the size of the
+			 * UDP tunnel header.
+			 */
+			hdr_len = skb_inner_mac_offset(skb);
+		hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len);
 	} else
 		hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
 
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..833950b 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -93,7 +93,14 @@ struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
 #define VIRTIO_NET_HDR_GSO_UDP		3	/* GSO frame, IPv4 UDP (UFO) */
 #define VIRTIO_NET_HDR_GSO_TCPV6	4	/* GSO frame, IPv6 TCP */
+#define VIRTIO_NET_HDR_GSO_UDP_TUNNEL	0x10	/* GSO frame, UDP tunnel */
+#define VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM 0x20	/* GSO frame, UDP tnl w CSUM */
+#define VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM 0x40	/* TUNNEL with TSO & REMCSUM */
 #define VIRTIO_NET_HDR_GSO_ECN		0x80	/* TCP has ECN set */
+#define VIRTIO_NET_HDR_GSO_FLAGS (VIRTIO_NET_HDR_GSO_UDP_TUNNEL | \
+				  VIRTIO_NET_HDR_GSO_UDP_TUNNEL_CSUM | \
+				  VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM | \
+				  VIRTIO_NET_HDR_GSO_ECN)
 	__u8 gso_type;
 	__virtio16 hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
 	__virtio16 gso_size;	/* Bytes to append to hdr_len per frame */
-- 
2.1.4

^ permalink raw reply related

* Re: Netperf UDP issue with connected sockets
From: Rick Jones @ 2016-11-17 23:08 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer; +Cc: netdev, Saeed Mahameed, Tariq Toukan
In-Reply-To: <1479419042.8455.280.camel@edumazet-glaptop3.roam.corp.google.com>

On 11/17/2016 01:44 PM, Eric Dumazet wrote:
> because netperf sends the same message
> over and over...

Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.

raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)    = 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000
read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096
read(5, "_INTEGER *intvl_two_ptr = &intvl"..., 4096) = 4096
read(5, "interval_count = interval_burst;"..., 4096) = 4096
read(5, ";\n\n/* these will control the wid"..., 4096) = 4096
read(5, "\n  LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096
read(5, "      &dwBytes,  \n              "..., 4096) = 4096

...

rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
alarm(1)                                = 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n    -m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472

Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...
happy benchmarking,

rick jones

^ permalink raw reply

* Re: Virtio_net support vxlan encapsulation package TSO offload discuss
From: Jarno Rajahalme @ 2016-11-17 23:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhangming (James, Euler), netdev@vger.kernel.org,
	Michael S. Tsirkin, Vlad Yasevic, Amnon Ilan
In-Reply-To: <0438366c-8b0e-270e-cbd1-334c1d655428@redhat.com>

I worked on the same issue a few months back. I rebased my proof-of-concept code to the current net-next and posted an RFC patch a moment ago.

I have zero experience on QEMU feature negotiation or extending the virtio_net spec. Since the virtio_net handling code is now all done using shared code, this should work for macvtap as well, not sure if macvtap needs some control plane changes.

I posted a separate patch to make af_packet also use the shared infra for virtio_net handling yesterday. My RFC patch assumes that af_packet need not be touched, i.e., assumes the af_packet patch is applied, even though the patches apply to net-next in either order.

  Jarno

> On Nov 16, 2016, at 11:27 PM, Jason Wang <jasowang@redhat.com> wrote:
> 
> 
> 
> On 2016年11月17日 09:31, Zhangming (James, Euler) wrote:
>> On 2016年11月15日 11:28, Jason Wang wrote:
>>> On 2016年11月10日 14:19, Zhangming (James, Euler) wrote:
>>>> On 2016年11月09日 15:14, Jason Wang wrote:
>>>>> On 2016年11月08日 19:58, Zhangming (James, Euler) wrote:
>>>>>> On 2016年11月08日 19:17, Jason Wang wrote:
>>>>>> 
>>>>>>> On 2016年11月08日 19:13, Jason Wang wrote:
>>>>>>>> Cc Michael
>>>>>>>> 
>>>>>>>> On 2016年11月08日 16:34, Zhangming (James, Euler) wrote:
>>>>>>>>> In container scenario, OVS is installed in the Virtual machine,
>>>>>>>>> and all the containers connected to the OVS will communicated
>>>>>>>>> through VXLAN encapsulation.
>>>>>>>>> 
>>>>>>>>> By now, virtio_net does not support TSO offload for VXLAN
>>>>>>>>> encapsulated TSO package. In this condition, the performance is
>>>>>>>>> not good, sender is bottleneck
>>>>>>>>> 
>>>>>>>>> I googled this scenario, but I didn’t find any information. Will
>>>>>>>>> virtio_net support VXLAN encapsulation package TSO offload later?
>>>>>>>>> 
>>>>>>>> Yes and for both sender and receiver.
>>>>>>>> 
>>>>>>>>> My idea is virtio_net open encapsulated TSO offload, and
>>>>>>>>> transport encapsulation info to TUN, TUN will parse the info and
>>>>>>>>> build skb with encapsulation info.
>>>>>>>>> 
>>>>>>>>> OVS or kernel on the host should be modified to support this.
>>>>>>>>> Using this method, the TCP performance aremore than 2x as before.
>>>>>>>>> 
>>>>>>>>> Any advice and suggestions for this idea or new idea will be
>>>>>>>>> greatly appreciated!
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> 
>>>>>>>>>      James zhang
>>>>>>>>> 
>>>>>>>> Sounds very good. And we may also need features bits
>>>>>>>> (VIRTIO_NET_F_GUEST|HOST_GSO_X) for this.
>>>>>>>> 
>>>>>>>> This is in fact one of items in networking todo list. (See
>>>>>>>> http://www.linux-kvm.org/page/NetworkingTodo). While at it, we'd
>>>>>>>> better support not only VXLAN but also other tunnels.
>>>>>>> Cc Vlad who is working on extending virtio-net headers.
>>>>>>> 
>>>>>>>> We can start with the spec work, or if you've already had some
>>>>>>>> bits you can post them as RFC for early review.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>> Below is my demo code
>>>>>> Virtio_net.c
>>>>>> static int virtnet_probe(struct virtio_device *vdev), add belows codes:
>>>>>>           if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) ||				// avoid gso segment, it should be negotiation later, because in the demo I reuse num_buffers.
>>>>>>               virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>>>>>>                   dev->hw_enc_features |= NETIF_F_TSO;
>>>>>>                   dev->hw_enc_features |= NETIF_F_ALL_CSUM;
>>>>>>                   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
>>>>>>                   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   dev->hw_enc_features |=
>>>>>> NETIF_F_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>                   dev->features |= NETIF_F_GSO_UDP_TUNNEL;
>>>>>>                   dev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   dev->features |= NETIF_F_GSO_TUNNEL_REMCSUM;
>>>>>>           }
>>>>>> 
>>>>>> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb), add
>>>>>> below to pieces of codes
>>>>>> 
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL)
>>>>>>                           hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL;
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM)
>>>>>>                           hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL_CSUM;
>>>>>>                   if (skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM)
>>>>>>                           hdr->hdr.gso_type |=
>>>>>> VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>           if (skb->encapsulation && skb_is_gso(skb)) {
>>>>>>                   inner_mac_len = skb_inner_network_header(skb) - skb_inner_mac_header(skb);
>>>>>>                   tnl_len = skb_inner_mac_header(skb) - skb_mac_header(skb);
>>>>>>                   if ( !(inner_mac_len >> DATA_LEN_SHIFT) && !(tnl_len >> DATA_LEN_SHIFT) ) {
>>>>>>                           hdr->hdr.flags |= VIRTIO_NET_HDR_F_ENCAPSULATION;
>>>>>>                           hdr->num_buffers = (__virtio16)((inner_mac_len << DATA_LEN_SHIFT) | tnl_len);		//we reuse num_buffers for simple , we should add extend member for later.
>>>>>>                   }  else
>>>>>>                           hdr->num_buffers = 0;
>>>>>>           }
>>>>>> 
>>>>>> Tun.c
>>>>>>                   if (memcpy_fromiovecend((void *)&hdr, iv, offset, tun->vnet_hdr_sz))		//read header with negotiation length
>>>>>>                           return -EFAULT;
>>>>>> 
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL)					//set tunnel gso info
>>>>>>                           skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_CSUM)
>>>>>>                           skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
>>>>>>                   if (hdr.gso_type & VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM)
>>>>>>                           skb_shinfo(skb)->gso_type |=
>>>>>> SKB_GSO_TUNNEL_REMCSUM;
>>>>>> 
>>>>>>           if (hdr.flags & VIRTIO_NET_HDR_F_ENCAPSULATION) {						//read tunnel info from header and set to built skb.
>>>>>>                   tnl_len = tun16_to_cpu(tun, hdr.num_buffers) & TUN_TNL_LEN_MASK;
>>>>>>                   payload_mac_len = tun16_to_cpu(tun, hdr.num_buffers) >> TUN_DATA_LEN_SHIFT;
>>>>>>                   mac_len = skb_network_header(skb) - skb_mac_header(skb);
>>>>>>                   skb_set_inner_mac_header(skb, tnl_len - mac_len);
>>>>>>                   skb_set_inner_network_header(skb, tnl_len + payload_mac_len - mac_len);
>>>>>>                   skb->encapsulation = 1;
>>>>>>           }
>>>>>> 
>>>>>> 
>>>>> Something like this, and you probably need do something more:
>>>>> 
>>>>> - use net-next.git to generate the patch (for the latest code)
>>>>> - add feature negotiation
>>>>> - tun/macvtap/qemu patches for this, you can start with tun/macvtap
>>>>> patches
>>>>> - support for all other SKB_GSO_* types which is not supported
>>>>> - use a new field instead of num_buffers
>>>>> - a virtio spec patch to describe the support for encapsulation
>>>>> offload
>>>>> 
>>>>> Thanks
>>>> Thank you for your advice, I will start it right now.
>>>> 
>>>> Thanks
>>> Cool, one more question: while at it, I think you may want to add support for dpdk too?
>>> 
>>> Thanks
>> Do you mean that the patch should be compatible with virtio pmd, or give virtio pmd patch?
>> 
>> Thanks
> 
> I mean it's better to prepare patches for both virtio pmd and dpdk.
> 
> Thanks

^ permalink raw reply

* [Patch net v2] af_unix: conditionally use freezable blocking calls in read
From: Cong Wang @ 2016-11-17 23:55 UTC (permalink / raw)
  To: netdev
  Cc: dvyukov, Cong Wang, Tejun Heo, Colin Cross, Rafael J. Wysocki,
	Hannes Frederic Sowa

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/unix/af_unix.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..2358f26 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2199,7 +2199,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
  *	Sleep until more data has arrived. But check for races..
  */
 static long unix_stream_data_wait(struct sock *sk, long timeo,
-				  struct sk_buff *last, unsigned int last_len)
+				  struct sk_buff *last, unsigned int last_len,
+				  bool freezable)
 {
 	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
@@ -2220,7 +2221,10 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		timeo = freezable_schedule_timeout(timeo);
+		if (freezable)
+			timeo = freezable_schedule_timeout(timeo);
+		else
+			timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))
@@ -2250,7 +2254,8 @@ struct unix_stream_read_state {
 	unsigned int splice_flags;
 };
 
-static int unix_stream_read_generic(struct unix_stream_read_state *state)
+static int unix_stream_read_generic(struct unix_stream_read_state *state,
+				    bool freezable)
 {
 	struct scm_cookie scm;
 	struct socket *sock = state->socket;
@@ -2330,7 +2335,7 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state)
 			mutex_unlock(&u->iolock);
 
 			timeo = unix_stream_data_wait(sk, timeo, last,
-						      last_len);
+						      last_len, freezable);
 
 			if (signal_pending(current)) {
 				err = sock_intr_errno(timeo);
@@ -2472,7 +2477,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 		.flags = flags
 	};
 
-	return unix_stream_read_generic(&state);
+	return unix_stream_read_generic(&state, true);
 }
 
 static int unix_stream_splice_actor(struct sk_buff *skb,
@@ -2503,7 +2508,7 @@ static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
 	    flags & SPLICE_F_NONBLOCK)
 		state.flags = MSG_DONTWAIT;
 
-	return unix_stream_read_generic(&state);
+	return unix_stream_read_generic(&state, false);
 }
 
 static int unix_shutdown(struct socket *sock, int mode)
-- 
2.1.0

^ permalink raw reply related

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Jarno Rajahalme @ 2016-11-18  0:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Paul E. McKenney, Cong Wang, Rolf Neugebauer,
	LKML, Linux Kernel Network Developers, Justin Cormack,
	Ian Campbell, Eric Dumazet
In-Reply-To: <1479164967.8455.87.camel@edumazet-glaptop3.roam.corp.google.com>


> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
> 		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
> 		__skb_queue_head_init(&cell->napi_skbs);
> +
> +		set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
> 		netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
> 		napi_enable(&cell->napi);
> 	}
> 

This solved a ~20 second slowdown between OVS datapath unit tests for me.

  Jarno

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Julian Anastasov @ 2016-11-18  0:37 UTC (permalink / raw)
  To: Rick Jones
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <d7d209a8-6142-2599-3303-19640d340b97@hpe.com>


	Hello,

On Thu, 17 Nov 2016, Rick Jones wrote:

> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
> 
> ...
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
> 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0

	connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

	Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.

Regards

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Rick Jones @ 2016-11-18  0:42 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Eric Dumazet, Jesper Dangaard Brouer, netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <alpine.LFD.2.11.1611180145580.2135@ja.home.ssi.bg>

On 11/17/2016 04:37 PM, Julian Anastasov wrote:
> On Thu, 17 Nov 2016, Rick Jones wrote:
>
>> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
>> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
>>
>> ...
>>
>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
>> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
>> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
>> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
>> 16) = 0
>> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
>
> 	connected socket can benefit from dst cached in socket
> but not if SO_DONTROUTE is set. If we do not want to send packets
> via gateway this -l 1 should help but I don't see IP_TTL setsockopt
> in your first example with connect() to 127.0.0.1.
>
> 	Also, may be there can be another default, if -l is used to
> specify TTL then SO_DONTROUTE should not be set. I.e. we should
> avoid SO_DONTROUTE, if possible.

The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.

I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.

happy benchmarking,

rick jones

^ permalink raw reply

* [iproute2] iproute2: fix the link group name getting error
From: Zhang Shengju @ 2016-11-18  1:12 UTC (permalink / raw)
  To: netdev

In the situation where more than one entry live in the same hash bucket,
loop to get the correct one.

Before:
$ cat /etc/iproute2/group
0	default
256     test

$ sudo ip link set group test dummy1

$ ip link show type dummy
11: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group 0 qlen 1000
    link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000
    link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

After:
$ ip link show type dummy
11: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000
    link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
---
 lib/rt_names.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/rt_names.c b/lib/rt_names.c
index b665d3e..c66cb1e 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -559,8 +559,12 @@ const char *rtnl_group_n2a(int id, char *buf, int len)
 
 	for (i = 0; i < 256; i++) {
 		entry = rtnl_group_hash[i];
-		if (entry && entry->id == id)
-			return entry->name;
+
+		while (entry) {
+			if (entry->id == id)
+				return entry->name;
+			entry = entry->next;
+		}
 	}
 
 	snprintf(buf, len, "%d", id);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
From: Harini Katakam @ 2016-11-18  4:29 UTC (permalink / raw)
  To: Rafal Ozieblo
  Cc: Nicolas Ferre, harini.katakam@xilinx.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <BN3PR07MB2516EBCDD8822FC5F8A9095FC9B10@BN3PR07MB2516.namprd07.prod.outlook.com>

Hi Rafal,

On Thu, Nov 17, 2016 at 7:05 PM, Rafal Ozieblo <rafalo@cadence.com> wrote:
> -----Original Message-----
> From: Nicolas Ferre [mailto:nicolas.ferre@atmel.com]
> Sent: 17 listopada 2016 14:29
> To: Harini Katakam; Rafal Ozieblo
> Cc: harini.katakam@xilinx.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
>
>> Le 17/11/2016 à 13:21, Harini Katakam a écrit :
>> > Hi Rafal,
>> >
>> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo <rafalo@cadence.com> wrote:
>> >> Hello,
>> >> I think, there could a bug in your patch.
>> >>
>> >>> +
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> +             dmacfg |= GEM_BIT(ADDR64); #endif
>> >>
>> >> You enable 64 bit addressing (64b dma bus width) always when appropriate architecture config option is enabled.
>> >> But there are some legacy controllers which do not support that feature. According Cadence hardware team:
>> >> "64 bit addressing was added in July 2013. Earlier version do not have it.
>> >> This feature was enhanced in release August 2014 to have separate upper address values for transmit and receive."
>> >>
>> >>> /* Bitfields in NSR */
>> >>> @@ -474,6 +479,10 @@
>> >>>  struct macb_dma_desc {
>> >>  >      u32     addr;
>> >>>       u32     ctrl;
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> +     u32     addrh;
>> >>> +     u32     resvd;
>> >>> +#endif
>> >>>  };
>> >>
>> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the new one is 4 words wide.
>> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't
>> >> support it at all, you will miss every second descriptor.
>> >>
>> >
>> > True, this feature is not available in all of Cadence IP versions.
>> > In fact, the IP version Zynq does not support this. But the one in ZynqMP does.
>> > So, we enable kernel config for 64 bit DMA addressing for this SoC and
>> > hence the driver picks it up. My assumption was that if the legacy IP
>> > does not support
>> > 64 bit addressing, then this DMA option wouldn't be enabled.
>> >
>> > There is a design config register in Cadence IP which is being read to
>> > check for 64 bit address support - DMA mask is set based on that.
>> > But the addition of two descriptor words cannot be based on this runtime check.
>> > For this reason, all the static changes were placed under this check.
>>
>> We have quite a bunch of options in this driver to determinate what is the real capacity of the underlying hardware.
>> If HW configuration registers are not appropriate, and it seems they are not, I would advice to simply use the DT compatibility string.
>>
>> Best regards,
>> --
>> Nicolas Ferre
>
> HW configuration registers are appropriate. The issue is that this code doesn’t use the capability bit to switch between different dma descriptors (2 words vs. 4 words).
> DMA descriptor size is chosen based on kernel configuration, not based on hardware capabilities.

HW configuration register does give appropriate information.
But addition of two address words in the macb descriptor structure is
a static change.

+static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr)
+{
+       desc->addr = (u32)addr;
+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+       desc->addrh = (u32)(addr >> 32);
+#endif
+

Even if the #ifdef condition here is changed to HW config check, addr
and addrh are different.
And "addrh" entry has to be present for 64 bit desc case to be handled
separately.
Can you please tell me how you propose change in DMA descriptor structure from
4 to 2 or 2 to 4 words *after* reading the DCFG register?

Regards,
Harini

^ permalink raw reply

* Proposal
From: Teresa Au @ 2016-11-18  4:34 UTC (permalink / raw)




Business Partnership Proposal For You,contact me via my personal E-mail for further 
detail's: ms_teresa_au17@outlook.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox