Netdev List

Netdev List
 help / color / mirror / Atom feed

* [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
From: Cong Wang @ 2016-11-17 22:09 UTC (permalink / raw)
  To: netdev; +Cc: dvyukov, Cong Wang, Tejun Heo, Colin Cross, Rafael J. Wysocki

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Colin Cross <ccross@android.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/unix/af_unix.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..6f37f9e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -116,7 +116,6 @@
 #include <linux/mount.h>
 #include <net/checksum.h>
 #include <linux/security.h>
-#include <linux/freezer.h>
 
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
@@ -2220,7 +2219,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 
 		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
 		unix_state_unlock(sk);
-		timeo = freezable_schedule_timeout(timeo);
+		timeo = schedule_timeout(timeo);
 		unix_state_lock(sk);
 
 		if (sock_flag(sk, SOCK_DEAD))
-- 
2.1.0

^ permalink raw reply related

* Re: [net-next PATCH v2] net: dummy: Introduce dummy virtual functions
From: Or Gerlitz @ 2016-11-17 22:04 UTC (permalink / raw)
  To: Phil Sutter; +Cc: David Miller, Linux Netdev List, Sabrina Dubroca
In-Reply-To: <20161114130205.18333-1-phil@nwl.cc>

On Mon, Nov 14, 2016 at 3:02 PM, Phil Sutter <phil@nwl.cc> wrote:

> Due to the assumption that all PFs are PCI devices, this implementation
> is not completely straightforward: In order to allow for
> rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
> attached to the dummy netdev. This has to happen at the right spot so
> register_netdevice() does not get confused. This patch abuses
> ndo_fix_features callback for that. In ndo_uninit callback, the fake
> parent is removed again for the same purpose.

So you did some mimic-ing of PCI interface, how do you let the user to
config the number of VFs? though a module param? why? if the module
param only serves to say how many VF the device supports, maybe
support the maximum possible by PCI spec and skip the module param?


> +module_param(num_vfs, int, 0);
> +MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device");
> +

> @@ -190,6 +382,7 @@ static int __init dummy_init_one(void)
>         err = register_netdevice(dev_dummy);
>         if (err < 0)
>                 goto err;
> +
>         return 0;

nit, remove this added blank line..

^ permalink raw reply

* Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
From: Jerome Brunet @ 2016-11-17 21:48 UTC (permalink / raw)
  To: Anand Moon
  Cc: netdev, devicetree, Florian Fainelli, Alexandre TORGUE,
	Neil Armstrong, Martin Blumenstingl, Kevin Hilman, Linux Kernel,
	Andre Roth, linux-amlogic, Carlo Caione, Giuseppe Cavallaro,
	linux-arm-kernel
In-Reply-To: <CANAwSgTB5=8tVj1FrZF3EgjK5mjUMwLYb90NeWSFDCsRjsuB6A@mail.gmail.com>

On Thu, 2016-11-17 at 23:30 +0530, Anand Moon wrote:
> Hi Jerone,
> 
> > > How about adding callback functionality for .soft_reset to handle
> > > BMCR
> > > where we update the Auto-Negotiation for the phy,
> > > as per the datasheet of the rtl8211f.

I think BMCR is already pretty well handled by the genphy, don't you
think ?

> > 
> > I'm not sure I understand how this would help with our issue (and
> > EEE).
> > Am I missing something or is it something unrelated that you would
> > like
> > to see happening on this driver ?
> > 
> [snip]
> 
> I was just tying other phy module to understand the feature.
> But in order to improve  the throughput I tried to integrate blow u-
> boot commit.
> 
> commit 3d6af748ebd831524cb22a29433e9092af469ec7
> Author: Shengzhou Liu <Shengzhou.Liu@freescale.com>
> Date:   Thu Mar 12 18:54:59 2015 +0800
> 
>     net/phy: Add support for realtek RTL8211F
> 
>     RTL8211F has different registers from RTL8211E.
>     This patch adds support for RTL8211F PHY which
>     can be found on Freescale's T1023 RDB board.
> 
> And added the similar functionality to  .config_aneg    =
> &rtl8211f_config_aneg,
> 

I assume this is the commit you are referring to : 
http://git.denx.de/?p=u-boot.git;a=commit;h=3d6af748ebd831524cb22a29433
e9092af469ec7

I tried looking a this particular commit and the other ones in
realtek.c history of u-boot. I don't really see what it does that linux
is not already doing.

> And I seem to have better results in through put with periodic drop
> but it recovers.
> -----
> odroid@odroid64:~$ iperf3 -c 10.0.0.102 -p 2006 -i 1 -t 100 -V
> iperf 3.0.11
> Linux odroid64 4.9.0-rc5-xc2ml #18 SMP PREEMPT Thu Nov 17 22:56:00 

[...]

> 
> Test Complete. Summary Results:
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec    4             sender
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec                  receiver
> CPU Utilization: local/sender 5.6% (0.2%u/5.4%s), remote/receiver
> 17.1% (1.2%u/15.9%s)
> 

That's the kind of throughput we have on the C2 once the link is
reliable (with EEE switch off for GbE)

> Can your confirm this at your end.
> Once confirm I will try to send this as a fix for this issue.
> 

I'm testing the code to disable EEE in a generic way. I'll post the RFC
for it asap.

> -Best Regards
> Anand Moon

^ permalink raw reply

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: Jerome Brunet @ 2016-11-17 21:47 UTC (permalink / raw)
  To: André Roth
  Cc: Martin Blumenstingl, Johnson Leung, Giuseppe CAVALLARO,
	linux-amlogic, Alexandre Torgue, netdev
In-Reply-To: <20161117194405.4ca7899b@gmail.com>

On Thu, 2016-11-17 at 19:44 +0100, André Roth wrote:
> Hi all,
> 
> > 
> > I checked again the kernel
> > at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The
> > version you mention (3.14.65-73) seems to be:
> > sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb
> > s905
> > kernel-73
> 
> I downloaded the prebuilt image from hardkernel, I did not build the
> kernel myself. but hardkernel has an earlier release of the same
> kernel
> version, which works fine too. I assume they would have committed the
> change in the newer version..
>  
> > 
> > In this particular version, both realtek drivers:
> > - drivers/net/phy/realtek.c
> > - drivers/amlogic/ethernet/phy/am_realtek.c
> > 
> > have the hack to disable 1000M advertisement. I don't understand
> > how
> > it possible for you to have 1000Base-T Full Duplex with this, maybe
> > I'm missing something here ?
> 
> that's what I don't understand as well...
> 
> the patched kernel shows the following:
> 
> $ uname -a
> Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016
> 
> $ sudo ethtool eth0
> Settings for eth0:
>         Supported ports: [ TP MII ]
>         Supported link modes:   10baseT/Half 10baseT/Full 
>                                 100baseT/Half 100baseT/Full 
>                                 1000baseT/Full 
>         Supported pause frame use: No
>         Supports auto-negotiation: Yes
>         Advertised link modes:  10baseT/Half 10baseT/Full 
>                                 100baseT/Half 100baseT/Full 
>                                 1000baseT/Full 
>         Advertised pause frame use: No
>         Advertised auto-negotiation: Yes
>         Link partner advertised link modes:  10baseT/Half
> 10baseT/Full 
>                                              100baseT/Half
> 100baseT/Full 1000baseT/Half 1000baseT/Full 
>         Link partner advertised pause frame use: Symmetric Receive-
> only
>         Link partner advertised auto-negotiation: Yes
>         Speed: 1000Mb/s
>         Duplex: Full
>         Port: MII
>         PHYAD: 0
>         Transceiver: external
>         Auto-negotiation: on
>         Supports Wake-on: ug
>         Wake-on: d
>         Current message level: 0x0000003f (63)
>                                drv probe link timer ifdown ifup
>         Link detected: yes
> 
> $ sudo ethtool --show-eee eth0
> EEE Settings for eth0:
>         EEE status: disabled
>         Tx LPI: disabled
>         Supported EEE link modes:  100baseT/Full 
>                                    1000baseT/Full 
>         Advertised EEE link modes:  100baseT/Full 
>         Link partner advertised EEE link modes:  100baseT/Full 
>                                                  1000baseT/Full 
> 
> can it be that "EEE link modes" and the "normal" link modes are two
> different things ? 

Exactly, They are.
Hardkernel code disable both.

With hardkernel's kernel, you should not have 1000baseT/Full in
"Advertised link modes" and you would have nothing reported in
"Advertised EEE link modes"

> 
> > 
> > If you did compile the kernel yourself, could you check the 2 file
> > mentioned above ? Just to be sure there was no patch applied at the
> > last minute, which would not show up in the git history of
> > hardkernel ?
> 
> I cannot check this easily at the moment..
> 
> Regards,
> 
>  André
> 

^ permalink raw reply

* Re: net: BUG still has locks held in unix_stream_splice_read
From: Cong Wang @ 2016-11-17 21:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Dmitry Vyukov, David Miller, Hannes Frederic Sowa, Eric Dumazet,
	netdev, LKML, syzkaller, Colin Cross, Mandeep Singh Baines
In-Reply-To: <20161010031450.GW19539@ZenIV.linux.org.uk>

On Sun, Oct 9, 2016 at 8:14 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> E.g what will happen if some code does a read on AF_UNIX socket with
> some local mutex held?  AFAICS, there are exactly two callers of
> freezable_schedule_timeout() - this one and one in XFS; the latter is
> in a kernel thread where we do have good warranties about the locking
> environment, but here it's in the bleeding ->recvmsg/->splice_read and
> for those assumption that caller doesn't hold any locks is pretty
> strong, especially since it's not documented anywhere.
>
> What's going on there?

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably correct
at that time, but later, commit 2b514574f7e88c8498027ee366
("net: af_unix: implement splice for stream af_unix sockets") breaks its
requirement for a freezable sleep:

    commit 0f9548ca10916dec166eaf74c816bded7d8e611d

    lockdep: check that no locks held at freeze time

    We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

So probably we just need to revert commit 2b15af6f95 now.

I am going to send a revert for at least -net and -stable, since Dmitry
saw this warning again.

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 21:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20161117221941.3b525181@redhat.com>

On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:

> 
> Maybe you can share your udp flood "udpsnd" program source?

Very ugly. This is based on what I wrote when tracking the UDP v6
checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
checksum in skb_checksum_help()), because netperf sends the same message
over and over...


Use -d 2   to remove the ip_idents_reserve() overhead.


#define _GNU_SOURCE

#include <errno.h>
#include <error.h>
#include <linux/errqueue.h>
#include <netinet/in.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>

char buffer[1400];

int main(int argc, char** argv) {
  int fd, i;
  struct sockaddr_in6 addr;
  char *host = "2002:af6:798::1";
  int family = AF_INET6;
  int discover = -1;

  while ((i = getopt(argc, argv, "4H:d:")) != -1) {
    switch (i) {
    case 'H': host = optarg; break;
    case '4': family = AF_INET; break;
    case 'd': discover = atoi(optarg); break;
    }
  }
  fd = socket(family, SOCK_DGRAM, 0);
  if (fd < 0)
    error(1, errno, "failed to create socket");
  if (discover != -1)
    setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
               &discover, sizeof(discover));

  memset(&addr, 0, sizeof(addr));
  if (family == AF_INET6) {
	  addr.sin6_family = AF_INET6;
	  addr.sin6_port = htons(9);
      inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
  } else {
    struct sockaddr_in *in = (struct sockaddr_in *)&addr;
    in->sin_family = family;
    in->sin_port = htons(9);
      inet_pton(family, host, &in->sin_addr);
  }
  connect(fd, (struct sockaddr *)&addr,
          (family == AF_INET6) ? sizeof(addr) :
                                 sizeof(struct sockaddr_in));
  memset(buffer, 1, 1400);
  for (i = 0; i < 655360000; i++) {
    memcpy(buffer, &i, sizeof(i));
    send(fd, buffer, 100 + rand() % 200, 0);
  }
  return 0;
}

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer @ 2016-11-17 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan, brouer
In-Reply-To: <1479408683.8455.273.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).    
> 
> 
> As I said, you are fooled by TX completions.
> 
> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default
> 
> lpaa23:~# sar -n DEV 1 10|grep eth1
> 10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
> 10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
> 10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
> 10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
> 10:49:29         eth1      1.00 9239218.00      0.06 2179044.83      0.00      0.00      0.00
> 10:49:30         eth1      3.00 9248775.00      0.23 2181257.84      0.00      0.00      1.00
> 10:49:31         eth1      4.00 9225471.00      0.65 2175772.75      0.00      0.00      0.00
> 10:49:32         eth1      2.00 9253536.00      0.33 2182666.44      0.00      0.00      1.00
> 10:49:33         eth1      1.00 9265900.00      0.06 2185598.40      0.00      0.00      0.00
> 10:49:34         eth1      1.00 6949031.00      0.06 1638889.63      0.00      0.00      1.00
> Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50
> 
> 
> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>      xmit_more: 2251366909
>      xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

xmit more not happen that frequently for my setup, it does happen
sometimes. And I do monitor with "ethtool -S".

~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2  ) stat:     92900913 (     92,900,913) <= tx0_bytes /sec
Ethtool(mlx5p2  ) stat:        36073 (         36,073) <= tx0_nop /sec
Ethtool(mlx5p2  ) stat:      1548349 (      1,548,349) <= tx0_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx0_xmit_more /sec
Ethtool(mlx5p2  ) stat:     92884899 (     92,884,899) <= tx_bytes /sec
Ethtool(mlx5p2  ) stat:     99297696 (     99,297,696) <= tx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_csum_partial /sec
Ethtool(mlx5p2  ) stat:      1548082 (      1,548,082) <= tx_packets /sec
Ethtool(mlx5p2  ) stat:      1551527 (      1,551,527) <= tx_packets_phy /sec
Ethtool(mlx5p2  ) stat:     99076658 (     99,076,658) <= tx_prio1_bytes /sec
Ethtool(mlx5p2  ) stat:      1548073 (      1,548,073) <= tx_prio1_packets /sec
Ethtool(mlx5p2  ) stat:     92936078 (     92,936,078) <= tx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:      1548934 (      1,548,934) <= tx_vport_unicast_packets /sec
Ethtool(mlx5p2  ) stat:            1 (              1) <= tx_xmit_more /sec

(after several attempts I got:)
$ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more
     tx_xmit_more: 14048
     tx0_xmit_more: 14048
     tx_xmit_more: 14049
     tx0_xmit_more: 14049

This was with:
 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920

>    PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
> ---------------------------------------------------------------------------------------------
> 
>     11.64%  [kernel]  [k] skb_set_owner_w               
>      6.21%  [kernel]  [k] queued_spin_lock_slowpath     
>      4.76%  [kernel]  [k] _raw_spin_lock                
>      4.40%  [kernel]  [k] __ip_make_skb                 
>      3.10%  [kernel]  [k] sock_wfree                    
>      2.87%  [kernel]  [k] ipt_do_table                  
>      2.76%  [kernel]  [k] fq_dequeue                    
>      2.71%  [kernel]  [k] mlx4_en_xmit                  
>      2.50%  [kernel]  [k] __dev_queue_xmit              
>      2.29%  [kernel]  [k] __ip_append_data.isra.40      
>      2.28%  [kernel]  [k] udp_sendmsg                   
>      2.01%  [kernel]  [k] __alloc_skb                   
>      1.90%  [kernel]  [k] napi_consume_skb              
>      1.63%  [kernel]  [k] udp_send_skb                  
>      1.62%  [kernel]  [k] skb_release_data              
>      1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
>      1.56%  [kernel]  [k] dev_hard_start_xmit           
>      1.55%  udpsnd    [.] __libc_send                   
>      1.48%  [kernel]  [k] netif_skb_features            
>      1.42%  [kernel]  [k] __qdisc_run                   
>      1.35%  [kernel]  [k] sk_dst_check                  
>      1.33%  [kernel]  [k] sock_def_write_space          
>      1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
>      1.29%  [kernel]  [k] __local_bh_enable_ip          
>      1.21%  [kernel]  [k] copy_user_enhanced_fast_string
>      1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
>      1.08%  [kernel]  [k] SYSC_sendto                   
>      1.07%  [kernel]  [k] kmem_cache_alloc_node         
>      0.95%  [kernel]  [k] ip_finish_output2             
>      0.95%  [kernel]  [k] ktime_get                     
>      0.91%  [kernel]  [k] validate_xmit_skb             
>      0.88%  [kernel]  [k] sock_alloc_send_pskb          
>      0.82%  [kernel]  [k] sock_sendmsg                  

I'm more interested in why I see fib_table_lookup() and
__ip_route_output_key_hash() when you don't ?!?  There must be some
mistake in my setup!

Maybe you can share your udp flood "udpsnd" program source?

Maybe I'm missing some important sysctl /proc/net/sys/ ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

p.s. I placed my testing software here:
 https://github.com/netoptimizer/network-testing/tree/master/src

^ permalink raw reply

* [PATCH] net: fec: Detect and recover receive queue hangs
From: Chris Lesiak @ 2016-11-17 21:14 UTC (permalink / raw)
  To: Fugang Duan; +Cc: netdev, linux-kernel, Jaccon Bastiaansen, chris.lesiak

This corrects a problem that appears to be similar to ERR006358.  But
while ERR006358 is a race when the tx queue transitions from empty to
not empty, this problem is a race when the rx queue transitions from
full to not full.

The symptom is a receive queue that is stuck.  The ENET_RDAR register
will read 0, indicating that there are no empty receive descriptors in
the receive ring.  Since no additional frames can be queued, no RXF
interrupts occur.

This problem can be triggered with a 1 Gb link and about 400 Mbps of
traffic.

This patch detects this condition, sets the work_rx bit, and
reschedules the poll method.

Signed-off-by: Chris Lesiak <chris.lesiak@licor.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index fea0f33..8a87037 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
 	return ret;
 }
 
+static inline bool
+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id)
+{
+	int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
+
+	if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
+		return false;
+
+	dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n");
+
+	fep->work_rx |= 1 << work_bit;
+
+	return true;
+}
+
+static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep)
+{
+	unsigned int q;
+	bool ret = false;
+
+	for (q = 0; q < fep->num_rx_queues; q++) {
+		if (fec_enet_recover_rxq(fep, q))
+			ret = true;
+	}
+
+	return ret;
+}
+
 static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 {
 	struct net_device *ndev = napi->dev;
@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 	if (pkts < budget) {
 		napi_complete(napi);
 		writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+
+		if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
+			writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
 	}
 	return pkts;
 }
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification
From: Cong Wang @ 2016-11-17 21:02 UTC (permalink / raw)
  To: Roman Mashak
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim
In-Reply-To: <1479334570-25159-1-git-send-email-mrv@mojatatu.com>

On Wed, Nov 16, 2016 at 2:16 PM, Roman Mashak <mrv@mojatatu.com> wrote:
> Userland client should be able to read an event, and reflect it back to
> the kernel, therefore it needs to extract complete set of netlink flags.
>
> For example, this will allow "tc monitor" to distinguish Add and Replace
> operations.
>
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---
>  net/sched/cls_api.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2b2a797..8e93d4a 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
>
>         for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL;
>              it_chain = &tp->next)
> -               tfilter_notify(net, oskb, n, tp, 0, event, false);
> +               tfilter_notify(net, oskb, n, tp, n->nlmsg_flags, event, false);


I must miss something, why does it make sense to pass n->nlmsg_flags
as 'fh' to tfilter_notify()??


>  }
>
>  /* Select new prio value from the range, managed by kernel. */
> @@ -430,7 +430,8 @@ static int tfilter_notify(struct net *net, struct sk_buff *oskb,
>         if (!skb)
>                 return -ENOBUFS;
>
> -       if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq, 0, event) <= 0) {
> +       if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq,
> +                         n->nlmsg_flags, event) <= 0) {


This part makes sense.

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
From: Daniel Borkmann @ 2016-11-17 20:03 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, Alexei Starovoitov, Brenden Blanco, Zhiyi Sun,
	Rana Shahout, Saeed Mahameed, Linux Netdev List
In-Reply-To: <CALzJLG-ZQxF0t1N5ry2OUqcBjrs+7XzuWb_dRDOjEnRf1975Hg@mail.gmail.com>

On 11/17/2016 10:48 AM, Saeed Mahameed wrote:
> On Wed, Nov 16, 2016 at 5:26 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 11/16/2016 03:30 PM, Saeed Mahameed wrote:
>>> On Wed, Nov 16, 2016 at 3:54 PM, Daniel Borkmann <daniel@iogearbox.net>
>>> wrote:
>>>> On 11/16/2016 01:25 PM, Saeed Mahameed wrote:
>>>>> On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann <daniel@iogearbox.net>
>>>>> wrote:
>>>>>>
>>>>>> There are multiple issues in mlx5e_xdp_set():
>>>>>>
>>>>>> 1) The batched bpf_prog_add() is currently not checked for errors! When
>>>>>>       doing so, it should be done at an earlier point in time to makes
>>>>>> sure
>>>>>>       that we cannot fail anymore at the time we want to set the program
>>>>>> for
>>>>>>       each channel. This only means that we have to undo the
>>>>>> bpf_prog_add()
>>>>>>       in case we return due to required reset.
>>>>>>
>>>>>> 2) When swapping the priv->xdp_prog, then no extra reference count must
>>>>>> be
>>>>>>       taken since we got that from call path via dev_change_xdp_fd()
>>>>>> already.
>>>>>>       Otherwise, we'd never be able to free the program. Also,
>>>>>> bpf_prog_add()
>>>>>>       without checking the return code could fail.
>>>>>>
>>>>>> Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs
>>>>>> support")
>>>>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>> ---
>>>>>>     drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25
>>>>>> ++++++++++++++++++-----
>>>>>>     1 file changed, 20 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>>> index ab0c336..cf26672 100644
>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>>>> @@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device
>>>>>> *netdev, struct bpf_prog *prog)
>>>>>>                    goto unlock;
>>>>>>            }
>>>>>>
>>>>>> +       if (prog) {
>>>>>> +               /* num_channels is invariant here, so we can take the
>>>>>> +                * batched reference right upfront.
>>>>>> +                */
>>>>>> +               prog = bpf_prog_add(prog, priv->params.num_channels);
>>>>>> +               if (IS_ERR(prog)) {
>>>>>> +                       err = PTR_ERR(prog);
>>>>>> +                       goto unlock;
>>>>>> +               }
>>>>>> +       }
>>>>>> +
>>>>>
>>>>> With this approach you will end up taking a ref count twice per RQ! on
>>>>> the first time xdp_set is called i.e (old_prog = NULL, prog != NULL).
>>>>> One ref will be taken per RQ/Channel from the above code, and since
>>>>> reset will be TRUE mlx5e_open_locked will be called and another ref
>>>>> count will be taken on mlx5e_create_rq.
>>>>>
>>>>> The issue here is that we have two places for ref count accounting,
>>>>> xdp_set and  mlx5e_create_rq. Having ref-count updates in
>>>>> mlx5e_create_rq is essential for num_channels configuration changes
>>>>> (mlx5e_set_ringparam).
>>>>>
>>>>> Our previous approach made sure that only one path will do the ref
>>>>> counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when
>>>>> reset == false).
>>>>
>>>> That is correct, for a short time bpf_prog_add() was charged also when
>>>> we reset. When we reset, we will then jump to unlock_put and safely undo
>>>> this since we took ref from mlx5e_create_rq() already in that case, and
>>>> return successfully. That was intentional, so that eventually we end up
>>>> just taking a /single/ ref per RQ when we exit mlx5e_xdp_set(), but more
>>>> below ...
>>>>
>>>> [...]
>>>>>
>>>>> 2. Keep the current approach and make sure to not update the ref count
>>>>> twice, you can batch update only if (!reset && was_open) otherwise you
>>>>> can rely on mlx5e_open_locked to take the num_channels ref count for
>>>>> you.
>>>>>
>>>>> Personally I prefer option 2, since we will keep the current logic
>>>>> which will allow configuration updates such as (mlx5e_set_ringparam)
>>>>> to not worry about ref counting since it will be done in the reset
>>>>> flow.
>>>>
>>>> ... agree on keeping current approach. I actually like the idea, so we'd
>>>> end up with this simpler version for the batched ref then.
>>>
>>> Great :).
>>>
>>> So let's do the batched update only if we are not going to reset (we
>>> already know that in advance), instead of the current patch where you
>>> batch update unconditionally and then
>>>    unlock_put in case reset was performed (which is just redundant and
>>> confusing).
>>>
>>> Please note that if open_locked fails you need to goto unlock_put.
>>
>> Sorry, I'm not quite sure I follow you here; are you /now/ commenting on
>> the original patch or on the already updated diff I did from my very last
>> email, that is, http://patchwork.ozlabs.org/patch/695564/ ?
>
> I am commenting on this patch  "V2 2/4".
> this patch will take batched ref count unconditionally and release the
> extra refs "unlock_put" if reset was performed.
> I am asking to take the batched ref only if we are not going to reset
> in advance to save the extra "unlock_put"

Okay, sure, it's clear and we discussed about that already earlier, and
my response back then was the diff in the /end/ of this email here:

   https://www.spinics.net/lists/netdev/msg404709.html

Hence my confusion why we were back on the original patch after that. ;)
Anyway, I'll get the updated set with the suggestion out tomorrow, thanks
for the review!

>>>> Note, your "bpf_prog_add(prog, 1) // one ref for the device." is not
>>>> needed
>>>> since this we already got that one through dev_change_xdp_fd() as
>>>> mentioned.
>>>
>>> If it is not needed then why we need bpf_prog_put on mlx5e_nic_cleanup
>>> in your next patch? this doesn't look symmetric (right) !
>>> you shouldn't release a reference you didn't take.
>>> Overall with this series the driver can take num_channels refs and
>>> will release num_channels refs on mlx5e_close. we shouldn't release
>>> one extra ref on NIC cleanup.
>>
>> I already explained this in the commit description; when dev_change_xdp_fd()
>> is called and sees a valid fd, it does bpf_prog_get_type(), which calls the
>> __bpf_prog_get() taking a ref on the program (bpf_prog_inc()). That is then
>> passed down to ops->ndo_xdp(). Only in case of error from the ->ndo_xdp()
>> callback, we bpf_prog_put() this reference from dev_change_xdp_fd() side.
>>
>> For drivers that implement against this ndo, it means that you need N-1 refs
>> in addition. Have a look at the other two in-tree users, which do it
>> correctly,
>> that is mlx4_xdp_set() and nfp_net_xdp_setup().
>>
>> It's documented here (include/linux/netdevice.h) with ndo_xdp referring to
>> it:
>>
>> /* These structures hold the attributes of xdp state that are being passed
>>   * to the netdevice through the xdp op.
>>   */
>> enum xdp_netdev_command {
>>          /* Set or clear a bpf program used in the earliest stages of packet
>>           * rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The
>> callee
>>           * is responsible for calling bpf_prog_put on any old progs that are
>>           * stored. In case of error, the callee need not release the new
>> prog
>>           * reference, but on success it takes ownership and must
>> bpf_prog_put
>>           * when it is no longer used.
>>           */
>>          XDP_SETUP_PROG,
>> [...]
>> };
>>
>> I think reason why this is rather confusing would be that initially, it was
>> just a single prog for all queues, but it was requested along the way to
>> move
>> prog pointer down to queues instead and let them have their own ref, so that
>> some time in future individual progs from a subset of the queues can be
>> exchanged.
>>
>> Since the way xdp in mlx5 is implemented, you currently have the
>> priv->xdp_prog
>> as the control prog. That's okay right now, but requires to drop the last
>> ref
>> on priv->xdp_prog via bpf_prog_put() when netdev is dismantled and
>> priv->xdp_prog
>> still present.
>
> Got it, but i would rather make the stack cleanup those refs once it
> receives an unregester_netdev event instead of counting on netdev to
> do it, as i said before, like any other resource added to the netdev
> by the stack (vlan/mac/etc..).
>
> but this is a general discussion of xdp API, it won't block this series :).
>
> Saeed.

^ permalink raw reply

* [PATCH net-next v4 5/5] net: dsa: mv88e6xxx: Select IRQ_DOMAIN
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>

Some architectures may not define IRQ_DOMAIN (like m32r), fixes
undefined references to IRQ_DOMAIN functions.

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig
index 486668813e15..1aaa7a95ebc4 100644
--- a/drivers/net/dsa/mv88e6xxx/Kconfig
+++ b/drivers/net/dsa/mv88e6xxx/Kconfig
@@ -1,6 +1,7 @@
 config NET_DSA_MV88E6XXX
 	tristate "Marvell 88E6xxx Ethernet switch fabric support"
 	depends on NET_DSA
+	select IRQ_DOMAIN
 	select NET_DSA_TAG_EDSA
 	select NET_DSA_TAG_DSA
 	help
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v4 4/5] net: marvell: Allow drivers to be built with COMPILE_TEST
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>

All Marvell Ethernet drivers actually build fine with COMPILE_TEST with
a few warnings. We need to add a few HAS_DMA dependencies to fix linking
failures on problematic architectures like m32r.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/marvell/Kconfig | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index 2664827ddecd..d74d4e6f0b34 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -5,7 +5,7 @@
 config NET_VENDOR_MARVELL
 	bool "Marvell devices"
 	default y
-	depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET
+	depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET || COMPILE_TEST
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y.
 
@@ -18,7 +18,8 @@ if NET_VENDOR_MARVELL
 
 config MV643XX_ETH
 	tristate "Marvell Discovery (643XX) and Orion ethernet support"
-	depends on (MV64X60 || PPC32 || PLAT_ORION) && INET
+	depends on (MV64X60 || PPC32 || PLAT_ORION || COMPILE_TEST) && INET
+	depends on HAS_DMA
 	select PHYLIB
 	select MVMDIO
 	---help---
@@ -55,7 +56,8 @@ config MVNETA_BM_ENABLE
 
 config MVNETA
 	tristate "Marvell Armada 370/38x/XP network interface support"
-	depends on PLAT_ORION
+	depends on PLAT_ORION || COMPILE_TEST
+	depends on HAS_DMA
 	select MVMDIO
 	select FIXED_PHY
 	---help---
@@ -77,7 +79,8 @@ config MVNETA_BM
 
 config MVPP2
 	tristate "Marvell Armada 375 network interface support"
-	depends on MACH_ARMADA_375
+	depends on MACH_ARMADA_375 || COMPILE_TEST
+	depends on HAS_DMA
 	select MVMDIO
 	---help---
 	  This driver supports the network interface units in the
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v4 3/5] bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>

In preparation for allowing CONFIG_MVNETA_BM to build with COMPILE_TEST,
provide an inline stub for mvebu_mbus_get_dram_win_info().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/linux/mbus.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mbus.h b/include/linux/mbus.h
index 2931aa43dab1..0d3f14fd2621 100644
--- a/include/linux/mbus.h
+++ b/include/linux/mbus.h
@@ -82,6 +82,7 @@ static inline int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size,
 }
 #endif
 
+#ifdef CONFIG_MVEBU_MBUS
 int mvebu_mbus_save_cpu_target(u32 __iomem *store_addr);
 void mvebu_mbus_get_pcie_mem_aperture(struct resource *res);
 void mvebu_mbus_get_pcie_io_aperture(struct resource *res);
@@ -97,5 +98,12 @@ int mvebu_mbus_init(const char *soc, phys_addr_t mbus_phys_base,
 		    size_t mbus_size, phys_addr_t sdram_phys_base,
 		    size_t sdram_size);
 int mvebu_mbus_dt_init(bool is_coherent);
+#else
+static inline int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target,
+					       u8 *attr)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_MVEBU_MBUS */
 
 #endif /* __LINUX_MBUS_H */
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v4 2/5] net: fsl: Allow most drivers to be built with COMPILE_TEST
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>

There are only a handful of Freescale Ethernet drivers that don't
actually build with COMPILE_TEST:

* FEC, for which we would need to define a default register layout if no
  supported architecture is defined

* UCC_GETH which depends on PowerPC cpm.h header (which could be moved
  to a generic location)

* GIANFAR needs to depend on HAS_DMA to fix linking failures on some
  architectures (like m32r)

We need to fix an unmet dependency to get there though:
warning: (FSL_XGMAC_MDIO) selects OF_MDIO which has unmet direct
dependencies (OF && PHYLIB)

which would result in CONFIG_OF_MDIO=[ym] without CONFIG_OF to be set.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/freescale/Kconfig | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/Kconfig b/drivers/net/ethernet/freescale/Kconfig
index aa3f615886b4..0d415516b577 100644
--- a/drivers/net/ethernet/freescale/Kconfig
+++ b/drivers/net/ethernet/freescale/Kconfig
@@ -8,7 +8,7 @@ config NET_VENDOR_FREESCALE
 	depends on FSL_SOC || QUICC_ENGINE || CPM1 || CPM2 || PPC_MPC512x || \
 		   M523x || M527x || M5272 || M528x || M520x || M532x || \
 		   ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM) || \
-		   ARCH_LAYERSCAPE
+		   ARCH_LAYERSCAPE || COMPILE_TEST
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y.
 
@@ -65,6 +65,7 @@ config FSL_PQ_MDIO
 config FSL_XGMAC_MDIO
 	tristate "Freescale XGMAC MDIO"
 	select PHYLIB
+	depends on OF
 	select OF_MDIO
 	---help---
 	  This driver supports the MDIO bus on the Fman 10G Ethernet MACs, and
@@ -85,6 +86,7 @@ config UGETH_TX_ON_DEMAND
 
 config GIANFAR
 	tristate "Gianfar Ethernet"
+	depends on HAS_DMA
 	select FSL_PQ_MDIO
 	select PHYLIB
 	select CRC32
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v4 0/5] net: Enable COMPILE_TEST for Marvell & Freescale drivers
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli

Hi all,

This patch series allows building the Freescale and Marvell Ethernet network
drivers with COMPILE_TEST.

Changes in v4:

- add proper HAS_DMA to fix build errors on m32r
- provide an inline stub for mvebu_mbus_get_dram_win_info
- added an additional patch to fix build errors with mv88e6xxx on m32r

Changes in v3:

- reorder patches to avoid introducing a build warning between commits

Changes in v2:

- rename register define clash when building for i386 (spotted by LKP)


Florian Fainelli (5):
  net: gianfar_ptp: Rename FS bit to FIPERST
  net: fsl: Allow most drivers to be built with COMPILE_TEST
  bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info
  net: marvell: Allow drivers to be built with COMPILE_TEST
  net: dsa: mv88e6xxx: Select IRQ_DOMAIN

 drivers/net/dsa/mv88e6xxx/Kconfig            |  1 +
 drivers/net/ethernet/freescale/Kconfig       |  4 +++-
 drivers/net/ethernet/freescale/gianfar_ptp.c |  4 ++--
 drivers/net/ethernet/marvell/Kconfig         | 11 +++++++----
 include/linux/mbus.h                         |  8 ++++++++
 5 files changed, 21 insertions(+), 7 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [PATCH net-next v4 1/5] net: gianfar_ptp: Rename FS bit to FIPERST
From: Florian Fainelli @ 2016-11-17 19:19 UTC (permalink / raw)
  To: netdev
  Cc: davem, mw, arnd, gregory.clement, Shaohui.Xie, andrew,
	Florian Fainelli
In-Reply-To: <20161117191914.11077-1-f.fainelli@gmail.com>

FS is a global symbol used by the x86 32-bit architecture, fixes builds
re-definitions:

>> drivers/net/ethernet/freescale/gianfar_ptp.c:75:0: warning: "FS"
>> redefined
    #define FS                    (1<<28) /* FIPER start indication */

   In file included from arch/x86/include/uapi/asm/ptrace.h:5:0,
                    from arch/x86/include/asm/ptrace.h:6,
                    from arch/x86/include/asm/math_emu.h:4,
                    from arch/x86/include/asm/processor.h:11,
                    from include/linux/mutex.h:19,
                    from include/linux/kernfs.h:13,
                    from include/linux/sysfs.h:15,
                    from include/linux/kobject.h:21,
                    from include/linux/device.h:17,
                    from
drivers/net/ethernet/freescale/gianfar_ptp.c:23:
   arch/x86/include/uapi/asm/ptrace-abi.h:15:0: note: this is the
location of the previous definition
    #define FS 9

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/freescale/gianfar_ptp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ptp.c b/drivers/net/ethernet/freescale/gianfar_ptp.c
index 57798814160d..3e8d1fffe34e 100644
--- a/drivers/net/ethernet/freescale/gianfar_ptp.c
+++ b/drivers/net/ethernet/freescale/gianfar_ptp.c
@@ -72,7 +72,7 @@ struct gianfar_ptp_registers {
 /* Bit definitions for the TMR_CTRL register */
 #define ALM1P                 (1<<31) /* Alarm1 output polarity */
 #define ALM2P                 (1<<30) /* Alarm2 output polarity */
-#define FS                    (1<<28) /* FIPER start indication */
+#define FIPERST               (1<<28) /* FIPER start indication */
 #define PP1L                  (1<<27) /* Fiper1 pulse loopback mode enabled. */
 #define PP2L                  (1<<26) /* Fiper2 pulse loopback mode enabled. */
 #define TCLK_PERIOD_SHIFT     (16) /* 1588 timer reference clock period. */
@@ -502,7 +502,7 @@ static int gianfar_ptp_probe(struct platform_device *dev)
 	gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
 	gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
 	set_alarm(etsects);
-	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FS|RTPE|TE|FRD);
+	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
 
 	spin_unlock_irqrestore(&etsects->lock, flags);
 
-- 
2.9.3

^ permalink raw reply related

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic Sowa @ 2016-11-17 19:05 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji,
	kaber
In-Reply-To: <20161117183240.fkzwqjzp2z2x7r3z@splinter.mtl.com>

On 17.11.2016 19:32, Ido Schimmel wrote:
> On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
>>>> The other way is the journal idea I had, which uses an rb-tree with
>>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>>> until the dump is finished and use it as queue later to shuffle stuff
>>>> into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
> 
> The delayed work queue only postpones the insert into the listener's
> table, but the entries will be present in the kernel's table as usual.
> Therefore, other lookup made on the kernel's table will see the pending
> entries.
> 
> Note that both listeners that currently call the dump API do that during
> their init, before any lookups can be made on their tables. If you think
> it's critical, then we can make sure the workqueues are flushed before
> the listeners register their netdevs and effectively expose their tables
> to lookups.

I do see routing anyway as a best-effort. ;)

That means in not too long time the hardware needs to be fully
synchronized to the software path and either have all correct entries or
abort must have been called.

> I'm looking into the reordering issues you mentioned. I belive it's a
> valid point.

Thanks!

>> For user space queries we would still query the in-kernel table.
> 
> Right. No changes here.
> 
>>> If, once inserted into the pending queue, you return success to the
>>> inserting entity, then you must make those pending entries visible
>>> to lookups.
>>
>> I think this same problem is in this patch set already. The way I
>> understood it, is, that if a problem during insert emerges, the driver
>> calls abort and every packet will be send to user space, as the routing
>> table cannot be offloaded and it won't try it again, Ido?
> 
> First of all, this abort mechanism is already in place and isn't part of
> this patchset. Secondly, why is this a problem? The all-or-nothing
> approach is an hard requirement and current implementation is infinitely
> better than previous approach in which the kernel's tables were flushed
> upon route insertion failure. It rendered the system unusable. Current
> implementation of abort mechanism keeps the system usable, but with
> reduced performance.

Yes, I argued below that I am toally fine with this approach.

>> Probably this is an artifact of the mellanox implementation and we can
>> implement this differently for other cards, but the schema to abort all
>> if the modification doesn't work, seems to be fundamental (I think we
>> require the all-or-nothing approach for now).
> 
> Yes, it's an hard requirement for all implementations. mlxsw implements
> it by evicting all routes from its tables and inserting a default route
> that traps packets to CPU.

Correct, I don't see how fib offloading can do something else, besides
semantically looking at the routing table and figure out where to insert
"exceptions" for subtrees but keep most packets flowing over the hw
directly. But this for another time... ;)

>>> If you block the inserting entity, well that doesn't make a lot of
>>> sense.  If blocking is a workable solution, then we can just block the
>>> entire insert while this FIB dump to the device driver is happening.
>>
>> I don't think we should really block (as in kernel-block) at any time.
>>
>> I was suggesting something like:
>>
>> rtnl_lock();
>> synchronize_rcu_expedited(); // barrier, all rounting modifications are
>> stable and no new can be added due to rtnl_lock
>> register notifier(); // notifier adds entries also into journal();
>> rtnl_unlock();
>> walk_fib_tree_rcu_into_journal();
>> // walk finished
>> start_syncing_journal_to_hw(); // if new entries show up we sync them
>> asap after this point
>>
>> The journal would need a spin lock to protect its internal state and
>> order events correctly.
>>
>>> But I am pretty sure the idea of blocking modifications for so long
>>> was considered undesirable.
>>
>> Yes, this is also still my opinion.
> 
> It was indeed rejected :)
> https://marc.info/?l=linux-netdev&m=147794848224884&w=2

Bye,
Hannes

^ permalink raw reply

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic Sowa @ 2016-11-17 19:03 UTC (permalink / raw)
  To: David Miller
  Cc: idosch, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20161117.131609.164578639844581895.davem@davemloft.net>

On 17.11.2016 19:16, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 17 Nov 2016 18:20:39 +0100
> 
>> Hi,
>>
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
>>>> The other way is the journal idea I had, which uses an rb-tree with
>>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>>> until the dump is finished and use it as queue later to shuffle stuff
>>>> into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
>>
>> For user space queries we would still query the in-kernel table.
> 
> Ok, I think I might misunderstand something.
> 
> What is going into this journal exactly?  The actual full software and
> hardware insert operation, or just the notification to the hardware
> device driver notifiers?

The journal is only used as a timely ordered queue for updating the
hardware in correct order.

The enqueue side is the fib notifier only. If no fib notifier is
registered we don't use this code at all (and also don't hit the lock
protecting this journal in fib insertion/deletion path - fast in-kernel
path is untouched - otherwise just a spin_lock already under rtnl_lock
in slow path).

The fib notifier enqueues the packet with a timestamp into this journal
and can also merge entries while they are in the queue, e.g. we got a
delete from the fib notifier but the rcu walk indicated an addition of
the entry, so we can merge that at this point and depending on the
timestamp remove the entry or drop the deletion event.

We start dequeueing the fib entries into the hardware as soon as the rcu
dump is finished, at this point we are up-to-date in the queue with all
events. New events can be added to the journal (with appropriate
locking) during this time, as the queue was once in proper synced state
we stay proper synchronized. We keep up with the queue in steady state
after the dump, so syncing happens ASAP. Maybe we can also drop the
journal then.

Something alike this described queue is implemented here (haven't
checked if it exactly matches the specification, certainly it provides
more features):
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c

For this to work the config path needs to add timestamps to the
fib_infos or fib_aliases.

> The "lookup" I'm mostly concerned with is the fast path where the
> packets being processed actually look up a route.

This doesn't change at all. All code will be hidden in a library that
gets attached to the fib notifier, which is configuration code path.

> I do not think we can return success on the insert to the user yet
> have the route lookup dataplace not return that route on a lookup.

We don't change kernel fast path at all.

If we add/delete a route in software and hardware, kernel indicates
success as soon as software has the entry added. It also gets queued up
in the journal. Journal will be lazily processed, if error happens
during that (e.g. hardware signals table full), abort is called and all
packets go to user space ASAP. User space will always show the route as
it is added to in the first place and after the driver called abort also
process the packets accordingly.

I can imagine this can get very complicated. David's approach with a
counter to check for modifications and a limited number of retries
probably works too, especially because the hardware will probably be
initialized before routing daemons start up and will be synced up
hopefully all the time.

So maybe this is over engineered, but I have no idea how long hardware
needs to sync up a e.g. full IPv4 routing table into hardware (if that
is actually the current goal of this).

Bye,
Hannes

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Eric Dumazet @ 2016-11-17 18:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Rick Jones, netdev, Saeed Mahameed, Tariq Toukan
In-Reply-To: <20161117193021.580589ae@redhat.com>

On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:

> The point is I can see a socket Send-Q forming, thus we do know the
> application have something to send. Thus, and possibility for
> non-opportunistic bulking. Allowing/implementing bulk enqueue from
> socket layer into qdisc layer, should be fairly simple (and rest of
> xmit_more is already in place).  


As I said, you are fooled by TX completions.

Please make sure to increase the sndbuf limits !

echo 2129920 >/proc/sys/net/core/wmem_default

lpaa23:~# sar -n DEV 1 10|grep eth1
10:49:25         eth1      7.00 9273283.00      0.61 2187214.90      0.00      0.00      0.00
10:49:26         eth1      1.00 9230795.00      0.06 2176787.57      0.00      0.00      1.00
10:49:27         eth1      2.00 9247906.00      0.17 2180915.45      0.00      0.00      0.00
10:49:28         eth1      3.00 9246542.00      0.23 2180790.38      0.00      0.00      1.00
10:49:29         eth1      1.00 9239218.00      0.06 2179044.83      0.00      0.00      0.00
10:49:30         eth1      3.00 9248775.00      0.23 2181257.84      0.00      0.00      1.00
10:49:31         eth1      4.00 9225471.00      0.65 2175772.75      0.00      0.00      0.00
10:49:32         eth1      2.00 9253536.00      0.33 2182666.44      0.00      0.00      1.00
10:49:33         eth1      1.00 9265900.00      0.06 2185598.40      0.00      0.00      0.00
10:49:34         eth1      1.00 6949031.00      0.06 1638889.63      0.00      0.00      1.00
Average:         eth1      2.50 9018045.70      0.25 2126893.82      0.00      0.00      0.50


lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
     xmit_more: 2251366909
     xmit_more: 2256011392

lpaa23:~# echo 2256011392-2251366909 | bc
4644483

   PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  (all, 48 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------

    11.64%  [kernel]  [k] skb_set_owner_w               
     6.21%  [kernel]  [k] queued_spin_lock_slowpath     
     4.76%  [kernel]  [k] _raw_spin_lock                
     4.40%  [kernel]  [k] __ip_make_skb                 
     3.10%  [kernel]  [k] sock_wfree                    
     2.87%  [kernel]  [k] ipt_do_table                  
     2.76%  [kernel]  [k] fq_dequeue                    
     2.71%  [kernel]  [k] mlx4_en_xmit                  
     2.50%  [kernel]  [k] __dev_queue_xmit              
     2.29%  [kernel]  [k] __ip_append_data.isra.40      
     2.28%  [kernel]  [k] udp_sendmsg                   
     2.01%  [kernel]  [k] __alloc_skb                   
     1.90%  [kernel]  [k] napi_consume_skb              
     1.63%  [kernel]  [k] udp_send_skb                  
     1.62%  [kernel]  [k] skb_release_data              
     1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath     
     1.56%  [kernel]  [k] dev_hard_start_xmit           
     1.55%  udpsnd    [.] __libc_send                   
     1.48%  [kernel]  [k] netif_skb_features            
     1.42%  [kernel]  [k] __qdisc_run                   
     1.35%  [kernel]  [k] sk_dst_check                  
     1.33%  [kernel]  [k] sock_def_write_space          
     1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
     1.29%  [kernel]  [k] __local_bh_enable_ip          
     1.21%  [kernel]  [k] copy_user_enhanced_fast_string
     1.08%  [kernel]  [k] __kmalloc_reserve.isra.40     
     1.08%  [kernel]  [k] SYSC_sendto                   
     1.07%  [kernel]  [k] kmem_cache_alloc_node         
     0.95%  [kernel]  [k] ip_finish_output2             
     0.95%  [kernel]  [k] ktime_get                     
     0.91%  [kernel]  [k] validate_xmit_skb             
     0.88%  [kernel]  [k] sock_alloc_send_pskb          
     0.82%  [kernel]  [k] sock_sendmsg                  

^ permalink raw reply

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: André Roth @ 2016-11-17 18:44 UTC (permalink / raw)
  To: Jerome Brunet
  Cc: Martin Blumenstingl, Johnson Leung, Giuseppe CAVALLARO,
	linux-amlogic, Alexandre Torgue, netdev
In-Reply-To: <1479120574.29252.29.camel@baylibre.com>


Hi all,

> I checked again the kernel
> at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The
> version you mention (3.14.65-73) seems to be:
> sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb s905
> kernel-73

I downloaded the prebuilt image from hardkernel, I did not build the
kernel myself. but hardkernel has an earlier release of the same kernel
version, which works fine too. I assume they would have committed the
change in the newer version..
 
> In this particular version, both realtek drivers:
> - drivers/net/phy/realtek.c
> - drivers/amlogic/ethernet/phy/am_realtek.c
> 
> have the hack to disable 1000M advertisement. I don't understand how
> it possible for you to have 1000Base-T Full Duplex with this, maybe
> I'm missing something here ?

that's what I don't understand as well...

the patched kernel shows the following:

$ uname -a
Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016

$ sudo ethtool eth0
Settings for eth0:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Link partner advertised link modes:  10baseT/Half 10baseT/Full 
                                             100baseT/Half
100baseT/Full 1000baseT/Half 1000baseT/Full 
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 0
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: ug
        Wake-on: d
        Current message level: 0x0000003f (63)
                               drv probe link timer ifdown ifup
        Link detected: yes

$ sudo ethtool --show-eee eth0
EEE Settings for eth0:
        EEE status: disabled
        Tx LPI: disabled
        Supported EEE link modes:  100baseT/Full 
                                   1000baseT/Full 
        Advertised EEE link modes:  100baseT/Full 
        Link partner advertised EEE link modes:  100baseT/Full 
                                                 1000baseT/Full 

can it be that "EEE link modes" and the "normal" link modes are two
different things ? 

> If you did compile the kernel yourself, could you check the 2 file
> mentioned above ? Just to be sure there was no patch applied at the
> last minute, which would not show up in the git history of
> hardkernel ?

I cannot check this easily at the moment..

Regards,

 André

^ permalink raw reply

* Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification
From: David Miller @ 2016-11-17 18:42 UTC (permalink / raw)
  To: mrv; +Cc: netdev, jhs
In-Reply-To: <1479334570-25159-1-git-send-email-mrv@mojatatu.com>

From: Roman Mashak <mrv@mojatatu.com>
Date: Wed, 16 Nov 2016 17:16:10 -0500

> Userland client should be able to read an event, and reflect it back to
> the kernel, therefore it needs to extract complete set of netlink flags.
> 
> For example, this will allow "tc monitor" to distinguish Add and Replace
> operations.
> 
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>

Looks good, applied.

^ permalink raw reply

* Re: [PATCH net-next 0/3] RDS: TCP: HA/Failover fixes
From: David Miller @ 2016-11-17 18:35 UTC (permalink / raw)
  To: sowmini.varadhan; +Cc: netdev, santosh.shilimkar, rds-devel
In-Reply-To: <cover.1478876910.git.sowmini.varadhan@oracle.com>

From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Date: Wed, 16 Nov 2016 13:29:47 -0800

> This series contains a set of fixes for bugs exposed when
> we ran the following in a loop between a test machine pair:
> 
>  while (1); do
>    # modprobe rds-tcp on test nodes
>    # run rds-stress in bi-dir mode between test machine pair 
>    # modprobe -r rds-tcp on test nodes
>  done
> 
> rds-stress in bi-dir mode will cause both nodes to initiate
> RDS-TCP connections at almost the same instant, exposing the 
> bugs fixed in this series. 
> 
> Without the fixes, rds-stress reports sporadic packet drops,
> and packets arriving out of sequence. After the fixes,we have
> been able to run the  test overnight, without any issues.
> 
> Each patch has a detailed description of the root-cause fixed
> by the patch.

Series applied, thank you.

^ permalink raw reply

* [PATCH 12/20] net/iucv: Convert to hotplug state machine
From: Sebastian Andrzej Siewior @ 2016-11-17 18:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: rt, Sebastian Andrzej Siewior, Ursula Braun, David S. Miller,
	linux-s390, netdev
In-Reply-To: <20161117183541.8588-1-bigeasy@linutronix.de>

Install the callbacks via the state machine and let the core invoke the
callbacks on the already online CPUs. The smp function calls in the
online/downprep callbacks are not required as the callback is guaranteed to
be invoked on the upcoming/outgoing cpu.

Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-s390@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/cpuhotplug.h |   1 +
 net/iucv/iucv.c            | 118 +++++++++++++++++----------------------------
 2 files changed, 45 insertions(+), 74 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index fd5598b8353a..69abf2c09f6c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -63,6 +63,7 @@ enum cpuhp_state {
 	CPUHP_X86_THERM_PREPARE,
 	CPUHP_X86_CPUID_PREPARE,
 	CPUHP_X86_MSR_PREPARE,
+	CPUHP_NET_IUCV_PREPARE,
 	CPUHP_TIMERS_DEAD,
 	CPUHP_NOTF_ERR_INJ_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
diff --git a/net/iucv/iucv.c b/net/iucv/iucv.c
index 88a2a3ba4212..f0d6afc5d4a9 100644
--- a/net/iucv/iucv.c
+++ b/net/iucv/iucv.c
@@ -639,7 +639,7 @@ static void iucv_disable(void)
 	put_online_cpus();
 }
 
-static void free_iucv_data(int cpu)
+static int iucv_cpu_dead(unsigned int cpu)
 {
 	kfree(iucv_param_irq[cpu]);
 	iucv_param_irq[cpu] = NULL;
@@ -647,9 +647,10 @@ static void free_iucv_data(int cpu)
 	iucv_param[cpu] = NULL;
 	kfree(iucv_irq_data[cpu]);
 	iucv_irq_data[cpu] = NULL;
+	return 0;
 }
 
-static int alloc_iucv_data(int cpu)
+static int iucv_cpu_prepare(unsigned int cpu)
 {
 	/* Note: GFP_DMA used to get memory below 2G */
 	iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
@@ -671,58 +672,38 @@ static int alloc_iucv_data(int cpu)
 	return 0;
 
 out_free:
-	free_iucv_data(cpu);
+	iucv_cpu_dead(cpu);
 	return -ENOMEM;
 }
 
-static int iucv_cpu_notify(struct notifier_block *self,
-				     unsigned long action, void *hcpu)
+static int iucv_cpu_online(unsigned int cpu)
 {
-	cpumask_t cpumask;
-	long cpu = (long) hcpu;
-
-	switch (action) {
-	case CPU_UP_PREPARE:
-	case CPU_UP_PREPARE_FROZEN:
-		if (alloc_iucv_data(cpu))
-			return notifier_from_errno(-ENOMEM);
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_iucv_data(cpu);
-		break;
-	case CPU_ONLINE:
-	case CPU_ONLINE_FROZEN:
-	case CPU_DOWN_FAILED:
-	case CPU_DOWN_FAILED_FROZEN:
-		if (!iucv_path_table)
-			break;
-		smp_call_function_single(cpu, iucv_declare_cpu, NULL, 1);
-		break;
-	case CPU_DOWN_PREPARE:
-	case CPU_DOWN_PREPARE_FROZEN:
-		if (!iucv_path_table)
-			break;
-		cpumask_copy(&cpumask, &iucv_buffer_cpumask);
-		cpumask_clear_cpu(cpu, &cpumask);
-		if (cpumask_empty(&cpumask))
-			/* Can't offline last IUCV enabled cpu. */
-			return notifier_from_errno(-EINVAL);
-		smp_call_function_single(cpu, iucv_retrieve_cpu, NULL, 1);
-		if (cpumask_empty(&iucv_irq_cpumask))
-			smp_call_function_single(
-				cpumask_first(&iucv_buffer_cpumask),
-				iucv_allow_cpu, NULL, 1);
-		break;
-	}
-	return NOTIFY_OK;
+	if (!iucv_path_table)
+		return 0;
+	iucv_declare_cpu(NULL);
+	return 0;
 }
 
-static struct notifier_block __refdata iucv_cpu_notifier = {
-	.notifier_call = iucv_cpu_notify,
-};
+static int iucv_cpu_down_prep(unsigned int cpu)
+{
+	cpumask_t cpumask;
+
+	if (!iucv_path_table)
+		return 0;
+
+	cpumask_copy(&cpumask, &iucv_buffer_cpumask);
+	cpumask_clear_cpu(cpu, &cpumask);
+	if (cpumask_empty(&cpumask))
+		/* Can't offline last IUCV enabled cpu. */
+		return -EINVAL;
+
+	iucv_retrieve_cpu(NULL);
+	if (!cpumask_empty(&iucv_irq_cpumask))
+		return 0;
+	smp_call_function_single(cpumask_first(&iucv_buffer_cpumask),
+				 iucv_allow_cpu, NULL, 1);
+	return 0;
+}
 
 /**
  * iucv_sever_pathid
@@ -2027,6 +2008,7 @@ struct iucv_interface iucv_if = {
 };
 EXPORT_SYMBOL(iucv_if);
 
+static enum cpuhp_state iucv_online;
 /**
  * iucv_init
  *
@@ -2035,7 +2017,6 @@ EXPORT_SYMBOL(iucv_if);
 static int __init iucv_init(void)
 {
 	int rc;
-	int cpu;
 
 	if (!MACHINE_IS_VM) {
 		rc = -EPROTONOSUPPORT;
@@ -2054,23 +2035,19 @@ static int __init iucv_init(void)
 		goto out_int;
 	}
 
-	cpu_notifier_register_begin();
-
-	for_each_online_cpu(cpu) {
-		if (alloc_iucv_data(cpu)) {
-			rc = -ENOMEM;
-			goto out_free;
-		}
-	}
-	rc = __register_hotcpu_notifier(&iucv_cpu_notifier);
+	rc = cpuhp_setup_state(CPUHP_NET_IUCV_PREPARE, "net/iucv:prepare",
+			       iucv_cpu_prepare, iucv_cpu_dead);
 	if (rc)
 		goto out_free;
-
-	cpu_notifier_register_done();
+	rc = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "net/iucv:online",
+			       iucv_cpu_online, iucv_cpu_down_prep);
+	if (rc < 0)
+		goto out_free;
+	iucv_online = rc;
 
 	rc = register_reboot_notifier(&iucv_reboot_notifier);
 	if (rc)
-		goto out_cpu;
+		goto out_free;
 	ASCEBC(iucv_error_no_listener, 16);
 	ASCEBC(iucv_error_no_memory, 16);
 	ASCEBC(iucv_error_pathid, 16);
@@ -2084,14 +2061,10 @@ static int __init iucv_init(void)
 
 out_reboot:
 	unregister_reboot_notifier(&iucv_reboot_notifier);
-out_cpu:
-	cpu_notifier_register_begin();
-	__unregister_hotcpu_notifier(&iucv_cpu_notifier);
 out_free:
-	for_each_possible_cpu(cpu)
-		free_iucv_data(cpu);
-
-	cpu_notifier_register_done();
+	if (iucv_online)
+		cpuhp_remove_state(iucv_online);
+	cpuhp_remove_state(CPUHP_NET_IUCV_PREPARE);
 
 	root_device_unregister(iucv_root);
 out_int:
@@ -2110,7 +2083,6 @@ static int __init iucv_init(void)
 static void __exit iucv_exit(void)
 {
 	struct iucv_irq_list *p, *n;
-	int cpu;
 
 	spin_lock_irq(&iucv_queue_lock);
 	list_for_each_entry_safe(p, n, &iucv_task_queue, list)
@@ -2119,11 +2091,9 @@ static void __exit iucv_exit(void)
 		kfree(p);
 	spin_unlock_irq(&iucv_queue_lock);
 	unregister_reboot_notifier(&iucv_reboot_notifier);
-	cpu_notifier_register_begin();
-	__unregister_hotcpu_notifier(&iucv_cpu_notifier);
-	for_each_possible_cpu(cpu)
-		free_iucv_data(cpu);
-	cpu_notifier_register_done();
+
+	cpuhp_remove_state_nocalls(iucv_online);
+	cpuhp_remove_state(CPUHP_NET_IUCV_PREPARE);
 	root_device_unregister(iucv_root);
 	bus_unregister(&iucv_bus);
 	unregister_external_irq(EXT_IRQ_IUCV, iucv_external_interrupt);
-- 
2.10.2

^ permalink raw reply related

* Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Ido Schimmel @ 2016-11-17 18:32 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kuznet, jmorris, yoshfuji,
	kaber
In-Reply-To: <f58c008f-c23a-4950-2975-3c1b4c3b8692@stressinduktion.org>

On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
> On 17.11.2016 17:45, David Miller wrote:
> > From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > Date: Thu, 17 Nov 2016 15:36:48 +0100
> > 
> >> The other way is the journal idea I had, which uses an rb-tree with
> >> timestamps as keys (can be lamport timestamps). You insert into the tree
> >> until the dump is finished and use it as queue later to shuffle stuff
> >> into the hardware.
> > 
> > If you have this "place" where pending inserts are stored, you have
> > a policy decision to make.
> > 
> > First of all what do other lookups see when there are pending entires?
> 
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).

The delayed work queue only postpones the insert into the listener's
table, but the entries will be present in the kernel's table as usual.
Therefore, other lookup made on the kernel's table will see the pending
entries.

Note that both listeners that currently call the dump API do that during
their init, before any lookups can be made on their tables. If you think
it's critical, then we can make sure the workqueues are flushed before
the listeners register their netdevs and effectively expose their tables
to lookups.

I'm looking into the reordering issues you mentioned. I belive it's a
valid point.

> For user space queries we would still query the in-kernel table.

Right. No changes here.

> > If, once inserted into the pending queue, you return success to the
> > inserting entity, then you must make those pending entries visible
> > to lookups.
> 
> I think this same problem is in this patch set already. The way I
> understood it, is, that if a problem during insert emerges, the driver
> calls abort and every packet will be send to user space, as the routing
> table cannot be offloaded and it won't try it again, Ido?

First of all, this abort mechanism is already in place and isn't part of
this patchset. Secondly, why is this a problem? The all-or-nothing
approach is an hard requirement and current implementation is infinitely
better than previous approach in which the kernel's tables were flushed
upon route insertion failure. It rendered the system unusable. Current
implementation of abort mechanism keeps the system usable, but with
reduced performance.

> Probably this is an artifact of the mellanox implementation and we can
> implement this differently for other cards, but the schema to abort all
> if the modification doesn't work, seems to be fundamental (I think we
> require the all-or-nothing approach for now).

Yes, it's an hard requirement for all implementations. mlxsw implements
it by evicting all routes from its tables and inserting a default route
that traps packets to CPU.

> > If you block the inserting entity, well that doesn't make a lot of
> > sense.  If blocking is a workable solution, then we can just block the
> > entire insert while this FIB dump to the device driver is happening.
> 
> I don't think we should really block (as in kernel-block) at any time.
> 
> I was suggesting something like:
> 
> rtnl_lock();
> synchronize_rcu_expedited(); // barrier, all rounting modifications are
> stable and no new can be added due to rtnl_lock
> register notifier(); // notifier adds entries also into journal();
> rtnl_unlock();
> walk_fib_tree_rcu_into_journal();
> // walk finished
> start_syncing_journal_to_hw(); // if new entries show up we sync them
> asap after this point
> 
> The journal would need a spin lock to protect its internal state and
> order events correctly.
> 
> > But I am pretty sure the idea of blocking modifications for so long
> > was considered undesirable.
> 
> Yes, this is also still my opinion.

It was indeed rejected :)
https://marc.info/?l=linux-netdev&m=147794848224884&w=2

^ permalink raw reply

* Re: [PATCH 3/3] net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpart
From: David Miller @ 2016-11-17 18:31 UTC (permalink / raw)
  To: clabbe.montjoie; +Cc: peppe.cavallaro, alexandre.torgue, netdev, linux-kernel
In-Reply-To: <1479323381-26639-3-git-send-email-clabbe.montjoie@gmail.com>

From: Corentin Labbe <clabbe.montjoie@gmail.com>
Date: Wed, 16 Nov 2016 20:09:41 +0100

> From: LABBE Corentin <clabbe.montjoie@gmail.com>
> 
> As sugested by Joe Perches, we could replace all
> if (netif_msg_type(priv)) dev_xxx(priv->devices, ...)
> by the simpler macro netif_xxx(priv, hw, priv->dev, ...)
> 
> Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>

Applied to net-next.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox