Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] rhashtable: remove insecure_max_entries param
From: David Miller @ 2017-04-26 18:39 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20170425094134.21885-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Tue, 25 Apr 2017 11:41:34 +0200

> no users in the tree, insecure_max_entries is always set to
> ht->p.max_size * 2 in rhtashtable_init().
> 
> Replace only spot that uses it with a ht->p.max_size check.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied, thanks Florian.

^ permalink raw reply

* Re: [PATCH v2] macsec: dynamically allocate space for sglist
From: David Miller @ 2017-04-26 18:42 UTC (permalink / raw)
  To: Jason; +Cc: netdev, linux-kernel, stable, security, sd
In-Reply-To: <20170425170818.32661-1-Jason@zx2c4.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Tue, 25 Apr 2017 19:08:18 +0200

> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Sabrina Dubroca <sd@queasysnail.net>
> Cc: security@kernel.org
> Cc: stable@vger.kernel.org

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 00/10] tcp: do not use tcp_time_stamp for rcv autotuning
From: David Miller @ 2017-04-26 18:44 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, soheil, eric.dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Tue, 25 Apr 2017 10:15:31 -0700

> Some devices or linux distributions use HZ=100 or HZ=250
> 
> TCP receive buffer autotuning has poor behavior caused by this choice.
> Since autotuning happens after 4 ms or 10 ms, short distance flows
> get their receive buffer tuned to a very high value, but after an initial
> period where it was frozen to (too small) initial value.
> 
> With BBR (or other CC allowing to increase BDP), we are willing to
> increase tcp_rmem[2], but this receive autotuning defect is a blocker
> for hosts dealing with gazillions of TCP flows in the data centers,
> since many of them have inflated RCVBUF. Risk of OOM is too high.
> 
> Note that TSO autodefer, tcp cubic, and TCP TS options (RFC 7323)
> also suffer from our dependency to jiffies (via tcp_time_stamp).
> 
> We have ongoing efforts to improve all that in the future.

Looks great, series applied, thanks Eric.

^ permalink raw reply

* Re: [oss-drivers] Re: [RFC 3/4] nfp: make use of extended ack message reporting
From: Simon Horman @ 2017-04-26 18:44 UTC (permalink / raw)
  To: David Miller
  Cc: jhs, jakub.kicinski, netdev, johannes, dsa, daniel,
	alexei.starovoitov, bblanco, john.fastabend, kubakici,
	oss-drivers
In-Reply-To: <20170426.104416.270999555163740292.davem@davemloft.net>

On Wed, Apr 26, 2017 at 10:44:16AM -0400, David Miller wrote:
> From: Simon Horman <simon.horman@netronome.com>
> Date: Wed, 26 Apr 2017 13:13:16 +0200
> 
> > On Tue, Apr 25, 2017 at 10:20:22AM -0400, David Miller wrote:
> >> From: Jamal Hadi Salim <jhs@mojatatu.com>
> >> Date: Tue, 25 Apr 2017 08:42:32 -0400
> >> 
> >> > So are we going to standardize these strings?
> >> 
> >> No.
> >> 
> >> > i.e what if some user has written a bash script that depends on this
> >> > string and it gets changed later.
> >> 
> >> They can't do that.
> >> 
> >> It's free form extra information an application may or not provide
> >> to the user when the kernel emits it.
> > 
> > I don't feel strongly about this and perhaps it can be revisited at some
> > point but perhaps it would be worth documenting that he strings do not
> > form part of the UAPI as my expectation would have been that they do f.e. to
> > facilitate internationalisation.
> 
> These two things are entirely separate.
> 
> We can maintain uptodate translations of the strings, yet document that
> they can change at any time and are thus not UAPI.

Thanks, I see that now.

^ permalink raw reply

* Re: [PATCH net-next] dt-bindings: mdio: Clarify binding document
From: David Miller @ 2017-04-26 18:46 UTC (permalink / raw)
  To: f.fainelli-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, rogerq-l0cyMroinI0,
	andrew-g2DYL2Zd6BY, tony-4v6yS6AI5VpBDgjK7y7TUQ,
	nsekhar-l0cyMroinI0, jsarha-l0cyMroinI0,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, lars-Qo5EllUWu/uELgA04lAiVw,
	robh+dt-DgEjT+Ai2ygdnm+yROfE0A, mark.rutland-5wv7dgnIgG8,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20170425183308.26107-1-f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

From: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Tue, 25 Apr 2017 11:33:03 -0700

> The described GPIO reset property is applicable to *all* child PHYs. If
> we have one reset line per PHY present on the MDIO bus, these
> automatically become properties of the child PHY nodes.
> 
> Finally, indicate how the RESET pulse width must be defined, which is
> the maximum value of all individual PHYs RESET pulse widths determined
> by reading their datasheets.
> 
> Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.")
> Signed-off-by: Florian Fainelli <f.fainelli-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Applied, thanks Florian.
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3] net: core: Prevent from dereferencing null pointer when releasing SKB
From: David Miller @ 2017-04-26 18:47 UTC (permalink / raw)
  To: mhjungk; +Cc: netdev
In-Reply-To: <1493146695-5387-1-git-send-email-mhjungk@gmail.com>

From: Myungho Jung <mhjungk@gmail.com>
Date: Tue, 25 Apr 2017 11:58:15 -0700

> Added NULL check to make __dev_kfree_skb_irq consistent with kfree
> family of functions.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289
> 
> Signed-off-by: Myungho Jung <mhjungk@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Move sub crq init out of interrupt context
From: David Miller @ 2017-04-26 18:49 UTC (permalink / raw)
  To: nfont; +Cc: netdev, jallen, tlfalcon
In-Reply-To: <20170425185704.41126.65738.stgit@ltcalpine2-lp23.aus.stglabs.ibm.com>

From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Date: Tue, 25 Apr 2017 15:00:58 -0400

> The sub crqs are currently intialized in interrupt context when
> handling a crq response fromn the vios server. There is no reason
> they must be initialized there.
> 
> Moving the initialization of the sub crqs to the ibmvnic_init routine
> allows us to do the initialization outside of interrupt context and
> make all of the allocations with GFP_KERNEL instead of GFP_ATOMIC.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: on tx, only call napi_disable if tx napi is on
From: David Miller @ 2017-04-26 18:50 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, mst, jasowang, virtualization, willemb
In-Reply-To: <20170425195917.54209-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 25 Apr 2017 15:59:17 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> As of tx napi, device down (`ip link set dev $dev down`) hangs unless
> tx napi is enabled. Else napi_enable is not called, so napi_disable
> will spin on test_and_set_bit NAPI_STATE_SCHED.
> 
> Only call napi_disable if tx napi is enabled.
> 
> Fixes: 5a719c2552ca ("virtio-net: transmit napi")
> Reported-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied, thanks.

^ permalink raw reply

* Re: [Patch net] ipv6: check skb->protocol before lookup for nexthop
From: David Miller @ 2017-04-26 18:51 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, andreyknvl, steffen.klassert
In-Reply-To: <1493156235-9823-1-git-send-email-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 25 Apr 2017 14:37:15 -0700

> Andrey reported a out-of-bound access in ip6_tnl_xmit(), this
> is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4
> neigh key as an IPv6 address:
> 
>         neigh = dst_neigh_lookup(skb_dst(skb),
>                                  &ipv6_hdr(skb)->daddr);
>         if (!neigh)
>                 goto tx_err_link_failure;
> 
>         addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE
>         addr_type = ipv6_addr_type(addr6);
> 
>         if (addr_type == IPV6_ADDR_ANY)
>                 addr6 = &ipv6_hdr(skb)->daddr;
> 
>         memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
> 
> Also the network header of the skb at this point should be still IPv4
> for 4in6 tunnels, we shold not just use it as IPv6 header.
> 
> This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it
> is, we are safe to do the nexthop lookup using skb_dst() and
> ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which
> dest address we can pick here, we have to rely on callers to fill it
> from tunnel config, so just fall to ip6_route_output() to make the
> decision.
> 
> Fixes: ea3dc9601bda ("ip6_tunnel: Add support for wildcard tunnel endpoints.")
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Andrey Konovalov <andreyknvl@google.com>
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied and queued up for -stable, thanks Cong.

^ permalink raw reply

* Re: [PATCH net-next] tcp: memset ca_priv data to 0 properly
From: David Miller @ 2017-04-26 18:59 UTC (permalink / raw)
  To: weiwan; +Cc: netdev, edumazet, ycheng, ncardwell
In-Reply-To: <20170426003802.40091-1-tracywwnj@gmail.com>

From: Wei Wang <weiwan@google.com>
Date: Tue, 25 Apr 2017 17:38:02 -0700

> From: Wei Wang <weiwan@google.com>
> 
> Always zero out ca_priv data in tcp_assign_congestion_control() so that
> ca_priv data is cleared out during socket creation.
> Also always zero out ca_priv data in tcp_reinit_congestion_control() so
> that when cc algorithm is changed, ca_priv data is cleared out as well.
> We should still zero out ca_priv data even in TCP_CLOSE state because
> user could call connect() on AF_UNSPEC to disconnect the socket and
> leave it in TCP_CLOSE state and later call setsockopt() to switch cc
> algorithm on this socket.
> 
> Fixes: 2b0a8c9ee ("tcp: add CDG congestion control")
> Reported-by: Andrey Konovalov  <andreyknvl@google.com>
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>

Applied to 'net' and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: check raw payload size correctly in ioctl
From: David Miller @ 2017-04-26 19:00 UTC (permalink / raw)
  To: jbainbri; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1493167407-27969-1-git-send-email-jbainbri@redhat.com>

From: Jamie Bainbridge <jbainbri@redhat.com>
Date: Wed, 26 Apr 2017 10:43:27 +1000

> In situations where an skb is paged, the transport header pointer and
> tail pointer can be the same because the skb contents are in frags.
> 
> This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a
> length of 0 when the length to receive is actually greater than zero.
> 
> skb->len is already correctly set in ip6_input_finish() with
> pskb_pull(), so use skb->len as it always returns the correct result
> for both linear and paged data.
> 
> Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2017-04-26 19:21 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) MLX5 bug fixes from Saeed Mahameed et al.
   a) Release wrong resources when firmware timeout happens
   b) Wrong check for encapsulation size limits
   c) UAR memory leak
   d) ETHTOOL_GRXCLSRLALL fails to fill in info->data

2) Don't cache l3mdev on mis-matches local route, causes
   net devices to leak refs.  From Robert Shearman.

3) Handle fragmented SKBs properly in macsec driver, the problem
   is that we were mis-sizing the sgvec table.  From Jason A.
   Donenfeld.

4) We cannot have checksum offload enabled for inner UDP tunneled
   packet during IPSEC, from Ansis Atteka.

5) Fix double SKB free in ravb driver, from Dan Carpenter.

6) Fix CPU port handling in b53 DSA driver, from Florian Dainelli.

7) Don't use on-stack buffers for usb_control_msg() in CAN usb driver,
   from Maksim Salau.

8) Fix device leak in macvlan driver, from Herbert Xu.  We have to
   purge the broadcast queue properly on port destroy.

9) Fix tx ring entry limit on EF10 devices in sfc driver.  From
   Bert Kenward.

10) Fix memory leaks in team driver, from Pan Bian.

11) Don't setup ipv6_stub before it can be actually used, from Paolo
   Abeni.

12) Fix tipc socket flow control accounting, from Parthasarathy
    Bhuvaragan.

13) Fix crash on module unload in hso driver, from Andreas Kemnade.

14) Fix purging of bridge multicast entries, the problem is that if
    we don't defer it to ndo_uninit it's possible for new entries to
    get added after we purge.  Fix from Xin Long.

15) Don't return garbage for PACKET_HDRLEN getsockopt, from Alexander
    Potapenko.

16) Fix autoneg stall properly in PHY layer, and revert micrel driver
    change that was papering over it.  From Alexander Kochetkov.

17) Don't dereference an ipv4 route as an ipv6 one in the ip6_tunnnel
    code, from Cong Wang.

18) Clear out the congestion control private of the TCP socket in all
    of the right places, from Wei Wang.

19) rawv6_ioctl measures SKB length incorrectly, fix from Jamie
    Bainbridge.

Please pull, thanks a lot!

The following changes since commit 94836ecf1e7378b64d37624fbb81fe48fbd4c772:

  Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux (2017-04-21 16:37:48 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 105f5528b9bbaa08b526d3405a5bcd2ff0c953c8:

  ipv6: check raw payload size correctly in ioctl (2017-04-26 14:59:35 -0400)

----------------------------------------------------------------
Alexander Kochetkov (1):
      net: phy: fix auto-negotiation stall due to unavailable interrupt

Alexander Potapenko (1):
      net/packet: check length in getsockopt() called with PACKET_HDRLEN

Andreas Kemnade (1):
      net: hso: fix module unloading

Ansis Atteka (1):
      udp: disable inner UDP checksum offloads in IPsec case

Bert Kenward (1):
      sfc: tx ring can only have 2048 entries for all EF10 NICs

Dan Carpenter (2):
      net: tc35815: move free after the dereference
      ravb: Double free on error in ravb_start_xmit()

David Ahern (2):
      net: ipv6: send unsolicited NA if enabled for all interfaces
      net: ipv6: regenerate host route if moved to gc list

David S. Miller (4):
      Merge tag 'mlx5-fixes-2017-04-22' of git://git.kernel.org/.../saeed/linux
      Merge branch 'dsa-b53-58xx-fixes'
      Merge tag 'linux-can-fixes-for-4.11-20170425' of git://git.kernel.org/.../mkl/linux-can
      Revert "phy: micrel: Disable auto negotiation on startup"

Eugenia Emantayev (1):
      net/mlx5e: Fix small packet threshold

Florian Fainelli (3):
      net: dsa: b53: Include IMP/CPU port in dumb forwarding mode
      net: dsa: b53: Implement software reset for 58xx devices
      net: dsa: b53: Fix CPU port for 58xx devices

Herbert Xu (1):
      macvlan: Fix device ref leak when purging bc_queue

Ilan Tayari (1):
      net/mlx5e: Fix ETHTOOL_GRXCLSRLALL handling

Jamie Bainbridge (1):
      ipv6: check raw payload size correctly in ioctl

Jason A. Donenfeld (2):
      macsec: avoid heap overflow in skb_to_sgvec
      macsec: dynamically allocate space for sglist

Maksim Salau (1):
      net: can: usb: gs_usb: Fix buffer on stack

Maor Gottlieb (1):
      net/mlx5: Fix UAR memory leak

Martin KaFai Lau (1):
      net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats

Mohamad Haj Yahia (1):
      net/mlx5: Fix driver load bad flow when having fw initializing timeout

Myungho Jung (1):
      net: core: Prevent from dereferencing null pointer when releasing SKB

Or Gerlitz (3):
      net/mlx5: E-Switch, Correctly deal with inline mode on ConnectX-5
      net/mlx5e: Make sure the FW max encap size is enough for ipv4 tunnels
      net/mlx5e: Make sure the FW max encap size is enough for ipv6 tunnels

Pan Bian (1):
      team: fix memory leaks

Paolo Abeni (1):
      ipv6: move stub initialization after ipv6 setup completion

Parthasarathy Bhuvaragan (2):
      tipc: fix socket flow control accounting error at tipc_send_stream
      tipc: fix socket flow control accounting error at tipc_recv_stream

Robert Shearman (1):
      ipv4: Avoid caching l3mdev dst on mismatched local route

Roman Spychała (1):
      usb: plusb: Add support for PL-27A1

Sabrina Dubroca (1):
      ipv6: fix source routing

Stephane Grosjean (2):
      can: usb: Add support of PCAN-Chip USB stamp module
      can: usb: Kconfig: Add PCAN-USB X6 device in help text

WANG Cong (1):
      ipv6: check skb->protocol before lookup for nexthop

Wei Wang (1):
      tcp: memset ca_priv data to 0 properly

Xin Long (1):
      bridge: move bridge multicast cleanup to ndo_uninit

stephen hemminger (1):
      netvsc: fix calculation of available send sections

sudarsana.kalluru@cavium.com (1):
      qed: Fix error in the dcbx app meta data initialization.

 drivers/net/can/usb/Kconfig                                |  2 ++
 drivers/net/can/usb/gs_usb.c                               | 17 ++++++++++++-----
 drivers/net/can/usb/peak_usb/pcan_usb_core.c               |  2 ++
 drivers/net/can/usb/peak_usb/pcan_usb_core.h               |  2 ++
 drivers/net/can/usb/peak_usb/pcan_usb_fd.c                 | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/dsa/b53/b53_common.c                           | 37 +++++++++++++++++++++++++++++++++++--
 drivers/net/dsa/b53/b53_regs.h                             |  5 +++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h               |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c    |  1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c          |  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c            | 87 ++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 36 ++++++++++++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx5/core/main.c             |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/uar.c              |  1 +
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c                 | 10 +++++-----
 drivers/net/ethernet/renesas/ravb_main.c                   |  7 ++++---
 drivers/net/ethernet/sfc/efx.h                             |  5 ++++-
 drivers/net/ethernet/sfc/workarounds.h                     |  1 +
 drivers/net/ethernet/toshiba/tc35815.c                     |  2 +-
 drivers/net/hyperv/hyperv_net.h                            |  1 -
 drivers/net/hyperv/netvsc.c                                |  9 ++++-----
 drivers/net/macsec.c                                       | 27 +++++++++++++++++++++------
 drivers/net/macvlan.c                                      | 11 ++++++++++-
 drivers/net/phy/micrel.c                                   | 11 -----------
 drivers/net/phy/phy.c                                      | 40 ++++++++++++++++++++++++++++++++++++----
 drivers/net/team/team.c                                    |  8 ++++++--
 drivers/net/usb/Kconfig                                    |  2 +-
 drivers/net/usb/hso.c                                      |  2 +-
 drivers/net/usb/plusb.c                                    | 15 +++++++++++++--
 include/linux/phy.h                                        |  1 +
 net/bridge/br_device.c                                     |  1 +
 net/bridge/br_if.c                                         |  1 -
 net/core/dev.c                                             |  3 +++
 net/ipv4/route.c                                           |  3 ++-
 net/ipv4/tcp_cong.c                                        | 11 +++--------
 net/ipv4/udp_offload.c                                     |  3 +++
 net/ipv6/addrconf.c                                        | 14 ++++++++++++--
 net/ipv6/af_inet6.c                                        |  6 ++++--
 net/ipv6/exthdrs.c                                         |  4 ++++
 net/ipv6/ip6_tunnel.c                                      | 34 ++++++++++++++++++----------------
 net/ipv6/ndisc.c                                           |  3 ++-
 net/ipv6/raw.c                                             |  3 +--
 net/packet/af_packet.c                                     |  2 ++
 net/tipc/socket.c                                          |  4 ++--
 44 files changed, 373 insertions(+), 141 deletions(-)

^ permalink raw reply

* Low speed MPLS to virtio-net
From: Алексей Болдырев @ 2017-04-26 19:15 UTC (permalink / raw)
  To: netdev

Started MPLS on the branch - Everything was fine. When I tried to run MPLS on a real network of virtual machines, there were problems with the speed:
root@containers:~# iperf3 -c 10.194.10.2 -B 10.194.10.1 -Z       
Connecting to host 10.194.10.2, port 5201
[  4] local 10.194.10.1 port 49533 connected to 10.194.10.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1018 KBytes  8.34 Mbits/sec  238   5.64 KBytes       
[  4]   1.00-2.00   sec  1.42 MBytes  11.9 Mbits/sec  373   1.41 KBytes       
[  4]   2.00-3.00   sec  1.43 MBytes  12.0 Mbits/sec  379   5.64 KBytes       
[  4]   3.00-4.00   sec  1.43 MBytes  12.0 Mbits/sec  376   5.64 KBytes       
[  4]   4.00-5.00   sec  1.41 MBytes  11.8 Mbits/sec  375   2.82 KBytes       
[  4]   5.00-6.00   sec  1.42 MBytes  11.9 Mbits/sec  376   2.82 KBytes       
[  4]   6.00-7.00   sec  1.42 MBytes  11.9 Mbits/sec  373   5.64 KBytes       
[  4]   7.00-8.00   sec  1.41 MBytes  11.8 Mbits/sec  372   5.64 KBytes       
[  4]   8.00-9.00   sec  1.42 MBytes  11.9 Mbits/sec  379   2.82 KBytes       
[  4]   9.00-10.00  sec  1.42 MBytes  11.9 Mbits/sec  373   5.64 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  13.8 MBytes  11.5 Mbits/sec  3614             sender
[  4]   0.00-10.00  sec  13.6 MBytes  11.4 Mbits/sec                  receiver

iperf Done.
root@containers:~# 
Here are the settings:
test0:
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo:1: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 10.194.10.1  netmask 255.255.255.255
        loop  txqueuelen 1000  (Local Loopback)

test0p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.194.1.50  netmask 255.255.255.0  broadcast 10.194.1.255
        inet6 fe80::b0a7:b1ff:fec1:3d5c  prefixlen 64  scopeid 0x20<link>
        ether b2:a7:b1:c1:3d:5c  txqueuelen 1000  (Ethernet)
        RX packets 19974  bytes 1410944 (1.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3726844  bytes 5236310466 (4.8 GiB)
        TX errors 0  dropped 3604 overruns 0  carrier 0  collisions 0

test1:
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo:1: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 10.194.10.2  netmask 255.255.255.255
        loop  txqueuelen 1000  (Local Loopback)

test1p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.194.1.51  netmask 255.255.255.0  broadcast 10.194.1.255
        inet6 fe80::5cc0:45ff:fe1a:9705  prefixlen 64  scopeid 0x20<link>
        ether 5e:c0:45:1a:97:05  txqueuelen 1000  (Ethernet)
        RX packets 2001923  bytes 2806406771 (2.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19907  bytes 1485150 (1.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Server configuration:
root@ne-vlezay80:~# ip -M r
100 via inet 10.194.1.50 dev vlan11 
101 via inet 10.194.1.51 dev vlan11 
root@ne-vlezay80:~# 
root@ne-vlezay80:~# ifconfig
eth0      Link encap:Ethernet  HWaddr 52:54:00:5d:81:90  
          inet addr:10.247.0.250  Bcast:10.247.0.255  Mask:255.255.255.0
          inet6 addr: fe80::5054:ff:fe5d:8190/64 Scope:Link
          inet6 addr: fd00:1002:1289:10::10/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:2500  Metric:1
          RX packets:7403 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4182 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:621871 (607.2 KiB)  TX bytes:445766 (435.3 KiB)

eth1      Link encap:Ethernet  HWaddr 52:54:00:0b:ff:2e  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fd00:104:1::1/64 Scope:Global
          inet6 addr: fe80::5054:ff:fe0b:ff2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2500  Metric:1
          RX packets:2204837 errors:0 dropped:5 overruns:0 frame:0
          TX packets:2083636 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3073876412 (2.8 GiB)  TX bytes:2897017540 (2.6 GiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:75 errors:0 dropped:0 overruns:0 frame:0
          TX packets:75 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:6600 (6.4 KiB)  TX bytes:6600 (6.4 KiB)

servers   Link encap:Ethernet  HWaddr b2:c2:cf:9a:9c:00  
          UP RUNNING NOARP MASTER  MTU:65536  Metric:1
          RX packets:2259 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4062 errors:0 dropped:219 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:220368 (215.2 KiB)  TX bytes:340283 (332.3 KiB)

vlan10    Link encap:Ethernet  HWaddr 52:54:00:0b:ff:2e  
          inet addr:10.194.0.1  Bcast:10.194.0.255  Mask:255.255.255.0
          inet6 addr: 2a01:d0:c353:180::1/64 Scope:Global
          inet6 addr: fe80::5054:ff:fe0b:ff2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2500  Metric:1
          RX packets:2260 errors:0 dropped:0 overruns:0 frame:0
          TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:85868 (83.8 KiB)  TX bytes:1322 (1.2 KiB)

vlan11    Link encap:Ethernet  HWaddr 52:54:00:0b:ff:2e  
          inet addr:10.194.1.1  Bcast:10.194.1.255  Mask:255.255.255.0
          inet6 addr: fe80::5054:ff:fe0b:ff2e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2500  Metric:1
          RX packets:2201655 errors:0 dropped:58 overruns:0 frame:0
          TX packets:2083391 errors:0 dropped:119523 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3034042872 (2.8 GiB)  TX bytes:2888672290 (2.6 GiB)

^ permalink raw reply

* Re: [PATCH net-next v8 2/3] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
From: Jamal Hadi Salim @ 2017-04-26 20:07 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: davem, xiyou.wangcong, eric.dumazet, netdev, Simon Horman,
	Benjamin LaHaise
In-Reply-To: <20170426135627.GI1867@nanopsycho.orion>

On 17-04-26 09:56 AM, Jiri Pirko wrote:
> Wed, Apr 26, 2017 at 03:14:38PM CEST, jhs@mojatatu.com wrote:
>> On 17-04-26 08:08 AM, Jiri Pirko wrote:

[..]

>> Jiri, what are you arguing about if you have done the math? ;->
>
> I can do 3*2*64. What I cannot do is to figure out the real performance
> impact.
>

Jiri, I do a lot of very large data dumping and setting towards the
kernel. You know that. It is why I even have these patches to begin
with.

The math should be convincing enough.
48B per rule extra for just MPLS in a filter rule. I havent started
testing the overhead of flower but i do plan to use it - with about a
million rules for offloading. I will give you the numbers then.

I think we are at a stalemate.
You are not going to convince me to use an attribute with a
u8 for a bit flag when I can fit 32 of them in one attribute (with
the same cost). And I am not able to convince you that you are
wrong to put beauty first.

>> Again: You are looking at this from a manageability point of view which
>> is useful but not the only input into a design. If i can squeeze more
>> data without killing usability - I am all for it. It just doesnt
>> compute that it is ok to use a flag per attribute because it looks
>> beautiful.
>
> Hmm. Now that I'm thinking about it, why don't we have NLA_FLAGS with
> couple of helpers around it? It will be obvious what the attr is, all
> kernel code would use the same helpers. Would be nice.
>

I think to have flags at that level is useful but it
is a different hierarchy level. I am not sure the
"actions dump large messages" is a fit for that level.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next 3/6] bpf: bpf_progs stores all loaded programs
From: Daniel Borkmann @ 2017-04-26 20:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa, netdev; +Cc: ast, jbenc, aconole
In-Reply-To: <20170426182419.14574-4-hannes@stressinduktion.org>

[ -daniel@iogearbox.com (wrong address) ]

On 04/26/2017 08:24 PM, Hannes Frederic Sowa wrote:
> We later want to give users a quick dump of what is possible with procfs,
> so store a list of all currently loaded bpf programs. Later this list
> will be printed in procfs.
>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>   include/linux/filter.h |  4 ++--
>   kernel/bpf/core.c      | 51 +++++++++++++++++++++++---------------------------
>   kernel/bpf/syscall.c   |  4 ++--
>   3 files changed, 27 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9a7786db14fa53..63624c619e371b 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -753,8 +753,8 @@ bpf_address_lookup(unsigned long addr, unsigned long *size,
>   	return ret;
>   }
>
> -void bpf_prog_kallsyms_add(struct bpf_prog *fp);
> -void bpf_prog_kallsyms_del(struct bpf_prog *fp);
> +void bpf_prog_link(struct bpf_prog *fp);
> +void bpf_prog_unlink(struct bpf_prog *fp);
>
>   #else /* CONFIG_BPF_JIT */
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 043f634ff58d87..2139118258cdf8 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,22 +365,6 @@ static struct latch_tree_root bpf_tree __cacheline_aligned;
>
>   int bpf_jit_kallsyms __read_mostly;
>
> -static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux)
> -{
> -	WARN_ON_ONCE(!list_empty(&aux->bpf_progs_head));
> -	list_add_tail_rcu(&aux->bpf_progs_head, &bpf_progs);
> -	latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
> -}
> -
> -static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux)
> -{
> -	if (list_empty(&aux->bpf_progs_head))
> -		return;
> -
> -	latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
> -	list_del_rcu(&aux->bpf_progs_head);
> -}
> -
>   static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp)
>   {
>   	return fp->jited && !bpf_prog_was_classic(fp);
> @@ -392,38 +376,45 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
>   	       fp->aux->bpf_progs_head.prev == LIST_POISON2;
>   }
>
> -void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> +void bpf_prog_link(struct bpf_prog *fp)
>   {
> -	if (!bpf_prog_kallsyms_candidate(fp) ||
> -	    !capable(CAP_SYS_ADMIN))
> -		return;
> +	struct bpf_prog_aux *aux = fp->aux;
>
>   	spin_lock_bh(&bpf_lock);
> -	bpf_prog_ksym_node_add(fp->aux);
> +	list_add_tail_rcu(&aux->bpf_progs_head, &bpf_progs);
> +	if (bpf_prog_kallsyms_candidate(fp))
> +		latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);

Hmm, this has the side-effect that it will hook up all progs
to kallsyms (I left out !capable(CAP_SYS_ADMIN) intentionally).

>   	spin_unlock_bh(&bpf_lock);
>   }
>
> -void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> +void bpf_prog_unlink(struct bpf_prog *fp)
>   {
> -	if (!bpf_prog_kallsyms_candidate(fp))
> -		return;
> +	struct bpf_prog_aux *aux = fp->aux;
>
>   	spin_lock_bh(&bpf_lock);
> -	bpf_prog_ksym_node_del(fp->aux);
> +	list_del_rcu(&aux->bpf_progs_head);
> +	if (bpf_prog_kallsyms_candidate(fp))
> +		latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops);
>   	spin_unlock_bh(&bpf_lock);
>   }
>
>   static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
>   {
>   	struct latch_tree_node *n;
> +	struct bpf_prog *prog;
>
>   	if (!bpf_jit_kallsyms_enabled())
>   		return NULL;
>
>   	n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops);
> -	return n ?
> -	       container_of(n, struct bpf_prog_aux, ksym_tnode)->prog :
> -	       NULL;
> +	if (!n)
> +		return NULL;
> +
> +	prog = container_of(n, struct bpf_prog_aux, ksym_tnode)->prog;
> +	if (!prog->priv_cap_sys_admin)

Where is this bit defined?

If we return NULL on them anyway, why adding them to the tree
in the first place, just wastes resources on the traversal?

> +		return NULL;
> +
> +	return prog;
>   }
>
>   const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
> @@ -474,6 +465,10 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
>
>   	rcu_read_lock();
>   	list_for_each_entry_rcu(aux, &bpf_progs, bpf_progs_head) {
> +		if (!bpf_prog_kallsyms_candidate(aux->prog) ||
> +		    !aux->prog->priv_cap_sys_admin)

Same here.

> +			continue;
> +
>   		if (it++ != symnum)
>   			continue;
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 13642c73dca0b4..d61d1bd3e6fee6 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -664,7 +664,7 @@ void bpf_prog_put(struct bpf_prog *prog)
>   {
>   	if (atomic_dec_and_test(&prog->aux->refcnt)) {
>   		trace_bpf_prog_put_rcu(prog);
> -		bpf_prog_kallsyms_del(prog);
> +		bpf_prog_unlink(prog);
>   		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>   	}
>   }
> @@ -858,7 +858,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>   		/* failed to allocate fd */
>   		goto free_used_maps;
>
> -	bpf_prog_kallsyms_add(prog);
> +	bpf_prog_link(prog);
>   	trace_bpf_prog_load(prog, err);
>   	return err;
>
>

^ permalink raw reply

* Re: [PATCH net-next] bpf: restore skb->sk before pskb_trim() call
From: Alexei Starovoitov @ 2017-04-26 20:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Andrey Konovalov, Willem de Bruijn
In-Reply-To: <1493222963.6453.77.camel@edumazet-glaptop3.roam.corp.google.com>

On Wed, Apr 26, 2017 at 09:09:23AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
> in skb_try_coalesce() reported by Andrey, I found that we had an skb
> with skb->sk set but no skb->destructor.
> 
> This invalidated heuristic found in commit 158f323b9868 ("net: adjust
> skb->truesize in pskb_expand_head()") and in cited patch.
> 
> Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
> restrain the temporary setting to a minimal section.
> 
> [1] https://patchwork.ozlabs.org/patch/755570/ 
>     net: adjust skb->truesize in ___pskb_trim()
> 
> Fixes: 8f917bba0042 ("bpf: pass sk to helper functions")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Andrey Konovalov <andreyknvl@google.com>

Ahh. Thanks for the fix.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [Patch net-next] ipv4: get rid of ip_ra_lock
From: Cong Wang @ 2017-04-26 20:55 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang

After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
we always take RTNL lock for ip_ra_control() which is the only place
we update the list ip_ra_chain, so the ip_ra_lock is no longer needed,
we just need to disable BH there.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/ip_sockglue.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 1d46d05..2923ea1 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -330,7 +330,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
    sent to multicast group to reach destination designated router.
  */
 struct ip_ra_chain __rcu *ip_ra_chain;
-static DEFINE_SPINLOCK(ip_ra_lock);
 
 
 static void ip_ra_destroy_rcu(struct rcu_head *head)
@@ -352,21 +351,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
 	new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-	spin_lock_bh(&ip_ra_lock);
+	ASSERT_RTNL();
+	local_bh_disable();
 	for (rap = &ip_ra_chain;
-	     (ra = rcu_dereference_protected(*rap,
-			lockdep_is_held(&ip_ra_lock))) != NULL;
+	     (ra = rtnl_dereference(*rap)) != NULL;
 	     rap = &ra->next) {
 		if (ra->sk == sk) {
 			if (on) {
-				spin_unlock_bh(&ip_ra_lock);
+				local_bh_enable();
 				kfree(new_ra);
 				return -EADDRINUSE;
 			}
 			/* dont let ip_call_ra_chain() use sk again */
 			ra->sk = NULL;
 			RCU_INIT_POINTER(*rap, ra->next);
-			spin_unlock_bh(&ip_ra_lock);
+			local_bh_enable();
 
 			if (ra->destructor)
 				ra->destructor(sk);
@@ -381,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 		}
 	}
 	if (!new_ra) {
-		spin_unlock_bh(&ip_ra_lock);
+		local_bh_enable();
 		return -ENOBUFS;
 	}
 	new_ra->sk = sk;
@@ -390,7 +389,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 	RCU_INIT_POINTER(new_ra->next, ra);
 	rcu_assign_pointer(*rap, new_ra);
 	sock_hold(sk);
-	spin_unlock_bh(&ip_ra_lock);
+	local_bh_enable();
 
 	return 0;
 }
-- 
2.5.5

^ permalink raw reply related

* Re: xdp_redirect ifindex vs port. Was: best API for returning/setting egress port?
From: Andy Gospodarek @ 2017-04-26 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Jesper Dangaard Brouer, Alexei Starovoitov,
	Daniel Borkmann, Daniel Borkmann, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org
In-Reply-To: <c9b3ff9b-8938-3274-da29-c04f9da2a314@fb.com>

On Wed, Apr 26, 2017 at 10:58:45AM -0700, Alexei Starovoitov wrote:
> On 4/26/17 9:35 AM, John Fastabend wrote:
> > 
> > > As Alexei also mentioned before, ifindex vs port makes no real
> > > difference seen from the bpf program side.  It is userspace's
> > > responsibility to add ifindex/port's to the bpf-maps, according to how
> > > the bpf program "policy" want to "connect" these ports.  The
> > > port-table system add one extra step, of also adding this port to the
> > > port-table (which lives inside the kernel).
> > > 
> > 
> > I'm not sure I understand the "lives inside the kernel" bit. I assumed
> > the 'map' should be a bpf map and behave like any other bpf map.
> > 
> > I wanted a new map to be defined, something like this from the bpf programmer
> > side.
> > 
> > struct bpf_map_def SEC("maps") port_table =
> > 	.type = BPF_MAP_TYPE_PORT_CONNECTION,
> > 	.key_size = sizeof(u32),
> > 	.value_size = BPF_PORT_CONNECTION_SIZE,
> > 	.max_entries = 256,
> > };
> 
> I like the idea.
> We have prog_array, perf_event_array, cgroup_array map specializations.
> This one can be new netdev_array with some new bpf_redirect-like helper
> accessing it.
> 
> > > When loading the XDP program, we also need to pass along a port table
> > > "id" this XDP program is associated with (and if it doesn't exists you
> > > create it).  And your userspace "control-plane" application also need
> > > to know this port table "id", when adding a new port.
> > 
> > So the user space application that is loading the program also needs
> > to handle this map. This seems correct to me. But I don't see the
> > value in making some new port table when we already have well understood
> > framework for maps.
> 
> +1
> 
> > > 
> > > The concept of having multiple port tables is key.  As this implies we
> > > can have several simultaneous "data-planes" that is *isolated* from
> > > each-other.  Think about how network-namespaces/containers want
> > > isolation. A subtle thing I'm afraid to mention, is that oppose to the
> > > ifindex model, a port table with mapping to a net_device pointer, would
> > > allow (faster) delivery into the container's inner net_device, which
> > > sort of violates the isolation, but I would argue it is not a problem
> > > as this net_device pointer could only be added from a process within the
> > > namespace.  I like this feature, but it could easily be disallowed via
> > > port insertion-time validation.
> > > 
> > 
> > I think the above optimization should be allowed. And agree multiple port
> > tables (maps?) is needed. Again all this points to using standard maps
> > logic in my mind. For permissions and different domains, which I think
> > you were starting to touch on, it looks like we could extend the pinning API.
> > At the moment it does an inode_permission(inode, MAY_WRITE) check but I
> > presume this could be extended. None of this would be needed in v1 and
> > could be added subsequently. read-only maps seems doable.
> 
> this is great idea. Once BPF_MAP_TYPE_NETDEV_ARRAY is populated
> the user space can make it readonly to prevent further changes.
> 
> From user space it can be done similar to perf_events/cgroups as well.
> bpf_map_update_elem(&netdev_array, &port_num, &ifindex)
> should work.
> For bpf_map_lookup_elem() from such netdev_array we can return
> ifindex back.
> The bpf_map_show_fdinfo() can be customized as well to pretty print
> ifindexes of netdevs stored in there.
> 

I agree with both of you on all of these points.  Having the port
redirection in a new type of map and/or array seems like the way to go.

I understood Jesper's perspecitive when thinking about a way to pass a
port-table id down, but I think the idea that the userspace loader code
defining the maps is going to be the one making this link is the right
idea and handling things like ifindex changes (rather than identifiers
that perform lookups in other tables) is going to have to be yet another
exercise left up to the...user.  :-)

^ permalink raw reply

* Re: [PATCH net-next 4/6] bpf: track if the bpf program was loaded with SYS_ADMIN capabilities
From: Daniel Borkmann @ 2017-04-26 21:04 UTC (permalink / raw)
  To: Hannes Frederic Sowa, netdev; +Cc: ast, daniel, jbenc, aconole
In-Reply-To: <20170426182419.14574-5-hannes@stressinduktion.org>

On 04/26/2017 08:24 PM, Hannes Frederic Sowa wrote:
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Ahh, looks this got swapped with 3/6.

> ---
>   include/linux/filter.h | 6 ++++--
>   kernel/bpf/core.c      | 4 +++-
>   kernel/bpf/syscall.c   | 7 ++++---
>   kernel/bpf/verifier.c  | 4 ++--
>   net/core/filter.c      | 6 +++---
>   5 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 63624c619e371b..635311f57bf24f 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -413,7 +413,8 @@ struct bpf_prog {
>   				locked:1,	/* Program image locked? */
>   				gpl_compatible:1, /* Is filter GPL compatible? */
>   				cb_access:1,	/* Is control block accessed? */
> -				dst_needed:1;	/* Do we need dst entry? */
> +				dst_needed:1,	/* Do we need dst entry? */
> +				priv_cap_sys_admin:1; /* Where we loaded as sys_admin? */
>   	kmemcheck_bitfield_end(meta);
>   	enum bpf_prog_type	type;		/* Type of BPF program */
[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6f8b6ed690be93..24c9dac374770f 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3488,7 +3488,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
>   	if (ret < 0)
>   		goto skip_full_check;
>
> -	env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
> +	env->allow_ptr_leaks = env->prog->priv_cap_sys_admin;
>
>   	ret = do_check(env);
>
> @@ -3589,7 +3589,7 @@ int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
>   	if (ret < 0)
>   		goto skip_full_check;
>
> -	env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
> +	env->allow_ptr_leaks = prog->priv_cap_sys_admin;
>
>   	ret = do_check(env);
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9a37860a80fc78..dc020d40bb770a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1100,7 +1100,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog)
>   	if (!bpf_check_basics_ok(fprog->filter, fprog->len))
>   		return -EINVAL;
>
> -	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0);
> +	fp = bpf_prog_alloc(bpf_prog_size(fprog->len), 0, false);
>   	if (!fp)
>   		return -ENOMEM;
>

Did you check that transferring allow_ptr_leaks doesn't have a side
effect on the nfp JIT? I believe it can also do cbpf migrations to
a certain extend.

^ permalink raw reply

* Re: [PATCH net-next 4/6] bpf: track if the bpf program was loaded with SYS_ADMIN capabilities
From: Alexei Starovoitov @ 2017-04-26 21:08 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, ast, daniel, jbenc, aconole
In-Reply-To: <20170426182419.14574-5-hannes@stressinduktion.org>

On Wed, Apr 26, 2017 at 08:24:17PM +0200, Hannes Frederic Sowa wrote:
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  include/linux/filter.h | 6 ++++--
>  kernel/bpf/core.c      | 4 +++-
>  kernel/bpf/syscall.c   | 7 ++++---
>  kernel/bpf/verifier.c  | 4 ++--
>  net/core/filter.c      | 6 +++---
>  5 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 63624c619e371b..635311f57bf24f 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -413,7 +413,8 @@ struct bpf_prog {
>  				locked:1,	/* Program image locked? */
>  				gpl_compatible:1, /* Is filter GPL compatible? */
>  				cb_access:1,	/* Is control block accessed? */
> -				dst_needed:1;	/* Do we need dst entry? */
> +				dst_needed:1,	/* Do we need dst entry? */
> +				priv_cap_sys_admin:1; /* Where we loaded as sys_admin? */

This is no go.
You didn't provide any explanation whatsoever why you want to see this boolean value.

^ permalink raw reply

* Re: [PATCH net-next 6/6] bpf: show bpf programs
From: Alexei Starovoitov @ 2017-04-26 21:25 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev, ast, daniel, jbenc, aconole, Martin KaFai Lau
In-Reply-To: <20170426182419.14574-7-hannes@stressinduktion.org>

On Wed, Apr 26, 2017 at 08:24:19PM +0200, Hannes Frederic Sowa wrote:
>  
> +static const char *bpf_type_string(enum bpf_prog_type type)
> +{
> +	static const char *bpf_type_names[] = {
> +#define X(type) #type
> +		BPF_PROG_TYPES
> +#undef X
> +	};
> +
> +	if (type >= ARRAY_SIZE(bpf_type_names))
> +		return "<unknown>";
> +
> +	return bpf_type_names[type];
> +}
> +
>  static int ebpf_proc_show(struct seq_file *s, void *v)
>  {
> +	struct bpf_prog *prog;
> +	struct bpf_prog_aux *aux;
> +	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
> +
>  	if (v == SEQ_START_TOKEN) {
> -		seq_printf(s, "# tag\n");
> +		seq_printf(s, "# tag\t\t\ttype\t\t\truntime\tcap\tmemlock\n");
>  		return 0;
>  	}
>  
> +	aux = v;
> +	prog = aux->prog;
> +
> +	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
> +	seq_printf(s, "%s\t%s\t%s\t%s\t%llu\n", prog_tag,
> +		   bpf_type_string(prog->type),
> +		   prog->jited ? "jit" : "int",
> +		   prog->priv_cap_sys_admin ? "priv" : "unpriv",
> +		   prog->pages * 1ULL << PAGE_SHIFT);

As I said several times already I'm strongly against procfs
style of exposing information about the programs.
I don't want this to become debugfs for bpf.
Maintaining the list of all loaded programs is fine
and we need a way to iterate through them, but procfs
is obviously not the interface to do that.
Programs/maps are binary whereas any fs interface is text.
It also doesn't scale with large number of programs/maps.
I prefer Daniel's suggestion on adding 'get_next' like API.
Also would be good if you can wait for Martin to finish his
prog->handle/id patches. Then user space will be able
to iterate through all the progs/maps and fetch all info about
them through syscall in extensible way.
And you wouldn't need to abuse kallsyms list for different purpose.

^ permalink raw reply

* Re: [PATCH net-next 6/6] bpf: show bpf programs
From: Daniel Borkmann @ 2017-04-26 21:35 UTC (permalink / raw)
  To: Hannes Frederic Sowa, netdev; +Cc: ast, daniel, jbenc, aconole
In-Reply-To: <20170426182419.14574-7-hannes@stressinduktion.org>

On 04/26/2017 08:24 PM, Hannes Frederic Sowa wrote:
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>   include/uapi/linux/bpf.h | 32 +++++++++++++++++++-------------
>   kernel/bpf/core.c        | 30 +++++++++++++++++++++++++++++-
>   2 files changed, 48 insertions(+), 14 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e553529929f683..d6506e320953d5 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -101,20 +101,26 @@ enum bpf_map_type {
>   	BPF_MAP_TYPE_HASH_OF_MAPS,
>   };
>
> +#define BPF_PROG_TYPES			\
> +	X(BPF_PROG_TYPE_UNSPEC),	\
> +	X(BPF_PROG_TYPE_SOCKET_FILTER),	\
> +	X(BPF_PROG_TYPE_KPROBE),	\
> +	X(BPF_PROG_TYPE_SCHED_CLS),	\
> +	X(BPF_PROG_TYPE_SCHED_ACT),	\
> +	X(BPF_PROG_TYPE_TRACEPOINT),	\
> +	X(BPF_PROG_TYPE_XDP),		\
> +	X(BPF_PROG_TYPE_PERF_EVENT),	\
> +	X(BPF_PROG_TYPE_CGROUP_SKB),	\
> +	X(BPF_PROG_TYPE_CGROUP_SOCK),	\
> +	X(BPF_PROG_TYPE_LWT_IN),	\
> +	X(BPF_PROG_TYPE_LWT_OUT),	\
> +	X(BPF_PROG_TYPE_LWT_XMIT),
> +
> +
>   enum bpf_prog_type {
> -	BPF_PROG_TYPE_UNSPEC,
> -	BPF_PROG_TYPE_SOCKET_FILTER,
> -	BPF_PROG_TYPE_KPROBE,
> -	BPF_PROG_TYPE_SCHED_CLS,
> -	BPF_PROG_TYPE_SCHED_ACT,
> -	BPF_PROG_TYPE_TRACEPOINT,
> -	BPF_PROG_TYPE_XDP,
> -	BPF_PROG_TYPE_PERF_EVENT,
> -	BPF_PROG_TYPE_CGROUP_SKB,
> -	BPF_PROG_TYPE_CGROUP_SOCK,
> -	BPF_PROG_TYPE_LWT_IN,
> -	BPF_PROG_TYPE_LWT_OUT,
> -	BPF_PROG_TYPE_LWT_XMIT,
> +#define X(type) type

Defining X in uapi could clash easily with other headers e.g.
from application side.

> +	BPF_PROG_TYPES
> +#undef X
>   };
>
>   enum bpf_attach_type {
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 3ba175a24e971a..685c1d0f31e029 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -536,13 +536,41 @@ static void ebpf_proc_stop(struct seq_file *s, void *v)
>   	rcu_read_unlock();
>   }
>
> +static const char *bpf_type_string(enum bpf_prog_type type)
> +{
> +	static const char *bpf_type_names[] = {
> +#define X(type) #type
> +		BPF_PROG_TYPES
> +#undef X
> +	};
> +
> +	if (type >= ARRAY_SIZE(bpf_type_names))
> +		return "<unknown>";
> +
> +	return bpf_type_names[type];
> +}
> +
>   static int ebpf_proc_show(struct seq_file *s, void *v)
>   {
> +	struct bpf_prog *prog;
> +	struct bpf_prog_aux *aux;
> +	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
> +
>   	if (v == SEQ_START_TOKEN) {
> -		seq_printf(s, "# tag\n");
> +		seq_printf(s, "# tag\t\t\ttype\t\t\truntime\tcap\tmemlock\n");
>   		return 0;
>   	}
>
> +	aux = v;
> +	prog = aux->prog;
> +
> +	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
> +	seq_printf(s, "%s\t%s\t%s\t%s\t%llu\n", prog_tag,
> +		   bpf_type_string(prog->type),
> +		   prog->jited ? "jit" : "int",
> +		   prog->priv_cap_sys_admin ? "priv" : "unpriv",
> +		   prog->pages * 1ULL << PAGE_SHIFT);

Yeah, so that would be quite similar to what we dump in
bpf_prog_show_fdinfo() modulo the priv bit.

I generally agree that a facility for dumping all progs is needed
and it was also on the TODO list after the bpf(2) cmd for dumping
program insns back to user space.

I think the procfs interface has pro and cons: the upside is that
you can use it with tools like cat to inspect it, but what you still
cannot do is to say that you want to see the prog insns for, say,
prog #4 from that list. If we could iterate over that list through fds
via bpf(2) syscall, you could i) present the same info you have above
via fdinfo already and ii) also dump the BPF insns from that specific
program through a BPF_PROG_DUMP bpf(2) command. Once that dump also
supports maps in progs, you could go further and fetch related map
fds for inspection, etc.

Such option of iterating through that would need a new BPF syscall
cmd aka BPF_PROG_GET_NEXT which returns the first prog from the list
and you would walk the next one by passing the current fd, which can
later also be closed as not needed anymore. We could restrict that
dump to capable(CAP_SYS_ADMIN), and the kernel tree would need to
ship a tool e.g. under tools/bpf/ that can be used for inspection.

> +
>   	return 0;
>   }
>
>

^ permalink raw reply

* [PATCH 1/2] wcn36xx: Pass used skb to ieee80211_tx_status()
From: Bjorn Andersson @ 2017-04-26 22:04 UTC (permalink / raw)
  To: Eugene Krasnikov, Kalle Valo
  Cc: Andy Gross, David Brown, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-arm-msm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-soc-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	wcn36xx-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Nicolas Dechesne

As the tx skbs are collected they should be passed to
ieee80211_tx_status() rather than ieee80211_free_txskb(), as the prior
will take care of monitoring and LED triggers while the latter will
consider the skb dropped.

Signed-off-by: Bjorn Andersson <bjorn.andersson-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
 drivers/net/wireless/ath/wcn36xx/dxe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/wcn36xx/dxe.c b/drivers/net/wireless/ath/wcn36xx/dxe.c
index 87dfdaf9044c..938b7bd733cf 100644
--- a/drivers/net/wireless/ath/wcn36xx/dxe.c
+++ b/drivers/net/wireless/ath/wcn36xx/dxe.c
@@ -371,7 +371,7 @@ static void reap_tx_dxes(struct wcn36xx *wcn, struct wcn36xx_dxe_ch *ch)
 			info = IEEE80211_SKB_CB(ctl->skb);
 			if (!(info->flags & IEEE80211_TX_CTL_REQ_TX_STATUS)) {
 				/* Keep frame until TX status comes */
-				ieee80211_free_txskb(wcn->hw, ctl->skb);
+				ieee80211_tx_status(wcn->hw, ctl->skb);
 			}
 			spin_lock(&ctl->skb_lock);
 			if (wcn->queues_stopped) {
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 2/2] arm64: dts: apq8016-sbc: Correct WLAN LED default-trigger
From: Bjorn Andersson @ 2017-04-26 22:04 UTC (permalink / raw)
  To: Andy Gross, David Brown
  Cc: devicetree, linux-arm-msm, linux-wireless, linux-kernel,
	Kalle Valo, netdev, Nicolas Dechesne, wcn36xx, linux-soc,
	Eugene Krasnikov, linux-arm-kernel
In-Reply-To: <20170426220444.10539-1-bjorn.andersson@linaro.org>

The TX status trigger of the wlan interface is named phy0tx, so this
updates the default-trigger for the WLAN LED to use that instead.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
---

Note that without patch 1/2 this trigger does not fire - but there's also no
harm in picking the two patches through separate trees.

 arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi b/arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
index 5d83b02b7c4a..21a8f5ce8955 100644
--- a/arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
+++ b/arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
@@ -178,7 +178,7 @@
 			led@5 {
 				label = "apq8016-sbc:yellow:wlan";
 				gpios = <&pm8916_mpps 2 GPIO_ACTIVE_HIGH>;
-				linux,default-trigger = "wlan";
+				linux,default-trigger = "phy0tx";
 				default-state = "off";
 			};
 
-- 
2.12.0

^ permalink raw reply related

* Re: [PATCH 4/7] ixgbe: use pcie_flr instead of duplicating it
From: Jeff Kirsher @ 2017-04-26 22:09 UTC (permalink / raw)
  To: Christoph Hellwig, Bjorn Helgaas, Giovanni Cabiddu,
	Salvatore Benedetto, Mike Marciniszyn, Dennis Dalessandro,
	Derek Chickles, Satanand Burla, Felix Manlunas, Raghu Vatsavayi
  Cc: linux-pci, qat-linux, linux-crypto, linux-rdma, netdev,
	linux-kernel
In-Reply-To: <20170413145339.20186-5-hch@lst.de>

[-- Attachment #1: Type: text/plain, Size: 325 bytes --]

On Thu, 2017-04-13 at 16:53 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 ++--------------
>  1 file changed, 2 insertions(+), 14 deletions(-)

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Sorry for the late ACK.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox