Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 1/2] net: stmmac: Rework coalesce timer and fix multi-queue races
From: Jose Abreu @ 2018-09-12 14:23 UTC (permalink / raw)
  To: Florian Fainelli, Jose Abreu, netdev, Tal Gilboa
  Cc: Jerome Brunet, Martin Blumenstingl, David S. Miller, Joao Pinto,
	Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <f0288def-ecf3-851b-fe81-12f0f79a061d@gmail.com>

Hi Florian,

Thanks for your input.

On 10-09-2018 20:22, Florian Fainelli wrote:
> On 09/10/2018 02:14 AM, Jose Abreu wrote:
>> This follows David Miller advice and tries to fix coalesce timer in
>> multi-queue scenarios.
>>
>> We are now using per-queue coalesce values and per-queue TX timer.
>>
>> Coalesce timer default values was changed to 1ms and the coalesce frames
>> to 25.
>>
>> Tested in B2B setup between XGMAC2 and GMAC5.
> Why not revert the entire features for this merge window and work on
> getting it to work over the next weeks/merge windows?

It was already reverted but the performance drops a little bit
(not that much but I'm trying to fix it).

>
> The idea of using a timer to coalesce TX path when there is not a HW
> timer is a good idea and if this is made robust enough, you could even
> promote that as being a network stack library/feature that could be used
> by other drivers. In fact, this could be a great addition to the net DIM
> library (Tal, what do you think?)
>
> Here's a quick drive by review of things that appear wrong in the
> current driver (without your patches):
>
> - in stmmac_xmit(), in case we hit the !is_jumbo branch and we fail the
> DMA mapping, there is no timer cancellation, don't we want to abort the
> whole transmission?

I don't think this is needed because then tx pointer will not
advance and in stmmac_tx_clean we just won't perform any work.
Besides, we can have a pending timer from previous packets
running so canceling it can cause some problems.

>
> - stmmac_tx_clean() should probably use netif_lock_bh() to guard against
> the timer (soft IRQ context) and the the NAPI context (also soft IRQ)
> running in parallel on two different CPUs. This may not explain all
> problems, but these two things are fundamentally exclusive, because the
> timer is meant to emulate the interrupt after N packets, while NAPI
> executes when such a thing did actually occur

Ok, and now I'm also using __netif_tx_lock_bh(queue) to just lock
per queue instead of the whole TX.

>
> - stmmac_poll() should cancel pending timer(s) if it was able to reclaim
> packets, likewise stmmac_tx_timer() should re-enable TX interrupts if it
> reclaimed packets, since TX interrupts could have been left disabled
> from a prior NAPI run. These could be considered optimizations, since
> you could leave the TX timer running all the time, just adjust the
> deadline (based on line rate, MTU, IPG, number of fragments and their
> respective length), worst case, both NAPI and the timer clean up your TX
> ring, so you should always have room to push more packets

In next version I'm dropping the direct call to stmmac_tx_clean()
in the timer function and just scheduling NAPI instead.

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* Re: [RFC v2 1/2] netlink: add NLA_REJECT policy type
From: Michal Kubecek @ 2018-09-12 19:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: David Miller, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1536777285.3678.28.camel-cdvu00un1VgdHxzADdlk8Q@public.gmane.org>

On Wed, Sep 12, 2018 at 08:34:45PM +0200, Johannes Berg wrote:
> It wouldn't be hard to reject attributes that are higher than maxtype -
> we already pass that to nla_parse() wherever we call it, but we'd have
> to find a way to make it optional I guess, for compatibility reasons.
> Perhaps with a warning, like attribute validation. For genetlink, a flag
> in the family (something like "strict attribute validation") would be
> easy, but for "netlink proper" we have a lot of nlmsg_parse() calls to
> patch, and/or replace by nlmsg_parse_strict().
> 
> I guess we should
> 
> 1) implement nlmsg_parse_strict() for those new things that want it
>    strictly - greenfield type stuff that doesn't need to work with
>    existing applications
> 
> 2) add a warning to nlmsg_parse() when a too high attribute is
>    encountered
> 
> 3) eventually replace nlmsg_parse() calls by nlmsg_parse_strict() and
>    see what breaks? :-) We won't be able to rely on that any time soon
>    though (unless userspace first checks with a guaranteed rejected
>    attribute, e.g. one that has NLA_REJECT, perhaps the u64 pad
>    attributes could be marked such since the kernel can't assume
>    alignment anyway)

I'm not so sure we (eventually) want to reject unknown attributes
everywhere. I don't have any concrete example in mind but I think there
will be use cases where we want to ignore unrecognized attributes
(probably per parse call). But it makes sense to require caller to
explicitely declare this is the case.

> While we're talking about wishlist, I'm also toying with the idea of
> having some sort of generic mechanism to convert netlink attributes
> to/from structs, for internal kernel representation; so far though I
> haven't been able to come up with anything useful.

I was also thinking about something like this. One motivation was to
design extensible version of ethtool_ops, the other was allowing complex
data types (structures, arrays) for ethtool tunables. But I have nothing
more than a vague idea so far.

Michal Kubecek

^ permalink raw reply

* Re: [net-next, v2, 1/2] net: stmmac: Rework coalesce timer and fix multi-queue races
From: Jose Abreu @ 2018-09-12 14:25 UTC (permalink / raw)
  To: Neil Armstrong, Jose Abreu, netdev
  Cc: Jerome Brunet, Martin Blumenstingl, David S. Miller, Joao Pinto,
	Giuseppe Cavallaro, Alexandre Torgue
In-Reply-To: <949c5c50-eb73-c02f-14a7-c3956d4847f2@baylibre.com>

Hi Neil,

On 12-09-2018 14:50, Neil Armstrong wrote:
> Hi Jose,
>
> On 11/09/2018 10:17, Jose Abreu wrote:
>> On 10-09-2018 19:15, Neil Armstrong wrote:
>>> RX is still ok but now TX fails almost immediately...
>>>
>>> With 100ms report :
>>>
>>> $ iperf3 -c 192.168.1.47 -t 0 -p 5202 -R -i 0.1
>>> Connecting to host 192.168.1.47, port 5202
>>> Reverse mode, remote host 192.168.1.47 is sending
>>> [  4] local 192.168.1.45 port 45900 connected to 192.168.1.47 port 5202
>>> [ ID] Interval           Transfer     Bandwidth
>>> [  4]   0.00-0.10   sec  10.9 MBytes   913 Mbits/sec
>>> [  4]   0.10-0.20   sec  11.0 MBytes   923 Mbits/sec
>>> [  4]   0.20-0.30   sec  6.34 MBytes   532 Mbits/sec
>>> [  4]   0.30-0.40   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.40-0.50   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.50-0.60   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.60-0.70   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.70-0.80   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.80-0.90   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   0.90-1.00   sec  0.00 Bytes  0.00 bits/sec
>>> [  4]   1.00-1.10   sec  0.00 Bytes  0.00 bits/sec
>>> ^C[  4]   1.10-1.10   sec  0.00 Bytes  0.00 bits/sec
>>> - - - - - - - - - - - - - - - - - - - - - - - - -
>>> [ ID] Interval           Transfer     Bandwidth
>>> [  4]   0.00-1.10   sec  0.00 Bytes  0.00 bits/sec                  sender
>>> [  4]   0.00-1.10   sec  28.2 MBytes   214 Mbits/sec                  receiver
>>> iperf3: interrupt - the client has terminated
>>>
>>> Neil
>> Ok, here goes another incremental patch. If this doesn't work can
>> you please send me a link to the spec of the board you are using ?
> Sorry for the delay...
>
> Not better, sorry.
>
> $ iperf3 -c 10.1.3.201 -p 5202 -R -t 0
> Connecting to host 10.1.3.201, port 5202
> Reverse mode, remote host 10.1.3.201 is sending
> [  4] local 10.1.2.12 port 60612 connected to 10.1.3.201 port 5202
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   110 MBytes   920 Mbits/sec
> [  4]   1.00-2.00   sec   110 MBytes   926 Mbits/sec
> [  4]   2.00-3.00   sec  1.94 MBytes  16.3 Mbits/sec
> [  4]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
> ^C[  4]   4.00-4.76   sec  0.00 Bytes  0.00 bits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-4.76   sec  0.00 Bytes  0.00 bits/sec                  sender
> [  4]   0.00-4.76   sec   222 MBytes   391 Mbits/sec                  receiver
> iperf3: interrupt - the client has terminated
>
> The board is the Amlogic S400 with the A113D SoC, sorry there is no public spec for this board and for this SoC.

Thanks for testing. I will send a new version with major
differences, if you could validate it it would be great.

Thanks and Best Regards,
Jose Miguel Abreu

>
> Neil
>
>> Thanks and Best Regards,
>> Jose Miguel Abreu
>>

^ permalink raw reply

* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Tobias Hommel @ 2018-09-12 15:18 UTC (permalink / raw)
  To: Steffen Klassert
  Cc: Wolfgang Walter, Kristian Evensen, Network Development, weiwan,
	edumazet
In-Reply-To: <20180912085046.GZ23674@gauss3.secunet.de>

[-- Attachment #1: Type: text/plain, Size: 2514 bytes --]

On Wed, Sep 12, 2018 at 10:50:46AM +0200, Steffen Klassert wrote:
> On Tue, Sep 11, 2018 at 09:02:48PM +0200, Tobias Hommel wrote:
> > > > Subject: [PATCH RFC] xfrm: Fix NULL pointer dereference when skb_dst_force
> > > > clears the dst_entry.
> > > > 
> > > > Since commit 222d7dbd258d ("net: prevent dst uses after free")
> > > > skb_dst_force() might clear the dst_entry attached to the skb.
> > > > The xfrm code don't expect this to happen, so we crash with
> > > > a NULL pointer dereference in this case. Fix it by checking
> > > > skb_dst(skb) for NULL after skb_dst_force() and drop the packet
> > > > in cast the dst_entry was cleared.
> > > > 
> > > > Fixes: 222d7dbd258d ("net: prevent dst uses after free")
> > > > Reported-by: Tobias Hommel <netdev-list@genoetigt.de>
> > > > Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
> > > > Reported-by: Wolfgang Walter <linux@stwm.de>
> > > > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> > > > ---
> > > 
> > > This patch fixes the problem here.
> > > 
> > > XfrmFwdHdrError gets around 80 at the very beginning and remains so. Probably 
> > > this happens when some route are changed/set then. 
> > > 
> > > Regards and thanks,
> > 
> > Same here, we're now running stable for ~6 hours, XfrmFwdHdrError is at 220.
> > This is less than 1 lost packet per minute, which seems to be okay for now.
> 
> Thanks a lot for testing! This is now applied to the ipsec tree.

After running for about 24 hours, I now encountered another panic. This time it
is caused by an out of memory situation. Although the trace shows action in the
filesystem code I'm posting it here because I cannot isolate the error and
maybe it is caused by our NULL pointer bug or by the new fix.
I do not have a serial console attached, so I could only attach a screenshot of
the panic to this mail.

I am running v4.19-rc3 from git with the above mentioned patch applied.
After 19 hours everything still looked fine, XfrmFwdHdrError value was at ~950.
Overall memory usage shown by htop was at 1.2G/15.6G.
I had htop running via ssh so I was able to see at least some status post
mortem. Uptime: 23:50:57
Overall memory usage was at 10.2G/15.6G and user processes were just
using the usual amount of memory, so it looks like the kernel was eating up at
least 9G of RAM.

Maybe this information is not very helpful for debugging, but it is at least a
warning that something might still be wrong.

I'll try to gather some more information and keep you updated.

[-- Attachment #2: oom_panic.png --]
[-- Type: image/png, Size: 56627 bytes --]

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2018-09-12 20:29 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix up several Kconfig dependencies in netfilter, from Martin Willi and
   Florian Westphal.

2) Memory leak in be2net driver, from Petr Oros.

3) Memory leak in E-Switch handling of mlx5 driver, from Raed Salem.

4) mlx5_attach_interface needs to check for errors, from Huy Nguyen.

5) tipc_release() needs to orphan the sock, from Cong Wang.

6) Need to program TxConfig register after TX/RX is enabled in r8169
   driver, not beforehand, from Maciej S. Szmigiero.

7) Handle 64K PAGE_SIZE properly in ena driver, from Netanel Belgazal.

8) Fix crash regression in ip_do_fragment(), from Taehee Yoo.

9) syzbot can create conditions where kernel log is flooded with
   synflood warnings due to creation of many listening sockets,
   fix that.  From Willem de Bruijn.

10) Fix RCU issues in rds socket layer, from Cong Wang.

11) Fix vlan matching in nfp driver, from Pieter Jansen van Vuuren.

Please pull, thanks a lot!

The following changes since commit 28619527b8a712590c93d0a9e24b4425b9376a8c:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-09-04 12:45:11 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 4851bfd64d42d9fb6d2d30a41c8523614b412a7a:

  Merge branch 'nfp-flower-fixes' (2018-09-12 13:18:30 -0700)

----------------------------------------------------------------
Cong Wang (6):
      tipc: orphan sock in tipc_release()
      tipc: call start and done ops directly in __tipc_nl_compat_dumpit()
      net_sched: properly cancel netlink dump on failure
      netfilter: xt_hashlimit: use s->file instead of s->private
      rds: fix two RCU related problems
      tipc: check return value of __tipc_dump_start()

Daniel Jurgens (1):
      net/mlx5: Consider PCI domain in search for next dev

David S. Miller (6):
      Merge branch 'iucv-fixes'
      Merge tag 'mlx5e-fixes-2018-09-05' of git://git.kernel.org/.../saeed/linux
      Merge branch 'ena-fixes'
      Merge git://git.kernel.org/.../pablo/nf
      Merge branch 'qeth-fixes'
      Merge branch 'nfp-flower-fixes'

Davide Caratti (1):
      net/sched: fix memory leak in act_tunnel_key_init()

Florian Westphal (5):
      netfilter: xt_checksum: ignore gso skbs
      netfilter: conntrack: place 'new' timeout in first location too
      netfilter: nf_tables: rework ct timeout set support
      netfilter: kconfig: nat related expression depend on nftables core
      netfilter: conntrack: reset tcp maxwin on re-register

Haishuang Yan (2):
      erspan: return PACKET_REJECT when the appropriate tunnel is not found
      erspan: fix error handling for erspan tunnel

Hauke Mehrtens (1):
      MIPS: lantiq: dma: add dev pointer

Huy Nguyen (1):
      net/mlx5: Check for error in mlx5_attach_interface

Jack Morgenstein (2):
      net/mlx5: Fix use-after-free in self-healing flow
      net/mlx5: Fix debugfs cleanup in the device init/remove flow

Juergen Gross (1):
      xen/netfront: fix waiting for xenbus state change

Julian Wiedmann (6):
      net/af_iucv: drop inbound packets with invalid flags
      net/af_iucv: fix skb handling on HiperTransport xmit error
      net/iucv: declare iucv_path_table_empty() as static
      s390/qeth: indicate error when netdev allocation fails
      s390/qeth: switch on SG by default for IQD devices
      s390/qeth: don't dump past end of unknown HW header

Kai-Heng Feng (1):
      r8169: Clear RTL_FLAG_TASK_*_PENDING when clearing RTL_FLAG_TASK_ENABLED

Kristian Evensen (1):
      qmi_wwan: Support dynamic config on Quectel EP06

Kuninori Morimoto (1):
      ethernet: renesas: convert to SPDX identifiers

Louis Peens (1):
      nfp: flower: reject tunnel encap with ipv6 outer headers for offloading

Maciej S. Szmigiero (1):
      r8169: set TxConfig register after TX / RX is enabled, just like RxConfig

Martin Willi (1):
      netfilter: xt_cluster: add dependency on conntrack module

Michal 'vorner' Vaner (1):
      netfilter: nfnetlink_queue: Solve the NFQUEUE/conntrack clash for NF_REPEAT

Netanel Belgazal (7):
      net: ena: fix surprise unplug NULL dereference kernel crash
      net: ena: fix driver when PAGE_SIZE == 64kB
      net: ena: fix device destruction to gracefully free resources
      net: ena: fix potential double ena_destroy_device()
      net: ena: fix missing lock during device destruction
      net: ena: fix missing calls to READ_ONCE
      net: ena: fix incorrect usage of memory barriers

Pablo Neira Ayuso (2):
      netfilter: conntrack: timeout interface depend on CONFIG_NF_CONNTRACK_TIMEOUT
      netfilter: cttimeout: ctnl_timeout_find_get() returns incorrect pointer to type

Petr Machata (1):
      mlxsw: spectrum_buffers: Set up a dedicated pool for BUM traffic

Petr Oros (1):
      be2net: Fix memory leak in be_cmd_get_profile_config()

Pieter Jansen van Vuuren (1):
      nfp: flower: fix vlan match by checking both vlan id and vlan pcp

Raed Salem (1):
      net/mlx5: E-Switch, Fix memory leak when creating switchdev mode FDB tables

Roi Dayan (2):
      net/mlx5: Fix not releasing read lock when adding flow rules
      net/mlx5: Fix possible deadlock from lockdep when adding fte to fg

Saeed Mahameed (1):
      net/mlx5e: Ethtool steering, fix udp source port value

Stefan Wahren (1):
      net: qca_spi: Fix race condition in spi transfers

Taehee Yoo (2):
      netfilter: nf_tables: release chain in flushing set
      ip: frags: fix crash in ip_do_fragment()

Tariq Toukan (2):
      net/mlx5: Use u16 for Work Queue buffer fragment size
      net/mlx5: Use u16 for Work Queue buffer strides offset

Vakul Garg (1):
      net/tls: Set count of SG entries if sk_alloc_sg returns -ENOSPC

Vincent Whitchurch (1):
      tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY

Wenjia Zhang (1):
      s390/qeth: use vzalloc for QUERY OAT buffer

Willem de Bruijn (1):
      tcp: rate limit synflood warnings further

Yue Haibing (1):
      netfilter: conntrack: remove duplicated include from nf_conntrack_proto_udp.c

 arch/mips/include/asm/mach-lantiq/xway/xway_dma.h          |   1 +
 arch/mips/lantiq/xway/dma.c                                |   4 +--
 drivers/net/ethernet/amazon/ena/ena_com.c                  |  24 ++++++++---------
 drivers/net/ethernet/amazon/ena/ena_eth_com.c              |   6 +++++
 drivers/net/ethernet/amazon/ena/ena_eth_com.h              |   8 ++----
 drivers/net/ethernet/amazon/ena/ena_netdev.c               |  82 ++++++++++++++++++++++++++------------------------------
 drivers/net/ethernet/amazon/ena/ena_netdev.h               |  11 ++++++++
 drivers/net/ethernet/emulex/benet/be_cmds.c                |   2 +-
 drivers/net/ethernet/lantiq_etop.c                         |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c              |  22 ++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c    |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c          |  76 ++++++++++++++++++++++++++--------------------------
 drivers/net/ethernet/mellanox/mlx5/core/health.c           |  10 ++++++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c             |  12 +++++----
 drivers/net/ethernet/mellanox/mlx5/core/wq.c               |   6 ++---
 drivers/net/ethernet/mellanox/mlx5/core/wq.h               |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_buffers.c     |  16 +++++------
 drivers/net/ethernet/netronome/nfp/flower/action.c         |   6 +++++
 drivers/net/ethernet/netronome/nfp/flower/main.h           |   1 +
 drivers/net/ethernet/netronome/nfp/flower/match.c          |   2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c        |  11 ++++++++
 drivers/net/ethernet/qualcomm/qca_7k.c                     |  76 +++++++++++++++++++++++++---------------------------
 drivers/net/ethernet/qualcomm/qca_spi.c                    | 110 +++++++++++++++++++++++++++++++++++++++-------------------------------------
 drivers/net/ethernet/qualcomm/qca_spi.h                    |   5 ----
 drivers/net/ethernet/realtek/r8169.c                       |  11 +++++---
 drivers/net/ethernet/renesas/Kconfig                       |   1 +
 drivers/net/ethernet/renesas/Makefile                      |   1 +
 drivers/net/ethernet/renesas/ravb_ptp.c                    |   6 +----
 drivers/net/usb/qmi_wwan.c                                 |  30 ++++++++++++++++++++-
 drivers/net/xen-netfront.c                                 |  24 +++++++----------
 drivers/s390/net/qeth_core_main.c                          |  11 +++++---
 drivers/s390/net/qeth_l2_main.c                            |   2 +-
 drivers/s390/net/qeth_l3_main.c                            |   2 +-
 include/linux/mlx5/driver.h                                |   8 +++---
 include/net/netfilter/nf_conntrack_timeout.h               |   2 +-
 net/core/skbuff.c                                          |   3 ---
 net/ipv4/ip_fragment.c                                     |   1 +
 net/ipv4/ip_gre.c                                          |   5 ++++
 net/ipv4/netfilter/Kconfig                                 |   8 +++---
 net/ipv4/tcp.c                                             |   2 +-
 net/ipv4/tcp_input.c                                       |   4 +--
 net/ipv6/netfilter/nf_conntrack_reasm.c                    |   1 +
 net/iucv/af_iucv.c                                         |  38 +++++++++++++++++---------
 net/iucv/iucv.c                                            |   2 +-
 net/netfilter/Kconfig                                      |  12 ++++-----
 net/netfilter/nf_conntrack_proto.c                         |  26 ++++++++++++++++++
 net/netfilter/nf_conntrack_proto_dccp.c                    |  19 ++++++++-----
 net/netfilter/nf_conntrack_proto_generic.c                 |   8 +++---
 net/netfilter/nf_conntrack_proto_gre.c                     |   8 +++---
 net/netfilter/nf_conntrack_proto_icmp.c                    |   8 +++---
 net/netfilter/nf_conntrack_proto_icmpv6.c                  |   8 +++---
 net/netfilter/nf_conntrack_proto_sctp.c                    |  21 ++++++++++-----
 net/netfilter/nf_conntrack_proto_tcp.c                     |  19 ++++++++-----
 net/netfilter/nf_conntrack_proto_udp.c                     |  21 +++++++--------
 net/netfilter/nf_tables_api.c                              |   1 +
 net/netfilter/nfnetlink_cttimeout.c                        |   6 ++---
 net/netfilter/nfnetlink_queue.c                            |   1 +
 net/netfilter/nft_ct.c                                     |  59 ++++++++++++++++++++---------------------
 net/netfilter/xt_CHECKSUM.c                                |  22 +++++++++++++++-
 net/netfilter/xt_cluster.c                                 |  14 +++++++++-
 net/netfilter/xt_hashlimit.c                               |  18 ++++++-------
 net/rds/bind.c                                             |   5 +++-
 net/sched/act_tunnel_key.c                                 |  28 +++++++++++++-------
 net/tipc/netlink_compat.c                                  |   5 ++++
 net/tipc/socket.c                                          |  18 ++++++++-----
 net/tipc/socket.h                                          |   1 +
 net/tls/tls_sw.c                                           |   6 +++++
 68 files changed, 593 insertions(+), 400 deletions(-)

^ permalink raw reply

* [PATCH net v2 0/3] tls: don't leave keys in kernel memory
From: Sabrina Dubroca @ 2018-09-12 15:44 UTC (permalink / raw)
  To: netdev
  Cc: Sabrina Dubroca, Aviad Yehezkel, Boris Pismenny, Dave Watson,
	Vakul Garg

There are a few places where the RX/TX key for a TLS socket is copied
to kernel memory. This series clears those memory areas when they're no
longer needed.

v2: add union tls_crypto_context, following Vakul Garg's comment
    swap patch 2 and 3, using new union in patch 3

Sabrina Dubroca (3):
  tls: don't copy the key out of tls12_crypto_info_aes_gcm_128
  tls: zero the crypto information from tls_context before freeing
  tls: clear key material from kernel memory when do_tls_setsockopt_conf
    fails

 include/net/tls.h             | 19 +++++++++----------
 net/tls/tls_device.c          |  6 +++---
 net/tls/tls_device_fallback.c |  2 +-
 net/tls/tls_main.c            | 22 ++++++++++++++++------
 net/tls/tls_sw.c              | 13 +++++--------
 5 files changed, 34 insertions(+), 28 deletions(-)

-- 
2.18.0

^ permalink raw reply

* [PATCH net v2 1/3] tls: don't copy the key out of tls12_crypto_info_aes_gcm_128
From: Sabrina Dubroca @ 2018-09-12 15:44 UTC (permalink / raw)
  To: netdev
  Cc: Sabrina Dubroca, Aviad Yehezkel, Boris Pismenny, Dave Watson,
	Vakul Garg
In-Reply-To: <cover.1536766755.git.sd@queasysnail.net>

There's no need to copy the key to an on-stack buffer before calling
crypto_aead_setkey().

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 net/tls/tls_sw.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e28a6ff25d96..f29b7c49cbf2 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1136,7 +1136,6 @@ void tls_sw_free_resources_rx(struct sock *sk)
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
-	char keyval[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
 	struct tls_crypto_info *crypto_info;
 	struct tls12_crypto_info_aes_gcm_128 *gcm_128_info;
 	struct tls_sw_context_tx *sw_ctx_tx = NULL;
@@ -1265,9 +1264,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 
 	ctx->push_pending_record = tls_sw_push_pending_record;
 
-	memcpy(keyval, gcm_128_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
-
-	rc = crypto_aead_setkey(*aead, keyval,
+	rc = crypto_aead_setkey(*aead, gcm_128_info->key,
 				TLS_CIPHER_AES_GCM_128_KEY_SIZE);
 	if (rc)
 		goto free_aead;
-- 
2.18.0

^ permalink raw reply related

* [PATCH net v2 2/3] tls: zero the crypto information from tls_context before freeing
From: Sabrina Dubroca @ 2018-09-12 15:44 UTC (permalink / raw)
  To: netdev
  Cc: Sabrina Dubroca, Aviad Yehezkel, Boris Pismenny, Dave Watson,
	Vakul Garg
In-Reply-To: <cover.1536766755.git.sd@queasysnail.net>

This contains key material in crypto_send_aes_gcm_128 and
crypto_recv_aes_gcm_128.

Introduce union tls_crypto_context, and replace the two identical
unions directly embedded in struct tls_context with it. We can then
use this union to clean up the memory in the new tls_ctx_free()
function.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: introduce union tls_crypto_context

 include/net/tls.h             | 19 +++++++++----------
 net/tls/tls_device.c          |  6 +++---
 net/tls/tls_device_fallback.c |  2 +-
 net/tls/tls_main.c            | 20 +++++++++++++++-----
 net/tls/tls_sw.c              |  8 ++++----
 5 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d5c683e8bb22..0a769cf2f5f3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -171,15 +171,14 @@ struct cipher_context {
 	char *rec_seq;
 };
 
+union tls_crypto_context {
+	struct tls_crypto_info info;
+	struct tls12_crypto_info_aes_gcm_128 aes_gcm_128;
+};
+
 struct tls_context {
-	union {
-		struct tls_crypto_info crypto_send;
-		struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
-	};
-	union {
-		struct tls_crypto_info crypto_recv;
-		struct tls12_crypto_info_aes_gcm_128 crypto_recv_aes_gcm_128;
-	};
+	union tls_crypto_context crypto_send;
+	union tls_crypto_context crypto_recv;
 
 	struct list_head list;
 	struct net_device *netdev;
@@ -367,8 +366,8 @@ static inline void tls_fill_prepend(struct tls_context *ctx,
 	 * size KTLS_DTLS_HEADER_SIZE + KTLS_DTLS_NONCE_EXPLICIT_SIZE
 	 */
 	buf[0] = record_type;
-	buf[1] = TLS_VERSION_MINOR(ctx->crypto_send.version);
-	buf[2] = TLS_VERSION_MAJOR(ctx->crypto_send.version);
+	buf[1] = TLS_VERSION_MINOR(ctx->crypto_send.info.version);
+	buf[2] = TLS_VERSION_MAJOR(ctx->crypto_send.info.version);
 	/* we can use IV for nonce explicit according to spec */
 	buf[3] = pkt_len >> 8;
 	buf[4] = pkt_len & 0xFF;
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 292742e50bfa..961b07d4d41c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -686,7 +686,7 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 		goto free_marker_record;
 	}
 
-	crypto_info = &ctx->crypto_send;
+	crypto_info = &ctx->crypto_send.info;
 	switch (crypto_info->cipher_type) {
 	case TLS_CIPHER_AES_GCM_128:
 		nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
@@ -780,7 +780,7 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx)
 
 	ctx->priv_ctx_tx = offload_ctx;
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
-					     &ctx->crypto_send,
+					     &ctx->crypto_send.info,
 					     tcp_sk(sk)->write_seq);
 	if (rc)
 		goto release_netdev;
@@ -862,7 +862,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx)
 		goto release_ctx;
 
 	rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
-					     &ctx->crypto_recv,
+					     &ctx->crypto_recv.info,
 					     tcp_sk(sk)->copied_seq);
 	if (rc) {
 		pr_err_ratelimited("%s: The netdev has refused to offload this socket\n",
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
index 6102169239d1..450a6dbc5a88 100644
--- a/net/tls/tls_device_fallback.c
+++ b/net/tls/tls_device_fallback.c
@@ -320,7 +320,7 @@ static struct sk_buff *tls_enc_skb(struct tls_context *tls_ctx,
 		goto free_req;
 
 	iv = buf;
-	memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
+	memcpy(iv, tls_ctx->crypto_send.aes_gcm_128.salt,
 	       TLS_CIPHER_AES_GCM_128_SALT_SIZE);
 	aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
 	      TLS_CIPHER_AES_GCM_128_IV_SIZE;
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 180b6640e531..737b3865be1b 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -241,6 +241,16 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free(struct tls_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
+	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
+	kfree(ctx);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
@@ -294,7 +304,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 #else
 	{
 #endif
-		kfree(ctx);
+		tls_ctx_free(ctx);
 		ctx = NULL;
 	}
 
@@ -305,7 +315,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	 * for sk->sk_prot->unhash [tls_hw_unhash]
 	 */
 	if (free_ctx)
-		kfree(ctx);
+		tls_ctx_free(ctx);
 }
 
 static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval,
@@ -330,7 +340,7 @@ static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval,
 	}
 
 	/* get user crypto info */
-	crypto_info = &ctx->crypto_send;
+	crypto_info = &ctx->crypto_send.info;
 
 	if (!TLS_CRYPTO_INFO_READY(crypto_info)) {
 		rc = -EBUSY;
@@ -417,9 +427,9 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 	}
 
 	if (tx)
-		crypto_info = &ctx->crypto_send;
+		crypto_info = &ctx->crypto_send.info;
 	else
-		crypto_info = &ctx->crypto_recv;
+		crypto_info = &ctx->crypto_recv.info;
 
 	/* Currently we don't support set crypto info more than one time */
 	if (TLS_CRYPTO_INFO_READY(crypto_info)) {
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f29b7c49cbf2..9e918489f4fb 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1055,8 +1055,8 @@ static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 		goto read_failure;
 	}
 
-	if (header[1] != TLS_VERSION_MINOR(tls_ctx->crypto_recv.version) ||
-	    header[2] != TLS_VERSION_MAJOR(tls_ctx->crypto_recv.version)) {
+	if (header[1] != TLS_VERSION_MINOR(tls_ctx->crypto_recv.info.version) ||
+	    header[2] != TLS_VERSION_MAJOR(tls_ctx->crypto_recv.info.version)) {
 		ret = -EINVAL;
 		goto read_failure;
 	}
@@ -1180,12 +1180,12 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 
 	if (tx) {
 		crypto_init_wait(&sw_ctx_tx->async_wait);
-		crypto_info = &ctx->crypto_send;
+		crypto_info = &ctx->crypto_send.info;
 		cctx = &ctx->tx;
 		aead = &sw_ctx_tx->aead_send;
 	} else {
 		crypto_init_wait(&sw_ctx_rx->async_wait);
-		crypto_info = &ctx->crypto_recv;
+		crypto_info = &ctx->crypto_recv.info;
 		cctx = &ctx->rx;
 		aead = &sw_ctx_rx->aead_recv;
 	}
-- 
2.18.0

^ permalink raw reply related

* [PATCH net v2 3/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails
From: Sabrina Dubroca @ 2018-09-12 15:44 UTC (permalink / raw)
  To: netdev
  Cc: Sabrina Dubroca, Aviad Yehezkel, Boris Pismenny, Dave Watson,
	Vakul Garg
In-Reply-To: <cover.1536766755.git.sd@queasysnail.net>

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v2: use the new union tls_crypto_context

 net/tls/tls_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 737b3865be1b..523622dc74f8 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -509,7 +509,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval,
 	goto out;
 
 err_crypto_info:
-	memset(crypto_info, 0, sizeof(*crypto_info));
+	memzero_explicit(crypto_info, sizeof(union tls_crypto_context));
 out:
 	return rc;
 }
-- 
2.18.0

^ permalink raw reply related

* Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg
From: Mauro Carvalho Chehab @ 2018-09-12 16:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: viro, linux-fsdevel, Greg Kroah-Hartman, David S. Miller, devel,
	linux-kernel, qat-linux, linux-crypto, linux-media, dri-devel,
	linaro-mm-sig, amd-gfx, linux-input, linux-iio, linux-rdma,
	linux-nvdimm, linux-nvme, linux-pci, platform-driver-x86,
	linux-remoteproc, sparclinux, linux-scsi, linux-usb, linux-fbdev,
	linuxppc-dev, linux-btrfs
In-Reply-To: <20180912151134.436719-1-arnd@arndb.de>

Em Wed, 12 Sep 2018 17:08:52 +0200
Arnd Bergmann <arnd@arndb.de> escreveu:

> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

>  drivers/media/rc/lirc_dev.c                 | 4 +---

> diff --git a/drivers/media/rc/lirc_dev.c b/drivers/media/rc/lirc_dev.c
> index f862f1b7f996..077209f414ed 100644
> --- a/drivers/media/rc/lirc_dev.c
> +++ b/drivers/media/rc/lirc_dev.c
> @@ -730,9 +730,7 @@ static const struct file_operations lirc_fops = {
>  	.owner		= THIS_MODULE,
>  	.write		= ir_lirc_transmit_ir,
>  	.unlocked_ioctl	= ir_lirc_ioctl,
> -#ifdef CONFIG_COMPAT
> -	.compat_ioctl	= ir_lirc_ioctl,
> -#endif
> +	.compat_ioctl	= generic_compat_ioctl_ptrarg,
>  	.read		= ir_lirc_read,
>  	.poll		= ir_lirc_poll,
>  	.open		= ir_lirc_open,

Adding an infrared remote controller to a s390 mainframe sounds fun :-)

I suspect that one could implement it on a s390 platform 
using gpio-ir-recv and/or gpio-ir-tx drivers. Perhaps one possible
practical usage would be to let the mainframe to send remote
controller codes to adjust the air conditioning system ;-)

From lirc driver's PoV, there's nothing that really prevents one to
do that and use lirc API, and the driver is generic enough to work
on any hardware platform.

I didn't check the implementation of generic_compat_ioctl_ptrarg(),
but assuming it is ok,

Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Thanks,
Mauro

^ permalink raw reply

* Re: [PATCH v2 02/17] compat_ioctl: move drivers to generic_compat_ioctl_ptrarg
From: Arnd Bergmann @ 2018-09-12 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Al Viro, Linux FS-devel Mailing List, Sudip Mukherjee, gregkh,
	Peter Huewe, Jarkko Sakkinen, Stefan Richter, Jiri Kosina,
	Benjamin Tissoires, Alexander Shishkin, Winkler, Tomas,
	Artem Bityutskiy, Marek Vasut, David Miller, Alex Williamson,
	OGAWA Hirofumi, Linux Kernel Mailing List, linux
In-Reply-To: <20180912153300.GE5633@ziepe.ca>

On Wed, Sep 12, 2018 at 5:33 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Sep 12, 2018 at 05:01:03PM +0200, Arnd Bergmann wrote:
> > Each of these drivers has a copy of the same trivial helper function to
> > convert the pointer argument and then call the native ioctl handler.
> >
> > We now have a generic implementation of that, so use it.
> >
>
> For vtpm:
>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
>
> Arnd, would you consider including a patch as part of/after this
> series to make compat_ioctl in drivers/infiniband/core/uverbs_main.c
> use this as well?  Looks like a bug too?

That should be included in patch 5 in this series. I may have skipped
some Cc there since it had too many recipients (sent only to the
mailing lists instead).

        Arnd

^ permalink raw reply

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: Tushar Dave @ 2018-09-12 16:21 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <1536694684-3200-2-git-send-email-tushar.n.dave@oracle.com>



On 09/11/2018 12:38 PM, Tushar Dave wrote:
> Helper bpg_msg_pull_data() can allocate multiple pages while
> linearizing multiple scatterlist elements into one shared page.
> However, if the shared page has size > PAGE_SIZE, using
> copy_page_to_iter() causes below warning.
> 
> e.g.
> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> To avoid above warning, use __GFP_COMP while allocating multiple
> contiguous pages.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> ---
>   net/core/filter.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d301134..0b40f95 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>   	if (unlikely(bytes_sg_total > copy))
>   		return -EINVAL;
>   
> -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> +			   get_order(copy));
>   	if (unlikely(!page))
>   		return -ENOMEM;
>   	p = page_address(page);

I should have mentioned that I could re-order this patch anywhere in
patch series (as long as it doesn't break git bisect). I kept it first
because I think it is more like a bug fix. I sent it along with these
patch series considering we have a context of why and for what I need
this patch!

Daniel, John,

Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
copy sg page to userspace using copy_page_to_iter().

example:

RDS packet size 8KB represented in scatterlist:
sg_data[0].length = 1400
sg_data[1].length = 1448
sg_data[2].length = 1448
sg_data[3].length = 1448
sg_data[4].length = 1448
sg_data[5].length = 1000

If start=0 and end=8192, bpf_msg_pull_data() will linearize all
sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
Using this sg_data[0].page in function copy_page_to_iter() causes:
WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

(FYI, patch 4 has code that does copy_page_to_iter)


Comments?

Thanks in advance,
-Tushar

^ permalink raw reply

* [PATCH v2 04/30] inet: frags: Convert timers to use timer_setup()
From: Stephen Hemminger @ 2018-09-12 16:36 UTC (permalink / raw)
  To: davem, gregkh
  Cc: edumazet, Kees Cook, Alexander Aring, Stefan Schmidt,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, linux-wpan, netdev,
	netfilter-devel, coreteam
In-Reply-To: <20180912163648.17611-1-sthemmin@microsoft.com>

From: Kees Cook <keescook@chromium.org>

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: linux-wpan@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
---
 include/net/inet_frag.h                 | 2 +-
 net/ieee802154/6lowpan/reassembly.c     | 5 +++--
 net/ipv4/inet_fragment.c                | 4 ++--
 net/ipv4/ip_fragment.c                  | 5 +++--
 net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +++--
 net/ipv6/reassembly.c                   | 5 +++--
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index fd338293a095..69e531ed8189 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -97,7 +97,7 @@ struct inet_frags {
 	void			(*constructor)(struct inet_frag_queue *q,
 					       const void *arg);
 	void			(*destructor)(struct inet_frag_queue *);
-	void			(*frag_expire)(unsigned long data);
+	void			(*frag_expire)(struct timer_list *t);
 	struct kmem_cache	*frags_cachep;
 	const char		*frags_cache_name;
 };
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 9ccb8458b5c3..6badc055555b 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -80,12 +80,13 @@ static void lowpan_frag_init(struct inet_frag_queue *q, const void *a)
 	fq->daddr = *arg->dst;
 }
 
-static void lowpan_frag_expire(unsigned long data)
+static void lowpan_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ieee802154_lowpan.frags);
 
 	spin_lock(&fq->q.lock);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4b44f973c37f..97e747b1e9a0 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -150,7 +150,7 @@ inet_evict_bucket(struct inet_frags *f, struct inet_frag_bucket *hb)
 	spin_unlock(&hb->chain_lock);
 
 	hlist_for_each_entry_safe(fq, n, &expired, list_evictor)
-		f->frag_expire((unsigned long) fq);
+		f->frag_expire(&fq->timer);
 
 	return evicted;
 }
@@ -367,7 +367,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	f->constructor(q, arg);
 	add_frag_mem_limit(nf, f->qsize);
 
-	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
+	timer_setup(&q->timer, f->frag_expire, 0);
 	spin_lock_init(&q->lock);
 	refcount_set(&q->refcnt, 1);
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9d0b08c8ee00..5171c8cc0eb6 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -191,12 +191,13 @@ static bool frag_expire_skip_icmp(u32 user)
 /*
  * Oops, a fragment queue timed out.  Kill it and send an ICMP reply.
  */
-static void ip_expire(unsigned long arg)
+static void ip_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct ipq *qp;
 	struct net *net;
 
-	qp = container_of((struct inet_frag_queue *) arg, struct ipq, q);
+	qp = container_of(frag, struct ipq, q);
 	net = container_of(qp->q.net, struct net, ipv4.frags);
 
 	rcu_read_lock();
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 7ea2b4490672..bc776ef392ea 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -169,12 +169,13 @@ static unsigned int nf_hashfn(const struct inet_frag_queue *q)
 	return nf_hash_frag(nq->id, &nq->saddr, &nq->daddr);
 }
 
-static void nf_ct_frag6_expire(unsigned long data)
+static void nf_ct_frag6_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, nf_frag.frags);
 
 	ip6_expire_frag_queue(net, fq);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 26f737c3fc7b..b85ef051b75c 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -169,12 +169,13 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
 }
 EXPORT_SYMBOL(ip6_expire_frag_queue);
 
-static void ip6_frag_expire(unsigned long data)
+static void ip6_frag_expire(struct timer_list *t)
 {
+	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	struct frag_queue *fq;
 	struct net *net;
 
-	fq = container_of((struct inet_frag_queue *)data, struct frag_queue, q);
+	fq = container_of(frag, struct frag_queue, q);
 	net = container_of(fq->q.net, struct net, ipv6.frags);
 
 	ip6_expire_frag_queue(net, fq);
-- 
2.18.0

^ permalink raw reply related

* Need input on placement of driver
From: Sunil Kovvuri @ 2018-09-12 16:50 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Arnd Bergmann

Hi David,

I am trying to submit a driver into drivers/soc folder and Arnd is of
the opinion that
the driver should be moved to drivers/net/ethernet.

Can you please go through below and give your feedback.

HW functionality in brief
# HW has a Admin function (AF) PCI device which has privilege access
to configure co-processors.
# Co-processors include network block, crypto block, ring buffer block
used by both network and crypto blocks, packet or anyother work
scheduler block, ingress/egress packet parser and forwarding block,
internal state machine caches etc.
# Each of these blocks are multiple in number and can be attached to
other PCI devices.
# Future variants of the same silicon might have additional functional
blocks in AF.
# There are other SRIOV PF/VF devices which are dumb at power-on and
acquire functionality based
    what blocks are attached to them by AF.
# So AF is the one which configures, facilities and manages all HW
resources (network and non-network).
   But doesn't handle any data.
# PF/VFs communicate with AF via a shared mailbox memory for functional block
   attach / detach requests, HW configuration etc etc.

AF driver will have logic not only the functionality needed by kernel
netdev or crypto drivers but
also the HW configuration logic needed by userspace application drivers.

Keeping current and future functionality in view we thought of having
3 drivers in kernel
# AF driver at drivers/soc
# PF/ VF netdev driver (network & ring buffer blocks attached to these
devices) at drivers/net/ethernet
# PF / VF crypto driver (ring buffer and crypto blocks attached to
these devices) at drivers/crypto.

I have submitted few patches for the AF driver
https://patchwork.kernel.org/cover/10587635/

Here Arnd has opined that all drivers should move into drivers/net/ethernet.
So wanted to check if you would be okay with this.

Thanks,
Sunil.

^ permalink raw reply

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
From: John Fastabend @ 2018-09-12 16:51 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <2fd601e3-5679-08f4-1610-b3c22de80935@oracle.com>

On 09/12/2018 09:21 AM, Tushar Dave wrote:
> 
> 
> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>> Helper bpg_msg_pull_data() can allocate multiple pages while
>> linearizing multiple scatterlist elements into one shared page.
>> However, if the shared page has size > PAGE_SIZE, using
>> copy_page_to_iter() causes below warning.
>>
>> e.g.
>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> To avoid above warning, use __GFP_COMP while allocating multiple
>> contiguous pages.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> ---
>>   net/core/filter.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index d301134..0b40f95 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>       if (unlikely(bytes_sg_total > copy))
>>           return -EINVAL;
>>   -    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>> +    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>> +               get_order(copy));
>>       if (unlikely(!page))
>>           return -ENOMEM;
>>       p = page_address(page);
> 
> I should have mentioned that I could re-order this patch anywhere in
> patch series (as long as it doesn't break git bisect). I kept it first
> because I think it is more like a bug fix. I sent it along with these
> patch series considering we have a context of why and for what I need
> this patch!
> 
> Daniel, John,
> 
> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
> copy sg page to userspace using copy_page_to_iter().
> 

I have not hit this before but I'm working on a set of patches for
test_sockmap to test the bpf_msg_pull_data() so I'll add a case
for this. Currently, we only test the simple case where we pull
data out of a single page in selftests. This was sufficient for
my use case but missed a handful of other valid cases.

> example:
> 
> RDS packet size 8KB represented in scatterlist:
> sg_data[0].length = 1400
> sg_data[1].length = 1448
> sg_data[2].length = 1448
> sg_data[3].length = 1448
> sg_data[4].length = 1448
> sg_data[5].length = 1000
> 
> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
> Using this sg_data[0].page in function copy_page_to_iter() causes:
> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> (FYI, patch 4 has code that does copy_page_to_iter)
> 

How about sending it as a bugfix against bpf on its own. It
looks like we could reproduce this with a combination of
bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
way seems like a candidate for the bpf fixes tree to me.

Thanks,
John

> 
> Comments?
> 
> Thanks in advance,
> -Tushar
> 
> 

^ permalink raw reply

* [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Martin KaFai Lau @ 2018-09-12 17:29 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team

The end boundary math for type section is incorrect in
btf_check_all_metas().  It just happens that hdr->type_off
is always 0 for now because there are only two sections
(type and string) and string section must be at the end (ensured
in btf_parse_str_sec).

However, type_off may not be 0 if a new section would be added later.
This patch fixes it.

Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2590700237c1..138f0302692e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1844,7 +1844,7 @@ static int btf_check_all_metas(struct btf_verifier_env *env)
 
 	hdr = &btf->hdr;
 	cur = btf->nohdr_data + hdr->type_off;
-	end = btf->nohdr_data + hdr->type_len;
+	end = cur + hdr->type_len;
 
 	env->log_type_id = 1;
 	while (cur < end) {
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf] bpf: btf: Fix end boundary calculation for type section
From: Yonghong Song @ 2018-09-12 17:38 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180912172911.3609494-1-kafai@fb.com>



On 9/12/18 10:29 AM, Martin KaFai Lau wrote:
> The end boundary math for type section is incorrect in
> btf_check_all_metas().  It just happens that hdr->type_off
> is always 0 for now because there are only two sections
> (type and string) and string section must be at the end (ensured
> in btf_parse_str_sec).
> 
> However, type_off may not be 0 if a new section would be added later.
> This patch fixes it.
> 
> Fixes: f80442a4cd18 ("bpf: btf: Change how section is supported in btf_header")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Acked-by: Yonghong Song <yhs@fb.com>

> ---
>   kernel/bpf/btf.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 2590700237c1..138f0302692e 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -1844,7 +1844,7 @@ static int btf_check_all_metas(struct btf_verifier_env *env)
>   
>   	hdr = &btf->hdr;
>   	cur = btf->nohdr_data + hdr->type_off;
> -	end = btf->nohdr_data + hdr->type_len;
> +	end = cur + hdr->type_len;
>   
>   	env->log_type_id = 1;
>   	while (cur < end) {
> 

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Florian Fainelli @ 2018-09-12 17:42 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, caleb.raitto, jasowang, mst, jonolson, Willem de Bruijn
In-Reply-To: <20180909224449.203593-1-willemdebruijn.kernel@gmail.com>



On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> Interrupt moderation is currently not supported, so these accept and
> display the default settings of 0 usec and 1 frame.
> 
> Toggle tx napi through a bit in tx-frames. So as to not interfere
> with possible future interrupt moderation, use bit 10, well outside
> the reasonable range of real interrupt moderation values.
> 
> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> be quiesced when switching modes. Only allow changing this setting
> when the device is down.

Humm, would not a private ethtool flag to switch TX NAPI on/off be more 
appropriate rather than use the coalescing configuration API here?

> 
> Link: https://patchwork.ozlabs.org/patch/948149/
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>   drivers/net/virtio_net.c | 52 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 52 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 765920905226..b320b6b14749 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -66,6 +66,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>   
>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>   
> +static const u32 ethtool_coalesce_napi_mask = (1UL << 10);
> +
>   static const unsigned long guest_offloads[] = {
>   	VIRTIO_NET_F_GUEST_TSO4,
>   	VIRTIO_NET_F_GUEST_TSO6,
> @@ -2181,6 +2183,54 @@ static int virtnet_get_link_ksettings(struct net_device *dev,
>   	return 0;
>   }
>   
> +static int virtnet_set_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	const struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_SCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +		.tx_max_coalesced_frames = 1,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i, napi_weight = 0;
> +
> +	if (ec->tx_max_coalesced_frames & ethtool_coalesce_napi_mask) {
> +		ec->tx_max_coalesced_frames &= ~ethtool_coalesce_napi_mask;
> +		napi_weight = NAPI_POLL_WEIGHT;
> +	}
> +
> +	/* disallow changes to fields not explicitly tested above */
> +	if (memcmp(ec, &ec_default, sizeof(ec_default)))
> +		return -EINVAL;
> +
> +	if (napi_weight ^ vi->sq[0].napi.weight) {
> +		if (dev->flags & IFF_UP)
> +			return -EBUSY;
> +		for (i = 0; i < vi->max_queue_pairs; i++)
> +			vi->sq[i].napi.weight = napi_weight;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_get_coalesce(struct net_device *dev,
> +				struct ethtool_coalesce *ec)
> +{
> +	const struct ethtool_coalesce ec_default = {
> +		.cmd = ETHTOOL_GCOALESCE,
> +		.rx_max_coalesced_frames = 1,
> +		.tx_max_coalesced_frames = 1,
> +	};
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	memcpy(ec, &ec_default, sizeof(ec_default));
> +
> +	if (vi->sq[0].napi.weight)
> +		ec->tx_max_coalesced_frames |= ethtool_coalesce_napi_mask;
> +
> +	return 0;
> +}
> +
>   static void virtnet_init_settings(struct net_device *dev)
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
> @@ -2219,6 +2269,8 @@ static const struct ethtool_ops virtnet_ethtool_ops = {
>   	.get_ts_info = ethtool_op_get_ts_info,
>   	.get_link_ksettings = virtnet_get_link_ksettings,
>   	.set_link_ksettings = virtnet_set_link_ksettings,
> +	.set_coalesce = virtnet_set_coalesce,
> +	.get_coalesce = virtnet_get_coalesce,
>   };
>   
>   static void virtnet_freeze_down(struct virtio_device *vdev)
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 18:07 UTC (permalink / raw)
  To: f.fainelli
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <a0272b54-3bf6-1c07-ab1d-62545d16a633@gmail.com>

On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> > Interrupt moderation is currently not supported, so these accept and
> > display the default settings of 0 usec and 1 frame.
> >
> > Toggle tx napi through a bit in tx-frames. So as to not interfere
> > with possible future interrupt moderation, use bit 10, well outside
> > the reasonable range of real interrupt moderation values.
> >
> > Changes are not atomic. The tx IRQ, napi BH and transmit path must
> > be quiesced when switching modes. Only allow changing this setting
> > when the device is down.
>
> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
> appropriate rather than use the coalescing configuration API here?

What do you mean by private ethtool flag? A new field in ethtool
--features (-k)?

Configurable napi-tx is not a common feature across devices. We really
want virtio-net to also just convert to napi-tx as default, but need a
way to gradually convert with application opt-out if some workloads
see regressions. There is prior art in interpreting coalesce values as
more than a direct mapping to usec. The e1000 is one example.

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Florian Fainelli @ 2018-09-12 18:16 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <CAF=yD-K4mrqY-YjsWL1JkYhqBOSciRTyvMtTv5CQU7Uq6GeypA@mail.gmail.com>



On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>>
>>
>> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
>>> From: Willem de Bruijn <willemb@google.com>
>>>
>>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
>>> Interrupt moderation is currently not supported, so these accept and
>>> display the default settings of 0 usec and 1 frame.
>>>
>>> Toggle tx napi through a bit in tx-frames. So as to not interfere
>>> with possible future interrupt moderation, use bit 10, well outside
>>> the reasonable range of real interrupt moderation values.
>>>
>>> Changes are not atomic. The tx IRQ, napi BH and transmit path must
>>> be quiesced when switching modes. Only allow changing this setting
>>> when the device is down.
>>
>> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
>> appropriate rather than use the coalescing configuration API here?
> 
> What do you mean by private ethtool flag? A new field in ethtool
> --features (-k)?

I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then 
ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that 
private flag. mlx5 has a number of privates flags for instance.

> 
> Configurable napi-tx is not a common feature across devices. We really
> want virtio-net to also just convert to napi-tx as default, but need a
> way to gradually convert with application opt-out if some workloads
> see regressions.

The rationale makes sense, no questions about it.

> There is prior art in interpreting coalesce values as
> more than a direct mapping to usec. The e1000 is one example.
>

Looked at both e1000 and e1000e and they both have a similar programming 
of the HW's interrupt target rate register, which is relevant to 
interrupt coalescing, what part of these drivers do you see as doing 
something not quite coalescing related?
-- 
Florian

^ permalink raw reply

* Re: [RFC v2 1/2] netlink: add NLA_REJECT policy type
From: Johannes Berg @ 2018-09-12 18:34 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, mkubecek
In-Reply-To: <20180912.111555.1317690378514849083.davem@davemloft.net>

On Wed, 2018-09-12 at 11:15 -0700, David Miller wrote:

> This looks great, no objections to this idea or the facility.

Great. I'll post this (with the fixups) for real tomorrow then, I guess.
A bit too late for me to do now.

> It does, however, remind me about about the classic problem of how bad
> we are at feature support detection because unrecognized attributes are
> ignored.
> 
> I do really hope we can fully solve that problem some day.

Yes.

There may be two or more levels to this.

It wouldn't be hard to reject attributes that are higher than maxtype -
we already pass that to nla_parse() wherever we call it, but we'd have
to find a way to make it optional I guess, for compatibility reasons.
Perhaps with a warning, like attribute validation. For genetlink, a flag
in the family (something like "strict attribute validation") would be
easy, but for "netlink proper" we have a lot of nlmsg_parse() calls to
patch, and/or replace by nlmsg_parse_strict().

I guess we should

1) implement nlmsg_parse_strict() for those new things that want it
   strictly - greenfield type stuff that doesn't need to work with
   existing applications

2) add a warning to nlmsg_parse() when a too high attribute is
   encountered

3) eventually replace nlmsg_parse() calls by nlmsg_parse_strict() and
   see what breaks? :-) We won't be able to rely on that any time soon
   though (unless userspace first checks with a guaranteed rejected
   attribute, e.g. one that has NLA_REJECT, perhaps the u64 pad
   attributes could be marked such since the kernel can't assume
   alignment anyway)

Perhaps we also have too many calls to nlmsg_parse() without a policy,
but that's orthogonal to this check.

On a second level though, with complex things like nl80211 it's often
not clear at all which attributes are used with which commands. Some
attributes (like NL80211_ATTR_IFINDEX) are (almost) universal, but there
are others that aren't. Perhaps this isn't all that important, since if
you try to trigger scanning and at the same time tell the kernel about a
key index, that clearly makes no sense at all. OTOH, we have no good way
of discovering what attribute is used where - we (try to) document this
well in the nl80211.h kernel-doc, but that isn't always complete.

So more introspection (of sorts) could be useful.

While we're talking about wishlist, I'm also toying with the idea of
having some sort of generic mechanism to convert netlink attributes
to/from structs, for internal kernel representation; so far though I
haven't been able to come up with anything useful.

johannes

^ permalink raw reply

* Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
From: Andy Lutomirski @ 2018-09-12 23:45 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jason A. Donenfeld, LKML, Netdev, David Miller,
	Greg Kroah-Hartman, Andrew Lutomirski, Samuel Neves,
	Jean-Philippe Aumasson, Linux Crypto Mailing List
In-Reply-To: <CAKv+Gu9RGo9rtGsmwMLZ_=-WSQ_h7har4agXP-2XOKupq-KYtA@mail.gmail.com>

On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> I realize you've put a lot of good and hard work into the existing
> I am also concerned about your claim that all software algorithms will
> be moved into this crypto library. You are not specific about whose
> responsibility it will be that this is going to happen in a timely
> fashion. But more importantly, it is not clear at all how you expect
> this to work for, e.g., h/w instruction based SHAxxx or AES in various
> chaining modes, which should be used only on cores that implement
> those instructions (note that on arm64, we have optional instructions
> for AES, PMULL, SHA1, SHA256, SHA512, SHA3, SM3 and SM4). Are all
> those implementations (only few of which will be used on a certain
> core) going to be part of the monolithic library? What are the APIs
> going to look like for block ciphers, taking chaining modes into
> account?

I'm not convinced that there's any real need for *all* crypto
algorithms to move into lib/zinc or to move at all.  As I see it,
there are two classes of crypto algorithms in the kernel:

a) Crypto that is used by code that chooses its algorithm statically
and wants synchronous operations.  These include everything in
drivers/char/random.c, but also a bunch of various networking things
that are hardcoded and basically everything that uses stack buffers.
(This means it includes all the code that I broke when I did
VMAP_STACK.  Sign.)

b) Crypto that is used dynamically.  This includes dm-crypt
(aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
lot of IPSEC stuff, possibly KCM, and probably many more.  These will
get comparatively little benefit from being converted to a zinc-like
interface.  For some of these cases, it wouldn't make any sense at all
to convert them.  Certainly the ones that do async hardware crypto
using DMA engines will never look at all like zinc, even under the
hood.

I think that, as a short-term goal, it makes a lot of sense to have
implementations of the crypto that *new* kernel code (like Wireguard)
wants to use in style (a) that live in /lib, and it obviously makes
sense to consolidate their implementations with the crypto/
implementations in a timely manner.  As a medium-term goal, adding
more algorithms as needed for things that could use the simpler APIs
(Bluetooth, perhaps) would make sense.

But I see no reason at all that /lib should ever contain a grab-bag of
crypto implementations just for the heck of it.  They should have real
in-kernel users IMO.  And this means that there will probably always
be some crypto implementations in crypto/ for things like aes-xts.

--Andy

^ permalink raw reply

* Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
From: Willem de Bruijn @ 2018-09-12 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, ecree,
	songliubraving, Tom Herbert, Petar Penkov, Willem de Bruijn
In-Reply-To: <20180912034632.bkxpktzvze242rvr@ast-mbp>

On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/linux/bpf.h            |   1 +
> >  include/linux/bpf_types.h      |   1 +
> >  include/linux/skbuff.h         |   7 ++
> >  include/net/net_namespace.h    |   3 +
> >  include/net/sch_generic.h      |  12 ++-
> >  include/uapi/linux/bpf.h       |  25 ++++++
> >  kernel/bpf/syscall.c           |   8 ++
> >  kernel/bpf/verifier.c          |  32 ++++++++
> >  net/core/filter.c              |  67 ++++++++++++++++
> >  net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
> >  tools/bpf/bpftool/prog.c       |   1 +
> >  tools/include/uapi/linux/bpf.h |  25 ++++++
> >  tools/lib/bpf/libbpf.c         |   2 +
>
> please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> We often have conflicts in there, so best to have a separate.
> Also please split tools/lib and tools/bpf chnages into patch 3.

Will do in v3.

> >  struct qdisc_skb_cb {
> > -     unsigned int            pkt_len;
> > -     u16                     slave_dev_queue_mapping;
> > -     u16                     tc_classid;
> > +     union {
> > +             struct {
> > +                     unsigned int            pkt_len;
> > +                     u16                     slave_dev_queue_mapping;
> > +                     u16                     tc_classid;
> > +             };
> > +             struct bpf_flow_keys *flow_keys;
> > +     };
>
> is this magic really necessary? flow_dissector runs very early in recv path.
> There is no qdisc or conflicts with tcp/ip use of cb.
> I think the whole cb block can be used.

The flow dissector also runs in the context of TC, from flower.
But that is not a reason to use this struct.

We need both (a) data shared with the BPF application and between
applications using tail-calls, to pass along the parse offset (nhoff),
and (b) data not accessible by the program, to store the flow_keys
pointer.

qdisc_skb_cb already has this split, exposing only the 20B .data
field to the application. Flow dissection currently reuses the existing
bpf_convert_ctx_access logic for this field.

We could create a separate flowdissect_skb_cb struct with the
same split setup, but a second constraint is that relevant internal
BPF interfaces already expect qdisc_skb_cb, notably
bkf_skb_data_end. So the union was easier.

There is another way to pass nhoff besides cb[] (see below). But
I don't immediately see another place to store the flow_keys ptr.

At least, if we pass skb as context. One larger change would
be to introduce another ctx struct, similar to sk_reuseport_(kern|md).

> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >       /* ... here. */
> >
> >       __u32 data_meta;
> > +     __u32 flow_keys;
>
> please use
> struct bpf_flow_keys *flow_keys;
> instead.
>
> See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> There is no need to hide pointers in u32.
>

Will do in v3.

> > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >                                             FLOW_DISSECTOR_KEY_BASIC,
> >                                             target_container);
> >
> > +     rcu_read_lock();
> > +     attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > +                    : NULL;
> > +     if (attached) {
> > +             /* Note that even though the const qualifier is discarded
> > +              * throughout the execution of the BPF program, all changes(the
> > +              * control block) are reverted after the BPF program returns.
> > +              * Therefore, __skb_flow_dissect does not alter the skb.
> > +              */
> > +             struct bpf_flow_keys flow_keys = {};
> > +             struct qdisc_skb_cb cb_saved;
> > +             struct qdisc_skb_cb *cb;
> > +             u16 *pseudo_cb;
> > +             u32 result;
> > +
> > +             cb = qdisc_skb_cb(skb);
> > +             pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > +
> > +             /* Save Control Block */
> > +             memcpy(&cb_saved, cb, sizeof(cb_saved));
> > +             memset(cb, 0, sizeof(cb_saved));
> > +
> > +             /* Pass parameters to the BPF program */
> > +             cb->flow_keys = &flow_keys;
> > +             *pseudo_cb = nhoff;
>
> I don't understand this bit.
> What is this pseudo_cb and why nhoff goes in there?
> Some odd way to pass it into the prog?

Yes. nhoff passes the offset to the program to start parsing from.
Applications also pass this during tail calls.

Alternatively we can just add a new field to struct bpf_flow_keys.

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: ethtool tx napi configuration
From: Willem de Bruijn @ 2018-09-12 19:11 UTC (permalink / raw)
  To: f.fainelli
  Cc: Network Development, David Miller, caleb.raitto, Jason Wang,
	Michael S. Tsirkin, Jon Olson (Google Drive), Willem de Bruijn
In-Reply-To: <2fd8d46f-7466-e507-af31-c587693beb6e@gmail.com>

On Wed, Sep 12, 2018 at 2:16 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 9/12/2018 11:07 AM, Willem de Bruijn wrote:
> > On Wed, Sep 12, 2018 at 1:42 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >>
> >>
> >> On 9/9/2018 3:44 PM, Willem de Bruijn wrote:
> >>> From: Willem de Bruijn <willemb@google.com>
> >>>
> >>> Implement ethtool .set_coalesce (-C) and .get_coalesce (-c) handlers.
> >>> Interrupt moderation is currently not supported, so these accept and
> >>> display the default settings of 0 usec and 1 frame.
> >>>
> >>> Toggle tx napi through a bit in tx-frames. So as to not interfere
> >>> with possible future interrupt moderation, use bit 10, well outside
> >>> the reasonable range of real interrupt moderation values.
> >>>
> >>> Changes are not atomic. The tx IRQ, napi BH and transmit path must
> >>> be quiesced when switching modes. Only allow changing this setting
> >>> when the device is down.
> >>
> >> Humm, would not a private ethtool flag to switch TX NAPI on/off be more
> >> appropriate rather than use the coalescing configuration API here?
> >
> > What do you mean by private ethtool flag? A new field in ethtool
> > --features (-k)?
>
> I meant using ethtool_drvinfo::n_priv_flags, ETH_SS_PRIV_FLAGS and then
> ETHTOOL_GFPFLAGS and ETHTOOL_SPFLAGS to control the toggling of that
> private flag. mlx5 has a number of privates flags for instance.

Interesting, thanks! I was not at all aware of those ethtool flags.
Am having a look. It definitely looks promising.

> > Configurable napi-tx is not a common feature across devices. We really
> > want virtio-net to also just convert to napi-tx as default, but need a
> > way to gradually convert with application opt-out if some workloads
> > see regressions.
>
> The rationale makes sense, no questions about it.
>
> > There is prior art in interpreting coalesce values as
> > more than a direct mapping to usec. The e1000 is one example.
> >
>
> Looked at both e1000 and e1000e and they both have a similar programming
> of the HW's interrupt target rate register, which is relevant to
> interrupt coalescing, what part of these drivers do you see as doing
> something not quite coalescing related?

It's all coalescing related, for sure. e1000_set_coalesce just does not
translate the tx-usecs values into microsecond latency directly.

It modifies both the interrupt throttle rate adapter->itr and interrupt mode
adapter->itr_setting, which are initially set in e1000_check_options from
module param InterruptThrottleRate.

Value 0 disables interrupt moderation. 1 and 3 program a dynamic mode.
2 is an illegal value as is 5..9. 10..10000 converts from usec to interrupt
rate/sec.

I took tx-napi to be a similar interrupt related option as, say, dynamic
conservative mode interrupt moderation.

^ permalink raw reply

* Re: [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Tushar Dave @ 2018-09-12 19:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai, rdna,
	yhs, netdev, rds-devel, sowmini.varadhan
In-Reply-To: <20180912035719.ff5mcjg3gbrg52xt@ast-mbp>



On 09/11/2018 08:57 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:01PM +0200, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/bpf_types.h      |  1 +
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/syscall.c           |  1 +
>>   kernel/bpf/verifier.c          |  1 +
>>   net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
>>   samples/bpf/bpf_load.c         | 11 ++++++---
>>   tools/bpf/bpftool/prog.c       |  1 +
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   tools/lib/bpf/libbpf.c         |  3 +++
>>   tools/lib/bpf/libbpf.h         |  2 ++
>

Alexi,

Thank you for reviewing the patches.

> please do not mix core kernel and user space into single patch.
> split tools/include/uapi/linux/bpf.h sync into separate patch
> and changes to tools/lib/bpf as yet another patch.

Sure, I can do that.

> 
>>   10 files changed, 72 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index cd26c09..7dc1503 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -16,6 +16,7 @@
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>>   #endif
>>   #ifdef CONFIG_BPF_EVENTS
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 66917a4..6ec1e32 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>>   	BPF_PROG_TYPE_LIRC_MODE2,
>>   	BPF_PROG_TYPE_SK_REUSEPORT,
>> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>>   };
>>   
>>   enum bpf_attach_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 3c9636f..5f302b7 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>>   
>>   	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>>   	    type != BPF_PROG_TYPE_CGROUP_SKB &&
>> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
> 
> I'm not comfortable to let unpriv use this right away.
> Can you live with root-only ?

Honestly, I prep this the same as how we treat
BPF_PROG_TYPE_SOCKET_FILTER. But sure I can live with root-only.

> 
>>   	    !capable(CAP_SYS_ADMIN))
>>   		return -EPERM;
>>   
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c5..17fc4d2 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>>   	case BPF_PROG_TYPE_LWT_XMIT:
>>   	case BPF_PROG_TYPE_SK_SKB:
>>   	case BPF_PROG_TYPE_SK_MSG:
>> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>>   		if (meta)
>>   			return meta->pkt_access;
>>   
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0b40f95..469c488 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>>   
>>   static void __bpf_prog_release(struct bpf_prog *prog)
>>   {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>   		bpf_prog_put(prog);
> 
> this doesn't look right.
> Why this is needed?
> Are you using old-style setsockopt to attach?

Yes.

> I think new style of attaching that all bpf prog types that came
> after socket_filter are using is preferred.
> Pls take a look at BPF_PROG_ATTACH cmd.

well, I am not sure if that is going to work.

I did this way so I can attach eBPF prog type
BPF_PROG_TYPE_SOCKET_SG_FILTER to a socket. Like today we can attach
regular socket filter bpf program (e.g. BPF_PROG_TYPE_SOCKET_FILTER) to
TCP, UDP sockets using setsockopt. The only difference between them is,
BPF_PROG_TYPE_SOCKET_FILTER deals with struct sk_buff while
BPF_PROG_TYPE_SOCKET_SG_FILTER deals with strut scatterlist.

e.g. setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
sizeof(prog_fd[0]));


> 
> Also it looks the first patch doesn't really add the useful logic, but adds
> few lines of code here and there. Then more code comes in patches 3 and 4.
> Please rearrange them that they're reviewable as logical pieces.

okay.

-Tushar
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox