Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Andrew Lunn @ 2018-06-03  0:08 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: Ilias Apalodimas, Ivan Vecera, Jiri Pirko, netdev,
	ivan.khoronzhuk, nsekhar, francois.ozog, yogeshs, spatton
In-Reply-To: <2b3cabca-4710-0a71-69c7-cc433e2b3062@ti.com>

On Sat, Jun 02, 2018 at 06:28:22PM -0500, Grygorii Strashko wrote:

Hi Grygorii

I'm just picking out one thing here... there is lots more good stuff here.

> Additional headache is PTP: we have on PHC, but both external interfaces P1/P2
> can timestamp packets.

This should not be a problem. The Marvell switches have one PHC, but
each port can time stamp packets using this counter. Each port has its
own receive and transmit time stamp registers. So i don't think this
will cause you problems.

     Andrew

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2018-06-02 23:39 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Infinite loop in _decode_session6(), from Eric Dumazet.

2) Pass correct argument to nla_strlcpy() in netfilter, also
   from Eric Dumazet.

3) Out of bounds memory access in ipv6 srh code, from Mathieu
   Xhonneux.

4) NULL deref in XDP_REDIRECT handling of tun driver, from
   Toshiaki Makita.

5) Incorrect idr release in cls_flower, from Paul Blakey.

6) Probe error handling fix in davinci_emac, from Dan Carpenter.

7) Memory leak in XPS configuration, from Alexander Duyck.

8) Use after free with cloned sockets in kcm, from Kirill Tkhai.

9) MTU handling fixes fo ip_tunnel and ip6_tunnel, from Nicolas
   Dichtel.

10) Fix UAPI hole in bpf data structure for 32-bit compat
    applications, from Daniel Borkmann.

Please pull, thanks a lot!

The following changes since commit bc2dbc5420e82560e650f8531ceca597441ca171:

  Merge branch 'akpm' (patches from Andrew) (2018-05-25 20:24:28 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to cd075ce4679ca7797734c4f6c5aa23878c8e2208:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf (2018-06-02 08:07:52 -0400)

----------------------------------------------------------------
Alexander Duyck (1):
      net-sysfs: Fix memory leak in XPS configuration

Ard Biesheuvel (1):
      net: netsec: reduce DMA mask to 40 bits

Damien Thébault (1):
      net: dsa: b53: Add BCM5389 support

Dan Carpenter (1):
      net: ethernet: davinci_emac: fix error handling in probe()

Daniel Borkmann (1):
      bpf: fix uapi hole for 32 bit compat applications

Daniele Palmas (1):
      net: usb: cdc_mbim: add flag FLAG_SEND_ZLP

David S. Miller (5):
      Merge git://git.kernel.org/.../pablo/nf
      Merge tag 'wireless-drivers-for-davem-2018-05-30' of git://git.kernel.org/.../kvalo/wireless-drivers
      Merge branch 'master' of git://git.kernel.org/.../klassert/ipsec
      Merge branch 'tunnel-mtus'
      Merge git://git.kernel.org/.../bpf/bpf

Eric Dumazet (2):
      xfrm6: avoid potential infinite loop in _decode_session6()
      netfilter: provide correct argument to nla_strlcpy()

Finn Thain (1):
      net/sonic: Use dma_mapping_error()

Hao Wei Tee (1):
      iwlwifi: pcie: compare with number of IRQs requested for, not number of CPUs

Ivan Bornyakov (1):
      atm: zatm: fix memcmp casting

Jason Wang (1):
      vhost_net: flush batched heads before trying to busy polling

Josh Hill (1):
      net: qmi_wwan: Add Netgear Aircard 779S

Julian Anastasov (1):
      ipvs: fix buffer overflow with sync daemon and service

Kirill Tkhai (1):
      kcm: Fix use-after-free caused by clonned sockets

Mathieu Xhonneux (1):
      ipv6: sr: fix memory OOB access in seg6_do_srh_encap/inline

Nicolas Dichtel (2):
      ip_tunnel: restore binding to ifaces with a large mtu
      ip6_tunnel: remove magic mtu value 0xFFF8

Ondřej Hlavatý (1):
      ixgbe: fix parsing of TC actions for HW offload

Pablo Neira Ayuso (2):
      netfilter: nft_limit: fix packet ratelimiting
      netfilter: nf_tables: disable preemption in nft_update_chain_stats()

Paolo Abeni (1):
      netfilter: ebtables: handle string from userspace with care

Paul Blakey (1):
      cls_flower: Fix incorrect idr release when failing to modify rule

Petr Machata (1):
      mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG

Samuel Mendoza-Jonas (1):
      net/ncsi: Fix array size in dumpit handler

Stanislaw Gruszka (1):
      Revert "rt2800: use TXOP_BACKOFF for probe frames"

Steffen Klassert (1):
      xfrm Fix potential error pointer dereference in xfrm_bundle_create.

Suresh Reddy (1):
      be2net: Fix error detection logic for BE3

Taehee Yoo (4):
      netfilter: nf_tables: fix NULL pointer dereference on nft_ct_helper_obj_dump()
      netfilter: nft_meta: fix wrong value dereference in nft_meta_set_eval
      netfilter: nf_tables: fix NULL-ptr in nf_tables_dump_obj()
      netfilter: nf_tables: increase nft_counters_enabled in nft_chain_stats_replace()

Toshiaki Makita (1):
      tun: Fix NULL pointer dereference in XDP redirect

 Documentation/devicetree/bindings/net/dsa/b53.txt |  1 +
 drivers/atm/zatm.c                                |  4 ++--
 drivers/net/dsa/b53/b53_common.c                  | 13 +++++++++++++
 drivers/net/dsa/b53/b53_mdio.c                    |  5 ++++-
 drivers/net/dsa/b53/b53_priv.h                    |  1 +
 drivers/net/ethernet/emulex/benet/be_main.c       |  4 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |  9 ++++-----
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c    |  5 +++++
 drivers/net/ethernet/natsemi/sonic.c              |  2 +-
 drivers/net/ethernet/socionext/netsec.c           |  4 ++--
 drivers/net/ethernet/ti/davinci_emac.c            | 22 ++++++++++++----------
 drivers/net/tun.c                                 | 15 +++++++++------
 drivers/net/usb/cdc_mbim.c                        |  2 +-
 drivers/net/usb/qmi_wwan.c                        |  1 +
 drivers/net/wireless/intel/iwlwifi/pcie/trans.c   | 10 +++++-----
 drivers/net/wireless/ralink/rt2x00/rt2x00queue.c  |  7 +++----
 drivers/vhost/net.c                               | 37 ++++++++++++++++++++++++-------------
 include/uapi/linux/bpf.h                          |  2 ++
 net/bridge/netfilter/ebtables.c                   |  3 ++-
 net/core/net-sysfs.c                              |  6 +++---
 net/ipv4/ip_tunnel.c                              |  8 ++++----
 net/ipv6/ip6_tunnel.c                             | 11 ++++++++---
 net/ipv6/seg6_iptunnel.c                          |  4 ++--
 net/ipv6/sit.c                                    |  5 +++--
 net/ipv6/xfrm6_policy.c                           |  2 +-
 net/kcm/kcmsock.c                                 |  2 +-
 net/ncsi/ncsi-netlink.c                           |  2 +-
 net/netfilter/ipvs/ip_vs_ctl.c                    | 21 +++++++++++++++------
 net/netfilter/nf_tables_api.c                     |  8 +++++---
 net/netfilter/nf_tables_core.c                    |  4 ++--
 net/netfilter/nfnetlink_acct.c                    |  2 +-
 net/netfilter/nfnetlink_cthelper.c                |  4 ++--
 net/netfilter/nft_ct.c                            | 20 ++++++++++++--------
 net/netfilter/nft_limit.c                         | 38 ++++++++++++++++++++++++--------------
 net/netfilter/nft_meta.c                          | 14 ++++++++------
 net/sched/cls_flower.c                            |  2 +-
 net/xfrm/xfrm_policy.c                            |  5 ++---
 tools/include/uapi/linux/bpf.h                    |  2 ++
 38 files changed, 192 insertions(+), 115 deletions(-)

^ permalink raw reply

* Re: [PATCH 0/4] RFC CPSW switchdev mode
From: Grygorii Strashko @ 2018-06-02 23:28 UTC (permalink / raw)
  To: Ilias Apalodimas, Ivan Vecera, Andrew Lunn
  Cc: Jiri Pirko, netdev, ivan.khoronzhuk, nsekhar, francois.ozog,
	yogeshs, spatton, Jiri Pirko
In-Reply-To: <20180524140831.GA16793@apalos>

Hi All,

Sry, for delayed reply.

On 05/24/2018 09:08 AM, Ilias Apalodimas wrote:
> On Thu, May 24, 2018 at 03:44:54PM +0200, Ivan Vecera wrote:
>> On 24.5.2018 14:54, Andrew Lunn wrote:
>>> On Thu, May 24, 2018 at 11:48:31AM +0300, Ilias Apalodimas wrote:
>>>> On Thu, May 24, 2018 at 10:05:28AM +0200, Jiri Pirko wrote:
>>>>> Thu, May 24, 2018 at 08:56:20AM CEST, ilias.apalodimas@linaro.org wrote:
>>>>> Any reason you need cpu port? We don't need it in mlxsw and also in dsa.
>>>> Yes i've seen that on mlxsw/rocker drivers and i was reluctant adding one here.
>>>> The reason is that TI wants this configured differently from customer facing
>>>> ports. Apparently there are existing customers already using the "feature".
>>>> So OR'ing and adding the cpu port on every operation (add/del vlans add
>>>> ucast/mcast entries etc) was less favoured.
>>>
>>> Hi Ilias
>>>
>>> Nice to see this device moving away from its custom model and towards
>>> the switchdev model.
>> +1
> Thanks. To be honest it opens up so many posibilities for common configuration
> from userspace across vendors that doing something new without it doesn't make
> any sense (at least to me).
> 
>>
>>> Did you consider making a clean break from the existing code and write
>>> a new driver. Let the existing customers using the existing
>>> driver. Have the new switchdev driver fully conform to switchdev.
>>
>> I would also prefer fresh new driver. The existing one can be marked as
>> 'bugfix-only' and later pertinently deprecated/removed.
> Yes, but given the driver and the platforms it's used at, we ended up patching
> the existing driver. I am not opposed to the idea, but Grygorii is more suited
> to reply on that.

Correct, we considered two options - start from scratch or hack existing driver
to get working prototype as fast as possible.
Hacking of existing driver was just faster way to go and i agree that new driver 
might be better approach for the future.

> 
>>>
>>> I don't like having this 'cpu' interface. As you say, it breaks the
>>> switchhdev model. If we need to extend the switchdev model to support
>>> some use case, lets do that. Please can you fully describe the use
>>> cases, so we can discuss how to implement them cleanly within the
>>> switchdev model.
>> +1
> There's configuration needs from customers adding or not adding a VLAN to the
> CPU port. In my configuration examples for instance, if the cpu port is not
> added to the bridge, you cannot get an ip address on it.
> Similar cases exist for customers on adding MDBs as far as i know. So they want
> the "customer facing ports" to have the MDBs present but not the cpu port.
> In some cases (where the CPE/device that has the switch) participates in the
> traffic they want the cpu port to have the samne MDBs installed.
> This is just two simple cases that come in mind, again Grygorii is more suited
> to answer and explain existing/more complex use cases better than me.
> 
> Adding a cpu port that cannot transmit or receive traffic is a bit "weird", on
> the other hand you can access it's configuration using the same userspace tools
> and the same commands you do for the "normal" ports. Extending switchdev might
> be the proper solution here.

I'd try to provide more information about this switch and why we end up adding eth0 port.
Please, be patient to me as this is new area for me and I might not be right in some
conclusions or can missing smth. 

*Before this patch set*: Current CPSW driver in switch mode can be described as below
          +----------------------------------------+
          |Linux Host 0                            |
          |                                        |
          |   TI tool (ioctl)/netdev               |
          |  + +                                   |
          +-----------------+----------------------+
             | |            |
Control MMIO | |            | Data DMA
             | |            |
         +------------+-----+-----+-----------------+
         |   | |      |   CPDMA   |        CPSW     |
         |   v |      |           |                 |
         |     |      +-----------+                 |
         |     |      |    P0     |                 |
         |     |      |           |                 |
         |     |      +-----+-----+                 |
         |     |            |  FDB: static MAC Port=0
         |     |      +-----+-----+                 |
         |     +------>    ALE    +-----+           |
         |     +------+           |     |           |
         |     |      +-----------+     |           |
         | +---+----+              +----+---+       |
         | |        |              |        |       |
         | |  P1    |              |  P2    |       |
         +------------------------------------------+
           |        |              |        |
           |  PHY   |              |  PHY   |
           +---+----+              +----+---+
               |                        |
           +---+----+               +---+----+
           |        |               |        |
           |  Host 1|               | Host 2 |
           +--------+               +--------+

Note. This is embedded world and in many cases network configuration is
static, everything which is not allowed - drop (means no such things like
IGMP or even ARP).

The core Part of CPSW is ALE module which perform switching ops according 
to ALE table (fdb/mdb/vlan). From ALE point of view *all* port absolutely equal.
And Linux Host 0 in most of customer use cases is the no much different from Hosts 1 or 2,
so allowed net traffic to Linux Host 0 have to be very carefully configured as Apps running
on it must continue working even if network failed or overloaded (packet storms).

So, as per my understanding, P0 meaning here is absolutely not the same as CPU port in DSA.
It just another switch port with bad luck to be connected directly to CPU.

Current CPSW driver offloads basic switch configuration in HW and additional configuration
can be done using TI custom tool (kernel updated to accept additional IOCTL).
Current driver creates one netdev ETH0 and all net traffic enter/exit through this device
- it possible to distinguish which external port p1/2 received packet or send packet directly
to external port p1/2, but current CPSW doesn't do this.
Static Unicast FDB entries created in ALE table for Port 0 to specify which 
unicast traffic need to be accepted by Linux Host 0.  By default only registered multicast 
or broadcast packets forwarded to Linux Host 0.



*After this patch set*: goal keep things working the same as max as possible and
get rid of TI custom tool.

        +-------------------------------------------------------+
        | Linux Host 0                                          |
        |                            +------------------+       |
        |                            |            br0   |       |
        |                            |                  |       |
        |                            +---+--------------+       |
        |                                |             |data    |
        |     +-------+--------------+---------------+ |        |
        |     |       ^              ^   |           | |        |
        |     |      ++-----+      +-+---+-+    +----+-+-+      |
        |     |      |sw0p0 |      | sw0p1 |    |sw0p2   |      |
        |     |      +------+      +----^--+    +-----^--+      |
        |     |                         |             |         |
        |     |             +-----------+-------------+         |
        |     |             |                                   |
        +----+v+------------------------------------------------+
             | |            |
Control MMIO | |            | Data DMA
             | |            |
         +------------+-----+-----+-----------------------------+
         |   | |      |   CPDMA   |        CPSW                 |
         |   v |      |           |                             |
         |     |      +-----------+                             |
         |     |      |    P0     |                             |
         |     |      |           |                             |
         |     |      +-----+-----+                             |
         |     |            |        FDB: static MACP1 port=0   |
         |     |      +-----+-----+  FDB: static MACP2 port=0   |
         |     +------>    ALE    +-----+                       |
         |     +------+           |     |                       |
         |     |      +-----------+     |                       |
         | +---+----+              +----+---+                   |
         | |        |              |        |                   |
         | |  P1    |              |  P2    |                   |
         +------------------------------------------------------+
           |        |              |        |
           |  PHY   |              |  PHY   |
           +--------+              +--------+

In this implementation switch egress traffic to Linux Host 0 always
split between sw0p1 and sw0p2 depending on which external port P1/P2
packet was received, sw0p0 doesn't produce any traffic.
On switch ingress from Linux Host 0 packets are always sent directly to
external ports P1/P2, with assumption that Linux Bridge knows where packet
should go, and ALE bypassed in this case. sw0p0 doesn't allows to
send any traffic and serves for configuration purposes only.

Below I've described some tested use cases (not include full static configuration),
but regarding sw0p0 - there is work done by Ivan Khoronzhuk [1] which enables
adds MQPRIO and CBS Qdisc and targets AVB network features. It required to
offload MQPRIO and CBS parameters on all ports including P0. In case of P0,
CPDMA TX channels shapers need to be configured, and in case 
of sw0p1/sw0p2 internal FIFOS. 
sw0p0 also expected to be used to configure CPDMA interface in general -
number of tx/rx channels, rates, ring sizes.
In addition there is set of global CPSW parameters (not related to P1/P2, like
MAC Authorization Mode, OUI Deny Mode, crc ) which I've 
thought can be added to sw0p0 (using ethtool -priv-flags).

Additional headache is PTP: we have on PHC, but both external interfaces P1/P2
can timestamp packets.

[1] https://lkml.org/lkml/2018/5/18/1134


Below use cases were tried with this approach tried with current LKML,
so possible changes to Net/Bridge framework not considered:

 
1) boot, ping no vlan

# ip link add name br0 type bridge
# echo 0 > /sys/class/net/br0/bridge/default_pvid
# ip link set dev eth2 master br0
# ip link set dev eth0 master br0
# ip link set dev eth1 master br0
# ifconfig br0 192.168.1.2

*Note*: I've had to disable default_pvid as otherwise linux Bridge adds
and offloads default vlan 1, but default configuration for CPSW driver is vid 0.
+  CPSW specific - it can't untag packets for P0.
Another option I've found:
# ip link set dev br0 type bridge vlan_filtering 1.
but anyway, I've found it confusing that Linux bridge adds default vlan when vlan_filtering == 0


2) add vlans.

Host 1 - not vlan capable,
Host 2/ Linux Host 0 can handle vlans

# bridge vlan add dev sw0p1 vid 100 pvid untagged master
# bridge vlan add dev sw0p2 vid 100 master
# bridge vlan add dev sw0p0 vid 100 master

Any combination expected to work.

*Note*. Ilias reused IFF_ALLMULTI and IFF_MULTICAST on each interface to
 configure registered and unregistered multicast ports masks for each ALE VLAN
entries. So, 

# ifconfig eth0 -multicast
# ifconfig eth0 --allmulti
# bridge vlan add dev sw0p0 vid 100 master
expected to stop forwarding  any multicast traffic to Linux Host 0

3) MDB

Allow mcast 239.1.1.1 between P1/P2
# bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent
# bridge mdb add dev br0 port sw0p2 grp 239.1.1.1 permanent

Allow mcast 239.1.1.2 between P0/P2
# bridge mdb add dev br0 port sw0p0 grp 239.1.1.2 permanent
# bridge mdb add dev br0 port sw0p2 grp 239.1.1.2 permanent

*Note !!!!!*. I've found no proper way to add L2 mcast addresses as
01-80-C2-00-00-00 or 01-1B-19-00-00-00. probably I missing smth.

4) stp

I did some updates (added port stp state offload/stp mcats addr configuration
- will post diff on Monday) and tried stp (kernel STP) by connecting P1/P2
ports to the same switch:

# ip link add name br0 type bridge
# echo 0 > /sys/class/net/br0/bridge/default_pvid
# ip link set dev sw0p2 master br0
# ip link set dev sw0p0 master br0
# ip link set dev sw0p1 master br0
# brctl show
# ifconfig br0 192.168.1.2
# brctl stp br0 on
# brctl showstp br0

sw0p1 (3)
 port id                8003                    state                  blocking
sw0p2 (1)
 port id                8001                    state                forwarding


Don't know howto:
1) add FDB entry with "blocked" flag - ALE can discard all packets with SRC/DST
address = blocked MAC
2) add multicast MAC address with Supervisory Packet flag set. 
Such packets will bypass most of checks inside ALE and will be forwarded in all port's
states except "disabled".
3) add "unknown vlan configuration" : ALE provides possibility to configure
default behavior for tagged packets with "unknown vlan" by configuring 
- Unknown VLAN Force Untagged Egress ports Mask.
- Unknown VLAN Registered Multicast Flood Ports Mask
- Unknown VLAN Multicast Flood ports Mask
- Unknown VLAN Member ports List
4) The way to detect "brctl stp br0 on/off"


If you are here. Thanks for reading this and your patience.

-- 
regards,
-grygorii

^ permalink raw reply

* [PATCH 0/9,v2] Netfilter updates for net-next
From: Pablo Neira Ayuso @ 2018-06-02 23:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

v2: Please scratch the previous pull request, it was missing the
    nft_connlimit.c file in diffstat. Sorry about that.

-o-

The following patchset contains Netfilter updates for your net-next tree:

1) Get rid of nf_sk_is_transparent(), use inet_sk_transparent() instead.
   From Máté Eckl.

2) Move shared tproxy infrastructure to nf_tproxy_ipv4 and nf_tproxy_ipv6.
   Also from Máté.

3) Add hashtable to speed up chain lookups by name, from Florian Westphal.

4) Patch series to add connlimit support reusing part of the
   nf_conncount infrastructure. This includes preparation changes such
   passing context to the object and expression destroy interface;
   garbage collection for expressions embedded into set elements, and
   the introduction of the clone_destroy interface for expressions.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks.

----------------------------------------------------------------

The following changes since commit 1ffdd8e1643f6ce28792edd3314be84167faabf1:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next (2018-06-02 09:04:21 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 1b2470e59fb1e983a3655feba30cdfc03e609d51:

  netfilter: nf_tables: handle chain name lookups via rhltable (2018-06-03 01:18:37 +0200)

----------------------------------------------------------------
Florian Westphal (1):
      netfilter: nf_tables: handle chain name lookups via rhltable

Máté Eckl (2):
      netfilter: Decrease code duplication regarding transparent socket option
      netfilter: Libify xt_TPROXY

Pablo Neira Ayuso (6):
      netfilter: nf_tables: pass context to object destroy indirection
      netfilter: nf_conncount: expose connection list interface
      netfilter: nf_tables: pass ctx to nf_tables_expr_destroy()
      netfilter: nf_tables: garbage collection for stateful expressions
      netfilter: nf_tables: add destroy_clone expression
      netfilter: nf_tables: add connlimit support

 include/net/netfilter/nf_conntrack_count.h |  11 +
 include/net/netfilter/nf_socket.h          |  13 -
 include/net/netfilter/nf_tables.h          |  20 +-
 include/net/netfilter/nf_tproxy.h          | 113 +++++++++
 include/uapi/linux/netfilter/nf_tables.h   |  21 +-
 net/ipv4/netfilter/Kconfig                 |   5 +-
 net/ipv4/netfilter/Makefile                |   1 +
 net/ipv4/netfilter/nf_tproxy_ipv4.c        | 147 ++++++++++++
 net/ipv6/netfilter/Kconfig                 |   5 +-
 net/ipv6/netfilter/Makefile                |   1 +
 net/ipv6/netfilter/nf_tproxy_ipv6.c        | 146 ++++++++++++
 net/netfilter/Kconfig                      |  11 +
 net/netfilter/Makefile                     |   1 +
 net/netfilter/nf_conncount.c               |  36 ++-
 net/netfilter/nf_tables_api.c              | 151 ++++++++++--
 net/netfilter/nft_connlimit.c              | 297 +++++++++++++++++++++++
 net/netfilter/nft_counter.c                |   4 +-
 net/netfilter/nft_ct.c                     |   3 +-
 net/netfilter/nft_dynset.c                 |   9 +
 net/netfilter/nft_set_hash.c               |  21 +-
 net/netfilter/nft_socket.c                 |   3 +-
 net/netfilter/xt_TPROXY.c                  | 366 ++---------------------------
 net/netfilter/xt_socket.c                  |   4 +-
 23 files changed, 977 insertions(+), 412 deletions(-)
 create mode 100644 include/net/netfilter/nf_tproxy.h
 create mode 100644 net/ipv4/netfilter/nf_tproxy_ipv4.c
 create mode 100644 net/ipv6/netfilter/nf_tproxy_ipv6.c
 create mode 100644 net/netfilter/nft_connlimit.c

^ permalink raw reply

* [PATCH 7/9] netfilter: nf_tables: add destroy_clone expression
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

Before this patch, cloned expressions are released via ->destroy. This
is a problem for the new connlimit expression since the ->destroy path
drop a reference on the conntrack modules and it unregisters hooks. The
new ->destroy_clone provides context that this expression is being
released from the packet path, so it is mirroring ->clone(), where
neither module reference is dropped nor hooks need to be unregistered -
because this done from the control plane path from the ->init() path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |  2 ++
 net/netfilter/nf_tables_api.c     | 12 ++++++++++--
 net/netfilter/nft_counter.c       |  1 +
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 871cb3b012e9..83e7b83ecf3e 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -745,6 +745,8 @@ struct nft_expr_ops {
 						      const struct nft_expr *expr);
 	void				(*destroy)(const struct nft_ctx *ctx,
 						   const struct nft_expr *expr);
+	void				(*destroy_clone)(const struct nft_ctx *ctx,
+							 const struct nft_expr *expr);
 	int				(*dump)(struct sk_buff *skb,
 						const struct nft_expr *expr);
 	int				(*validate)(const struct nft_ctx *ctx,
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 12463984dd5c..0a6eafa49879 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4046,8 +4046,16 @@ void nft_set_elem_destroy(const struct nft_set *set, void *elem,
 	nft_data_release(nft_set_ext_key(ext), NFT_DATA_VALUE);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
 		nft_data_release(nft_set_ext_data(ext), set->dtype);
-	if (destroy_expr && nft_set_ext_exists(ext, NFT_SET_EXT_EXPR))
-		nf_tables_expr_destroy(&ctx, nft_set_ext_expr(ext));
+	if (destroy_expr && nft_set_ext_exists(ext, NFT_SET_EXT_EXPR)) {
+		struct nft_expr *expr = nft_set_ext_expr(ext);
+
+		if (expr->ops->destroy_clone) {
+			expr->ops->destroy_clone(&ctx, expr);
+			module_put(expr->ops->type->owner);
+		} else {
+			nf_tables_expr_destroy(&ctx, expr);
+		}
+	}
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF))
 		(*nft_set_ext_obj(ext))->use--;
 	kfree(elem);
diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index e59a74d6b7d6..a61d7edfc290 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -258,6 +258,7 @@ static const struct nft_expr_ops nft_counter_ops = {
 	.eval		= nft_counter_eval,
 	.init		= nft_counter_init,
 	.destroy	= nft_counter_destroy,
+	.destroy_clone	= nft_counter_destroy,
 	.dump		= nft_counter_dump,
 	.clone		= nft_counter_clone,
 };
-- 
2.11.0

^ permalink raw reply related

* [PATCH 9/9] netfilter: nf_tables: handle chain name lookups via rhltable
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

If there is a significant amount of chains list search is too slow, so
add an rhlist table for this.

This speeds up ruleset loading: for every new rule we have to check if
the name already exists in current generation.

We need to be able to cope with duplicate chain names in case a transaction
drops the nfnl mutex (for request_module) and the abort of this old
transaction is still pending.

The list is kept -- we need a way to iterate chains even if hash resize is
in progress without missing an entry.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |   7 ++-
 net/netfilter/nf_tables_api.c     | 113 +++++++++++++++++++++++++++++++++-----
 2 files changed, 104 insertions(+), 16 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 83e7b83ecf3e..08c005ce56e9 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -9,6 +9,7 @@
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter/nf_tables.h>
 #include <linux/u64_stats_sync.h>
+#include <linux/rhashtable.h>
 #include <net/netfilter/nf_flow_table.h>
 #include <net/netlink.h>
 
@@ -860,6 +861,7 @@ enum nft_chain_flags {
  *
  *	@rules: list of rules in the chain
  *	@list: used internally
+ *	@rhlhead: used internally
  *	@table: table that this chain belongs to
  *	@handle: chain handle
  *	@use: number of jump references to this chain
@@ -872,6 +874,7 @@ struct nft_chain {
 	struct nft_rule			*__rcu *rules_gen_1;
 	struct list_head		rules;
 	struct list_head		list;
+	struct rhlist_head		rhlhead;
 	struct nft_table		*table;
 	u64				handle;
 	u32				use;
@@ -965,7 +968,8 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt, void *priv);
  *	struct nft_table - nf_tables table
  *
  *	@list: used internally
- *	@chains: chains in the table
+ *	@chains_ht: chains in the table
+ *	@chains: same, for stable walks
  *	@sets: sets in the table
  *	@objects: stateful objects in the table
  *	@flowtables: flow tables in the table
@@ -979,6 +983,7 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt, void *priv);
  */
 struct nft_table {
 	struct list_head		list;
+	struct rhltable			chains_ht;
 	struct list_head		chains;
 	struct list_head		sets;
 	struct list_head		objects;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0a6eafa49879..2e8fd961746d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -34,6 +34,20 @@ enum {
 	NFT_VALIDATE_DO,
 };
 
+static u32 nft_chain_hash(const void *data, u32 len, u32 seed);
+static u32 nft_chain_hash_obj(const void *data, u32 len, u32 seed);
+static int nft_chain_hash_cmp(struct rhashtable_compare_arg *, const void *);
+
+static const struct rhashtable_params nft_chain_ht_params = {
+	.head_offset		= offsetof(struct nft_chain, rhlhead),
+	.key_offset		= offsetof(struct nft_chain, name),
+	.hashfn			= nft_chain_hash,
+	.obj_hashfn		= nft_chain_hash_obj,
+	.obj_cmpfn		= nft_chain_hash_cmp,
+	.locks_mul		= 1,
+	.automatic_shrinking	= true,
+};
+
 static void nft_validate_state_update(struct net *net, u8 new_validate_state)
 {
 	switch (net->nft.validate_state) {
@@ -720,6 +734,29 @@ static int nf_tables_updtable(struct nft_ctx *ctx)
 	return ret;
 }
 
+static u32 nft_chain_hash(const void *data, u32 len, u32 seed)
+{
+	const char *name = data;
+
+	return jhash(name, strlen(name), seed);
+}
+
+static u32 nft_chain_hash_obj(const void *data, u32 len, u32 seed)
+{
+	const struct nft_chain *chain = data;
+
+	return nft_chain_hash(chain->name, 0, seed);
+}
+
+static int nft_chain_hash_cmp(struct rhashtable_compare_arg *arg,
+			      const void *ptr)
+{
+	const struct nft_chain *chain = ptr;
+	const char *name = arg->key;
+
+	return strcmp(chain->name, name);
+}
+
 static int nf_tables_newtable(struct net *net, struct sock *nlsk,
 			      struct sk_buff *skb, const struct nlmsghdr *nlh,
 			      const struct nlattr * const nla[],
@@ -766,6 +803,10 @@ static int nf_tables_newtable(struct net *net, struct sock *nlsk,
 	if (table->name == NULL)
 		goto err_strdup;
 
+	err = rhltable_init(&table->chains_ht, &nft_chain_ht_params);
+	if (err)
+		goto err_chain_ht;
+
 	INIT_LIST_HEAD(&table->chains);
 	INIT_LIST_HEAD(&table->sets);
 	INIT_LIST_HEAD(&table->objects);
@@ -782,6 +823,8 @@ static int nf_tables_newtable(struct net *net, struct sock *nlsk,
 	list_add_tail_rcu(&table->list, &net->nft.tables);
 	return 0;
 err_trans:
+	rhltable_destroy(&table->chains_ht);
+err_chain_ht:
 	kfree(table->name);
 err_strdup:
 	kfree(table);
@@ -922,6 +965,7 @@ static void nf_tables_table_destroy(struct nft_ctx *ctx)
 {
 	BUG_ON(ctx->table->use > 0);
 
+	rhltable_destroy(&ctx->table->chains_ht);
 	kfree(ctx->table->name);
 	kfree(ctx->table);
 }
@@ -967,21 +1011,35 @@ nft_chain_lookup_byhandle(const struct nft_table *table, u64 handle, u8 genmask)
 	return ERR_PTR(-ENOENT);
 }
 
-static struct nft_chain *nft_chain_lookup(const struct nft_table *table,
+static struct nft_chain *nft_chain_lookup(struct nft_table *table,
 					  const struct nlattr *nla, u8 genmask)
 {
+	char search[NFT_CHAIN_MAXNAMELEN + 1];
+	struct rhlist_head *tmp, *list;
 	struct nft_chain *chain;
 
 	if (nla == NULL)
 		return ERR_PTR(-EINVAL);
 
-	list_for_each_entry_rcu(chain, &table->chains, list) {
-		if (!nla_strcmp(nla, chain->name) &&
-		    nft_active_genmask(chain, genmask))
-			return chain;
-	}
+	nla_strlcpy(search, nla, sizeof(search));
 
-	return ERR_PTR(-ENOENT);
+	WARN_ON(!rcu_read_lock_held() &&
+		!lockdep_nfnl_is_held(NFNL_SUBSYS_NFTABLES));
+
+	chain = ERR_PTR(-ENOENT);
+	rcu_read_lock();
+	list = rhltable_lookup(&table->chains_ht, search, nft_chain_ht_params);
+	if (!list)
+		goto out_unlock;
+
+	rhl_for_each_entry_rcu(chain, tmp, list, rhlhead) {
+		if (nft_active_genmask(chain, genmask))
+			goto out_unlock;
+	}
+	chain = ERR_PTR(-ENOENT);
+out_unlock:
+	rcu_read_unlock();
+	return chain;
 }
 
 static const struct nla_policy nft_chain_policy[NFTA_CHAIN_MAX + 1] = {
@@ -1185,8 +1243,8 @@ static int nf_tables_getchain(struct net *net, struct sock *nlsk,
 {
 	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
 	u8 genmask = nft_genmask_cur(net);
-	const struct nft_table *table;
 	const struct nft_chain *chain;
+	struct nft_table *table;
 	struct sk_buff *skb2;
 	int family = nfmsg->nfgen_family;
 	int err;
@@ -1504,9 +1562,17 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
 	if (err < 0)
 		goto err1;
 
+	err = rhltable_insert_key(&table->chains_ht, chain->name,
+				  &chain->rhlhead, nft_chain_ht_params);
+	if (err)
+		goto err2;
+
 	err = nft_trans_chain_add(ctx, NFT_MSG_NEWCHAIN);
-	if (err < 0)
+	if (err < 0) {
+		rhltable_remove(&table->chains_ht, &chain->rhlhead,
+				nft_chain_ht_params);
 		goto err2;
+	}
 
 	table->use++;
 	list_add_tail_rcu(&chain->list, &table->chains);
@@ -2206,9 +2272,9 @@ static int nf_tables_getrule(struct net *net, struct sock *nlsk,
 {
 	const struct nfgenmsg *nfmsg = nlmsg_data(nlh);
 	u8 genmask = nft_genmask_cur(net);
-	const struct nft_table *table;
 	const struct nft_chain *chain;
 	const struct nft_rule *rule;
+	struct nft_table *table;
 	struct sk_buff *skb2;
 	int family = nfmsg->nfgen_family;
 	int err;
@@ -5981,8 +6047,16 @@ static void nft_chain_commit_update(struct nft_trans *trans)
 {
 	struct nft_base_chain *basechain;
 
-	if (nft_trans_chain_name(trans))
+	if (nft_trans_chain_name(trans)) {
+		rhltable_remove(&trans->ctx.table->chains_ht,
+				&trans->ctx.chain->rhlhead,
+				nft_chain_ht_params);
 		swap(trans->ctx.chain->name, nft_trans_chain_name(trans));
+		rhltable_insert_key(&trans->ctx.table->chains_ht,
+				    trans->ctx.chain->name,
+				    &trans->ctx.chain->rhlhead,
+				    nft_chain_ht_params);
+	}
 
 	if (!nft_is_base_chain(trans->ctx.chain))
 		return;
@@ -6159,6 +6233,15 @@ static void nf_tables_commit_chain_active(struct net *net, struct nft_chain *cha
 		nf_tables_commit_chain_free_rules_old(g0);
 }
 
+static void nft_chain_del(struct nft_chain *chain)
+{
+	struct nft_table *table = chain->table;
+
+	WARN_ON_ONCE(rhltable_remove(&table->chains_ht, &chain->rhlhead,
+				     nft_chain_ht_params));
+	list_del_rcu(&chain->list);
+}
+
 static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 {
 	struct nft_trans *trans, *next;
@@ -6233,7 +6316,7 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 			nft_trans_destroy(trans);
 			break;
 		case NFT_MSG_DELCHAIN:
-			list_del_rcu(&trans->ctx.chain->list);
+			nft_chain_del(trans->ctx.chain);
 			nf_tables_chain_notify(&trans->ctx, NFT_MSG_DELCHAIN);
 			nf_tables_unregister_hook(trans->ctx.net,
 						  trans->ctx.table,
@@ -6384,7 +6467,7 @@ static int nf_tables_abort(struct net *net, struct sk_buff *skb)
 				nft_trans_destroy(trans);
 			} else {
 				trans->ctx.table->use--;
-				list_del_rcu(&trans->ctx.chain->list);
+				nft_chain_del(trans->ctx.chain);
 				nf_tables_unregister_hook(trans->ctx.net,
 							  trans->ctx.table,
 							  trans->ctx.chain);
@@ -6986,7 +7069,7 @@ int __nft_release_basechain(struct nft_ctx *ctx)
 		ctx->chain->use--;
 		nf_tables_rule_release(ctx, rule);
 	}
-	list_del(&ctx->chain->list);
+	nft_chain_del(ctx->chain);
 	ctx->table->use--;
 	nf_tables_chain_destroy(ctx);
 
@@ -7042,7 +7125,7 @@ static void __nft_release_tables(struct net *net)
 		}
 		list_for_each_entry_safe(chain, nc, &table->chains, list) {
 			ctx.chain = chain;
-			list_del(&chain->list);
+			nft_chain_del(chain);
 			table->use--;
 			nf_tables_chain_destroy(&ctx);
 		}
-- 
2.11.0

^ permalink raw reply related

* [PATCH 8/9] netfilter: nf_tables: add connlimit support
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

This features which allows you to limit the maximum number of
connections per arbitrary key. The connlimit expression is stateful,
therefore it can be used from meters to dynamically populate a set, this
provides a mapping to the iptables' connlimit match. This patch also
comes that allows you define static connlimit policies.

This extension depends on the nf_conncount infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/uapi/linux/netfilter/nf_tables.h | 21 ++++++++++++++++++++-
 net/netfilter/Kconfig                    |  9 +++++++++
 net/netfilter/Makefile                   |  1 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index a089af092a29..ae00a3c49b8a 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -1043,6 +1043,24 @@ enum nft_limit_attributes {
 };
 #define NFTA_LIMIT_MAX		(__NFTA_LIMIT_MAX - 1)
 
+enum nft_connlimit_flags {
+	NFT_CONNLIMIT_F_INV	= (1 << 0),
+};
+
+/**
+ * enum nft_connlimit_attributes - nf_tables connlimit expression netlink attributes
+ *
+ * @NFTA_CONNLIMIT_COUNT: number of connections (NLA_U32)
+ * @NFTA_CONNLIMIT_FLAGS: flags (NLA_U32: enum nft_connlimit_flags)
+ */
+enum nft_connlimit_attributes {
+	NFTA_CONNLIMIT_UNSPEC,
+	NFTA_CONNLIMIT_COUNT,
+	NFTA_CONNLIMIT_FLAGS,
+	__NFTA_CONNLIMIT_MAX
+};
+#define NFTA_CONNLIMIT_MAX	(__NFTA_CONNLIMIT_MAX - 1)
+
 /**
  * enum nft_counter_attributes - nf_tables counter expression netlink attributes
  *
@@ -1357,7 +1375,8 @@ enum nft_ct_helper_attributes {
 #define NFT_OBJECT_QUOTA	2
 #define NFT_OBJECT_CT_HELPER	3
 #define NFT_OBJECT_LIMIT	4
-#define __NFT_OBJECT_MAX	5
+#define NFT_OBJECT_CONNLIMIT	5
+#define __NFT_OBJECT_MAX	6
 #define NFT_OBJECT_MAX		(__NFT_OBJECT_MAX - 1)
 
 /**
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 41240abd755f..dbd7d1fad277 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -517,6 +517,15 @@ config NFT_COUNTER
 	  This option adds the "counter" expression that you can use to
 	  include packet and byte counters in a rule.
 
+config NFT_CONNLIMIT
+	tristate "Netfilter nf_tables connlimit module"
+	depends on NF_CONNTRACK
+	depends on NETFILTER_ADVANCED
+	select NETFILTER_CONNCOUNT
+	help
+	  This option adds the "connlimit" expression that you can use to
+	  ratelimit rule matchings per connections.
+
 config NFT_LOG
 	tristate "Netfilter nf_tables log module"
 	help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index eec169555731..44449389e527 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -80,6 +80,7 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
+obj-$(CONFIG_NFT_CONNLIMIT)	+= nft_connlimit.o
 obj-$(CONFIG_NFT_NUMGEN)	+= nft_numgen.o
 obj-$(CONFIG_NFT_CT)		+= nft_ct.o
 obj-$(CONFIG_NFT_FLOW_OFFLOAD)	+= nft_flow_offload.o
-- 
2.11.0

^ permalink raw reply related

* [PATCH 4/9] netfilter: nf_conncount: expose connection list interface
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

This patch provides an interface to maintain the list of connections and
the lookup function to obtain the number of connections in the list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_count.h | 11 +++++++++
 net/netfilter/nf_conncount.c               | 36 +++++++++++++++++++-----------
 2 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_count.h b/include/net/netfilter/nf_conntrack_count.h
index e61184fbfb71..1910b6572430 100644
--- a/include/net/netfilter/nf_conntrack_count.h
+++ b/include/net/netfilter/nf_conntrack_count.h
@@ -13,4 +13,15 @@ unsigned int nf_conncount_count(struct net *net,
 				const u32 *key,
 				const struct nf_conntrack_tuple *tuple,
 				const struct nf_conntrack_zone *zone);
+
+unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+				 const struct nf_conntrack_tuple *tuple,
+				 const struct nf_conntrack_zone *zone,
+				 bool *addit);
+
+bool nf_conncount_add(struct hlist_head *head,
+		      const struct nf_conntrack_tuple *tuple);
+
+void nf_conncount_cache_free(struct hlist_head *hhead);
+
 #endif
diff --git a/net/netfilter/nf_conncount.c b/net/netfilter/nf_conncount.c
index 153e690e2893..3b5059a8dcdd 100644
--- a/net/netfilter/nf_conncount.c
+++ b/net/netfilter/nf_conncount.c
@@ -79,7 +79,7 @@ static int key_diff(const u32 *a, const u32 *b, unsigned int klen)
 	return memcmp(a, b, klen * sizeof(u32));
 }
 
-static bool add_hlist(struct hlist_head *head,
+bool nf_conncount_add(struct hlist_head *head,
 		      const struct nf_conntrack_tuple *tuple)
 {
 	struct nf_conncount_tuple *conn;
@@ -91,12 +91,12 @@ static bool add_hlist(struct hlist_head *head,
 	hlist_add_head(&conn->node, head);
 	return true;
 }
+EXPORT_SYMBOL_GPL(nf_conncount_add);
 
-static unsigned int check_hlist(struct net *net,
-				struct hlist_head *head,
-				const struct nf_conntrack_tuple *tuple,
-				const struct nf_conntrack_zone *zone,
-				bool *addit)
+unsigned int nf_conncount_lookup(struct net *net, struct hlist_head *head,
+				 const struct nf_conntrack_tuple *tuple,
+				 const struct nf_conntrack_zone *zone,
+				 bool *addit)
 {
 	const struct nf_conntrack_tuple_hash *found;
 	struct nf_conncount_tuple *conn;
@@ -141,6 +141,7 @@ static unsigned int check_hlist(struct net *net,
 
 	return length;
 }
+EXPORT_SYMBOL_GPL(nf_conncount_lookup);
 
 static void tree_nodes_free(struct rb_root *root,
 			    struct nf_conncount_rb *gc_nodes[],
@@ -187,13 +188,15 @@ count_tree(struct net *net, struct rb_root *root,
 		} else {
 			/* same source network -> be counted! */
 			unsigned int count;
-			count = check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+
+			count = nf_conncount_lookup(net, &rbconn->hhead, tuple,
+						    zone, &addit);
 
 			tree_nodes_free(root, gc_nodes, gc_count);
 			if (!addit)
 				return count;
 
-			if (!add_hlist(&rbconn->hhead, tuple))
+			if (!nf_conncount_add(&rbconn->hhead, tuple))
 				return 0; /* hotdrop */
 
 			return count + 1;
@@ -203,7 +206,7 @@ count_tree(struct net *net, struct rb_root *root,
 			continue;
 
 		/* only used for GC on hhead, retval and 'addit' ignored */
-		check_hlist(net, &rbconn->hhead, tuple, zone, &addit);
+		nf_conncount_lookup(net, &rbconn->hhead, tuple, zone, &addit);
 		if (hlist_empty(&rbconn->hhead))
 			gc_nodes[gc_count++] = rbconn;
 	}
@@ -303,11 +306,19 @@ struct nf_conncount_data *nf_conncount_init(struct net *net, unsigned int family
 }
 EXPORT_SYMBOL_GPL(nf_conncount_init);
 
-static void destroy_tree(struct rb_root *r)
+void nf_conncount_cache_free(struct hlist_head *hhead)
 {
 	struct nf_conncount_tuple *conn;
-	struct nf_conncount_rb *rbconn;
 	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(conn, n, hhead, node)
+		kmem_cache_free(conncount_conn_cachep, conn);
+}
+EXPORT_SYMBOL_GPL(nf_conncount_cache_free);
+
+static void destroy_tree(struct rb_root *r)
+{
+	struct nf_conncount_rb *rbconn;
 	struct rb_node *node;
 
 	while ((node = rb_first(r)) != NULL) {
@@ -315,8 +326,7 @@ static void destroy_tree(struct rb_root *r)
 
 		rb_erase(node, r);
 
-		hlist_for_each_entry_safe(conn, n, &rbconn->hhead, node)
-			kmem_cache_free(conncount_conn_cachep, conn);
+		nf_conncount_cache_free(&rbconn->hhead);
 
 		kmem_cache_free(conncount_rb_cachep, rbconn);
 	}
-- 
2.11.0

^ permalink raw reply related

* [PATCH 5/9] netfilter: nf_tables: pass ctx to nf_tables_expr_destroy()
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

nft_set_elem_destroy() can be called from call_rcu context. Annotate
netns and table in set object so we can populate the context object.
Moreover, pass context object to nf_tables_set_elem_destroy() from the
commit phase, since it is already available from there.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |  4 ++++
 net/netfilter/nf_tables_api.c     | 16 ++++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 81ec070582b6..e3d1bac9b0d5 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -370,6 +370,8 @@ void nft_unregister_set(struct nft_set_type *type);
  *
  *	@list: table set list node
  *	@bindings: list of set bindings
+ *	@table: table this set belongs to
+ *	@net: netnamespace this set belongs to
  * 	@name: name of the set
  *	@handle: unique handle of the set
  * 	@ktype: key type (numeric type defined by userspace, not used in the kernel)
@@ -393,6 +395,8 @@ void nft_unregister_set(struct nft_set_type *type);
 struct nft_set {
 	struct list_head		list;
 	struct list_head		bindings;
+	struct nft_table		*table;
+	possible_net_t			net;
 	char				*name;
 	u64				handle;
 	u32				ktype;
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 177658f4007e..12463984dd5c 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -3359,6 +3359,8 @@ static int nf_tables_newset(struct net *net, struct sock *nlsk,
 	}
 
 	INIT_LIST_HEAD(&set->bindings);
+	set->table = table;
+	write_pnet(&set->net, net);
 	set->ops   = ops;
 	set->ktype = ktype;
 	set->klen  = desc.klen;
@@ -4036,12 +4038,16 @@ void nft_set_elem_destroy(const struct nft_set *set, void *elem,
 			  bool destroy_expr)
 {
 	struct nft_set_ext *ext = nft_set_elem_ext(set, elem);
+	struct nft_ctx ctx = {
+		.net	= read_pnet(&set->net),
+		.family	= set->table->family,
+	};
 
 	nft_data_release(nft_set_ext_key(ext), NFT_DATA_VALUE);
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
 		nft_data_release(nft_set_ext_data(ext), set->dtype);
 	if (destroy_expr && nft_set_ext_exists(ext, NFT_SET_EXT_EXPR))
-		nf_tables_expr_destroy(NULL, nft_set_ext_expr(ext));
+		nf_tables_expr_destroy(&ctx, nft_set_ext_expr(ext));
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_OBJREF))
 		(*nft_set_ext_obj(ext))->use--;
 	kfree(elem);
@@ -4051,12 +4057,13 @@ EXPORT_SYMBOL_GPL(nft_set_elem_destroy);
 /* Only called from commit path, nft_set_elem_deactivate() already deals with
  * the refcounting from the preparation phase.
  */
-static void nf_tables_set_elem_destroy(const struct nft_set *set, void *elem)
+static void nf_tables_set_elem_destroy(const struct nft_ctx *ctx,
+				       const struct nft_set *set, void *elem)
 {
 	struct nft_set_ext *ext = nft_set_elem_ext(set, elem);
 
 	if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPR))
-		nf_tables_expr_destroy(NULL, nft_set_ext_expr(ext));
+		nf_tables_expr_destroy(ctx, nft_set_ext_expr(ext));
 	kfree(elem);
 }
 
@@ -5999,7 +6006,8 @@ static void nft_commit_release(struct nft_trans *trans)
 		nft_set_destroy(nft_trans_set(trans));
 		break;
 	case NFT_MSG_DELSETELEM:
-		nf_tables_set_elem_destroy(nft_trans_elem_set(trans),
+		nf_tables_set_elem_destroy(&trans->ctx,
+					   nft_trans_elem_set(trans),
 					   nft_trans_elem(trans).priv);
 		break;
 	case NFT_MSG_DELOBJ:
-- 
2.11.0

^ permalink raw reply related

* [PATCH 6/9] netfilter: nf_tables: garbage collection for stateful expressions
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

Use garbage collector to schedule removal of elements based of feedback
from expression that this element comes with. Therefore, the garbage
collector is not guided by timeout expirations in this new mode.

The new connlimit expression sets on the NFT_EXPR_GC flag to enable this
behaviour, the dynset expression needs to explicitly enable the garbage
collector via set->ops->gc_init call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |  4 ++++
 net/netfilter/nft_dynset.c        |  9 +++++++++
 net/netfilter/nft_set_hash.c      | 21 +++++++++++++++++++--
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index e3d1bac9b0d5..871cb3b012e9 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -342,6 +342,7 @@ struct nft_set_ops {
 						const struct nft_set_desc *desc,
 						const struct nlattr * const nla[]);
 	void				(*destroy)(const struct nft_set *set);
+	void				(*gc_init)(const struct nft_set *set);
 
 	unsigned int			elemsize;
 };
@@ -712,6 +713,7 @@ struct nft_expr_type {
 };
 
 #define NFT_EXPR_STATEFUL		0x1
+#define NFT_EXPR_GC			0x2
 
 /**
  *	struct nft_expr_ops - nf_tables expression operations
@@ -748,6 +750,8 @@ struct nft_expr_ops {
 	int				(*validate)(const struct nft_ctx *ctx,
 						    const struct nft_expr *expr,
 						    const struct nft_data **data);
+	bool				(*gc)(struct net *net,
+					      const struct nft_expr *expr);
 	const struct nft_expr_type	*type;
 	void				*data;
 };
diff --git a/net/netfilter/nft_dynset.c b/net/netfilter/nft_dynset.c
index b07a3fd9eeea..4d49529cff61 100644
--- a/net/netfilter/nft_dynset.c
+++ b/net/netfilter/nft_dynset.c
@@ -195,6 +195,15 @@ static int nft_dynset_init(const struct nft_ctx *ctx,
 		err = -EOPNOTSUPP;
 		if (!(priv->expr->ops->type->flags & NFT_EXPR_STATEFUL))
 			goto err1;
+
+		if (priv->expr->ops->type->flags & NFT_EXPR_GC) {
+			if (set->flags & NFT_SET_TIMEOUT)
+				goto err1;
+			if (!set->ops->gc_init)
+				goto err1;
+			set->ops->gc_init(set);
+		}
+
 	} else if (set->flags & NFT_SET_EVAL)
 		return -EINVAL;
 
diff --git a/net/netfilter/nft_set_hash.c b/net/netfilter/nft_set_hash.c
index dbf1f4ad077c..6f9a1365a09f 100644
--- a/net/netfilter/nft_set_hash.c
+++ b/net/netfilter/nft_set_hash.c
@@ -311,8 +311,16 @@ static void nft_rhash_gc(struct work_struct *work)
 			continue;
 		}
 
+		if (nft_set_ext_exists(&he->ext, NFT_SET_EXT_EXPR)) {
+			struct nft_expr *expr = nft_set_ext_expr(&he->ext);
+
+			if (expr->ops->gc &&
+			    expr->ops->gc(read_pnet(&set->net), expr))
+				goto gc;
+		}
 		if (!nft_set_elem_expired(&he->ext))
 			continue;
+gc:
 		if (nft_set_elem_mark_busy(&he->ext))
 			continue;
 
@@ -339,6 +347,14 @@ static unsigned int nft_rhash_privsize(const struct nlattr * const nla[],
 	return sizeof(struct nft_rhash);
 }
 
+static void nft_rhash_gc_init(const struct nft_set *set)
+{
+	struct nft_rhash *priv = nft_set_priv(set);
+
+	queue_delayed_work(system_power_efficient_wq, &priv->gc_work,
+			   nft_set_gc_interval(set));
+}
+
 static int nft_rhash_init(const struct nft_set *set,
 			  const struct nft_set_desc *desc,
 			  const struct nlattr * const tb[])
@@ -356,8 +372,8 @@ static int nft_rhash_init(const struct nft_set *set,
 
 	INIT_DEFERRABLE_WORK(&priv->gc_work, nft_rhash_gc);
 	if (set->flags & NFT_SET_TIMEOUT)
-		queue_delayed_work(system_power_efficient_wq, &priv->gc_work,
-				   nft_set_gc_interval(set));
+		nft_rhash_gc_init(set);
+
 	return 0;
 }
 
@@ -647,6 +663,7 @@ static struct nft_set_type nft_rhash_type __read_mostly = {
 		.elemsize	= offsetof(struct nft_rhash_elem, ext),
 		.estimate	= nft_rhash_estimate,
 		.init		= nft_rhash_init,
+		.gc_init	= nft_rhash_gc_init,
 		.destroy	= nft_rhash_destroy,
 		.insert		= nft_rhash_insert,
 		.activate	= nft_rhash_activate,
-- 
2.11.0

^ permalink raw reply related

* [PATCH 2/9] netfilter: Libify xt_TPROXY
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

From: Máté Eckl <ecklm94@gmail.com>

The extracted functions will likely be usefull to implement tproxy
support in nf_tables.

Extrancted functions:
	- nf_tproxy_sk_is_transparent
	- nf_tproxy_laddr4
	- nf_tproxy_handle_time_wait4
	- nf_tproxy_get_sock_v4
	- nf_tproxy_laddr6
	- nf_tproxy_handle_time_wait6
	- nf_tproxy_get_sock_v6

(nf_)tproxy_handle_time_wait6 also needed some refactor as its current
implementation was xtables-specific.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tproxy.h   | 113 ++++++++++++
 net/ipv4/netfilter/Kconfig          |   5 +-
 net/ipv4/netfilter/Makefile         |   1 +
 net/ipv4/netfilter/nf_tproxy_ipv4.c | 147 +++++++++++++++
 net/ipv6/netfilter/Kconfig          |   5 +-
 net/ipv6/netfilter/Makefile         |   1 +
 net/ipv6/netfilter/nf_tproxy_ipv6.c | 146 +++++++++++++++
 net/netfilter/Kconfig               |   2 +
 net/netfilter/xt_TPROXY.c           | 355 ++----------------------------------
 9 files changed, 436 insertions(+), 339 deletions(-)
 create mode 100644 include/net/netfilter/nf_tproxy.h
 create mode 100644 net/ipv4/netfilter/nf_tproxy_ipv4.c
 create mode 100644 net/ipv6/netfilter/nf_tproxy_ipv6.c

diff --git a/include/net/netfilter/nf_tproxy.h b/include/net/netfilter/nf_tproxy.h
new file mode 100644
index 000000000000..9754a50ecde9
--- /dev/null
+++ b/include/net/netfilter/nf_tproxy.h
@@ -0,0 +1,113 @@
+#ifndef _NF_TPROXY_H_
+#define _NF_TPROXY_H_
+
+#include <net/tcp.h>
+
+enum nf_tproxy_lookup_t {
+	 NF_TPROXY_LOOKUP_LISTENER,
+	 NF_TPROXY_LOOKUP_ESTABLISHED,
+};
+
+static inline bool nf_tproxy_sk_is_transparent(struct sock *sk)
+{
+	if (inet_sk_transparent(sk))
+		return true;
+
+	sock_gen_put(sk);
+	return false;
+}
+
+__be32 nf_tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr);
+
+/**
+ * nf_tproxy_handle_time_wait4 - handle IPv4 TCP TIME_WAIT reopen redirections
+ * @skb:	The skb being processed.
+ * @laddr:	IPv4 address to redirect to or zero.
+ * @lport:	TCP port to redirect to or zero.
+ * @sk:		The TIME_WAIT TCP socket found by the lookup.
+ *
+ * We have to handle SYN packets arriving to TIME_WAIT sockets
+ * differently: instead of reopening the connection we should rather
+ * redirect the new connection to the proxy if there's a listener
+ * socket present.
+ *
+ * nf_tproxy_handle_time_wait4() consumes the socket reference passed in.
+ *
+ * Returns the listener socket if there's one, the TIME_WAIT socket if
+ * no such listener is found, or NULL if the TCP header is incomplete.
+ */
+struct sock *
+nf_tproxy_handle_time_wait4(struct net *net, struct sk_buff *skb,
+			    __be32 laddr, __be16 lport, struct sock *sk);
+
+/*
+ * This is used when the user wants to intercept a connection matching
+ * an explicit iptables rule. In this case the sockets are assumed
+ * matching in preference order:
+ *
+ *   - match: if there's a fully established connection matching the
+ *     _packet_ tuple, it is returned, assuming the redirection
+ *     already took place and we process a packet belonging to an
+ *     established connection
+ *
+ *   - match: if there's a listening socket matching the redirection
+ *     (e.g. on-port & on-ip of the connection), it is returned,
+ *     regardless if it was bound to 0.0.0.0 or an explicit
+ *     address. The reasoning is that if there's an explicit rule, it
+ *     does not really matter if the listener is bound to an interface
+ *     or to 0. The user already stated that he wants redirection
+ *     (since he added the rule).
+ *
+ * Please note that there's an overlap between what a TPROXY target
+ * and a socket match will match. Normally if you have both rules the
+ * "socket" match will be the first one, effectively all packets
+ * belonging to established connections going through that one.
+ */
+struct sock *
+nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, void *hp,
+		      const u8 protocol,
+		      const __be32 saddr, const __be32 daddr,
+		      const __be16 sport, const __be16 dport,
+		      const struct net_device *in,
+		      const enum nf_tproxy_lookup_t lookup_type);
+
+const struct in6_addr *
+nf_tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr,
+		 const struct in6_addr *daddr);
+
+/**
+ * nf_tproxy_handle_time_wait6 - handle IPv6 TCP TIME_WAIT reopen redirections
+ * @skb:	The skb being processed.
+ * @tproto:	Transport protocol.
+ * @thoff:	Transport protocol header offset.
+ * @net:	Network namespace.
+ * @laddr:	IPv6 address to redirect to.
+ * @lport:	TCP port to redirect to or zero.
+ * @sk:		The TIME_WAIT TCP socket found by the lookup.
+ *
+ * We have to handle SYN packets arriving to TIME_WAIT sockets
+ * differently: instead of reopening the connection we should rather
+ * redirect the new connection to the proxy if there's a listener
+ * socket present.
+ *
+ * nf_tproxy_handle_time_wait6() consumes the socket reference passed in.
+ *
+ * Returns the listener socket if there's one, the TIME_WAIT socket if
+ * no such listener is found, or NULL if the TCP header is incomplete.
+ */
+struct sock *
+nf_tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
+			    struct net *net,
+			    const struct in6_addr *laddr,
+			    const __be16 lport,
+			    struct sock *sk);
+
+struct sock *
+nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, void *hp,
+		      const u8 protocol,
+		      const struct in6_addr *saddr, const struct in6_addr *daddr,
+		      const __be16 sport, const __be16 dport,
+		      const struct net_device *in,
+		      const enum nf_tproxy_lookup_t lookup_type);
+
+#endif /* _NF_TPROXY_H_ */
diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index d03bc5a01a70..bbfc356cb1b5 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -29,7 +29,10 @@ config NF_SOCKET_IPV4
 	tristate "IPv4 socket lookup support"
 	help
 	  This option enables the IPv4 socket lookup infrastructure. This is
-	  is required by the iptables socket match.
+	  is required by the {ip,nf}tables socket match.
+
+config NF_TPROXY_IPV4
+	tristate "IPv4 tproxy support"
 
 if NF_TABLES
 
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index c4b05b174091..8394c17c269f 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_NF_NAT_IPV4) += nf_nat_ipv4.o
 obj-$(CONFIG_NF_DEFRAG_IPV4) += nf_defrag_ipv4.o
 
 obj-$(CONFIG_NF_SOCKET_IPV4) += nf_socket_ipv4.o
+obj-$(CONFIG_NF_TPROXY_IPV4) += nf_tproxy_ipv4.o
 
 # logging
 obj-$(CONFIG_NF_LOG_ARP) += nf_log_arp.o
diff --git a/net/ipv4/netfilter/nf_tproxy_ipv4.c b/net/ipv4/netfilter/nf_tproxy_ipv4.c
new file mode 100644
index 000000000000..805e83ec3ad9
--- /dev/null
+++ b/net/ipv4/netfilter/nf_tproxy_ipv4.c
@@ -0,0 +1,147 @@
+/*
+ * Copyright (C) 2007-2008 BalaBit IT Ltd.
+ * Author: Krisztian Kovacs
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <net/netfilter/nf_tproxy.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <net/inet_sock.h>
+#include <linux/ip.h>
+#include <net/checksum.h>
+#include <net/udp.h>
+#include <net/tcp.h>
+#include <linux/inetdevice.h>
+
+struct sock *
+nf_tproxy_handle_time_wait4(struct net *net, struct sk_buff *skb,
+			 __be32 laddr, __be16 lport, struct sock *sk)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	struct tcphdr _hdr, *hp;
+
+	hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		inet_twsk_put(inet_twsk(sk));
+		return NULL;
+	}
+
+	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
+		/* SYN to a TIME_WAIT socket, we'd rather redirect it
+		 * to a listener socket if there's one */
+		struct sock *sk2;
+
+		sk2 = nf_tproxy_get_sock_v4(net, skb, hp, iph->protocol,
+					    iph->saddr, laddr ? laddr : iph->daddr,
+					    hp->source, lport ? lport : hp->dest,
+					    skb->dev, NF_TPROXY_LOOKUP_LISTENER);
+		if (sk2) {
+			inet_twsk_deschedule_put(inet_twsk(sk));
+			sk = sk2;
+		}
+	}
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_handle_time_wait4);
+
+__be32 nf_tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr)
+{
+	struct in_device *indev;
+	__be32 laddr;
+
+	if (user_laddr)
+		return user_laddr;
+
+	laddr = 0;
+	indev = __in_dev_get_rcu(skb->dev);
+	for_primary_ifa(indev) {
+		laddr = ifa->ifa_local;
+		break;
+	} endfor_ifa(indev);
+
+	return laddr ? laddr : daddr;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_laddr4);
+
+struct sock *
+nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, void *hp,
+		      const u8 protocol,
+		      const __be32 saddr, const __be32 daddr,
+		      const __be16 sport, const __be16 dport,
+		      const struct net_device *in,
+		      const enum nf_tproxy_lookup_t lookup_type)
+{
+	struct sock *sk;
+	struct tcphdr *tcph;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		switch (lookup_type) {
+		case NF_TPROXY_LOOKUP_LISTENER:
+			tcph = hp;
+			sk = inet_lookup_listener(net, &tcp_hashinfo, skb,
+						    ip_hdrlen(skb) +
+						      __tcp_hdrlen(tcph),
+						    saddr, sport,
+						    daddr, dport,
+						    in->ifindex, 0);
+
+			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
+				sk = NULL;
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too
+			 */
+			break;
+		case NF_TPROXY_LOOKUP_ESTABLISHED:
+			sk = inet_lookup_established(net, &tcp_hashinfo,
+						    saddr, sport, daddr, dport,
+						    in->ifindex);
+			break;
+		default:
+			BUG();
+		}
+		break;
+	case IPPROTO_UDP:
+		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
+				     in->ifindex);
+		if (sk) {
+			int connected = (sk->sk_state == TCP_ESTABLISHED);
+			int wildcard = (inet_sk(sk)->inet_rcv_saddr == 0);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too
+			 */
+			if ((lookup_type == NF_TPROXY_LOOKUP_ESTABLISHED &&
+			      (!connected || wildcard)) ||
+			    (lookup_type == NF_TPROXY_LOOKUP_LISTENER && connected)) {
+				sock_put(sk);
+				sk = NULL;
+			}
+		}
+		break;
+	default:
+		WARN_ON(1);
+		sk = NULL;
+	}
+
+	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, lookup type: %d, sock %p\n",
+		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), lookup_type, sk);
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_get_sock_v4);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Balazs Scheidler, Krisztian Kovacs");
+MODULE_DESCRIPTION("Netfilter IPv4 transparent proxy support");
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index 9f5b00a39adf..37b14dc9d863 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -29,7 +29,10 @@ config NF_SOCKET_IPV6
 	tristate "IPv6 socket lookup support"
 	help
 	  This option enables the IPv6 socket lookup infrastructure. This
-	  is used by the ip6tables socket match.
+	  is used by the {ip6,nf}tables socket match.
+
+config NF_TPROXY_IPV6
+	tristate "IPv6 tproxy support"
 
 if NF_TABLES
 
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index 71518f22ae39..10a5a1c87320 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -26,6 +26,7 @@ nf_defrag_ipv6-y := nf_defrag_ipv6_hooks.o nf_conntrack_reasm.o
 obj-$(CONFIG_NF_DEFRAG_IPV6) += nf_defrag_ipv6.o
 
 obj-$(CONFIG_NF_SOCKET_IPV6) += nf_socket_ipv6.o
+obj-$(CONFIG_NF_TPROXY_IPV6) += nf_tproxy_ipv6.o
 
 # logging
 obj-$(CONFIG_NF_LOG_IPV6) += nf_log_ipv6.o
diff --git a/net/ipv6/netfilter/nf_tproxy_ipv6.c b/net/ipv6/netfilter/nf_tproxy_ipv6.c
new file mode 100644
index 000000000000..bf1d6c421e3b
--- /dev/null
+++ b/net/ipv6/netfilter/nf_tproxy_ipv6.c
@@ -0,0 +1,146 @@
+#include <net/netfilter/nf_tproxy.h>
+#include <linux/module.h>
+#include <net/inet6_hashtables.h>
+#include <net/addrconf.h>
+#include <net/udp.h>
+#include <net/tcp.h>
+
+const struct in6_addr *
+nf_tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr,
+	      const struct in6_addr *daddr)
+{
+	struct inet6_dev *indev;
+	struct inet6_ifaddr *ifa;
+	struct in6_addr *laddr;
+
+	if (!ipv6_addr_any(user_laddr))
+		return user_laddr;
+	laddr = NULL;
+
+	indev = __in6_dev_get(skb->dev);
+	if (indev) {
+		read_lock_bh(&indev->lock);
+		list_for_each_entry(ifa, &indev->addr_list, if_list) {
+			if (ifa->flags & (IFA_F_TENTATIVE | IFA_F_DEPRECATED))
+				continue;
+
+			laddr = &ifa->addr;
+			break;
+		}
+		read_unlock_bh(&indev->lock);
+	}
+
+	return laddr ? laddr : daddr;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_laddr6);
+
+struct sock *
+nf_tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
+			 struct net *net,
+			 const struct in6_addr *laddr,
+			 const __be16 lport,
+			 struct sock *sk)
+{
+	const struct ipv6hdr *iph = ipv6_hdr(skb);
+	struct tcphdr _hdr, *hp;
+
+	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		inet_twsk_put(inet_twsk(sk));
+		return NULL;
+	}
+
+	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
+		/* SYN to a TIME_WAIT socket, we'd rather redirect it
+		 * to a listener socket if there's one */
+		struct sock *sk2;
+
+		sk2 = nf_tproxy_get_sock_v6(net, skb, thoff, hp, tproto,
+					    &iph->saddr,
+					    nf_tproxy_laddr6(skb, laddr, &iph->daddr),
+					    hp->source,
+					    lport ? lport : hp->dest,
+					    skb->dev, NF_TPROXY_LOOKUP_LISTENER);
+		if (sk2) {
+			inet_twsk_deschedule_put(inet_twsk(sk));
+			sk = sk2;
+		}
+	}
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_handle_time_wait6);
+
+struct sock *
+nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, void *hp,
+		      const u8 protocol,
+		      const struct in6_addr *saddr, const struct in6_addr *daddr,
+		      const __be16 sport, const __be16 dport,
+		      const struct net_device *in,
+		      const enum nf_tproxy_lookup_t lookup_type)
+{
+	struct sock *sk;
+	struct tcphdr *tcph;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		switch (lookup_type) {
+		case NF_TPROXY_LOOKUP_LISTENER:
+			tcph = hp;
+			sk = inet6_lookup_listener(net, &tcp_hashinfo, skb,
+						   thoff + __tcp_hdrlen(tcph),
+						   saddr, sport,
+						   daddr, ntohs(dport),
+						   in->ifindex, 0);
+
+			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
+				sk = NULL;
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too
+			 */
+			break;
+		case NF_TPROXY_LOOKUP_ESTABLISHED:
+			sk = __inet6_lookup_established(net, &tcp_hashinfo,
+							saddr, sport, daddr, ntohs(dport),
+							in->ifindex, 0);
+			break;
+		default:
+			BUG();
+		}
+		break;
+	case IPPROTO_UDP:
+		sk = udp6_lib_lookup(net, saddr, sport, daddr, dport,
+				     in->ifindex);
+		if (sk) {
+			int connected = (sk->sk_state == TCP_ESTABLISHED);
+			int wildcard = ipv6_addr_any(&sk->sk_v6_rcv_saddr);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too
+			 */
+			if ((lookup_type == NF_TPROXY_LOOKUP_ESTABLISHED && (!connected || wildcard)) ||
+			    (lookup_type == NF_TPROXY_LOOKUP_LISTENER && connected)) {
+				sock_put(sk);
+				sk = NULL;
+			}
+		}
+		break;
+	default:
+		WARN_ON(1);
+		sk = NULL;
+	}
+
+	pr_debug("tproxy socket lookup: proto %u %pI6:%u -> %pI6:%u, lookup type: %d, sock %p\n",
+		 protocol, saddr, ntohs(sport), daddr, ntohs(dport), lookup_type, sk);
+
+	return sk;
+}
+EXPORT_SYMBOL_GPL(nf_tproxy_get_sock_v6);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Balazs Scheidler, Krisztian Kovacs");
+MODULE_DESCRIPTION("Netfilter IPv4 transparent proxy support");
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 276e1e32f44e..41240abd755f 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -989,6 +989,8 @@ config NETFILTER_XT_TARGET_TPROXY
 	depends on IP_NF_MANGLE
 	select NF_DEFRAG_IPV4
 	select NF_DEFRAG_IPV6 if IP6_NF_IPTABLES != n
+	select NF_TPROXY_IPV4
+	select NF_TPROXY_IPV6 if IP6_NF_IPTABLES
 	help
 	  This option adds a `TPROXY' target, which is somewhat similar to
 	  REDIRECT.  It can only be used in the mangle table and is useful
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 74df7977c17c..58fce4e749a9 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -33,253 +33,9 @@
 #include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 #endif
 
+#include <net/netfilter/nf_tproxy.h>
 #include <linux/netfilter/xt_TPROXY.h>
 
-enum nf_tproxy_lookup_t {
-	 NFT_LOOKUP_LISTENER,
-	 NFT_LOOKUP_ESTABLISHED,
-};
-
-static bool tproxy_sk_is_transparent(struct sock *sk)
-{
-	if (inet_sk_transparent(sk))
-		return true;
-
-	sock_gen_put(sk);
-	return false;
-}
-
-static inline __be32
-tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr)
-{
-	struct in_device *indev;
-	__be32 laddr;
-
-	if (user_laddr)
-		return user_laddr;
-
-	laddr = 0;
-	indev = __in_dev_get_rcu(skb->dev);
-	for_primary_ifa(indev) {
-		laddr = ifa->ifa_local;
-		break;
-	} endfor_ifa(indev);
-
-	return laddr ? laddr : daddr;
-}
-
-/*
- * This is used when the user wants to intercept a connection matching
- * an explicit iptables rule. In this case the sockets are assumed
- * matching in preference order:
- *
- *   - match: if there's a fully established connection matching the
- *     _packet_ tuple, it is returned, assuming the redirection
- *     already took place and we process a packet belonging to an
- *     established connection
- *
- *   - match: if there's a listening socket matching the redirection
- *     (e.g. on-port & on-ip of the connection), it is returned,
- *     regardless if it was bound to 0.0.0.0 or an explicit
- *     address. The reasoning is that if there's an explicit rule, it
- *     does not really matter if the listener is bound to an interface
- *     or to 0. The user already stated that he wants redirection
- *     (since he added the rule).
- *
- * Please note that there's an overlap between what a TPROXY target
- * and a socket match will match. Normally if you have both rules the
- * "socket" match will be the first one, effectively all packets
- * belonging to established connections going through that one.
- */
-static inline struct sock *
-nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, void *hp,
-		      const u8 protocol,
-		      const __be32 saddr, const __be32 daddr,
-		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in,
-		      const enum nf_tproxy_lookup_t lookup_type)
-{
-	struct sock *sk;
-	struct tcphdr *tcph;
-
-	switch (protocol) {
-	case IPPROTO_TCP:
-		switch (lookup_type) {
-		case NFT_LOOKUP_LISTENER:
-			tcph = hp;
-			sk = inet_lookup_listener(net, &tcp_hashinfo, skb,
-						    ip_hdrlen(skb) +
-						      __tcp_hdrlen(tcph),
-						    saddr, sport,
-						    daddr, dport,
-						    in->ifindex, 0);
-
-			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
-				sk = NULL;
-			/* NOTE: we return listeners even if bound to
-			 * 0.0.0.0, those are filtered out in
-			 * xt_socket, since xt_TPROXY needs 0 bound
-			 * listeners too
-			 */
-			break;
-		case NFT_LOOKUP_ESTABLISHED:
-			sk = inet_lookup_established(net, &tcp_hashinfo,
-						    saddr, sport, daddr, dport,
-						    in->ifindex);
-			break;
-		default:
-			BUG();
-		}
-		break;
-	case IPPROTO_UDP:
-		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
-				     in->ifindex);
-		if (sk) {
-			int connected = (sk->sk_state == TCP_ESTABLISHED);
-			int wildcard = (inet_sk(sk)->inet_rcv_saddr == 0);
-
-			/* NOTE: we return listeners even if bound to
-			 * 0.0.0.0, those are filtered out in
-			 * xt_socket, since xt_TPROXY needs 0 bound
-			 * listeners too
-			 */
-			if ((lookup_type == NFT_LOOKUP_ESTABLISHED && (!connected || wildcard)) ||
-			    (lookup_type == NFT_LOOKUP_LISTENER && connected)) {
-				sock_put(sk);
-				sk = NULL;
-			}
-		}
-		break;
-	default:
-		WARN_ON(1);
-		sk = NULL;
-	}
-
-	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, lookup type: %d, sock %p\n",
-		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), lookup_type, sk);
-
-	return sk;
-}
-
-#ifdef XT_TPROXY_HAVE_IPV6
-static inline struct sock *
-nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, void *hp,
-		      const u8 protocol,
-		      const struct in6_addr *saddr, const struct in6_addr *daddr,
-		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in,
-		      const enum nf_tproxy_lookup_t lookup_type)
-{
-	struct sock *sk;
-	struct tcphdr *tcph;
-
-	switch (protocol) {
-	case IPPROTO_TCP:
-		switch (lookup_type) {
-		case NFT_LOOKUP_LISTENER:
-			tcph = hp;
-			sk = inet6_lookup_listener(net, &tcp_hashinfo, skb,
-						   thoff + __tcp_hdrlen(tcph),
-						   saddr, sport,
-						   daddr, ntohs(dport),
-						   in->ifindex, 0);
-
-			if (sk && !refcount_inc_not_zero(&sk->sk_refcnt))
-				sk = NULL;
-			/* NOTE: we return listeners even if bound to
-			 * 0.0.0.0, those are filtered out in
-			 * xt_socket, since xt_TPROXY needs 0 bound
-			 * listeners too
-			 */
-			break;
-		case NFT_LOOKUP_ESTABLISHED:
-			sk = __inet6_lookup_established(net, &tcp_hashinfo,
-							saddr, sport, daddr, ntohs(dport),
-							in->ifindex, 0);
-			break;
-		default:
-			BUG();
-		}
-		break;
-	case IPPROTO_UDP:
-		sk = udp6_lib_lookup(net, saddr, sport, daddr, dport,
-				     in->ifindex);
-		if (sk) {
-			int connected = (sk->sk_state == TCP_ESTABLISHED);
-			int wildcard = ipv6_addr_any(&sk->sk_v6_rcv_saddr);
-
-			/* NOTE: we return listeners even if bound to
-			 * 0.0.0.0, those are filtered out in
-			 * xt_socket, since xt_TPROXY needs 0 bound
-			 * listeners too
-			 */
-			if ((lookup_type == NFT_LOOKUP_ESTABLISHED && (!connected || wildcard)) ||
-			    (lookup_type == NFT_LOOKUP_LISTENER && connected)) {
-				sock_put(sk);
-				sk = NULL;
-			}
-		}
-		break;
-	default:
-		WARN_ON(1);
-		sk = NULL;
-	}
-
-	pr_debug("tproxy socket lookup: proto %u %pI6:%u -> %pI6:%u, lookup type: %d, sock %p\n",
-		 protocol, saddr, ntohs(sport), daddr, ntohs(dport), lookup_type, sk);
-
-	return sk;
-}
-#endif
-
-/**
- * tproxy_handle_time_wait4 - handle IPv4 TCP TIME_WAIT reopen redirections
- * @skb:	The skb being processed.
- * @laddr:	IPv4 address to redirect to or zero.
- * @lport:	TCP port to redirect to or zero.
- * @sk:		The TIME_WAIT TCP socket found by the lookup.
- *
- * We have to handle SYN packets arriving to TIME_WAIT sockets
- * differently: instead of reopening the connection we should rather
- * redirect the new connection to the proxy if there's a listener
- * socket present.
- *
- * tproxy_handle_time_wait4() consumes the socket reference passed in.
- *
- * Returns the listener socket if there's one, the TIME_WAIT socket if
- * no such listener is found, or NULL if the TCP header is incomplete.
- */
-static struct sock *
-tproxy_handle_time_wait4(struct net *net, struct sk_buff *skb,
-			 __be32 laddr, __be16 lport, struct sock *sk)
-{
-	const struct iphdr *iph = ip_hdr(skb);
-	struct tcphdr _hdr, *hp;
-
-	hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr);
-	if (hp == NULL) {
-		inet_twsk_put(inet_twsk(sk));
-		return NULL;
-	}
-
-	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
-		/* SYN to a TIME_WAIT socket, we'd rather redirect it
-		 * to a listener socket if there's one */
-		struct sock *sk2;
-
-		sk2 = nf_tproxy_get_sock_v4(net, skb, hp, iph->protocol,
-					    iph->saddr, laddr ? laddr : iph->daddr,
-					    hp->source, lport ? lport : hp->dest,
-					    skb->dev, NFT_LOOKUP_LISTENER);
-		if (sk2) {
-			inet_twsk_deschedule_put(inet_twsk(sk));
-			sk = sk2;
-		}
-	}
-
-	return sk;
-}
-
 /* assign a socket to the skb -- consumes sk */
 static void
 nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk)
@@ -308,26 +64,26 @@ tproxy_tg4(struct net *net, struct sk_buff *skb, __be32 laddr, __be16 lport,
 	sk = nf_tproxy_get_sock_v4(net, skb, hp, iph->protocol,
 				   iph->saddr, iph->daddr,
 				   hp->source, hp->dest,
-				   skb->dev, NFT_LOOKUP_ESTABLISHED);
+				   skb->dev, NF_TPROXY_LOOKUP_ESTABLISHED);
 
-	laddr = tproxy_laddr4(skb, laddr, iph->daddr);
+	laddr = nf_tproxy_laddr4(skb, laddr, iph->daddr);
 	if (!lport)
 		lport = hp->dest;
 
 	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
 	if (sk && sk->sk_state == TCP_TIME_WAIT)
 		/* reopening a TIME_WAIT connection needs special handling */
-		sk = tproxy_handle_time_wait4(net, skb, laddr, lport, sk);
+		sk = nf_tproxy_handle_time_wait4(net, skb, laddr, lport, sk);
 	else if (!sk)
 		/* no, there's no established connection, check if
 		 * there's a listener on the redirected addr/port */
 		sk = nf_tproxy_get_sock_v4(net, skb, hp, iph->protocol,
 					   iph->saddr, laddr,
 					   hp->source, lport,
-					   skb->dev, NFT_LOOKUP_LISTENER);
+					   skb->dev, NF_TPROXY_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
-	if (sk && tproxy_sk_is_transparent(sk)) {
+	if (sk && nf_tproxy_sk_is_transparent(sk)) {
 		/* This should be in a separate target, but we don't do multiple
 		   targets on the same rule yet */
 		skb->mark = (skb->mark & ~mark_mask) ^ mark_value;
@@ -366,87 +122,6 @@ tproxy_tg4_v1(struct sk_buff *skb, const struct xt_action_param *par)
 
 #ifdef XT_TPROXY_HAVE_IPV6
 
-static inline const struct in6_addr *
-tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr,
-	      const struct in6_addr *daddr)
-{
-	struct inet6_dev *indev;
-	struct inet6_ifaddr *ifa;
-	struct in6_addr *laddr;
-
-	if (!ipv6_addr_any(user_laddr))
-		return user_laddr;
-	laddr = NULL;
-
-	indev = __in6_dev_get(skb->dev);
-	if (indev) {
-		read_lock_bh(&indev->lock);
-		list_for_each_entry(ifa, &indev->addr_list, if_list) {
-			if (ifa->flags & (IFA_F_TENTATIVE | IFA_F_DEPRECATED))
-				continue;
-
-			laddr = &ifa->addr;
-			break;
-		}
-		read_unlock_bh(&indev->lock);
-	}
-
-	return laddr ? laddr : daddr;
-}
-
-/**
- * tproxy_handle_time_wait6 - handle IPv6 TCP TIME_WAIT reopen redirections
- * @skb:	The skb being processed.
- * @tproto:	Transport protocol.
- * @thoff:	Transport protocol header offset.
- * @par:	Iptables target parameters.
- * @sk:		The TIME_WAIT TCP socket found by the lookup.
- *
- * We have to handle SYN packets arriving to TIME_WAIT sockets
- * differently: instead of reopening the connection we should rather
- * redirect the new connection to the proxy if there's a listener
- * socket present.
- *
- * tproxy_handle_time_wait6() consumes the socket reference passed in.
- *
- * Returns the listener socket if there's one, the TIME_WAIT socket if
- * no such listener is found, or NULL if the TCP header is incomplete.
- */
-static struct sock *
-tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
-			 const struct xt_action_param *par,
-			 struct sock *sk)
-{
-	const struct ipv6hdr *iph = ipv6_hdr(skb);
-	struct tcphdr _hdr, *hp;
-	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
-
-	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
-	if (hp == NULL) {
-		inet_twsk_put(inet_twsk(sk));
-		return NULL;
-	}
-
-	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
-		/* SYN to a TIME_WAIT socket, we'd rather redirect it
-		 * to a listener socket if there's one */
-		struct sock *sk2;
-
-		sk2 = nf_tproxy_get_sock_v6(xt_net(par), skb, thoff, hp, tproto,
-					    &iph->saddr,
-					    tproxy_laddr6(skb, &tgi->laddr.in6, &iph->daddr),
-					    hp->source,
-					    tgi->lport ? tgi->lport : hp->dest,
-					    skb->dev, NFT_LOOKUP_LISTENER);
-		if (sk2) {
-			inet_twsk_deschedule_put(inet_twsk(sk));
-			sk = sk2;
-		}
-	}
-
-	return sk;
-}
-
 static unsigned int
 tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 {
@@ -478,25 +153,31 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 	sk = nf_tproxy_get_sock_v6(xt_net(par), skb, thoff, hp, tproto,
 				   &iph->saddr, &iph->daddr,
 				   hp->source, hp->dest,
-				   xt_in(par), NFT_LOOKUP_ESTABLISHED);
+				   xt_in(par), NF_TPROXY_LOOKUP_ESTABLISHED);
 
-	laddr = tproxy_laddr6(skb, &tgi->laddr.in6, &iph->daddr);
+	laddr = nf_tproxy_laddr6(skb, &tgi->laddr.in6, &iph->daddr);
 	lport = tgi->lport ? tgi->lport : hp->dest;
 
 	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
-	if (sk && sk->sk_state == TCP_TIME_WAIT)
+	if (sk && sk->sk_state == TCP_TIME_WAIT) {
+		const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
 		/* reopening a TIME_WAIT connection needs special handling */
-		sk = tproxy_handle_time_wait6(skb, tproto, thoff, par, sk);
+		sk = nf_tproxy_handle_time_wait6(skb, tproto, thoff,
+					      xt_net(par),
+					      &tgi->laddr.in6,
+					      tgi->lport,
+					      sk);
+	}
 	else if (!sk)
 		/* no there's no established connection, check if
 		 * there's a listener on the redirected addr/port */
 		sk = nf_tproxy_get_sock_v6(xt_net(par), skb, thoff, hp,
 					   tproto, &iph->saddr, laddr,
 					   hp->source, lport,
-					   xt_in(par), NFT_LOOKUP_LISTENER);
+					   xt_in(par), NF_TPROXY_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
-	if (sk && tproxy_sk_is_transparent(sk)) {
+	if (sk && nf_tproxy_sk_is_transparent(sk)) {
 		/* This should be in a separate target, but we don't do multiple
 		   targets on the same rule yet */
 		skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
-- 
2.11.0

^ permalink raw reply related

* [PATCH 3/9] netfilter: nf_tables: pass context to object destroy indirection
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

The new connlimit object needs this to properly deal with conntrack
dependencies.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |  3 ++-
 net/netfilter/nf_tables_api.c     | 12 ++++++------
 net/netfilter/nft_counter.c       |  3 ++-
 net/netfilter/nft_ct.c            |  3 ++-
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 435c32d8a995..81ec070582b6 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -1070,7 +1070,8 @@ struct nft_object_ops {
 	int				(*init)(const struct nft_ctx *ctx,
 						const struct nlattr *const tb[],
 						struct nft_object *obj);
-	void				(*destroy)(struct nft_object *obj);
+	void				(*destroy)(const struct nft_ctx *ctx,
+						   struct nft_object *obj);
 	int				(*dump)(struct sk_buff *skb,
 						struct nft_object *obj,
 						bool reset);
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index c785bc5a66f1..177658f4007e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4787,7 +4787,7 @@ static int nf_tables_newobj(struct net *net, struct sock *nlsk,
 	kfree(obj->name);
 err2:
 	if (obj->ops->destroy)
-		obj->ops->destroy(obj);
+		obj->ops->destroy(&ctx, obj);
 	kfree(obj);
 err1:
 	module_put(type->owner);
@@ -4997,10 +4997,10 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
 	return err;
 }
 
-static void nft_obj_destroy(struct nft_object *obj)
+static void nft_obj_destroy(const struct nft_ctx *ctx, struct nft_object *obj)
 {
 	if (obj->ops->destroy)
-		obj->ops->destroy(obj);
+		obj->ops->destroy(ctx, obj);
 
 	module_put(obj->ops->type->owner);
 	kfree(obj->name);
@@ -6003,7 +6003,7 @@ static void nft_commit_release(struct nft_trans *trans)
 					   nft_trans_elem(trans).priv);
 		break;
 	case NFT_MSG_DELOBJ:
-		nft_obj_destroy(nft_trans_obj(trans));
+		nft_obj_destroy(&trans->ctx, nft_trans_obj(trans));
 		break;
 	case NFT_MSG_DELFLOWTABLE:
 		nf_tables_flowtable_destroy(nft_trans_flowtable(trans));
@@ -6328,7 +6328,7 @@ static void nf_tables_abort_release(struct nft_trans *trans)
 				     nft_trans_elem(trans).priv, true);
 		break;
 	case NFT_MSG_NEWOBJ:
-		nft_obj_destroy(nft_trans_obj(trans));
+		nft_obj_destroy(&trans->ctx, nft_trans_obj(trans));
 		break;
 	case NFT_MSG_NEWFLOWTABLE:
 		nf_tables_flowtable_destroy(nft_trans_flowtable(trans));
@@ -7022,7 +7022,7 @@ static void __nft_release_tables(struct net *net)
 		list_for_each_entry_safe(obj, ne, &table->objects, list) {
 			list_del(&obj->list);
 			table->use--;
-			nft_obj_destroy(obj);
+			nft_obj_destroy(&ctx, obj);
 		}
 		list_for_each_entry_safe(chain, nc, &table->chains, list) {
 			ctx.chain = chain;
diff --git a/net/netfilter/nft_counter.c b/net/netfilter/nft_counter.c
index eefe3b409925..e59a74d6b7d6 100644
--- a/net/netfilter/nft_counter.c
+++ b/net/netfilter/nft_counter.c
@@ -96,7 +96,8 @@ static void nft_counter_do_destroy(struct nft_counter_percpu_priv *priv)
 	free_percpu(priv->counter);
 }
 
-static void nft_counter_obj_destroy(struct nft_object *obj)
+static void nft_counter_obj_destroy(const struct nft_ctx *ctx,
+				    struct nft_object *obj)
 {
 	struct nft_counter_percpu_priv *priv = nft_obj_data(obj);
 
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index ea737fd789e8..f8b19eacfa0c 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -826,7 +826,8 @@ static int nft_ct_helper_obj_init(const struct nft_ctx *ctx,
 	return 0;
 }
 
-static void nft_ct_helper_obj_destroy(struct nft_object *obj)
+static void nft_ct_helper_obj_destroy(const struct nft_ctx *ctx,
+				      struct nft_object *obj)
 {
 	struct nft_ct_helper_obj *priv = nft_obj_data(obj);
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH 1/9] netfilter: Decrease code duplication regarding transparent socket option
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <20180602231413.31974-1-pablo@netfilter.org>

From: Máté Eckl <ecklm94@gmail.com>

There is a function in include/net/netfilter/nf_socket.h to decide if a
socket has IP(V6)_TRANSPARENT socket option set or not. However this
does the same as inet_sk_transparent() in include/net/tcp.h

include/net/tcp.h:1733
/* This helper checks if socket has IP_TRANSPARENT set */
static inline bool inet_sk_transparent(const struct sock *sk)
{
	switch (sk->sk_state) {
	case TCP_TIME_WAIT:
		return inet_twsk(sk)->tw_transparent;
	case TCP_NEW_SYN_RECV:
		return inet_rsk(inet_reqsk(sk))->no_srccheck;
	}
	return inet_sk(sk)->transparent;
}

tproxy_sk_is_transparent has also been refactored to use this function
instead of reimplementing it.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_socket.h | 13 -------------
 net/netfilter/nft_socket.c        |  3 ++-
 net/netfilter/xt_TPROXY.c         | 15 ++-------------
 net/netfilter/xt_socket.c         |  4 ++--
 4 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/include/net/netfilter/nf_socket.h b/include/net/netfilter/nf_socket.h
index 29b6313f0557..f9d7bee9bd4e 100644
--- a/include/net/netfilter/nf_socket.h
+++ b/include/net/netfilter/nf_socket.h
@@ -3,19 +3,6 @@
 #define _NF_SOCK_H_
 
 #include <net/sock.h>
-#include <net/inet_timewait_sock.h>
-
-static inline bool nf_sk_is_transparent(struct sock *sk)
-{
-	switch (sk->sk_state) {
-	case TCP_TIME_WAIT:
-		return inet_twsk(sk)->tw_transparent;
-	case TCP_NEW_SYN_RECV:
-		return inet_rsk(inet_reqsk(sk))->no_srccheck;
-	default:
-		return inet_sk(sk)->transparent;
-	}
-}
 
 struct sock *nf_sk_lookup_slow_v4(struct net *net, const struct sk_buff *skb,
 				  const struct net_device *indev);
diff --git a/net/netfilter/nft_socket.c b/net/netfilter/nft_socket.c
index d86337068ecb..f28a0b944087 100644
--- a/net/netfilter/nft_socket.c
+++ b/net/netfilter/nft_socket.c
@@ -5,6 +5,7 @@
 #include <net/netfilter/nf_tables_core.h>
 #include <net/netfilter/nf_socket.h>
 #include <net/inet_sock.h>
+#include <net/tcp.h>
 
 struct nft_socket {
 	enum nft_socket_keys		key:8;
@@ -48,7 +49,7 @@ static void nft_socket_eval(const struct nft_expr *expr,
 
 	switch(priv->key) {
 	case NFT_SOCKET_TRANSPARENT:
-		nft_reg_store8(dest, nf_sk_is_transparent(sk));
+		nft_reg_store8(dest, inet_sk_transparent(sk));
 		break;
 	default:
 		WARN_ON(1);
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 8c89323c06af..74df7977c17c 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -42,19 +42,8 @@ enum nf_tproxy_lookup_t {
 
 static bool tproxy_sk_is_transparent(struct sock *sk)
 {
-	switch (sk->sk_state) {
-	case TCP_TIME_WAIT:
-		if (inet_twsk(sk)->tw_transparent)
-			return true;
-		break;
-	case TCP_NEW_SYN_RECV:
-		if (inet_rsk(inet_reqsk(sk))->no_srccheck)
-			return true;
-		break;
-	default:
-		if (inet_sk(sk)->transparent)
-			return true;
-	}
+	if (inet_sk_transparent(sk))
+		return true;
 
 	sock_gen_put(sk);
 	return false;
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 2ac7f674d19b..5c0779c4fa3c 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -73,7 +73,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 		 * if XT_SOCKET_TRANSPARENT is used
 		 */
 		if (info->flags & XT_SOCKET_TRANSPARENT)
-			transparent = nf_sk_is_transparent(sk);
+			transparent = inet_sk_transparent(sk);
 
 		if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard &&
 		    transparent && sk_fullsock(sk))
@@ -130,7 +130,7 @@ socket_mt6_v1_v2_v3(const struct sk_buff *skb, struct xt_action_param *par)
 		 * if XT_SOCKET_TRANSPARENT is used
 		 */
 		if (info->flags & XT_SOCKET_TRANSPARENT)
-			transparent = nf_sk_is_transparent(sk);
+			transparent = inet_sk_transparent(sk);
 
 		if (info->flags & XT_SOCKET_RESTORESKMARK && !wildcard &&
 		    transparent && sk_fullsock(sk))
-- 
2.11.0

^ permalink raw reply related

* [PATCH 0/9] Netfilter updates for net-next
From: Pablo Neira Ayuso @ 2018-06-02 23:14 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter updates for your net-next tree:

1) Get rid of nf_sk_is_transparent(), use inet_sk_transparent() instead.
   From Máté Eckl.

2) Move shared tproxy infrastructure to nf_tproxy_ipv4 and nf_tproxy_ipv6.
   Also from Máté.

3) Add hashtable to speed up chain lookups by name, from Florian Westphal.

4) Patch series to add connlimit support reusing part of the
   nf_conncount infrastructure. This includes preparation changes such
   passing context to the object and expression destroy interface;
   garbage collection for expressions embedded into set elements, and
   the introduction of the clone_destroy interface for expressions.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks.

----------------------------------------------------------------

The following changes since commit 1ffdd8e1643f6ce28792edd3314be84167faabf1:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next (2018-06-02 09:04:21 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 9b73c579df3d3d6359a010beda9b03be5a4c3ed5:

  netfilter: nf_tables: handle chain name lookups via rhltable (2018-06-03 00:02:13 +0200)

----------------------------------------------------------------
Florian Westphal (1):
      netfilter: nf_tables: handle chain name lookups via rhltable

Máté Eckl (2):
      netfilter: Decrease code duplication regarding transparent socket option
      netfilter: Libify xt_TPROXY

Pablo Neira Ayuso (6):
      netfilter: nf_tables: pass context to object destroy indirection
      netfilter: nf_conncount: expose connection list interface
      netfilter: nf_tables: pass ctx to nf_tables_expr_destroy()
      netfilter: nf_tables: garbage collection for stateful expressions
      netfilter: nf_tables: add destroy_clone expression
      netfilter: nf_tables: add connlimit support

 include/net/netfilter/nf_conntrack_count.h |  11 +
 include/net/netfilter/nf_socket.h          |  13 -
 include/net/netfilter/nf_tables.h          |  20 +-
 include/net/netfilter/nf_tproxy.h          | 113 +++++++++
 include/uapi/linux/netfilter/nf_tables.h   |  21 +-
 net/ipv4/netfilter/Kconfig                 |   5 +-
 net/ipv4/netfilter/Makefile                |   1 +
 net/ipv4/netfilter/nf_tproxy_ipv4.c        | 147 ++++++++++++
 net/ipv6/netfilter/Kconfig                 |   5 +-
 net/ipv6/netfilter/Makefile                |   1 +
 net/ipv6/netfilter/nf_tproxy_ipv6.c        | 146 ++++++++++++
 net/netfilter/Kconfig                      |  11 +
 net/netfilter/Makefile                     |   1 +
 net/netfilter/nf_conncount.c               |  36 ++-
 net/netfilter/nf_tables_api.c              | 151 ++++++++++--
 net/netfilter/nft_counter.c                |   4 +-
 net/netfilter/nft_ct.c                     |   3 +-
 net/netfilter/nft_dynset.c                 |   9 +
 net/netfilter/nft_set_hash.c               |  21 +-
 net/netfilter/nft_socket.c                 |   3 +-
 net/netfilter/xt_TPROXY.c                  | 366 ++---------------------------
 net/netfilter/xt_socket.c                  |   4 +-
 22 files changed, 680 insertions(+), 412 deletions(-)
 create mode 100644 include/net/netfilter/nf_tproxy.h
 create mode 100644 net/ipv4/netfilter/nf_tproxy_ipv4.c
 create mode 100644 net/ipv6/netfilter/nf_tproxy_ipv6.c

^ permalink raw reply

* [PATCH net-next] Allow ethtool to change tun link settings
From: Chas Williams @ 2018-06-02 21:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, Chas Williams

Let user space set whatever it would like to advertise for the
tun interface.  Preserve the existing defaults.

Signed-off-by: Chas Williams <3chas3@gmail.com>
---
 drivers/net/tun.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2265d2ccea47..7a7492845f53 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -81,6 +81,9 @@
 #include <linux/uaccess.h>
 #include <linux/proc_fs.h>
 
+static void tun_default_link_ksettings(struct net_device *dev,
+				       struct ethtool_link_ksettings *cmd);
+
 /* Uncomment to enable debugging */
 /* #define TUN_DEBUG 1 */
 
@@ -242,6 +245,7 @@ struct tun_struct {
 	struct bpf_prog __rcu *xdp_prog;
 	struct tun_prog __rcu *steering_prog;
 	struct tun_prog __rcu *filter_prog;
+	struct ethtool_link_ksettings link_ksettings;
 };
 
 struct veth {
@@ -2292,6 +2296,7 @@ static void tun_setup(struct net_device *dev)
 
 	tun->owner = INVALID_UID;
 	tun->group = INVALID_GID;
+	tun_default_link_ksettings(dev, &tun->link_ksettings);
 
 	dev->ethtool_ops = &tun_ethtool_ops;
 	dev->needs_free_netdev = true;
@@ -3323,8 +3328,8 @@ static struct miscdevice tun_miscdev = {
 
 /* ethtool interface */
 
-static int tun_get_link_ksettings(struct net_device *dev,
-				  struct ethtool_link_ksettings *cmd)
+static void tun_default_link_ksettings(struct net_device *dev,
+				       struct ethtool_link_ksettings *cmd)
 {
 	ethtool_link_ksettings_zero_link_mode(cmd, supported);
 	ethtool_link_ksettings_zero_link_mode(cmd, advertising);
@@ -3333,6 +3338,23 @@ static int tun_get_link_ksettings(struct net_device *dev,
 	cmd->base.port		= PORT_TP;
 	cmd->base.phy_address	= 0;
 	cmd->base.autoneg	= AUTONEG_DISABLE;
+}
+
+static int tun_get_link_ksettings(struct net_device *dev,
+				  struct ethtool_link_ksettings *cmd)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+
+	memcpy(cmd, &tun->link_ksettings, sizeof(*cmd));
+	return 0;
+}
+
+static int tun_set_link_ksettings(struct net_device *dev,
+				  const struct ethtool_link_ksettings *cmd)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+
+	memcpy(&tun->link_ksettings, cmd, sizeof(*cmd));
 	return 0;
 }
 
@@ -3403,6 +3425,7 @@ static const struct ethtool_ops tun_ethtool_ops = {
 	.get_coalesce   = tun_get_coalesce,
 	.set_coalesce   = tun_set_coalesce,
 	.get_link_ksettings = tun_get_link_ksettings,
+	.set_link_ksettings = tun_set_link_ksettings,
 };
 
 static int tun_queue_resize(struct tun_struct *tun)
-- 
2.14.3

^ permalink raw reply related

* Re: [bpf PATCH v2] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-02 21:39 UTC (permalink / raw)
  To: Eric Dumazet, edumazet, ast, daniel, Dave Watson; +Cc: netdev
In-Reply-To: <13d3be75-4ed2-aeca-caba-797766e9b676@gmail.com>

On 06/01/2018 12:58 PM, Eric Dumazet wrote:
> 
> 
> On 06/01/2018 03:46 PM, John Fastabend wrote:
>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>> of tcpv6_prot.
> 
> ...
> 
>> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
>> +	 * state. Supporting sockets in LISTEN state will require us to
>> +	 * modify the accept implementation to clone rather then share the
>> +	 * ulp context.
>> +	 */
>> +	if (sock->sk_state != TCP_ESTABLISHED)
>> +		return -ENOTSUPP;
>> +
>>  	/* 1. If sock map has BPF programs those will be inherited by the
>>  	 * sock being added. If the sock is already attached to BPF programs
>>  	 * this results in an error.
>>
> 
> Next question will be then : What happens if syzbot uses tcp_disconnect() and then listen() ?

Yep we need to fix that as well :( Looks like we can plumb the
unhash callback and remove it from the sockmap when the socket
goes through tcp_disconnect().

This patch should go in as-is though and we can fix the disconnect
issue with a new patch.

Adding Dave Watson to the thread as well because I'm guessing
the disconnect() case is also applicable to TLS. At least I see
a hw handler for unhash but there does not appear to be a handler
in the SW case, at least from a quick glance.

Thanks again!

> 
> Thanks !
> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 10/11] bpf: sync bpf uapi header with tools
From: Daniel Borkmann @ 2018-06-02 21:09 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev
In-Reply-To: <ad87d48c7fb46aaad9d731354d6beca88120b44f.1527971376.git.daniel@iogearbox.net>

On 06/02/2018 10:53 PM, Daniel Borkmann wrote:
[...]
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 64ac0f7..3f47f62 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -2008,6 +2008,7 @@ union bpf_attr {
>   *	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> +<<<<<<< c19df1a3088759f66949226160124eca31065b34
>   * int bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle)
>   *	Description

Sorry for the noise, this went unnoticed in the comment during rebase, therefore
tossed the v2 series in patchwork and removed in v3.

^ permalink raw reply

* [PATCH bpf-next v3 11/11] bpf, doc: add missing patchwork url and libbpf to maintainers
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Add missing bits under tools/lib/bpf/ and also Q: entry in order to
make it easier for people to retrieve current patch queue.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f492431..2fd51db 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2722,6 +2722,7 @@ L:	netdev@vger.kernel.org
 L:	linux-kernel@vger.kernel.org
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
+Q:	https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
 S:	Supported
 F:	arch/x86/net/bpf_jit*
 F:	Documentation/networking/filter.txt
@@ -2740,6 +2741,7 @@ F:	net/sched/act_bpf.c
 F:	net/sched/cls_bpf.c
 F:	samples/bpf/
 F:	tools/bpf/
+F:	tools/lib/bpf/
 F:	tools/testing/selftests/bpf/
 
 BROADCOM B44 10/100 ETHERNET DRIVER
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 10/11] bpf: sync bpf uapi header with tools
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Pull in recent changes from include/uapi/linux/bpf.h.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 tools/include/uapi/linux/bpf.h | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 64ac0f7..f0b6608 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2054,6 +2054,22 @@ union bpf_attr {
  *
  *	Return
  *		0
+ *
+ * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb)
+ * 	Description
+ * 		Return the cgroup v2 id of the socket associated with the *skb*.
+ * 		This is roughly similar to the **bpf_get_cgroup_classid**\ ()
+ * 		helper for cgroup v1 by providing a tag resp. identifier that
+ * 		can be matched on or used for map lookups e.g. to implement
+ * 		policy. The cgroup v2 id of a given path in the hierarchy is
+ * 		exposed in user space through the f_handle API in order to get
+ * 		to the same 64-bit id.
+ *
+ * 		This helper can be used on TC egress path, but not on ingress,
+ * 		and is available only if the kernel was compiled with the
+ * 		**CONFIG_SOCK_CGROUP_DATA** configuration option.
+ * 	Return
+ * 		The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2134,7 +2150,8 @@ union bpf_attr {
 	FN(lwt_seg6_adjust_srh),	\
 	FN(lwt_seg6_action),		\
 	FN(rc_repeat),			\
-	FN(rc_keydown),
+	FN(rc_keydown),			\
+	FN(skb_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2251,7 +2268,7 @@ struct bpf_tunnel_key {
 	};
 	__u8 tunnel_tos;
 	__u8 tunnel_ttl;
-	__u16 tunnel_ext;
+	__u16 tunnel_ext;	/* Padding, future use. */
 	__u32 tunnel_label;
 };
 
@@ -2262,6 +2279,7 @@ struct bpf_xfrm_state {
 	__u32 reqid;
 	__u32 spi;	/* Stored in network byte order */
 	__u16 family;
+	__u16 ext;	/* Padding, future use. */
 	union {
 		__u32 remote_ipv4;	/* Stored in network byte order */
 		__u32 remote_ipv6[4];	/* Stored in network byte order */
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 09/11] bpf: fix context access in tracing progs on 32 bit archs
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:

  172/p unpriv: spill/fill of different pointers ldx FAIL
  Unexpected error message!
  0: (bf) r6 = r10
  1: (07) r6 += -8
  2: (15) if r1 == 0x0 goto pc+3
  R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
  3: (bf) r2 = r10
  4: (07) r2 += -76
  5: (7b) *(u64 *)(r6 +0) = r2
  6: (55) if r1 != 0x0 goto pc+1
  R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
  7: (7b) *(u64 *)(r6 +0) = r1
  8: (79) r1 = *(u64 *)(r6 +0)
  9: (79) r1 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

  378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (71) r0 = *(u8 *)(r1 +68)
  invalid bpf_context access off=68 size=1

  379/p check bpf_perf_event_data->sample_period half load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (69) r0 = *(u16 *)(r1 +68)
  invalid bpf_context access off=68 size=2

  380/p check bpf_perf_event_data->sample_period word load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (61) r0 = *(u32 *)(r1 +68)
  invalid bpf_context access off=68 size=4

  381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (79) r0 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
will then bail out saying that off & (size_default - 1) which is 68 & 7
doesn't cleanly align in the case of sample_period access from struct
bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
unaligned access here though underlying arch can handle it just fine.
Therefore adjust this down to machine size and check and rewrite the
offset for narrow access on that basis. We also need to fix corresponding
pe_prog_is_valid_access(), since we hit the check for off % size != 0
(e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
for tracing work on x86_32.

Reported-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang YanQing <udknight@gmail.com>
---
 include/linux/filter.h   | 30 ++++++++++++++++++++++++------
 kernel/bpf/verifier.c    |  3 ++-
 kernel/trace/bpf_trace.c | 10 ++++++++--
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8e60f1e..45fc0f5 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -639,16 +639,34 @@ static inline bool bpf_prog_was_classic(const struct bpf_prog *prog)
 	return prog->type == BPF_PROG_TYPE_UNSPEC;
 }
 
-static inline bool
-bpf_ctx_narrow_access_ok(u32 off, u32 size, const u32 size_default)
+static inline u32 bpf_ctx_off_adjust_machine(u32 size)
+{
+	const u32 size_machine = sizeof(unsigned long);
+
+	if (size > size_machine && size % size_machine == 0)
+		size = size_machine;
+
+	return size;
+}
+
+static inline bool bpf_ctx_narrow_align_ok(u32 off, u32 size_access,
+					   u32 size_default)
 {
-	bool off_ok;
+	size_default = bpf_ctx_off_adjust_machine(size_default);
+	size_access  = bpf_ctx_off_adjust_machine(size_access);
+
 #ifdef __LITTLE_ENDIAN
-	off_ok = (off & (size_default - 1)) == 0;
+	return (off & (size_default - 1)) == 0;
 #else
-	off_ok = (off & (size_default - 1)) + size == size_default;
+	return (off & (size_default - 1)) + size_access == size_default;
 #endif
-	return off_ok && size <= size_default && (size & (size - 1)) == 0;
+}
+
+static inline bool
+bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default)
+{
+	return bpf_ctx_narrow_align_ok(off, size, size_default) &&
+	       size <= size_default && (size & (size - 1)) == 0;
 }
 
 #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5684b15..011def5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5349,6 +5349,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		 */
 		is_narrower_load = size < ctx_field_size;
 		if (is_narrower_load) {
+			u32 size_default = bpf_ctx_off_adjust_machine(ctx_field_size);
 			u32 off = insn->off;
 			u8 size_code;
 
@@ -5363,7 +5364,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			else if (ctx_field_size == 8)
 				size_code = BPF_DW;
 
-			insn->off = off & ~(ctx_field_size - 1);
+			insn->off = off & ~(size_default - 1);
 			insn->code = BPF_LDX | BPF_MEM | size_code;
 		}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index af1486d..752992c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -880,8 +880,14 @@ static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type
 		return false;
 	if (type != BPF_READ)
 		return false;
-	if (off % size != 0)
-		return false;
+	if (off % size != 0) {
+		if (sizeof(unsigned long) != 4)
+			return false;
+		if (size != 8)
+			return false;
+		if (off % size != 4)
+			return false;
+	}
 
 	switch (off) {
 	case bpf_ctx_range(struct bpf_perf_event_data, sample_period):
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 08/11] bpf: fix cbpf parser bug for octal numbers
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Range is 0-7, not 0-9, otherwise parser silently excludes it from the
strtol() rather than throwing an error.

Reported-by: Marc Boschma <marc@boschma.cx>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 tools/bpf/bpf_exp.l | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpf_exp.l b/tools/bpf/bpf_exp.l
index bd83149..4da8d05 100644
--- a/tools/bpf/bpf_exp.l
+++ b/tools/bpf/bpf_exp.l
@@ -175,7 +175,7 @@ extern void yyerror(const char *str);
 			yylval.number = strtol(yytext, NULL, 10);
 			return number;
 		}
-([0][0-9]+)	{
+([0][0-7]+)	{
 			yylval.number = strtol(yytext + 1, NULL, 8);
 			return number;
 		}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 07/11] bpf: make sure to clear unused fields in tunnel/xfrm state fetch
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Since the remaining bits are not filled in struct bpf_tunnel_key
resp. struct bpf_xfrm_state and originate from uninitialized stack
space, we should make sure to clear them before handing control
back to the program.

Also add a padding element to struct bpf_xfrm_state for future use
similar as we have in struct bpf_tunnel_key and clear it as well.

  struct bpf_xfrm_state {
      __u32                      reqid;            /*     0     4 */
      __u32                      spi;              /*     4     4 */
      __u16                      family;           /*     8     2 */

      /* XXX 2 bytes hole, try to pack */

      union {
          __u32              remote_ipv4;          /*           4 */
          __u32              remote_ipv6[4];       /*          16 */
      };                                           /*    12    16 */

      /* size: 28, cachelines: 1, members: 4 */
      /* sum members: 26, holes: 1, sum holes: 2 */
      /* last cacheline: 28 bytes */
  };

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 include/uapi/linux/bpf.h | 3 ++-
 net/core/filter.c        | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6613181..f0b6608 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2268,7 +2268,7 @@ struct bpf_tunnel_key {
 	};
 	__u8 tunnel_tos;
 	__u8 tunnel_ttl;
-	__u16 tunnel_ext;
+	__u16 tunnel_ext;	/* Padding, future use. */
 	__u32 tunnel_label;
 };
 
@@ -2279,6 +2279,7 @@ struct bpf_xfrm_state {
 	__u32 reqid;
 	__u32 spi;	/* Stored in network byte order */
 	__u16 family;
+	__u16 ext;	/* Padding, future use. */
 	union {
 		__u32 remote_ipv4;	/* Stored in network byte order */
 		__u32 remote_ipv6[4];	/* Stored in network byte order */
diff --git a/net/core/filter.c b/net/core/filter.c
index edbfaa6..28e8647 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3445,6 +3445,7 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, struct bpf_tunnel_key
 	to->tunnel_id = be64_to_cpu(info->key.tun_id);
 	to->tunnel_tos = info->key.tos;
 	to->tunnel_ttl = info->key.ttl;
+	to->tunnel_ext = 0;
 
 	if (flags & BPF_F_TUNINFO_IPV6) {
 		memcpy(to->remote_ipv6, &info->key.u.ipv6.src,
@@ -3452,6 +3453,8 @@ BPF_CALL_4(bpf_skb_get_tunnel_key, struct sk_buff *, skb, struct bpf_tunnel_key
 		to->tunnel_label = be32_to_cpu(info->key.label);
 	} else {
 		to->remote_ipv4 = be32_to_cpu(info->key.u.ipv4.src);
+		memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3);
+		to->tunnel_label = 0;
 	}
 
 	if (unlikely(size != sizeof(struct bpf_tunnel_key)))
@@ -4047,11 +4050,14 @@ BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
 	to->reqid = x->props.reqid;
 	to->spi = x->id.spi;
 	to->family = x->props.family;
+	to->ext = 0;
+
 	if (to->family == AF_INET6) {
 		memcpy(to->remote_ipv6, x->props.saddr.a6,
 		       sizeof(to->remote_ipv6));
 	} else {
 		to->remote_ipv4 = x->props.saddr.a4;
+		memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3);
 	}
 
 	return 0;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 06/11] bpf: add bpf_skb_cgroup_id helper
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
cgroup id from the skb's socket. This is useful in particular to
enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
cgroup v2 by allowing ID based matching on egress. This can in
particular be used in combination with applying policy e.g. from
map lookups, and also complements the older bpf_skb_under_cgroup()
interface. In user space the cgroup id for a given path can be
retrieved through the f_handle as demonstrated in [0] recently.

  [0] https://lkml.org/lkml/2018/5/22/1190

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpf.h | 19 ++++++++++++++++++-
 net/core/filter.c        | 29 +++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 64ac0f7..6613181 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2054,6 +2054,22 @@ union bpf_attr {
  *
  *	Return
  *		0
+ *
+ * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb)
+ * 	Description
+ * 		Return the cgroup v2 id of the socket associated with the *skb*.
+ * 		This is roughly similar to the **bpf_get_cgroup_classid**\ ()
+ * 		helper for cgroup v1 by providing a tag resp. identifier that
+ * 		can be matched on or used for map lookups e.g. to implement
+ * 		policy. The cgroup v2 id of a given path in the hierarchy is
+ * 		exposed in user space through the f_handle API in order to get
+ * 		to the same 64-bit id.
+ *
+ * 		This helper can be used on TC egress path, but not on ingress,
+ * 		and is available only if the kernel was compiled with the
+ * 		**CONFIG_SOCK_CGROUP_DATA** configuration option.
+ * 	Return
+ * 		The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2134,7 +2150,8 @@ union bpf_attr {
 	FN(lwt_seg6_adjust_srh),	\
 	FN(lwt_seg6_action),		\
 	FN(rc_repeat),			\
-	FN(rc_keydown),
+	FN(rc_keydown),			\
+	FN(skb_cgroup_id),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 885fb0e..edbfaa6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3661,6 +3661,27 @@ static const struct bpf_func_proto bpf_skb_under_cgroup_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+#ifdef CONFIG_SOCK_CGROUP_DATA
+BPF_CALL_1(bpf_skb_cgroup_id, const struct sk_buff *, skb)
+{
+	struct sock *sk = skb_to_full_sk(skb);
+	struct cgroup *cgrp;
+
+	if (!sk || !sk_fullsock(sk))
+		return 0;
+
+	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	return cgrp->kn->id.id;
+}
+
+static const struct bpf_func_proto bpf_skb_cgroup_id_proto = {
+	.func           = bpf_skb_cgroup_id,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+};
+#endif
+
 static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 				  unsigned long off, unsigned long len)
 {
@@ -4747,12 +4768,16 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_cookie_proto;
 	case BPF_FUNC_get_socket_uid:
 		return &bpf_get_socket_uid_proto;
+	case BPF_FUNC_fib_lookup:
+		return &bpf_skb_fib_lookup_proto;
 #ifdef CONFIG_XFRM
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
 #endif
-	case BPF_FUNC_fib_lookup:
-		return &bpf_skb_fib_lookup_proto;
+#ifdef CONFIG_SOCK_CGROUP_DATA
+	case BPF_FUNC_skb_cgroup_id:
+		return &bpf_skb_cgroup_id_proto;
+#endif
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 05/11] bpf: avoid retpoline for lookup/update/delete calls on maps
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

While some of the BPF map lookup helpers provide a ->map_gen_lookup()
callback for inlining the map lookup altogether it is not available
for every map, so the remaining ones have to call bpf_map_lookup_elem()
helper which does a dispatch to map->ops->map_lookup_elem(). In
times of retpolines, this will control and trap speculative execution
rather than letting it do its work for the indirect call and will
therefore cause a slowdown. Likewise, bpf_map_update_elem() and
bpf_map_delete_elem() do not have an inlined version and need to call
into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
handlers.

Before:

  # bpftool p d x i 1
    0: (bf) r2 = r10
    1: (07) r2 += -8
    2: (7a) *(u64 *)(r2 +0) = 0
    3: (18) r1 = map[id:1]
    5: (85) call __htab_map_lookup_elem#232656
    6: (15) if r0 == 0x0 goto pc+4
    7: (71) r1 = *(u8 *)(r0 +35)
    8: (55) if r1 != 0x0 goto pc+1
    9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
   16: (95) exit                                 helper

After:

  # bpftool p d x i 1
    0: (bf) r2 = r10
    1: (07) r2 += -8
    2: (7a) *(u64 *)(r2 +0) = 0
    3: (18) r1 = map[id:1]
    5: (85) call __htab_map_lookup_elem#233328
    6: (15) if r0 == 0x0 goto pc+4
    7: (71) r1 = *(u8 *)(r0 +35)
    8: (55) if r1 != 0x0 goto pc+1
    9: (72) *(u8 *)(r0 +35) = 1
   10: (07) r0 += 56
   11: (15) if r0 == 0x0 goto pc+4
   12: (bf) r2 = r0
   13: (18) r1 = map[id:1]
   15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
   16: (95) exit

In all three lookup/update/delete cases however we can use the actual
address of the map callback directly if we find that there's only a
single path with a map pointer leading to the helper call, meaning
when the map pointer has not been poisoned from verifier side.
Example code can be seen above for the delete case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
---
 include/linux/filter.h |  3 +++
 kernel/bpf/hashtab.c   | 12 ++++++---
 kernel/bpf/verifier.c  | 67 +++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 62 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fd375fe..8e60f1e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -301,6 +301,9 @@ struct xdp_buff;
 
 /* Function call */
 
+#define BPF_CAST_CALL(x)					\
+		((u64 (*)(u64, u64, u64, u64, u64))(x))
+
 #define BPF_EMIT_CALL(FUNC)					\
 	((struct bpf_insn) {					\
 		.code  = BPF_JMP | BPF_CALL,			\
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b76828f..3ca2198 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -503,7 +503,9 @@ static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 	struct bpf_insn *insn = insn_buf;
 	const int ret = BPF_REG_0;
 
-	*insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem);
+	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
 	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
 				offsetof(struct htab_elem, key) +
@@ -530,7 +532,9 @@ static u32 htab_lru_map_gen_lookup(struct bpf_map *map,
 	const int ret = BPF_REG_0;
 	const int ref_reg = BPF_REG_1;
 
-	*insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem);
+	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
 	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4);
 	*insn++ = BPF_LDX_MEM(BPF_B, ref_reg, ret,
 			      offsetof(struct htab_elem, lru_node) +
@@ -1369,7 +1373,9 @@ static u32 htab_of_map_gen_lookup(struct bpf_map *map,
 	struct bpf_insn *insn = insn_buf;
 	const int ret = BPF_REG_0;
 
-	*insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem);
+	BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem,
+		     (void *(*)(struct bpf_map *map, void *key))NULL));
+	*insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem));
 	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2);
 	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
 				offsetof(struct htab_elem, key) +
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4f4786e..5684b15 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2421,8 +2421,11 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
 
 	if (func_id != BPF_FUNC_tail_call &&
-	    func_id != BPF_FUNC_map_lookup_elem)
+	    func_id != BPF_FUNC_map_lookup_elem &&
+	    func_id != BPF_FUNC_map_update_elem &&
+	    func_id != BPF_FUNC_map_delete_elem)
 		return 0;
+
 	if (meta->map_ptr == NULL) {
 		verbose(env, "kernel subsystem misconfigured verifier\n");
 		return -EINVAL;
@@ -5586,6 +5589,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 	struct bpf_insn *insn = prog->insnsi;
 	const struct bpf_func_proto *fn;
 	const int insn_cnt = prog->len;
+	const struct bpf_map_ops *ops;
 	struct bpf_insn_aux_data *aux;
 	struct bpf_insn insn_buf[16];
 	struct bpf_prog *new_prog;
@@ -5715,10 +5719,13 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 		}
 
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
-		 * handlers are currently limited to 64 bit only.
+		 * and other inlining handlers are currently limited to 64 bit
+		 * only.
 		 */
 		if (prog->jit_requested && BITS_PER_LONG == 64 &&
-		    insn->imm == BPF_FUNC_map_lookup_elem) {
+		    (insn->imm == BPF_FUNC_map_lookup_elem ||
+		     insn->imm == BPF_FUNC_map_update_elem ||
+		     insn->imm == BPF_FUNC_map_delete_elem)) {
 			aux = &env->insn_aux_data[i + delta];
 			if (bpf_map_ptr_poisoned(aux))
 				goto patch_call_imm;
@@ -5727,23 +5734,49 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			if (!map_ptr->ops->map_gen_lookup)
 				goto patch_call_imm;
 
-			cnt = map_ptr->ops->map_gen_lookup(map_ptr, insn_buf);
-			if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) {
-				verbose(env, "bpf verifier is misconfigured\n");
-				return -EINVAL;
-			}
+			ops = map_ptr->ops;
+			if (insn->imm == BPF_FUNC_map_lookup_elem &&
+			    ops->map_gen_lookup) {
+				cnt = ops->map_gen_lookup(map_ptr, insn_buf);
+				if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) {
+					verbose(env, "bpf verifier is misconfigured\n");
+					return -EINVAL;
+				}
 
-			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf,
-						       cnt);
-			if (!new_prog)
-				return -ENOMEM;
+				new_prog = bpf_patch_insn_data(env, i + delta,
+							       insn_buf, cnt);
+				if (!new_prog)
+					return -ENOMEM;
 
-			delta += cnt - 1;
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				continue;
+			}
 
-			/* keep walking new program and skip insns we just inserted */
-			env->prog = prog = new_prog;
-			insn      = new_prog->insnsi + i + delta;
-			continue;
+			BUILD_BUG_ON(!__same_type(ops->map_lookup_elem,
+				     (void *(*)(struct bpf_map *map, void *key))NULL));
+			BUILD_BUG_ON(!__same_type(ops->map_delete_elem,
+				     (int (*)(struct bpf_map *map, void *key))NULL));
+			BUILD_BUG_ON(!__same_type(ops->map_update_elem,
+				     (int (*)(struct bpf_map *map, void *key, void *value,
+					      u64 flags))NULL));
+			switch (insn->imm) {
+			case BPF_FUNC_map_lookup_elem:
+				insn->imm = BPF_CAST_CALL(ops->map_lookup_elem) -
+					    __bpf_call_base;
+				continue;
+			case BPF_FUNC_map_update_elem:
+				insn->imm = BPF_CAST_CALL(ops->map_update_elem) -
+					    __bpf_call_base;
+				continue;
+			case BPF_FUNC_map_delete_elem:
+				insn->imm = BPF_CAST_CALL(ops->map_delete_elem) -
+					    __bpf_call_base;
+				continue;
+			}
+
+			goto patch_call_imm;
 		}
 
 		if (insn->imm == BPF_FUNC_redirect_map) {
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v3 04/11] bpf: show prog and map id in fdinfo
From: Daniel Borkmann @ 2018-06-02 21:06 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann
In-Reply-To: <20180602210641.6163-1-daniel@iogearbox.net>

Its trivial and straight forward to expose it for scripts that can
then use it along with bpftool in order to inspect an individual
application's used maps and progs. Right now we dump some basic
information in the fdinfo file but with the help of the map/prog
id full introspection becomes possible now.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 kernel/bpf/syscall.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7365d79..0fa2062 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -327,13 +327,15 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp)
 		   "value_size:\t%u\n"
 		   "max_entries:\t%u\n"
 		   "map_flags:\t%#x\n"
-		   "memlock:\t%llu\n",
+		   "memlock:\t%llu\n"
+		   "map_id:\t%u\n",
 		   map->map_type,
 		   map->key_size,
 		   map->value_size,
 		   map->max_entries,
 		   map->map_flags,
-		   map->pages * 1ULL << PAGE_SHIFT);
+		   map->pages * 1ULL << PAGE_SHIFT,
+		   map->id);
 
 	if (owner_prog_type) {
 		seq_printf(m, "owner_prog_type:\t%u\n",
@@ -1070,11 +1072,13 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp)
 		   "prog_type:\t%u\n"
 		   "prog_jited:\t%u\n"
 		   "prog_tag:\t%s\n"
-		   "memlock:\t%llu\n",
+		   "memlock:\t%llu\n"
+		   "prog_id:\t%u\n",
 		   prog->type,
 		   prog->jited,
 		   prog_tag,
-		   prog->pages * 1ULL << PAGE_SHIFT);
+		   prog->pages * 1ULL << PAGE_SHIFT,
+		   prog->aux->id);
 }
 #endif
 
-- 
2.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox