Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net/sched: cls_flower: Add user specified data
From: John Fastabend @ 2017-01-02 18:23 UTC (permalink / raw)
  To: Jamal Hadi Salim, Paul Blakey, David S. Miller, netdev
  Cc: Jiri Pirko, Hadar Hen Zion, Or Gerlitz, Roi Dayan, Roman Mashak
In-Reply-To: <14675f63-4212-2f72-da4c-cd24b9d10881@mojatatu.com>

On 17-01-02 06:59 AM, Jamal Hadi Salim wrote:
> 
> We have been using a cookie as well for actions (which we have been
> using but have been too lazy to submit so far). I am going to port
> it over to the newer kernels and post it.
> In our case that is intended to be opaque to the kernel i.e kernel
> never inteprets it; in that case it is similar to the kernel
> FIB protocol field.
> 
> In your case - could this cookie have been a class/flowid
> (a 32 bit)?
> And would it not make more sense for it the cookie to be
> generic to all classifiers? i.e why is it specific to flower?
> 
> cheers,
> jamal
> 
> On 17-01-02 08:13 AM, Paul Blakey wrote:
>> This is to support saving extra data that might be helpful on retrieval.
>> First use case is upcoming openvswitch flow offloads, extra data will
>> include UFID and port mappings for each added flow.
>>
>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
>> Reviewed-by: Roi Dayan <roid@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> ---

Additionally I would like to point out this is an arbitrary length binary
blob (for undefined use, without even a specified encoding) that gets pushed
between user space and hardware ;) This seemed to get folks fairly excited in
the past.

Some questions, exactly what do you mean by "port mappings" above? In
general the 'tc' API uses the netdev the netlink msg is processed on as
the port mapping. If you mean OVS port to netdev port I think this is
a OVS problem and nothing to do with 'tc'. For what its worth there is an
existing problem with 'tc' where rules only apply to a single ingress or
egress port which is limiting on hardware.

The UFID in my ovs code base is defined as best I can tell here,

        [OVS_FLOW_ATTR_UFID] = { .type = NL_A_UNSPEC, .optional = true,
                                 .min_len = sizeof(ovs_u128) },

So you need 128 bits if you want a 1:1 mapping onto 'tc'. So rather
than an arbitrary blob why not make the case that 'tc' ids need to be
128 bits long? Even if its just initially done in flower call it
flower_flow_id and define it so its not opaque and at least at the code
level it isn't an arbitrary blob of data.

And what are the "next" uses of this besides OVS. It would be really
valuable to see how this generalizes to other usage models. To avoid
embedding OVS syntax into 'tc'.

Finally if you want to see an example of binary data encodings look at
how drivers/hardware/users are currently using the user defined bits in
ethtools ntuple API. Also track down out of tree drivers to see other
interesting uses. And that was capped at 64bits :/

Thanks,
John

^ permalink raw reply

* Re: tcp_bbr: Forcing set of BBR congestion control as default
From: Sedat Dilek @ 2017-01-02 18:49 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Netdev
In-Reply-To: <CADVnQy=q88Hvh2aYnGCpCQH_b5PNwTZr_AWfgU0NnuSs6aatVQ@mail.gmail.com>

On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>>
>> Hi,
>>
>> I am trying to force the set of BBR congestion control as default.
>> My old linux-config uses CUBIC as default.
>> I want both BBR and CUBIC to be built but BBR shall be my default.
>>
>> I tried the below snippet.
>>
>> I refresh my new linux-config like this...
>>
>> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
>> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
>> $MAKE_OPTS silentoldconfig < /dev/null
>>
>> What am I doing wrong?
>
> Perhaps your build directory already has an old .config file that sets
> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
> "make silentoldconfig" are maintaining those lines from the old
> .config?
>
> If you want to start with your existing .config but then make BBR the
> default, then it might be simplest just to edit your .config directly:
>
> -CONFIG_DEFAULT_TCP_CONG="cubic"
> +CONFIG_DEFAULT_TCP_CONG="bbr"
>

Just to clarify...

I can have both TCP_CONG cubic and bbr built and switch for example via sysctl?

Which tc version is required?
Here tc is from iproute (20121211-2~precise).
Is that enough?

> BTW, I presume you have seen the warning in the BBR commit message or
> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>
>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>   enabled, since pacing is integral to the BBR design and
>   implementation. BBR without pacing would not function properly, and
>   may incur unnecessary high packet loss rates.
>
> The BBR quick-start guide has some details about how to build and
> enable BBR and fq:
>
>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>

Hmmm...

>From [1] Section "Further reading"...

egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config

then you see exactly the following lines:

CONFIG_TCP_CONG_BBR=y
CONFIG_NET_SCH_FQ=y

Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
That would be safer.

[1] https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md

> Hope that helps,

Thanks you reply helped for further testings.

- Sedat -

P.S.: Note2myself: Enable NET_SCH_FQ

$ ./scripts/diffconfig /boot/config-4.9.0-2-iniza-amd64 .config
 NET_SCH_FQ n -> y

^ permalink raw reply

* [PATCH] net: faraday: ftmac100: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2017-01-02 18:53 UTC (permalink / raw)
  To: davem, joel, gwshan, andrew, andrew; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 drivers/net/ethernet/faraday/ftmac100.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftmac100.c b/drivers/net/ethernet/faraday/ftmac100.c
index dce5f7b..c0ddbbe 100644
--- a/drivers/net/ethernet/faraday/ftmac100.c
+++ b/drivers/net/ethernet/faraday/ftmac100.c
@@ -825,16 +825,18 @@ static void ftmac100_get_drvinfo(struct net_device *netdev,
 	strlcpy(info->bus_info, dev_name(&netdev->dev), sizeof(info->bus_info));
 }
 
-static int ftmac100_get_settings(struct net_device *netdev, struct ethtool_cmd *cmd)
+static int ftmac100_get_link_ksettings(struct net_device *netdev,
+				       struct ethtool_link_ksettings *cmd)
 {
 	struct ftmac100 *priv = netdev_priv(netdev);
-	return mii_ethtool_gset(&priv->mii, cmd);
+	return mii_ethtool_get_link_ksettings(&priv->mii, cmd);
 }
 
-static int ftmac100_set_settings(struct net_device *netdev, struct ethtool_cmd *cmd)
+static int ftmac100_set_link_ksettings(struct net_device *netdev,
+				       const struct ethtool_link_ksettings *cmd)
 {
 	struct ftmac100 *priv = netdev_priv(netdev);
-	return mii_ethtool_sset(&priv->mii, cmd);
+	return mii_ethtool_set_link_ksettings(&priv->mii, cmd);
 }
 
 static int ftmac100_nway_reset(struct net_device *netdev)
@@ -850,11 +852,11 @@ static u32 ftmac100_get_link(struct net_device *netdev)
 }
 
 static const struct ethtool_ops ftmac100_ethtool_ops = {
-	.set_settings		= ftmac100_set_settings,
-	.get_settings		= ftmac100_get_settings,
 	.get_drvinfo		= ftmac100_get_drvinfo,
 	.nway_reset		= ftmac100_nway_reset,
 	.get_link		= ftmac100_get_link,
+	.get_link_ksettings	= ftmac100_get_link_ksettings,
+	.set_link_ksettings	= ftmac100_set_link_ksettings,
 };
 
 /******************************************************************************
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH iproute2 net-next] tc: flower: support matching flags
From: Jiri Benc @ 2017-01-02 18:55 UTC (permalink / raw)
  To: Paul Blakey
  Cc: netdev, Stephen Hemminger, David S. Miller, Hadar Hen Zion,
	Or Gerlitz, Roi Dayan
In-Reply-To: <1482930409-55059-1-git-send-email-paulb@mellanox.com>

On Wed, 28 Dec 2016 15:06:49 +0200, Paul Blakey wrote:
> Enhance flower to support matching on flags.
> 
> The 1st flag allows to match on whether the packet is
> an IP fragment.
> 
> Example:
> 
> 	# add a flower filter that will drop fragmented packets
> 	# (bit 0 of control flags)
> 	tc filter add dev ens4f0 protocol ip parent ffff: \
> 		flower \
> 		src_mac e4:1d:2d:fd:8b:01 \
> 		dst_mac e4:1d:2d:fd:8b:02 \
> 		indev ens4f0 \
> 		matching_flags 0x1/0x1 \
> 	action drop

This is very poor API. First, how is the user supposed to know what
those magic values in "matching_flags" mean? At the very least, it
should be documented in the man page.

Second, why "matching_flags"? That name suggests that those modify the
way the matching is done (to illustrate my point, I'd expect things
like "if the packet is too short, match this rule anyway" to be a
"matching flag"). But this is not the case. What's wrong with plain
"flags"? Or, if you want to be more specific, perhaps packet_flags?

Third, all of this looks very wrong anyway. There should be separate
keywords for individual flags. In this case, there should be an
"ip_fragment" flag. The tc tool should be responsible for putting the
flags together and creating the appropriate mask. The example would
then be:

	tc filter add dev ens4f0 protocol ip parent ffff: \
		flower \
		src_mac e4:1d:2d:fd:8b:01 \
		dst_mac e4:1d:2d:fd:8b:02 \
		indev ens4f0 \
		ip_fragment yes\
	action drop

I don't care whether it's "ip_fragment yes/no", "ip_fragment 1/0",
"ip_fragment/noip_fragment" or similar. The important thing is it's a
boolean flag; if specified, it's set to 0/1 and unmasked, if not
specified, it's wildcarded.

Stephen, I understand that you already applied this patch but given how
horrible the proposed API is and that's even undocumented in this
patch, please reconsider this. If this is released, the API is set in
stone and, frankly, it's very user unfriendly this way.

Paul, could you please prepare a patch that would introduce a more sane
API? I'd strongly prefer what I described under "third" but should you
strongly disagree, at least implement "second" and document the
currently known flag values.

Thanks,

 Jiri

^ permalink raw reply

* Re: tcp_bbr: Forcing set of BBR congestion control as default
From: Neal Cardwell @ 2017-01-02 19:12 UTC (permalink / raw)
  To: sedat.dilek; +Cc: Netdev
In-Reply-To: <CA+icZUUxHFDqfEj+DitFFH6P9+P5kOfMLoQpnK1qW+EZ6PmPKQ@mail.gmail.com>

On Mon, Jan 2, 2017 at 1:49 PM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
> On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell <ncardwell@google.com> wrote:
>> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am trying to force the set of BBR congestion control as default.
>>> My old linux-config uses CUBIC as default.
>>> I want both BBR and CUBIC to be built but BBR shall be my default.
>>>
>>> I tried the below snippet.
>>>
>>> I refresh my new linux-config like this...
>>>
>>> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
>>> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
>>> $MAKE_OPTS silentoldconfig < /dev/null
>>>
>>> What am I doing wrong?
>>
>> Perhaps your build directory already has an old .config file that sets
>> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
>> "make silentoldconfig" are maintaining those lines from the old
>> .config?
>>
>> If you want to start with your existing .config but then make BBR the
>> default, then it might be simplest just to edit your .config directly:
>>
>> -CONFIG_DEFAULT_TCP_CONG="cubic"
>> +CONFIG_DEFAULT_TCP_CONG="bbr"
>>
>
> Just to clarify...
>
> I can have both TCP_CONG cubic and bbr built and switch for example via sysctl?

Yes, you can set the default congestion control using sysctl, eg:

  sysctl net.ipv4.tcp_congestion_control=bbr

And (as mentioned in the quick-start guide) on many Linux systems you
can make this "sticky" after reboots with something like:

  sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf'

> Which tc version is required?
> Here tc is from iproute (20121211-2~precise).
> Is that enough?

You should not need a particular version of tc to install "fq".

In the BBR quick-start guide we recommend setting fq to be the default qdisc:

  sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'

But if that doesn't work in your environment, then I believe you
should be able to install the fq qdisc with any version of tc, with
something like:

  tc qdisc replace dev eth0 root fq

You just won't be able to set or view configuration options.

>> BTW, I presume you have seen the warning in the BBR commit message or
>> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>>
>>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>>   enabled, since pacing is integral to the BBR design and
>>   implementation. BBR without pacing would not function properly, and
>>   may incur unnecessary high packet loss rates.
>>
>> The BBR quick-start guide has some details about how to build and
>> enable BBR and fq:
>>
>>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>>
>
> Hmmm...
>
> From [1] Section "Further reading"...
>
> egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config
>
> then you see exactly the following lines:
>
> CONFIG_TCP_CONG_BBR=y
> CONFIG_NET_SCH_FQ=y
>
> Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
> That would be safer.
>
> [1] https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md

That would be a little safer, but not sufficient (since the qdisc
still has to be configured to be in the transmit path somewhere).

thanks,
neal

^ permalink raw reply

* Re: tcp_bbr: Forcing set of BBR congestion control as default
From: Sedat Dilek @ 2017-01-02 19:30 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Netdev
In-Reply-To: <CADVnQymyX1G4=1qaegDYfbq89k15NsG-Acw7mTHRc7jo7=xbQw@mail.gmail.com>

On Mon, Jan 2, 2017 at 8:12 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Jan 2, 2017 at 1:49 PM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>> On Mon, Jan 2, 2017 at 7:17 PM, Neal Cardwell <ncardwell@google.com> wrote:
>>> On Mon, Jan 2, 2017 at 12:05 AM, Sedat Dilek <sedat.dilek@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am trying to force the set of BBR congestion control as default.
>>>> My old linux-config uses CUBIC as default.
>>>> I want both BBR and CUBIC to be built but BBR shall be my default.
>>>>
>>>> I tried the below snippet.
>>>>
>>>> I refresh my new linux-config like this...
>>>>
>>>> $ MAKE="make V=1" ; COMPILER="mycompiler" ; MAKE_OPTS="CC=$COMPILER
>>>> HOSTCC=$COMPILER" ; yes "" | $MAKE $MAKE_OPTS oldconfig && $MAKE
>>>> $MAKE_OPTS silentoldconfig < /dev/null
>>>>
>>>> What am I doing wrong?
>>>
>>> Perhaps your build directory already has an old .config file that sets
>>> the DEFAULT_TCP_CONG to be "cubic", and your "make oldconfig" and
>>> "make silentoldconfig" are maintaining those lines from the old
>>> .config?
>>>
>>> If you want to start with your existing .config but then make BBR the
>>> default, then it might be simplest just to edit your .config directly:
>>>
>>> -CONFIG_DEFAULT_TCP_CONG="cubic"
>>> +CONFIG_DEFAULT_TCP_CONG="bbr"
>>>
>>
>> Just to clarify...
>>
>> I can have both TCP_CONG cubic and bbr built and switch for example via sysctl?
>
> Yes, you can set the default congestion control using sysctl, eg:
>
>   sysctl net.ipv4.tcp_congestion_control=bbr
>
> And (as mentioned in the quick-start guide) on many Linux systems you
> can make this "sticky" after reboots with something like:
>
>   sudo bash -c 'echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf'
>
>> Which tc version is required?
>> Here tc is from iproute (20121211-2~precise).
>> Is that enough?
>
> You should not need a particular version of tc to install "fq".
>
> In the BBR quick-start guide we recommend setting fq to be the default qdisc:
>
>   sudo bash -c 'echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf'
>

OK, this looks now good.

# sysctl net.core.default_qdisc
net.core.default_qdisc = fq

# sysctl net.ipv4.tcp_congestion_control
net.ipv4.tcp_congestion_control = bbr

I wondered why my Internet connection has so many stalls.

> But if that doesn't work in your environment, then I believe you
> should be able to install the fq qdisc with any version of tc, with
> something like:
>
>   tc qdisc replace dev eth0 root fq
>
> You just won't be able to set or view configuration options.
>
>>> BTW, I presume you have seen the warning in the BBR commit message or
>>> tcp_bbr.c about ensuring that BBR is used with the "fq" qdisc:
>>>
>>>   NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
>>>   enabled, since pacing is integral to the BBR design and
>>>   implementation. BBR without pacing would not function properly, and
>>>   may incur unnecessary high packet loss rates.
>>>
>>> The BBR quick-start guide has some details about how to build and
>>> enable BBR and fq:
>>>
>>>   https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>>>
>>
>> Hmmm...
>>
>> From [1] Section "Further reading"...
>>
>> egrep '(CONFIG_TCP_CONG_BBR|CONFIG_NET_SCH_FQ)=' .config
>>
>> then you see exactly the following lines:
>>
>> CONFIG_TCP_CONG_BBR=y
>> CONFIG_NET_SCH_FQ=y
>>
>> Should CONFIG_TCP_CONG_BBR have a "select CONFIG_NET_SCH_FQ" in its Kconfig?
>> That would be safer.
>>
>> [1] https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md
>
> That would be a little safer, but not sufficient (since the qdisc
> still has to be configured to be in the transmit path somewhere).
>

Does BBR only work with fq-qdisc best?
What about fq_codel?

- Sedat -

^ permalink raw reply

* [PATCH net-next] bridge: multicast to unicast
From: Linus Lüssing @ 2017-01-02 19:32 UTC (permalink / raw)
  To: netdev
  Cc: bridge, linux-wireless, linux-kernel, David S . Miller,
	Felix Fietkau

Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.

multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.

This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).

However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.

The initial patch and idea is from Felix Fietkau.

Cc: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>

---

This feature is used and enabled by default in OpenWRT and LEDE for AP
interfaces for more than a year now to allow both a more robust multicast
delivery and multicast at higher rates (e.g. multicast streaming).

In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
the network daemon enabling AP isolation and by that separating all STAs.
Delivery of STA-to-STA IP mulitcast is made possible again by
enabling and utilizing the bridge hairpin mode, which considers the
incoming port as a potential outgoing port, too.

Hairpin-mode is performed after multicast snooping, therefore leading to
only deliver reports to STAs running a multicast router.
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 44 +++++++++++++++++++++--
 net/bridge/br_mdb.c       |  2 +-
 net/bridge/br_multicast.c | 92 ++++++++++++++++++++++++++++++++++-------------
 net/bridge/br_private.h   |  4 ++-
 net/bridge/br_sysfs_if.c  |  2 ++
 6 files changed, 115 insertions(+), 30 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c6587c0..f1b0d78 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -46,6 +46,7 @@ struct br_ip_list {
 #define BR_LEARNING_SYNC	BIT(9)
 #define BR_PROXYARP_WIFI	BIT(10)
 #define BR_MCAST_FLOOD		BIT(11)
+#define BR_MULTICAST_TO_UCAST	BIT(12)
 
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7cb41ae..49d742d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -174,6 +174,33 @@ static struct net_bridge_port *maybe_deliver(
 	return p;
 }
 
+static struct net_bridge_port *maybe_deliver_addr(
+	struct net_bridge_port *prev, struct net_bridge_port *p,
+	struct sk_buff *skb, const unsigned char *addr,
+	bool local_orig)
+{
+	struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
+	const unsigned char *src = eth_hdr(skb)->h_source;
+
+	if (!should_deliver(p, skb))
+		return prev;
+
+	/* Even with hairpin, no soliloquies - prevent breaking IPv6 DAD */
+	if (skb->dev == p->dev && ether_addr_equal(src, addr))
+		return prev;
+
+	skb = skb_copy(skb, GFP_ATOMIC);
+	if (!skb) {
+		dev->stats.tx_dropped++;
+		return prev;
+	}
+
+	memcpy(eth_hdr(skb)->h_dest, addr, ETH_ALEN);
+	__br_forward(p, skb, local_orig);
+
+	return prev;
+}
+
 /* called under rcu_read_lock */
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
 	      enum br_pkt_type pkt_type, bool local_rcv, bool local_orig)
@@ -231,6 +258,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
 	struct net_bridge_port *prev = NULL;
 	struct net_bridge_port_group *p;
 	struct hlist_node *rp;
+	const unsigned char *addr;
 
 	rp = rcu_dereference(hlist_first_rcu(&br->router_list));
 	p = mdst ? rcu_dereference(mdst->ports) : NULL;
@@ -241,10 +269,20 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
 		rport = rp ? hlist_entry(rp, struct net_bridge_port, rlist) :
 			     NULL;
 
-		port = (unsigned long)lport > (unsigned long)rport ?
-		       lport : rport;
+		if ((unsigned long)lport > (unsigned long)rport) {
+			port = lport;
+			addr = p->unicast ? p->eth_addr : NULL;
+		} else {
+			port = rport;
+			addr = NULL;
+		}
+
+		if (addr)
+			prev = maybe_deliver_addr(prev, port, skb, addr,
+						  local_orig);
+		else
+			prev = maybe_deliver(prev, port, skb, local_orig);
 
-		prev = maybe_deliver(prev, port, skb, local_orig);
 		if (IS_ERR(prev))
 			goto out;
 		if (prev == port)
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 7dbc80d..056e6ac 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -531,7 +531,7 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port,
 			break;
 	}
 
-	p = br_multicast_new_port_group(port, group, *pp, state);
+	p = br_multicast_new_port_group(port, group, *pp, state, NULL);
 	if (unlikely(!p))
 		return -ENOMEM;
 	rcu_assign_pointer(*pp, p);
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index b30e77e..470a2409 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -43,12 +43,14 @@ static void br_multicast_add_router(struct net_bridge *br,
 static void br_ip4_multicast_leave_group(struct net_bridge *br,
 					 struct net_bridge_port *port,
 					 __be32 group,
-					 __u16 vid);
+					 __u16 vid,
+					 const unsigned char *src);
+
 #if IS_ENABLED(CONFIG_IPV6)
 static void br_ip6_multicast_leave_group(struct net_bridge *br,
 					 struct net_bridge_port *port,
 					 const struct in6_addr *group,
-					 __u16 vid);
+					 __u16 vid, const unsigned char *src);
 #endif
 unsigned int br_mdb_rehash_seq;
 
@@ -711,7 +713,8 @@ struct net_bridge_port_group *br_multicast_new_port_group(
 			struct net_bridge_port *port,
 			struct br_ip *group,
 			struct net_bridge_port_group __rcu *next,
-			unsigned char flags)
+			unsigned char flags,
+			const unsigned char *src)
 {
 	struct net_bridge_port_group *p;
 
@@ -726,12 +729,35 @@ struct net_bridge_port_group *br_multicast_new_port_group(
 	hlist_add_head(&p->mglist, &port->mglist);
 	setup_timer(&p->timer, br_multicast_port_group_expired,
 		    (unsigned long)p);
+
+	if ((port->flags & BR_MULTICAST_TO_UCAST) && src) {
+		memcpy(p->eth_addr, src, ETH_ALEN);
+		p->unicast = true;
+	}
+
 	return p;
 }
 
+static bool br_port_group_equal(struct net_bridge_port_group *p,
+				struct net_bridge_port *port,
+				const unsigned char *src)
+{
+	if (p->port != port)
+		return false;
+
+	if (!p->unicast)
+		return true;
+
+	if (!src)
+		return false;
+
+	return ether_addr_equal(src, p->eth_addr);
+}
+
 static int br_multicast_add_group(struct net_bridge *br,
 				  struct net_bridge_port *port,
-				  struct br_ip *group)
+				  struct br_ip *group,
+				  const unsigned char *src)
 {
 	struct net_bridge_port_group __rcu **pp;
 	struct net_bridge_port_group *p;
@@ -758,13 +784,13 @@ static int br_multicast_add_group(struct net_bridge *br,
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
-		if (p->port == port)
+		if (br_port_group_equal(p, port, src))
 			goto found;
 		if ((unsigned long)p->port < (unsigned long)port)
 			break;
 	}
 
-	p = br_multicast_new_port_group(port, group, *pp, 0);
+	p = br_multicast_new_port_group(port, group, *pp, 0, src);
 	if (unlikely(!p))
 		goto err;
 	rcu_assign_pointer(*pp, p);
@@ -783,7 +809,8 @@ static int br_multicast_add_group(struct net_bridge *br,
 static int br_ip4_multicast_add_group(struct net_bridge *br,
 				      struct net_bridge_port *port,
 				      __be32 group,
-				      __u16 vid)
+				      __u16 vid,
+				      const unsigned char *src)
 {
 	struct br_ip br_group;
 
@@ -794,14 +821,15 @@ static int br_ip4_multicast_add_group(struct net_bridge *br,
 	br_group.proto = htons(ETH_P_IP);
 	br_group.vid = vid;
 
-	return br_multicast_add_group(br, port, &br_group);
+	return br_multicast_add_group(br, port, &br_group, src);
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int br_ip6_multicast_add_group(struct net_bridge *br,
 				      struct net_bridge_port *port,
 				      const struct in6_addr *group,
-				      __u16 vid)
+				      __u16 vid,
+				      const unsigned char *src)
 {
 	struct br_ip br_group;
 
@@ -812,7 +840,7 @@ static int br_ip6_multicast_add_group(struct net_bridge *br,
 	br_group.proto = htons(ETH_P_IPV6);
 	br_group.vid = vid;
 
-	return br_multicast_add_group(br, port, &br_group);
+	return br_multicast_add_group(br, port, &br_group, src);
 }
 #endif
 
@@ -1081,6 +1109,7 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
 					 struct sk_buff *skb,
 					 u16 vid)
 {
+	const unsigned char *src;
 	struct igmpv3_report *ih;
 	struct igmpv3_grec *grec;
 	int i;
@@ -1121,12 +1150,14 @@ static int br_ip4_multicast_igmp3_report(struct net_bridge *br,
 			continue;
 		}
 
+		src = eth_hdr(skb)->h_source;
 		if ((type == IGMPV3_CHANGE_TO_INCLUDE ||
 		     type == IGMPV3_MODE_IS_INCLUDE) &&
 		    ntohs(grec->grec_nsrcs) == 0) {
-			br_ip4_multicast_leave_group(br, port, group, vid);
+			br_ip4_multicast_leave_group(br, port, group, vid, src);
 		} else {
-			err = br_ip4_multicast_add_group(br, port, group, vid);
+			err = br_ip4_multicast_add_group(br, port, group, vid,
+							 src);
 			if (err)
 				break;
 		}
@@ -1141,6 +1172,7 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
 					struct sk_buff *skb,
 					u16 vid)
 {
+	const unsigned char *src = eth_hdr(skb)->h_source;
 	struct icmp6hdr *icmp6h;
 	struct mld2_grec *grec;
 	int i;
@@ -1192,10 +1224,11 @@ static int br_ip6_multicast_mld2_report(struct net_bridge *br,
 		     grec->grec_type == MLD2_MODE_IS_INCLUDE) &&
 		    ntohs(*nsrcs) == 0) {
 			br_ip6_multicast_leave_group(br, port, &grec->grec_mca,
-						     vid);
+						     vid, src);
 		} else {
 			err = br_ip6_multicast_add_group(br, port,
-							 &grec->grec_mca, vid);
+							 &grec->grec_mca, vid,
+							 src);
 			if (err)
 				break;
 		}
@@ -1511,7 +1544,8 @@ br_multicast_leave_group(struct net_bridge *br,
 			 struct net_bridge_port *port,
 			 struct br_ip *group,
 			 struct bridge_mcast_other_query *other_query,
-			 struct bridge_mcast_own_query *own_query)
+			 struct bridge_mcast_own_query *own_query,
+			 const unsigned char *src)
 {
 	struct net_bridge_mdb_htable *mdb;
 	struct net_bridge_mdb_entry *mp;
@@ -1535,7 +1569,7 @@ br_multicast_leave_group(struct net_bridge *br,
 		for (pp = &mp->ports;
 		     (p = mlock_dereference(*pp, br)) != NULL;
 		     pp = &p->next) {
-			if (p->port != port)
+			if (!br_port_group_equal(p, port, src))
 				continue;
 
 			rcu_assign_pointer(*pp, p->next);
@@ -1566,7 +1600,7 @@ br_multicast_leave_group(struct net_bridge *br,
 		for (p = mlock_dereference(mp->ports, br);
 		     p != NULL;
 		     p = mlock_dereference(p->next, br)) {
-			if (p->port != port)
+			if (!br_port_group_equal(p, port, src))
 				continue;
 
 			if (!hlist_unhashed(&p->mglist) &&
@@ -1617,7 +1651,8 @@ br_multicast_leave_group(struct net_bridge *br,
 static void br_ip4_multicast_leave_group(struct net_bridge *br,
 					 struct net_bridge_port *port,
 					 __be32 group,
-					 __u16 vid)
+					 __u16 vid,
+					 const unsigned char *src)
 {
 	struct br_ip br_group;
 	struct bridge_mcast_own_query *own_query;
@@ -1632,14 +1667,15 @@ static void br_ip4_multicast_leave_group(struct net_bridge *br,
 	br_group.vid = vid;
 
 	br_multicast_leave_group(br, port, &br_group, &br->ip4_other_query,
-				 own_query);
+				 own_query, src);
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static void br_ip6_multicast_leave_group(struct net_bridge *br,
 					 struct net_bridge_port *port,
 					 const struct in6_addr *group,
-					 __u16 vid)
+					 __u16 vid,
+					 const unsigned char *src)
 {
 	struct br_ip br_group;
 	struct bridge_mcast_own_query *own_query;
@@ -1654,7 +1690,7 @@ static void br_ip6_multicast_leave_group(struct net_bridge *br,
 	br_group.vid = vid;
 
 	br_multicast_leave_group(br, port, &br_group, &br->ip6_other_query,
-				 own_query);
+				 own_query, src);
 }
 #endif
 
@@ -1711,6 +1747,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge *br,
 				 struct sk_buff *skb,
 				 u16 vid)
 {
+	const unsigned char *src;
 	struct sk_buff *skb_trimmed = NULL;
 	struct igmphdr *ih;
 	int err;
@@ -1731,13 +1768,14 @@ static int br_multicast_ipv4_rcv(struct net_bridge *br,
 	}
 
 	ih = igmp_hdr(skb);
+	src = eth_hdr(skb)->h_source;
 	BR_INPUT_SKB_CB(skb)->igmp = ih->type;
 
 	switch (ih->type) {
 	case IGMP_HOST_MEMBERSHIP_REPORT:
 	case IGMPV2_HOST_MEMBERSHIP_REPORT:
 		BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
-		err = br_ip4_multicast_add_group(br, port, ih->group, vid);
+		err = br_ip4_multicast_add_group(br, port, ih->group, vid, src);
 		break;
 	case IGMPV3_HOST_MEMBERSHIP_REPORT:
 		err = br_ip4_multicast_igmp3_report(br, port, skb_trimmed, vid);
@@ -1746,7 +1784,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge *br,
 		err = br_ip4_multicast_query(br, port, skb_trimmed, vid);
 		break;
 	case IGMP_HOST_LEAVE_MESSAGE:
-		br_ip4_multicast_leave_group(br, port, ih->group, vid);
+		br_ip4_multicast_leave_group(br, port, ih->group, vid, src);
 		break;
 	}
 
@@ -1765,6 +1803,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 				 struct sk_buff *skb,
 				 u16 vid)
 {
+	const unsigned char *src;
 	struct sk_buff *skb_trimmed = NULL;
 	struct mld_msg *mld;
 	int err;
@@ -1785,8 +1824,10 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 
 	switch (mld->mld_type) {
 	case ICMPV6_MGM_REPORT:
+		src = eth_hdr(skb)->h_source;
 		BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
-		err = br_ip6_multicast_add_group(br, port, &mld->mld_mca, vid);
+		err = br_ip6_multicast_add_group(br, port, &mld->mld_mca, vid,
+						 src);
 		break;
 	case ICMPV6_MLD2_REPORT:
 		err = br_ip6_multicast_mld2_report(br, port, skb_trimmed, vid);
@@ -1795,7 +1836,8 @@ static int br_multicast_ipv6_rcv(struct net_bridge *br,
 		err = br_ip6_multicast_query(br, port, skb_trimmed, vid);
 		break;
 	case ICMPV6_MGM_REDUCTION:
-		br_ip6_multicast_leave_group(br, port, &mld->mld_mca, vid);
+		src = eth_hdr(skb)->h_source;
+		br_ip6_multicast_leave_group(br, port, &mld->mld_mca, vid, src);
 		break;
 	}
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 8ce621e..cc55100 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -177,6 +177,8 @@ struct net_bridge_port_group {
 	struct timer_list		timer;
 	struct br_ip			addr;
 	unsigned char			flags;
+	unsigned char			eth_addr[ETH_ALEN];
+	bool				unicast;
 };
 
 struct net_bridge_mdb_entry
@@ -599,7 +601,7 @@ void br_multicast_free_pg(struct rcu_head *head);
 struct net_bridge_port_group *
 br_multicast_new_port_group(struct net_bridge_port *port, struct br_ip *group,
 			    struct net_bridge_port_group __rcu *next,
-			    unsigned char flags);
+			    unsigned char flags, const unsigned char *src);
 void br_mdb_init(void);
 void br_mdb_uninit(void);
 void br_mdb_notify(struct net_device *dev, struct net_bridge_port *port,
diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
index 8bd5696..1730278 100644
--- a/net/bridge/br_sysfs_if.c
+++ b/net/bridge/br_sysfs_if.c
@@ -188,6 +188,7 @@ static BRPORT_ATTR(multicast_router, S_IRUGO | S_IWUSR, show_multicast_router,
 		   store_multicast_router);
 
 BRPORT_ATTR_FLAG(multicast_fast_leave, BR_MULTICAST_FAST_LEAVE);
+BRPORT_ATTR_FLAG(multicast_to_unicast, BR_MULTICAST_TO_UCAST);
 #endif
 
 static const struct brport_attribute *brport_attrs[] = {
@@ -214,6 +215,7 @@ static const struct brport_attribute *brport_attrs[] = {
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	&brport_attr_multicast_router,
 	&brport_attr_multicast_fast_leave,
+	&brport_attr_multicast_to_unicast,
 #endif
 	&brport_attr_proxyarp,
 	&brport_attr_proxyarp_wifi,
-- 
2.10.2

^ permalink raw reply related

* [RFC PATCH] virtio_net: XDP support for adjust_head
From: John Fastabend @ 2017-01-02 19:44 UTC (permalink / raw)
  To: jasowang, mst
  Cc: john.r.fastabend, netdev, john.fastabend, alexei.starovoitov,
	daniel

Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.

In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.

: There is a problem with this patch as is. When xdp prog is loaded
  the old buffers without the 256B headers need to be flushed so that
  the bpf prog has the necessary headroom. This patch does this by
  calling the virtqueue_detach_unused_buf() and followed by the
  virtnet_set_queues() call to reinitialize the buffers. However I
  don't believe this is safe per comment in virtio_ring this API
  is not valid on an active queue and the only thing we have done
  here is napi_disable/napi_enable wrappers which doesn't do anything
  to the emulation layer.

  So the RFC is really to find the best solution to this problem.
  A couple things come to mind, (a) always allocate the necessary
  headroom but this is a bit of a waste (b) add some bit somewhere
  to check if the buffer has headroom but this would mean XDP programs
  would be broke for a cycle through the ring, (c) figure out how
  to deactivate a queue, free the buffers and finally reallocate.
  I think (c) is the best choice for now but I'm not seeing the
  API to do this so virtio/qemu experts anyone know off-hand
  how to make this work? I started looking into the PCI callbacks
  reset() and virtio_device_ready() or possibly hitting the right
  set of bits with vp_set_status() but my first attempt just hung
  the device.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |  106 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5deeda6..fcc5bd7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -159,6 +159,9 @@ struct virtnet_info {
 	/* Ethtool settings */
 	u8 duplex;
 	u32 speed;
+
+	/* Headroom allocated in RX Queue */
+	unsigned int headroom;
 };
 
 struct padded_vnet_hdr {
@@ -355,6 +358,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
 	}
 
 	if (vi->mergeable_rx_bufs) {
+		xdp->data -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
 		/* Zero header and leave csum up to XDP layers */
 		hdr = xdp->data;
 		memset(hdr, 0, vi->hdr_len);
@@ -371,7 +375,7 @@ static void virtnet_xdp_xmit(struct virtnet_info *vi,
 		num_sg = 2;
 		sg_init_table(sq->sg, 2);
 		sg_set_buf(sq->sg, hdr, vi->hdr_len);
-		skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+		skb_to_sgvec(skb, sq->sg + 1, vi->headroom, xdp->data_end - xdp->data);
 	}
 	err = virtqueue_add_outbuf(sq->vq, sq->sg, num_sg,
 				   data, GFP_ATOMIC);
@@ -393,34 +397,39 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
 		       struct bpf_prog *xdp_prog,
 		       void *data, int len)
 {
-	int hdr_padded_len;
 	struct xdp_buff xdp;
-	void *buf;
 	unsigned int qp;
 	u32 act;
 
+
 	if (vi->mergeable_rx_bufs) {
-		hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
-		xdp.data = data + hdr_padded_len;
+		int desc_room = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+		/* Allow consuming headroom but reserve enough space to push
+		 * the descriptor on if we get an XDP_TX return code.
+		 */
+		xdp.data_hard_start = data - vi->headroom + desc_room;
+		xdp.data = data + desc_room;
 		xdp.data_end = xdp.data + (len - vi->hdr_len);
-		buf = data;
 	} else { /* small buffers */
 		struct sk_buff *skb = data;
 
-		xdp.data = skb->data;
+		xdp.data_hard_start = skb->data;
+		xdp.data = skb->data + vi->headroom;
 		xdp.data_end = xdp.data + len;
-		buf = skb->data;
 	}
 
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
+		if (!vi->mergeable_rx_bufs)
+			__skb_pull((struct sk_buff *) data,
+				   xdp.data - xdp.data_hard_start);
 		return XDP_PASS;
 	case XDP_TX:
 		qp = vi->curr_queue_pairs -
 			vi->xdp_queue_pairs +
 			smp_processor_id();
-		xdp.data = buf;
 		virtnet_xdp_xmit(vi, rq, &vi->sq[qp], &xdp, data);
 		return XDP_TX;
 	default:
@@ -440,7 +449,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	struct bpf_prog *xdp_prog;
 
 	len -= vi->hdr_len;
-	skb_trim(skb, len);
 
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
@@ -463,7 +471,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 		}
 	}
 	rcu_read_unlock();
-
+	skb_trim(skb, len);
 	return skb;
 
 err_xdp:
@@ -772,20 +780,21 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
 static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
 			     gfp_t gfp)
 {
+	int headroom = GOOD_PACKET_LEN + vi->headroom;
 	struct sk_buff *skb;
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
 	int err;
 
-	skb = __netdev_alloc_skb_ip_align(vi->dev, GOOD_PACKET_LEN, gfp);
+	skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
-	skb_put(skb, GOOD_PACKET_LEN);
+	skb_put(skb, headroom);
 
 	hdr = skb_vnet_hdr(skb);
 	sg_init_table(rq->sg, 2);
 	sg_set_buf(rq->sg, hdr, vi->hdr_len);
-	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
+	skb_to_sgvec(skb, rq->sg + 1, vi->headroom, skb->len - vi->headroom);
 
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
 	if (err < 0)
@@ -853,24 +862,27 @@ static unsigned int get_mergeable_buf_len(struct ewma_pkt_len *avg_pkt_len)
 	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static int add_recvbuf_mergeable(struct virtnet_info *vi,
+				 struct receive_queue *rq, gfp_t gfp)
 {
 	struct page_frag *alloc_frag = &rq->alloc_frag;
+	unsigned int headroom = vi->headroom;
 	char *buf;
 	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
 	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
-	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
+	if (unlikely(!skb_page_frag_refill(len + headroom, alloc_frag, gfp)))
 		return -ENOMEM;
 
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	buf += headroom; /* advance address leaving hole at front of pkt */
 	ctx = mergeable_buf_to_ctx(buf, len);
 	get_page(alloc_frag->page);
-	alloc_frag->offset += len;
+	alloc_frag->offset += len + headroom;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < len) {
+	if (hole < len + headroom) {
 		/* To avoid internal fragmentation, if there is very likely not
 		 * enough space for another buffer, add the remaining space to
 		 * the current buffer. This extra space is not included in
@@ -904,7 +916,7 @@ static bool try_fill_recv(struct virtnet_info *vi, struct receive_queue *rq,
 	gfp |= __GFP_COLD;
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(rq, gfp);
+			err = add_recvbuf_mergeable(vi, rq, gfp);
 		else if (vi->big_packets)
 			err = add_recvbuf_big(vi, rq, gfp);
 		else
@@ -1699,12 +1711,15 @@ static void virtnet_init_settings(struct net_device *dev)
 	.set_settings = virtnet_set_settings,
 };
 
+#define VIRTIO_XDP_HEADROOM 256
+
 static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 	u16 xdp_qp = 0, curr_qp;
+	unsigned int old_hr;
 	int i, err;
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1736,19 +1751,58 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return -ENOMEM;
 	}
 
+	old_hr = vi->headroom;
+	if (prog) {
+		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+		vi->headroom = VIRTIO_XDP_HEADROOM;
+	} else {
+		vi->headroom = 0;
+	}
+
+	/* Changing the headroom in buffers is a disruptive operation because
+	 * existing buffers must be flushed and reallocated. This will happen
+	 * when a xdp program is initially added or xdp is disabled by removing
+	 * the xdp program.
+	 */
+	if (old_hr != vi->headroom) {
+		cancel_delayed_work_sync(&vi->refill);
+		if (netif_running(vi->dev)) {
+			for (i = 0; i < vi->max_queue_pairs; i++)
+				napi_disable(&vi->rq[i].napi);
+		}
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			struct virtqueue *vq = vi->rq[i].vq;
+			void *buf;
+
+			while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
+				if (vi->mergeable_rx_bufs) {
+					unsigned long ctx = (unsigned long)buf;
+					void *base = mergeable_ctx_to_buf_address(ctx);
+					put_page(virt_to_head_page(base));
+				} else if (vi->big_packets) {
+					give_pages(&vi->rq[i], buf);
+				} else {
+					dev_kfree_skb(buf);
+				}
+			}
+		}
+		if (netif_running(vi->dev)) {
+			for (i = 0; i < vi->max_queue_pairs; i++)
+				virtnet_napi_enable(&vi->rq[i]);
+		}
+	}
+
 	err = virtnet_set_queues(vi, curr_qp + xdp_qp);
 	if (err) {
 		dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
+		vi->headroom = old_hr;
+		if (prog)
+			bpf_prog_sub(prog, vi->max_queue_pairs - 1);
 		return err;
 	}
 
-	if (prog) {
-		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
-		if (IS_ERR(prog)) {
-			virtnet_set_queues(vi, curr_qp);
-			return PTR_ERR(prog);
-		}
-	}
 
 	vi->xdp_queue_pairs = xdp_qp;
 	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);

^ permalink raw reply related

* [PATCH] net: fealnx: use new api ethtool_{get|set}_link_ksettings
From: Philippe Reynes @ 2017-01-02 19:47 UTC (permalink / raw)
  To: davem, mugunthanvnm, a, fw, jarod; +Cc: netdev, linux-kernel, Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
---
 drivers/net/ethernet/fealnx.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/fealnx.c b/drivers/net/ethernet/fealnx.c
index 9cb436c..766636a 100644
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -1817,25 +1817,27 @@ static void netdev_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *i
 	strlcpy(info->bus_info, pci_name(np->pci_dev), sizeof(info->bus_info));
 }
 
-static int netdev_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_get_link_ksettings(struct net_device *dev,
+				     struct ethtool_link_ksettings *cmd)
 {
 	struct netdev_private *np = netdev_priv(dev);
 	int rc;
 
 	spin_lock_irq(&np->lock);
-	rc = mii_ethtool_gset(&np->mii, cmd);
+	rc = mii_ethtool_get_link_ksettings(&np->mii, cmd);
 	spin_unlock_irq(&np->lock);
 
 	return rc;
 }
 
-static int netdev_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_set_link_ksettings(struct net_device *dev,
+				     const struct ethtool_link_ksettings *cmd)
 {
 	struct netdev_private *np = netdev_priv(dev);
 	int rc;
 
 	spin_lock_irq(&np->lock);
-	rc = mii_ethtool_sset(&np->mii, cmd);
+	rc = mii_ethtool_set_link_ksettings(&np->mii, cmd);
 	spin_unlock_irq(&np->lock);
 
 	return rc;
@@ -1865,12 +1867,12 @@ static void netdev_set_msglevel(struct net_device *dev, u32 value)
 
 static const struct ethtool_ops netdev_ethtool_ops = {
 	.get_drvinfo		= netdev_get_drvinfo,
-	.get_settings		= netdev_get_settings,
-	.set_settings		= netdev_set_settings,
 	.nway_reset		= netdev_nway_reset,
 	.get_link		= netdev_get_link,
 	.get_msglevel		= netdev_get_msglevel,
 	.set_msglevel		= netdev_set_msglevel,
+	.get_link_ksettings	= netdev_get_link_ksettings,
+	.set_link_ksettings	= netdev_set_link_ksettings,
 };
 
 static int mii_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
-- 
1.7.4.4

^ permalink raw reply related

* Re: [RFC PATCH] virtio_net: XDP support for adjust_head
From: John Fastabend @ 2017-01-02 19:47 UTC (permalink / raw)
  To: jasowang, mst; +Cc: john.r.fastabend, netdev, alexei.starovoitov, daniel
In-Reply-To: <20170102194413.9089.39078.stgit@john-Precision-Tower-5810>

On 17-01-02 11:44 AM, John Fastabend wrote:
> Add support for XDP adjust head by allocating a 256B header region
> that XDP programs can grow into. This is only enabled when a XDP
> program is loaded.
> 
> In order to ensure that we do not have to unwind queue headroom push
> queue setup below bpf_prog_add. It reads better to do a prog ref
> unwind vs another queue setup call.
> 
> : There is a problem with this patch as is. When xdp prog is loaded
>   the old buffers without the 256B headers need to be flushed so that
>   the bpf prog has the necessary headroom. This patch does this by
>   calling the virtqueue_detach_unused_buf() and followed by the
>   virtnet_set_queues() call to reinitialize the buffers. However I
>   don't believe this is safe per comment in virtio_ring this API
>   is not valid on an active queue and the only thing we have done
>   here is napi_disable/napi_enable wrappers which doesn't do anything
>   to the emulation layer.
> 
>   So the RFC is really to find the best solution to this problem.
>   A couple things come to mind, (a) always allocate the necessary
>   headroom but this is a bit of a waste (b) add some bit somewhere
>   to check if the buffer has headroom but this would mean XDP programs
>   would be broke for a cycle through the ring, (c) figure out how
>   to deactivate a queue, free the buffers and finally reallocate.
>   I think (c) is the best choice for now but I'm not seeing the
>   API to do this so virtio/qemu experts anyone know off-hand
>   how to make this work? I started looking into the PCI callbacks
>   reset() and virtio_device_ready() or possibly hitting the right
>   set of bits with vp_set_status() but my first attempt just hung
>   the device.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---


[...]

> +
> +	/* Changing the headroom in buffers is a disruptive operation because
> +	 * existing buffers must be flushed and reallocated. This will happen
> +	 * when a xdp program is initially added or xdp is disabled by removing
> +	 * the xdp program.
> +	 */
> +	if (old_hr != vi->headroom) {
> +		cancel_delayed_work_sync(&vi->refill);
> +		if (netif_running(vi->dev)) {
> +			for (i = 0; i < vi->max_queue_pairs; i++)
> +				napi_disable(&vi->rq[i].napi);
> +		}
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			struct virtqueue *vq = vi->rq[i].vq;
> +			void *buf;
> +
> +			while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Per API in virtio_ring.c queue must be deactivated to call this.


> +				if (vi->mergeable_rx_bufs) {
> +					unsigned long ctx = (unsigned long)buf;
> +					void *base = mergeable_ctx_to_buf_address(ctx);
> +					put_page(virt_to_head_page(base));
> +				} else if (vi->big_packets) {
> +					give_pages(&vi->rq[i], buf);
> +				} else {
> +					dev_kfree_skb(buf);
> +				}
> +			}
> +		}
> +		if (netif_running(vi->dev)) {
> +			for (i = 0; i < vi->max_queue_pairs; i++)
> +				virtnet_napi_enable(&vi->rq[i]);
> +		}
> +	}
> +

Hi Jason, Michael,

Any hints on how to solve the above kludge to flush the existing ring and
reallocate with correct headroom? The above doesn't look safe to me per commit
message.

Thanks!
John

^ permalink raw reply

* Re: [PATCH 02/12] Common functions and definitions
From: Stephen Hemminger @ 2017-01-02 20:00 UTC (permalink / raw)
  To: David VomLehn
  Cc: netdev, Simon Edelhaus, Dmitrii Tarakanov, Alexander Loktionov
In-Reply-To: <67f8be9c3f6a4fede2eefaa4a3c45f9dbaf9b411.1482844668.git.vomlehn@texas.net>


> +#define AQ_OBJ_SET(_OBJ_, _F_) \
> +{ unsigned long flags_old, flags_new; atomic_t *flags = &(_OBJ_)->flags; \
> +do { \
> +	flags_old = atomic_read(flags); \
> +	flags_new = flags_old | (_F_); \
> +} while (atomic_cmpxchg(flags, \
> +	flags_old, flags_new) != flags_old); }
> +
> +#define AQ_OBJ_CLR(_OBJ_, _F_) \
> +{ unsigned long flags_old, flags_new; atomic_t *flags = &(_OBJ_)->flags; \
> +do { \
> +	flags_old = atomic_read(flags); \
> +	flags_new = flags_old & ~(_F_); \
> +} while (atomic_cmpxchg(flags, \
> +	flags_old, flags_new) != flags_old); }
> +

These are way to complex to be macros. Can't the same logic be done
as inline functions.

^ permalink raw reply

* [PATCH] tcp: provide tx timestamps for partial writes
From: Soheil Hassas Yeganeh @ 2017-01-02 20:20 UTC (permalink / raw)
  To: davem, netdev
  Cc: Soheil Hassas Yeganeh, Willem de Bruijn, Yuchung Cheng,
	Eric Dumazet, Neal Cardwell, Martin KaFai Lau

From: Soheil Hassas Yeganeh <soheil@google.com>

For TCP sockets, tx timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps when the user data is partially copied into
the socket.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2e3807d..c207b16 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	return copied;
 
 do_error:
-	if (copied)
+	if (copied) {
+		tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
 		goto out;
+	}
 out_err:
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
@@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	}
 
 do_error:
-	if (copied + copied_syn)
+	if (copied + copied_syn) {
+		tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
 		goto out;
+	}
 out_err:
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [PATCH] tcp: provide tx timestamps for partial writes
From: Soheil Hassas Yeganeh @ 2017-01-02 20:23 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Willem de Bruijn, Yuchung Cheng, Eric Dumazet, Neal Cardwell,
	Martin KaFai Lau
In-Reply-To: <1483388457-4041-1-git-send-email-soheil.kdev@gmail.com>

On Mon, Jan 2, 2017 at 3:20 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> For TCP sockets, tx timestamps are only captured when the user data
> is successfully and fully written to the socket. In many cases,
> however, TCP writes can be partial for which no timestamp is
> collected.
>
> Collect timestamps when the user data is partially copied into
> the socket.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> ---
>  net/ipv4/tcp.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 2e3807d..c207b16 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -992,8 +992,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
>         return copied;
>
>  do_error:
> -       if (copied)
> +       if (copied) {
> +               tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
>                 goto out;
> +       }
>  out_err:
>         /* make sure we wake any epoll edge trigger waiter */
>         if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
> @@ -1329,8 +1331,10 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>         }
>
>  do_error:
> -       if (copied + copied_syn)
> +       if (copied + copied_syn) {
> +               tcp_tx_timestamp(sk, sk->sk_tsflags, tcp_write_queue_tail(sk));
>                 goto out;
> +       }
>  out_err:
>         err = sk_stream_error(sk, flags, err);
>         /* make sure we wake any epoll edge trigger waiter */
> --
> 2.8.0.rc3.226.g39d4020
>

I'm sorry for the incomplete annotation. This is for [net-next].

Thanks,
Soheil

^ permalink raw reply

* Re: pull-request: wireless-drivers-next 2017-01-02
From: David Miller @ 2017-01-02 20:46 UTC (permalink / raw)
  To: kvalo; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <8760lxk37o.fsf@kamboji.qca.qualcomm.com>

From: Kalle Valo <kvalo@codeaurora.org>
Date: Mon, 02 Jan 2017 15:20:59 +0200

> first pull request for 4.11. The tree is based on 4.9 but that shouldn't
> be a problem, at least my test pull to net-next worked ok. I'll fast
> forward my trees after you have pulled this.
> 
> Please let me know if you have any problems.

Happy new year, pulled, thanks!

^ permalink raw reply

* Re: [PATCH net] Documentation/networking: fix typo in mpls-sysctl
From: David Miller @ 2017-01-02 20:49 UTC (permalink / raw)
  To: alexander; +Cc: netdev, corbet
In-Reply-To: <20170102175224.8010-1-alexander@alemayhu.com>

From: Alexander Alemayhu <alexander@alemayhu.com>
Date: Mon,  2 Jan 2017 18:52:24 +0100

> s/utliziation/utilization
> 
> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH for-next V2 00/11] Mellanox mlx5 core and ODP updates 2017-01-01
From: David Miller @ 2017-01-02 20:53 UTC (permalink / raw)
  To: saeedm; +Cc: dledford, netdev, linux-rdma, leonro, ilyal, artemyko
In-Reply-To: <1483349868-11716-1-git-send-email-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Mon,  2 Jan 2017 11:37:37 +0200

> The following eleven patches mainly come from Artemy Kovalyov
> who expanded mlx5 on-demand-paging (ODP) support. In addition
> there are three cleanup patches which don't change any functionality,
> but are needed to align codebase prior accepting other patches.

Series applied to net-next, thanks.

^ permalink raw reply

* [net PATCH] ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules
From: Alexander Duyck @ 2017-01-02 21:32 UTC (permalink / raw)
  To: netdev, jjk, davem, dsa

From: Alexander Duyck <alexander.h.duyck@intel.com>

In the case of custom rules being present we need to handle the case of the
LOCAL table being intialized after the new rule has been added.  To address
that I am adding a new check so that we can make certain we don't use an
alias of MAIN for LOCAL when allocating a new table.

Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Oliver Brunel <jjk@jjacky.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/ipv4/fib_frontend.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 3ff8938893ec..eae0332b0e8c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -85,7 +85,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	if (tb)
 		return tb;
 
-	if (id == RT_TABLE_LOCAL)
+	if (id == RT_TABLE_LOCAL && !net->ipv4.fib_has_custom_rules)
 		alias = fib_new_table(net, RT_TABLE_MAIN);
 
 	tb = fib_trie_table(id, alias);

^ permalink raw reply related

* [PATCH nf-next 0/7] xtables: use dedicated copy_to_user helpers
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

xtables list and save interfaces share xt_match and xt_target state
with userspace. The kernel and userspace definitions of these structs
differ. Currently, the structs are copied wholesale, then patched up.
The match and target structs contain a kernel pointer. Type-specific
data may contain additional kernel-only state.

Introduce xt_match_to_user and xt_target_to_user helper functions to
copy only fields intended to be shared with userspace.

Introduce xt_data_to_user to do the same for type-specific state. Add
a field .usersize to xt_match and xt_target to define the range of
bytes in .matchsize that should be shared with userspace. All matches
and targets that define kernel-only data store this at the tail of
their struct.

Tested:

  Ran iptables-test.py from iptables.git, with both a 64-bit and
  32-bit compat binary. 603/603 tests passed both before and after
  the patches (out of 705, but some CONFIGs were not enabled).

  Also ran the following example queries manually, again using 64-bit
  and 32-bit compat paths:

  iptables -A INPUT  -m string --algo bm --string 'xxx' -j LOG
  iptables -L
  iptables-save

  ip6tables -A INPUT  -m string --algo bm --string 'xxx' -j LOG
  ip6tables -L
  ip6tables-save

  ebtables -A INPUT --limit 3 -j ACCEPT
  ebtables -L

  arptables -A INPUT --source-mac 00:11:22:33:44:55 -j ACCEPT
  arptables -L

  An instrumented binary that initializes its buffer with 0x66 bytes
  shows the result of the patchset.

  iptables LOG target in hex before and after. The xt_target struct
  only has its size, name and revision specified. Trailing bytes in
  the name field are not zeroed:

    40 00 4c 4f 47 00 00 00
    40 e1 0a a0 ff ff ff ff
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    04 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00

    40 00 4c 4f 47 00 66 66
    66 66 66 66 66 66 66 66
    66 66 66 66 66 66 66 66
    66 66 66 66 66 66 66 00
    04 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00

  ebtables limit match in hex before and after. Only the avg and burst
  fields of ebt_limit_info are shared.

    6c 69 6d 69 74 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    20 00 00 00 00 00 00 00
    05 0d 00 00 05 00 00 00
    66 de fc ff 00 00 00 00
    50 d0 00 00 50 d0 00 00
    a9 29 00 00 00 00 00 00

    6c 69 6d 69 74 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    20 00 00 00 00 00 00 00
    05 0d 00 00 05 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00


Willem de Bruijn (7):
  xtables: add xt_match, xt_target and data copy_to_user functions
  iptables: use match, target and data copy_to_user helpers
  ip6tables: use match, target and data copy_to_user helpers
  arptables: use match, target and data copy_to_user helpers
  ebtables: use match, target and data copy_to_user helpers
  xtables: use match, target and data copy_to_user helpers in compat
  xtables: extend matches and targets with .usersize

 include/linux/netfilter/x_tables.h |  9 +++++
 net/bridge/netfilter/ebt_limit.c   |  1 +
 net/bridge/netfilter/ebtables.c    | 78 +++++++++++++++++++++++---------------
 net/ipv4/netfilter/arp_tables.c    | 15 +++-----
 net/ipv4/netfilter/ip_tables.c     | 21 +++-------
 net/ipv4/netfilter/ipt_CLUSTERIP.c |  1 +
 net/ipv6/netfilter/ip6_tables.c    | 21 +++-------
 net/ipv6/netfilter/ip6t_NPT.c      |  2 +
 net/netfilter/x_tables.c           | 68 ++++++++++++++++++++++++++++-----
 net/netfilter/xt_CT.c              |  3 ++
 net/netfilter/xt_RATEEST.c         |  1 +
 net/netfilter/xt_TEE.c             |  2 +
 net/netfilter/xt_bpf.c             |  2 +
 net/netfilter/xt_cgroup.c          |  1 +
 net/netfilter/xt_connlimit.c       |  1 +
 net/netfilter/xt_hashlimit.c       |  4 ++
 net/netfilter/xt_limit.c           |  2 +
 net/netfilter/xt_quota.c           |  1 +
 net/netfilter/xt_rateest.c         |  1 +
 net/netfilter/xt_string.c          |  1 +
 20 files changed, 154 insertions(+), 81 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH nf-next 1/7] xtables: add xt_match, xt_target and data copy_to_user functions
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

xt_entry_target, xt_entry_match and their private data may contain
kernel data.

Introduce helper functions xt_match_to_user, xt_target_to_user and
xt_data_to_user that copy only the expected fields. These replace
existing logic that calls copy_to_user on entire structs, then
overwrites select fields.

Private data is defined in xt_match and xt_target. All matches and
targets that maintain kernel data store this at the tail of their
private structure. Extend xt_match and xt_target with .usersize to
limit how many bytes of data are copied. The remainder is cleared.

If compatsize is specified, usersize can only safely be used if all
fields up to usersize use platform-independent types. Otherwise, the
compat_to_user callback must be defined.

This patch does not yet enable the support logic.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/netfilter/x_tables.h |  9 +++++++
 net/netfilter/x_tables.c           | 54 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5117e4d..be378cf 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -167,6 +167,7 @@ struct xt_match {
 
 	const char *table;
 	unsigned int matchsize;
+	unsigned int usersize;
 #ifdef CONFIG_COMPAT
 	unsigned int compatsize;
 #endif
@@ -207,6 +208,7 @@ struct xt_target {
 
 	const char *table;
 	unsigned int targetsize;
+	unsigned int usersize;
 #ifdef CONFIG_COMPAT
 	unsigned int compatsize;
 #endif
@@ -287,6 +289,13 @@ int xt_check_match(struct xt_mtchk_param *, unsigned int size, u_int8_t proto,
 int xt_check_target(struct xt_tgchk_param *, unsigned int size, u_int8_t proto,
 		    bool inv_proto);
 
+int xt_match_to_user(const struct xt_entry_match *m,
+		     struct xt_entry_match __user *u);
+int xt_target_to_user(const struct xt_entry_target *t,
+		      struct xt_entry_target __user *u);
+int xt_data_to_user(void __user *dst, const void *src,
+		    int usersize, int size);
+
 void *xt_copy_counters_from_user(const void __user *user, unsigned int len,
 				 struct xt_counters_info *info, bool compat);
 
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 2ff4996..feccf52 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -262,6 +262,60 @@ struct xt_target *xt_request_find_target(u8 af, const char *name, u8 revision)
 }
 EXPORT_SYMBOL_GPL(xt_request_find_target);
 
+
+static int xt_obj_to_user(u16 __user *psize, u16 size,
+			  void __user *pname, const char *name,
+			  u8 __user *prev, u8 rev)
+{
+	if (put_user(size, psize))
+		return -EFAULT;
+	if (copy_to_user(pname, name, strlen(name) + 1))
+		return -EFAULT;
+	if (put_user(rev, prev))
+		return -EFAULT;
+
+	return 0;
+}
+
+#define XT_OBJ_TO_USER(U, K, TYPE, C_SIZE)				\
+	xt_obj_to_user(&U->u.TYPE##_size, C_SIZE ? : K->u.TYPE##_size,	\
+		       U->u.user.name, K->u.kernel.TYPE->name,		\
+		       &U->u.user.revision, K->u.kernel.TYPE->revision)
+
+int xt_data_to_user(void __user *dst, const void *src,
+		    int usersize, int size)
+{
+	usersize = usersize ? : size;
+	if (copy_to_user(dst, src, usersize))
+		return -EFAULT;
+	if (usersize != size && clear_user(dst + usersize, size - usersize))
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xt_data_to_user);
+
+#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)				\
+	xt_data_to_user(U->data, K->data,				\
+			K->u.kernel.TYPE->usersize,			\
+			C_SIZE ? : K->u.kernel.TYPE->TYPE##size)
+
+int xt_match_to_user(const struct xt_entry_match *m,
+		     struct xt_entry_match __user *u)
+{
+	return XT_OBJ_TO_USER(u, m, match, 0) ||
+	       XT_DATA_TO_USER(u, m, match, 0);
+}
+EXPORT_SYMBOL_GPL(xt_match_to_user);
+
+int xt_target_to_user(const struct xt_entry_target *t,
+		      struct xt_entry_target __user *u)
+{
+	return XT_OBJ_TO_USER(u, t, target, 0) ||
+	       XT_DATA_TO_USER(u, t, target, 0);
+}
+EXPORT_SYMBOL_GPL(xt_target_to_user);
+
 static int match_revfn(u8 af, const char *name, u8 revision, int *bestp)
 {
 	const struct xt_match *m;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 2/7] iptables: use match, target and data copy_to_user helpers
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert iptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/netfilter/ip_tables.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 91656a1..384b857 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -826,10 +826,6 @@ copy_entries_to_user(unsigned int total_size,
 		return PTR_ERR(counters);
 
 	loc_cpu_entry = private->entries;
-	if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-		ret = -EFAULT;
-		goto free_counters;
-	}
 
 	/* FIXME: use iterator macros --RR */
 	/* ... then go back and fix counters and names */
@@ -839,6 +835,10 @@ copy_entries_to_user(unsigned int total_size,
 		const struct xt_entry_target *t;
 
 		e = (struct ipt_entry *)(loc_cpu_entry + off);
+		if (copy_to_user(userptr + off, e, sizeof(*e))) {
+			ret = -EFAULT;
+			goto free_counters;
+		}
 		if (copy_to_user(userptr + off
 				 + offsetof(struct ipt_entry, counters),
 				 &counters[num],
@@ -852,23 +852,14 @@ copy_entries_to_user(unsigned int total_size,
 		     i += m->u.match_size) {
 			m = (void *)e + i;
 
-			if (copy_to_user(userptr + off + i
-					 + offsetof(struct xt_entry_match,
-						    u.user.name),
-					 m->u.kernel.match->name,
-					 strlen(m->u.kernel.match->name)+1)
-			    != 0) {
+			if (xt_match_to_user(m, userptr + off + i)) {
 				ret = -EFAULT;
 				goto free_counters;
 			}
 		}
 
 		t = ipt_get_target_c(e);
-		if (copy_to_user(userptr + off + e->target_offset
-				 + offsetof(struct xt_entry_target,
-					    u.user.name),
-				 t->u.kernel.target->name,
-				 strlen(t->u.kernel.target->name)+1) != 0) {
+		if (xt_target_to_user(t, userptr + off + e->target_offset)) {
 			ret = -EFAULT;
 			goto free_counters;
 		}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 3/7] ip6tables: use match, target and data copy_to_user helpers
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert ip6tables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv6/netfilter/ip6_tables.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 25a022d..1e15c54 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -855,10 +855,6 @@ copy_entries_to_user(unsigned int total_size,
 		return PTR_ERR(counters);
 
 	loc_cpu_entry = private->entries;
-	if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-		ret = -EFAULT;
-		goto free_counters;
-	}
 
 	/* FIXME: use iterator macros --RR */
 	/* ... then go back and fix counters and names */
@@ -868,6 +864,10 @@ copy_entries_to_user(unsigned int total_size,
 		const struct xt_entry_target *t;
 
 		e = (struct ip6t_entry *)(loc_cpu_entry + off);
+		if (copy_to_user(userptr + off, e, sizeof(*e))) {
+			ret = -EFAULT;
+			goto free_counters;
+		}
 		if (copy_to_user(userptr + off
 				 + offsetof(struct ip6t_entry, counters),
 				 &counters[num],
@@ -881,23 +881,14 @@ copy_entries_to_user(unsigned int total_size,
 		     i += m->u.match_size) {
 			m = (void *)e + i;
 
-			if (copy_to_user(userptr + off + i
-					 + offsetof(struct xt_entry_match,
-						    u.user.name),
-					 m->u.kernel.match->name,
-					 strlen(m->u.kernel.match->name)+1)
-			    != 0) {
+			if (xt_match_to_user(m, userptr + off + i)) {
 				ret = -EFAULT;
 				goto free_counters;
 			}
 		}
 
 		t = ip6t_get_target_c(e);
-		if (copy_to_user(userptr + off + e->target_offset
-				 + offsetof(struct xt_entry_target,
-					    u.user.name),
-				 t->u.kernel.target->name,
-				 strlen(t->u.kernel.target->name)+1) != 0) {
+		if (xt_target_to_user(t, userptr + off + e->target_offset)) {
 			ret = -EFAULT;
 			goto free_counters;
 		}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 4/7] arptables: use match, target and data copy_to_user helpers
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert arptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/netfilter/arp_tables.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index a467e12..6241a81 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -677,11 +677,6 @@ static int copy_entries_to_user(unsigned int total_size,
 		return PTR_ERR(counters);
 
 	loc_cpu_entry = private->entries;
-	/* ... then copy entire thing ... */
-	if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-		ret = -EFAULT;
-		goto free_counters;
-	}
 
 	/* FIXME: use iterator macros --RR */
 	/* ... then go back and fix counters and names */
@@ -689,6 +684,10 @@ static int copy_entries_to_user(unsigned int total_size,
 		const struct xt_entry_target *t;
 
 		e = (struct arpt_entry *)(loc_cpu_entry + off);
+		if (copy_to_user(userptr + off, e, sizeof(*e))) {
+			ret = -EFAULT;
+			goto free_counters;
+		}
 		if (copy_to_user(userptr + off
 				 + offsetof(struct arpt_entry, counters),
 				 &counters[num],
@@ -698,11 +697,7 @@ static int copy_entries_to_user(unsigned int total_size,
 		}
 
 		t = arpt_get_target_c(e);
-		if (copy_to_user(userptr + off + e->target_offset
-				 + offsetof(struct xt_entry_target,
-					    u.user.name),
-				 t->u.kernel.target->name,
-				 strlen(t->u.kernel.target->name)+1) != 0) {
+		if (xt_target_to_user(t, userptr + off + e->target_offset)) {
 			ret = -EFAULT;
 			goto free_counters;
 		}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 5/7] ebtables: use match, target and data copy_to_user helpers
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert ebtables to copying entries, matches and targets one by one.

The solution is analogous to that of generic xt_(match|target)_to_user
helpers, but is applied to different structs.

Convert existing helpers ebt_make_XXXname helpers that overwrite
fields of an already copy_to_user'd struct with ebt_XXX_to_user
helpers that copy all relevant fields of the struct from scratch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/bridge/netfilter/ebtables.c | 78 +++++++++++++++++++++++++----------------
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 537e3d5..79b6991 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1346,56 +1346,72 @@ static int update_counters(struct net *net, const void __user *user,
 				hlp.num_counters, user, len);
 }
 
-static inline int ebt_make_matchname(const struct ebt_entry_match *m,
-				     const char *base, char __user *ubase)
+static inline int ebt_obj_to_user(char __user *um, const char *_name,
+				  const char *data, int entrysize,
+				  int usersize, int datasize)
 {
-	char __user *hlp = ubase + ((char *)m - base);
-	char name[EBT_FUNCTION_MAXNAMELEN] = {};
+	char name[EBT_FUNCTION_MAXNAMELEN] = {0};
 
 	/* ebtables expects 32 bytes long names but xt_match names are 29 bytes
 	 * long. Copy 29 bytes and fill remaining bytes with zeroes.
 	 */
-	strlcpy(name, m->u.match->name, sizeof(name));
-	if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
+	strlcpy(name, _name, sizeof(name));
+	if (copy_to_user(um, name, EBT_FUNCTION_MAXNAMELEN) ||
+	    put_user(datasize, (int __user *)(um + EBT_FUNCTION_MAXNAMELEN)) ||
+	    xt_data_to_user(um + entrysize, data, usersize, datasize))
 		return -EFAULT;
+
 	return 0;
 }
 
-static inline int ebt_make_watchername(const struct ebt_entry_watcher *w,
-				       const char *base, char __user *ubase)
+static inline int ebt_match_to_user(const struct ebt_entry_match *m,
+				    const char *base, char __user *ubase)
 {
-	char __user *hlp = ubase + ((char *)w - base);
-	char name[EBT_FUNCTION_MAXNAMELEN] = {};
+	return ebt_obj_to_user(ubase + ((char *)m - base),
+			       m->u.match->name, m->data, sizeof(*m),
+			       m->u.match->usersize, m->match_size);
+}
 
-	strlcpy(name, w->u.watcher->name, sizeof(name));
-	if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-		return -EFAULT;
-	return 0;
+static inline int ebt_watcher_to_user(const struct ebt_entry_watcher *w,
+				      const char *base, char __user *ubase)
+{
+	return ebt_obj_to_user(ubase + ((char *)w - base),
+			       w->u.watcher->name, w->data, sizeof(*w),
+			       w->u.watcher->usersize, w->watcher_size);
 }
 
-static inline int ebt_make_names(struct ebt_entry *e, const char *base,
-				 char __user *ubase)
+static inline int ebt_entry_to_user(struct ebt_entry *e, const char *base,
+				    char __user *ubase)
 {
 	int ret;
 	char __user *hlp;
 	const struct ebt_entry_target *t;
-	char name[EBT_FUNCTION_MAXNAMELEN] = {};
 
-	if (e->bitmask == 0)
+	if (e->bitmask == 0) {
+		/* special case !EBT_ENTRY_OR_ENTRIES */
+		if (copy_to_user(ubase + ((char *)e - base), e,
+				 sizeof(struct ebt_entries)))
+			return -EFAULT;
 		return 0;
+	}
+
+	if (copy_to_user(ubase + ((char *)e - base), e, sizeof(*e)))
+		return -EFAULT;
 
 	hlp = ubase + (((char *)e + e->target_offset) - base);
 	t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
 
-	ret = EBT_MATCH_ITERATE(e, ebt_make_matchname, base, ubase);
+	ret = EBT_MATCH_ITERATE(e, ebt_match_to_user, base, ubase);
 	if (ret != 0)
 		return ret;
-	ret = EBT_WATCHER_ITERATE(e, ebt_make_watchername, base, ubase);
+	ret = EBT_WATCHER_ITERATE(e, ebt_watcher_to_user, base, ubase);
 	if (ret != 0)
 		return ret;
-	strlcpy(name, t->u.target->name, sizeof(name));
-	if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-		return -EFAULT;
+	ret = ebt_obj_to_user(hlp, t->u.target->name, t->data, sizeof(*t),
+			      t->u.target->usersize, t->target_size);
+	if (ret != 0)
+		return ret;
+
 	return 0;
 }
 
@@ -1475,13 +1491,9 @@ static int copy_everything_to_user(struct ebt_table *t, void __user *user,
 	if (ret)
 		return ret;
 
-	if (copy_to_user(tmp.entries, entries, entries_size)) {
-		BUGPRINT("Couldn't copy entries to userspace\n");
-		return -EFAULT;
-	}
 	/* set the match/watcher/target names right */
 	return EBT_ENTRY_ITERATE(entries, entries_size,
-	   ebt_make_names, entries, tmp.entries);
+	   ebt_entry_to_user, entries, tmp.entries);
 }
 
 static int do_ebt_set_ctl(struct sock *sk,
@@ -1630,8 +1642,10 @@ static int compat_match_to_user(struct ebt_entry_match *m, void __user **dstptr,
 	if (match->compat_to_user) {
 		if (match->compat_to_user(cm->data, m->data))
 			return -EFAULT;
-	} else if (copy_to_user(cm->data, m->data, msize))
+	} else {
+		if (xt_data_to_user(cm->data, m->data, match->usersize, msize))
 			return -EFAULT;
+	}
 
 	*size -= ebt_compat_entry_padsize() + off;
 	*dstptr = cm->data;
@@ -1657,8 +1671,10 @@ static int compat_target_to_user(struct ebt_entry_target *t,
 	if (target->compat_to_user) {
 		if (target->compat_to_user(cm->data, t->data))
 			return -EFAULT;
-	} else if (copy_to_user(cm->data, t->data, tsize))
-		return -EFAULT;
+	} else {
+		if (xt_data_to_user(cm->data, t->data, target->usersize, tsize))
+			return -EFAULT;
+	}
 
 	*size -= ebt_compat_entry_padsize() + off;
 	*dstptr = cm->data;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 6/7] xtables: use match, target and data copy_to_user helpers in compat
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Convert compat to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/netfilter/x_tables.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index feccf52..016db6b 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -619,17 +619,14 @@ int xt_compat_match_to_user(const struct xt_entry_match *m,
 	int off = xt_compat_match_offset(match);
 	u_int16_t msize = m->u.user.match_size - off;
 
-	if (copy_to_user(cm, m, sizeof(*cm)) ||
-	    put_user(msize, &cm->u.user.match_size) ||
-	    copy_to_user(cm->u.user.name, m->u.kernel.match->name,
-			 strlen(m->u.kernel.match->name) + 1))
+	if (XT_OBJ_TO_USER(cm, m, match, msize))
 		return -EFAULT;
 
 	if (match->compat_to_user) {
 		if (match->compat_to_user((void __user *)cm->data, m->data))
 			return -EFAULT;
 	} else {
-		if (copy_to_user(cm->data, m->data, msize - sizeof(*cm)))
+		if (XT_DATA_TO_USER(cm, m, match, msize - sizeof(*cm)))
 			return -EFAULT;
 	}
 
@@ -977,17 +974,14 @@ int xt_compat_target_to_user(const struct xt_entry_target *t,
 	int off = xt_compat_target_offset(target);
 	u_int16_t tsize = t->u.user.target_size - off;
 
-	if (copy_to_user(ct, t, sizeof(*ct)) ||
-	    put_user(tsize, &ct->u.user.target_size) ||
-	    copy_to_user(ct->u.user.name, t->u.kernel.target->name,
-			 strlen(t->u.kernel.target->name) + 1))
+	if (XT_OBJ_TO_USER(ct, t, target, tsize))
 		return -EFAULT;
 
 	if (target->compat_to_user) {
 		if (target->compat_to_user((void __user *)ct->data, t->data))
 			return -EFAULT;
 	} else {
-		if (copy_to_user(ct->data, t->data, tsize - sizeof(*ct)))
+		if (XT_DATA_TO_USER(ct, t, target, tsize - sizeof(*ct)))
 			return -EFAULT;
 	}
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH nf-next 7/7] xtables: extend matches and targets with .usersize
From: Willem de Bruijn @ 2017-01-02 22:19 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, davem, fw, dborkman, pablo, Willem de Bruijn
In-Reply-To: <1483395586-105774-1-git-send-email-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

In matches and targets that define a kernel-only tail to their
xt_match and xt_target data structs, add a field .usersize that
specifies up to where data is to be shared with userspace.

Performed a search for comment "Used internally by the kernel" to find
relevant matches and targets. Manually inspected the structs to derive
a valid offsetof.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/bridge/netfilter/ebt_limit.c   | 1 +
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 1 +
 net/ipv6/netfilter/ip6t_NPT.c      | 2 ++
 net/netfilter/xt_CT.c              | 3 +++
 net/netfilter/xt_RATEEST.c         | 1 +
 net/netfilter/xt_TEE.c             | 2 ++
 net/netfilter/xt_bpf.c             | 2 ++
 net/netfilter/xt_cgroup.c          | 1 +
 net/netfilter/xt_connlimit.c       | 1 +
 net/netfilter/xt_hashlimit.c       | 4 ++++
 net/netfilter/xt_limit.c           | 2 ++
 net/netfilter/xt_quota.c           | 1 +
 net/netfilter/xt_rateest.c         | 1 +
 net/netfilter/xt_string.c          | 1 +
 14 files changed, 23 insertions(+)

diff --git a/net/bridge/netfilter/ebt_limit.c b/net/bridge/netfilter/ebt_limit.c
index 517e78b..61a9f1b 100644
--- a/net/bridge/netfilter/ebt_limit.c
+++ b/net/bridge/netfilter/ebt_limit.c
@@ -105,6 +105,7 @@ static struct xt_match ebt_limit_mt_reg __read_mostly = {
 	.match		= ebt_limit_mt,
 	.checkentry	= ebt_limit_mt_check,
 	.matchsize	= sizeof(struct ebt_limit_info),
+	.usersize	= offsetof(struct ebt_limit_info, prev),
 #ifdef CONFIG_COMPAT
 	.compatsize	= sizeof(struct ebt_compat_limit_info),
 #endif
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 21db00d..8a3d20e 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -468,6 +468,7 @@ static struct xt_target clusterip_tg_reg __read_mostly = {
 	.checkentry	= clusterip_tg_check,
 	.destroy	= clusterip_tg_destroy,
 	.targetsize	= sizeof(struct ipt_clusterip_tgt_info),
+	.usersize	= offsetof(struct ipt_clusterip_tgt_info, config),
 #ifdef CONFIG_COMPAT
 	.compatsize	= sizeof(struct compat_ipt_clusterip_tgt_info),
 #endif /* CONFIG_COMPAT */
diff --git a/net/ipv6/netfilter/ip6t_NPT.c b/net/ipv6/netfilter/ip6t_NPT.c
index 590f767..a379d2f 100644
--- a/net/ipv6/netfilter/ip6t_NPT.c
+++ b/net/ipv6/netfilter/ip6t_NPT.c
@@ -112,6 +112,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly = {
 		.table		= "mangle",
 		.target		= ip6t_snpt_tg,
 		.targetsize	= sizeof(struct ip6t_npt_tginfo),
+		.usersize	= offsetof(struct ip6t_npt_tginfo, adjustment),
 		.checkentry	= ip6t_npt_checkentry,
 		.family		= NFPROTO_IPV6,
 		.hooks		= (1 << NF_INET_LOCAL_IN) |
@@ -123,6 +124,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly = {
 		.table		= "mangle",
 		.target		= ip6t_dnpt_tg,
 		.targetsize	= sizeof(struct ip6t_npt_tginfo),
+		.usersize	= offsetof(struct ip6t_npt_tginfo, adjustment),
 		.checkentry	= ip6t_npt_checkentry,
 		.family		= NFPROTO_IPV6,
 		.hooks		= (1 << NF_INET_PRE_ROUTING) |
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 95c7503..26b0bccfa 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -373,6 +373,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
 		.name		= "CT",
 		.family		= NFPROTO_UNSPEC,
 		.targetsize	= sizeof(struct xt_ct_target_info),
+		.usersize	= offsetof(struct xt_ct_target_info, ct),
 		.checkentry	= xt_ct_tg_check_v0,
 		.destroy	= xt_ct_tg_destroy_v0,
 		.target		= xt_ct_target_v0,
@@ -384,6 +385,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
 		.family		= NFPROTO_UNSPEC,
 		.revision	= 1,
 		.targetsize	= sizeof(struct xt_ct_target_info_v1),
+		.usersize	= offsetof(struct xt_ct_target_info, ct),
 		.checkentry	= xt_ct_tg_check_v1,
 		.destroy	= xt_ct_tg_destroy_v1,
 		.target		= xt_ct_target_v1,
@@ -395,6 +397,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
 		.family		= NFPROTO_UNSPEC,
 		.revision	= 2,
 		.targetsize	= sizeof(struct xt_ct_target_info_v1),
+		.usersize	= offsetof(struct xt_ct_target_info, ct),
 		.checkentry	= xt_ct_tg_check_v2,
 		.destroy	= xt_ct_tg_destroy_v1,
 		.target		= xt_ct_target_v1,
diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 91a373a..498b54f 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -162,6 +162,7 @@ static struct xt_target xt_rateest_tg_reg __read_mostly = {
 	.checkentry = xt_rateest_tg_checkentry,
 	.destroy    = xt_rateest_tg_destroy,
 	.targetsize = sizeof(struct xt_rateest_target_info),
+	.usersize   = offsetof(struct xt_rateest_target_info, est),
 	.me         = THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 1c57ace..86b0580 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -133,6 +133,7 @@ static struct xt_target tee_tg_reg[] __read_mostly = {
 		.family     = NFPROTO_IPV4,
 		.target     = tee_tg4,
 		.targetsize = sizeof(struct xt_tee_tginfo),
+		.usersize   = offsetof(struct xt_tee_tginfo, priv),
 		.checkentry = tee_tg_check,
 		.destroy    = tee_tg_destroy,
 		.me         = THIS_MODULE,
@@ -144,6 +145,7 @@ static struct xt_target tee_tg_reg[] __read_mostly = {
 		.family     = NFPROTO_IPV6,
 		.target     = tee_tg6,
 		.targetsize = sizeof(struct xt_tee_tginfo),
+		.usersize   = offsetof(struct xt_tee_tginfo, priv),
 		.checkentry = tee_tg_check,
 		.destroy    = tee_tg_destroy,
 		.me         = THIS_MODULE,
diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
index 2dedaa2..38986a9 100644
--- a/net/netfilter/xt_bpf.c
+++ b/net/netfilter/xt_bpf.c
@@ -110,6 +110,7 @@ static struct xt_match bpf_mt_reg[] __read_mostly = {
 		.match		= bpf_mt,
 		.destroy	= bpf_mt_destroy,
 		.matchsize	= sizeof(struct xt_bpf_info),
+		.usersize	= offsetof(struct xt_bpf_info, filter),
 		.me		= THIS_MODULE,
 	},
 	{
@@ -120,6 +121,7 @@ static struct xt_match bpf_mt_reg[] __read_mostly = {
 		.match		= bpf_mt_v1,
 		.destroy	= bpf_mt_destroy_v1,
 		.matchsize	= sizeof(struct xt_bpf_info_v1),
+		.usersize	= offsetof(struct xt_bpf_info_v1, filter),
 		.me		= THIS_MODULE,
 	},
 };
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index a086a91..1db1ce5 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -122,6 +122,7 @@ static struct xt_match cgroup_mt_reg[] __read_mostly = {
 		.checkentry	= cgroup_mt_check_v1,
 		.match		= cgroup_mt_v1,
 		.matchsize	= sizeof(struct xt_cgroup_info_v1),
+		.usersize	= offsetof(struct xt_cgroup_info_v1, priv),
 		.destroy	= cgroup_mt_destroy_v1,
 		.me		= THIS_MODULE,
 		.hooks		= (1 << NF_INET_LOCAL_OUT) |
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 2aff2b7..6d16f46 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -431,6 +431,7 @@ static struct xt_match connlimit_mt_reg __read_mostly = {
 	.checkentry = connlimit_mt_check,
 	.match      = connlimit_mt,
 	.matchsize  = sizeof(struct xt_connlimit_info),
+	.usersize   = offsetof(struct xt_connlimit_info, data),
 	.destroy    = connlimit_mt_destroy,
 	.me         = THIS_MODULE,
 };
diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 1006340..26ef70c 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -838,6 +838,7 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.family         = NFPROTO_IPV4,
 		.match          = hashlimit_mt_v1,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo1),
+		.usersize	= offsetof(struct xt_hashlimit_mtinfo1, hinfo),
 		.checkentry     = hashlimit_mt_check_v1,
 		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
@@ -848,6 +849,7 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.family         = NFPROTO_IPV4,
 		.match          = hashlimit_mt,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo2),
+		.usersize	= offsetof(struct xt_hashlimit_mtinfo2, hinfo),
 		.checkentry     = hashlimit_mt_check,
 		.destroy        = hashlimit_mt_destroy,
 		.me             = THIS_MODULE,
@@ -859,6 +861,7 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.family         = NFPROTO_IPV6,
 		.match          = hashlimit_mt_v1,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo1),
+		.usersize	= offsetof(struct xt_hashlimit_mtinfo1, hinfo),
 		.checkentry     = hashlimit_mt_check_v1,
 		.destroy        = hashlimit_mt_destroy_v1,
 		.me             = THIS_MODULE,
@@ -869,6 +872,7 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = {
 		.family         = NFPROTO_IPV6,
 		.match          = hashlimit_mt,
 		.matchsize      = sizeof(struct xt_hashlimit_mtinfo2),
+		.usersize	= offsetof(struct xt_hashlimit_mtinfo2, hinfo),
 		.checkentry     = hashlimit_mt_check,
 		.destroy        = hashlimit_mt_destroy,
 		.me             = THIS_MODULE,
diff --git a/net/netfilter/xt_limit.c b/net/netfilter/xt_limit.c
index bef8505..dab962d 100644
--- a/net/netfilter/xt_limit.c
+++ b/net/netfilter/xt_limit.c
@@ -192,6 +192,8 @@ static struct xt_match limit_mt_reg __read_mostly = {
 	.compatsize       = sizeof(struct compat_xt_rateinfo),
 	.compat_from_user = limit_mt_compat_from_user,
 	.compat_to_user   = limit_mt_compat_to_user,
+#else
+	.usersize         = offsetof(struct xt_rateinfo, prev),
 #endif
 	.me               = THIS_MODULE,
 };
diff --git a/net/netfilter/xt_quota.c b/net/netfilter/xt_quota.c
index 44c8eb4..10d61a6 100644
--- a/net/netfilter/xt_quota.c
+++ b/net/netfilter/xt_quota.c
@@ -73,6 +73,7 @@ static struct xt_match quota_mt_reg __read_mostly = {
 	.checkentry = quota_mt_check,
 	.destroy    = quota_mt_destroy,
 	.matchsize  = sizeof(struct xt_quota_info),
+	.usersize   = offsetof(struct xt_quota_info, master),
 	.me         = THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_rateest.c b/net/netfilter/xt_rateest.c
index 1db02f6..755d2f6 100644
--- a/net/netfilter/xt_rateest.c
+++ b/net/netfilter/xt_rateest.c
@@ -133,6 +133,7 @@ static struct xt_match xt_rateest_mt_reg __read_mostly = {
 	.checkentry = xt_rateest_mt_checkentry,
 	.destroy    = xt_rateest_mt_destroy,
 	.matchsize  = sizeof(struct xt_rateest_match_info),
+	.usersize   = offsetof(struct xt_rateest_match_info, est1),
 	.me         = THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_string.c b/net/netfilter/xt_string.c
index 0bc3460..423293e 100644
--- a/net/netfilter/xt_string.c
+++ b/net/netfilter/xt_string.c
@@ -77,6 +77,7 @@ static struct xt_match xt_string_mt_reg __read_mostly = {
 	.match      = string_mt,
 	.destroy    = string_mt_destroy,
 	.matchsize  = sizeof(struct xt_string_info),
+	.usersize   = offsetof(struct xt_string_info, config),
 	.me         = THIS_MODULE,
 };
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox