Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] net: macb: do not scan PHYs manually
From: Nathan Sullivan @ 2016-04-28 15:55 UTC (permalink / raw)
  To: Nicolas Ferre; +Cc: netdev, linux-kernel, Florian Fainelli, Alexandre Belloni
In-Reply-To: <57222FCE.8050407@atmel.com>

On Thu, Apr 28, 2016 at 05:44:14PM +0200, Nicolas Ferre wrote:
> Le 28/04/2016 16:46, Nathan Sullivan a écrit :
> > Since of_mdiobus_register and mdiobus_register will scan automatically,
> > do not manually scan for PHY devices in the macb ethernet driver. Doing
> > so will result in many nonexistent PHYs on the MDIO bus if the MDIO
> > lines are floating or grounded, such as when they are not used.
> > 
> > Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com>
> 
> Well, as explained in the commit message that added this feature and in
> the comment, if no phy is specified in the DT we end up without phy...
> 
> There are AT91 platforms which lack specification for the phy node in
> the DT. So, I don't know if there is a better way to deal with this case
> but I see this removal as risky.
> 
> Bye,
> 
> Nicolas Ferre

Hmm, are AT91 platforms special in this regard? As far as I can tell, this
driver (macb) and Marvell PXA are the only ethernet drivers that call
mdiobus_scan directly, and PXA does it on a known address. I do see that there
are trees that use macb and don't have a phy listed, which is unfortunate.

Another way to fix our issue would be to consider all 0x0s a bad ID in
mdiobus_scan, so grounded MDIO lines do not get PHYs scanned.  Or we could add a
DT property to disable the manual scan.  I'm not sure what the correct solution
is, do you have a preference?

^ permalink raw reply

* Re: [PATCH net v3 0/5] drivers: net: cpsw: phy-handle fixes
From: Grygorii Strashko @ 2016-04-28 15:55 UTC (permalink / raw)
  To: David Rivshin (Allworx), netdev, linux-omap
  Cc: linux-arm-kernel, devicetree, linux-kernel, David Miller,
	Mugunthan V N, Andrew Goodbody, Markus Brunner, Nicolas Chauvet
In-Reply-To: <1461805808-4102-1-git-send-email-drivshin.allworx@gmail.com>

On 04/28/2016 04:10 AM, David Rivshin (Allworx) wrote:
> From: David Rivshin <drivshin@allworx.com>
>
> This series fixes a number of related issues around using phy-handle
> properties in cpsw emac nodes.
>
> Patch 1 fixes a bug if more than one slave is used, and either
> slave uses the phy-handle property in the devicetree.
>
> Patch 2 fixes a NULL pointer dereference which can occur if a
> phy-handle property is used and of_phy_connect() return NULL,
> such as with a bad devicetree.
>
> Patch 3 fixes an issue where the phy-mode property would be ignored
> if a phy-handle property was used. This also fixes a bogus error
> message that would be emitted.
>
> Patch 4 fixes makes the binding documentation more explicit that
> exactly one PHY property should be used, and also marks phy_id as
> deprecated.
>
> Patch 5 cleans up the fixed-link case to work like the now-fixed
> phy-handle case.
>
> I have tested on the following hardware configurations:
>   - (EVMSK) dual emac, phy_id property in both slaves
>   - (EVMSK) dual emac, phy-handle property in both slaves
>   - (EVMSK) a bad phy-handle property pointing to &mmc1
>   - (EVMSK) phy_id property with incorrect PHY address
>   - (BeagleBoneBlack) single emac, phy_id property
>   - (custom) single emac, fixed-link subnode
>
> Andrew Goodbody reported testing v2 on a board that doesn't use
> dual_emac mode, but with 2 PHYs using phy-handle properties [1].
>
> Nicolas Chauvet reported testing v2 on an HP t410 (dm8148).
>
> Markus Brunner reported testing v1 on the following [2]:
>   - emac0 with phy_id and emac1 with fixed phy
>   - emac0 with phy-handle and emac1 with fixed phy
>   - emac0 with fixed phy and emac1 with fixed phy
>
> [1] https://lkml.org/lkml/2016/4/22/537
> [2] http://www.spinics.net/lists/netdev/msg357890.html
>
> David Rivshin (5):
>    drivers: net: cpsw: fix parsing of phy-handle DT property in dual_emac
>      config
>    drivers: net: cpsw: fix segfault in case of bad phy-handle
>    drivers: net: cpsw: don't ignore phy-mode if phy-handle is used
>    dt: cpsw: phy-handle, phy_id, and fixed-link are mutually exclusive
>    drivers: net: cpsw: use of_phy_connect() in fixed-link case
>
>   Documentation/devicetree/bindings/net/cpsw.txt |  6 +--
>   drivers/net/ethernet/ti/cpsw.c                 | 69 ++++++++++++++------------
>   drivers/net/ethernet/ti/cpsw.h                 |  1 +
>   3 files changed, 41 insertions(+), 35 deletions(-)
>

Thanks a lot.
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>

-- 
regards,
-grygorii

^ permalink raw reply

* [PATCH net-next] net: constify is_skb_forwardable's arguments
From: Nikolay Aleksandrov @ 2016-04-28 15:59 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, Nikolay Aleksandrov

is_skb_forwardable is not supposed to change anything so constify its
arguments

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
Hit this while working on the bridge per-vlan stats and needed to pass a
constified skb down. skb_is_gso() already takes a const skb.

 include/linux/netdevice.h | 3 ++-
 net/core/dev.c            | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 18d8394f2e5d..f9d573e26db3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3260,7 +3260,8 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
-bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb);
+bool is_skb_forwardable(const struct net_device *dev,
+			const struct sk_buff *skb);
 
 extern int		netdev_budget;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 6324bc9267f7..36f0170a0754 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1741,7 +1741,7 @@ static inline void net_timestamp_set(struct sk_buff *skb)
 			__net_timestamp(SKB);		\
 	}						\
 
-bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb)
+bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
 {
 	unsigned int len;
 
-- 
2.4.11

^ permalink raw reply related

* Re: [RFC PATCH net] ipv6/ila: fix nlsize calculation for lwtunnel
From: Tom Herbert @ 2016-04-28 16:07 UTC (permalink / raw)
  To: David Miller; +Cc: Nicolas Dichtel, Linux Kernel Network Developers
In-Reply-To: <20160427.152053.2243888701043810040.davem@davemloft.net>

On Wed, Apr 27, 2016 at 12:20 PM, David Miller <davem@davemloft.net> wrote:
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Fri, 22 Apr 2016 17:58:02 +0200
>
>> The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
>>
>> Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
>> CC: Tom Herbert <tom@herbertland.com>
>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> ---
>>
>> Tom, when I read the comment, I feel I'm misssing something, but what?
>
> Tom, seriously, please look at this.
>
Yes, it is an issue. Need to do add size for ILA_ATTR_LOCATOR and
ILA_ATTR_CSUM_MODE. Also not we made ILA_ATTR_CSUM_MODE u64 in fill
encap info. I will send a path shortly.

> And with recent changes in net-next the csum attribute size needs to
> be specified as well, plus the locator needs to use the 64-bit
> alignment sizing helper.

Something other than nla_put_u64_64bit?

Thanks,
Tom

^ permalink raw reply

* Re: [RFC PATCH net] ipv6/ila: fix nlsize calculation for lwtunnel
From: David Miller @ 2016-04-28 16:16 UTC (permalink / raw)
  To: tom; +Cc: nicolas.dichtel, netdev
In-Reply-To: <CALx6S36jkPqx5U+B9akRYZ+xiRhozk_AdOmKBEr4N9mhwTKQMw@mail.gmail.com>

From: Tom Herbert <tom@herbertland.com>
Date: Thu, 28 Apr 2016 09:07:25 -0700

> On Wed, Apr 27, 2016 at 12:20 PM, David Miller <davem@davemloft.net> wrote:
>> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> Date: Fri, 22 Apr 2016 17:58:02 +0200
>>
>>> The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
>>>
>>> Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module")
>>> CC: Tom Herbert <tom@herbertland.com>
>>> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>>> ---
>>>
>>> Tom, when I read the comment, I feel I'm misssing something, but what?
>>
>> Tom, seriously, please look at this.
>>
> Yes, it is an issue. Need to do add size for ILA_ATTR_LOCATOR and
> ILA_ATTR_CSUM_MODE. Also not we made ILA_ATTR_CSUM_MODE u64 in fill
> encap info. I will send a path shortly.
> 
>> And with recent changes in net-next the csum attribute size needs to
>> be specified as well, plus the locator needs to use the 64-bit
>> alignment sizing helper.
> 
> Something other than nla_put_u64_64bit?

That's what it should use, yes.

^ permalink raw reply

* Re: [PATCH v2] net: macb: do not scan PHYs manually
From: Andrew Lunn @ 2016-04-28 16:32 UTC (permalink / raw)
  To: Nathan Sullivan
  Cc: Nicolas Ferre, netdev, linux-kernel, Florian Fainelli,
	Alexandre Belloni
In-Reply-To: <20160428155556.GA8333@nathan3500-linux-VM>

> Hmm, are AT91 platforms special in this regard? As far as I can tell, this
> driver (macb) and Marvell PXA are the only ethernet drivers that call
> mdiobus_scan directly, and PXA does it on a known address. I do see that there
> are trees that use macb and don't have a phy listed, which is unfortunate.

How it is supposed to work is that you do one of two things:

1) Your device tree does not have an mdio node. In this case, you call
mdiobus_register() and it will perform a scan of the bus, and find the
phys.

2) Your device tree does have an MDIO node, and you list your PHYs.

Having an MDIO node and not listing the PHYs is broken...

There are however, a few broken device trees around, and a few drivers
have workarounds. e.g. davinci_mdio.c

       /* register the mii bus
         * Create PHYs from DT only in case if PHY child nodes are explicitly
         * defined to support backward compatibility with DTs which assume that
         * Davinci MDIO will always scan the bus for PHYs detection.
         */
        if (dev->of_node && of_get_child_count(dev->of_node)) {
                data->skip_scan = true;
                ret = of_mdiobus_register(data->bus, dev->of_node);
        } else {
                ret = mdiobus_register(data->bus);
        }

You probably need to do the same for AT91, count the number of
children, and it if it zero, fall back to the non-DT way. It would
also be good to print a warning to get people to fix their device
tree.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next] macvtap: add namespace support to the sysfs device class
From: Marc Angel @ 2016-04-28 16:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <8737q933kv.fsf@x220.int.ebiederm.org>

On Mon, Apr 25, 2016 at 8:12 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>> The 'net' device class is isolated between network namespaces so each
>> one has its own hierarchy of net devices.
>> This isn't the case for the 'macvtap' device class.
>> The problem occurs half-way through the netdev registration, when
>> `macvtap_device_event` is called-back to create the 'tapNN' macvtap
>> class device under the 'macvtapX' net class device.
>>
>> This patch adds namespace support the the 'macvtap' device class so
>> that /sys/class/macvtap is no longer shared between net namespaces.
>>
>> However, doing this has the side effect of changing
>> /sys/devices/virtual/net/macvtapX/tapNN  into
>> /sys/devices/virtual/net/macvtapX/macvtap/tapNN
>
> I forget the details of how this interface works, but
> /sys/devices/virtual/net is definitely allows different overlapping
> content per network namespace, so we should not need to add an extra
> directory to make this work.

It really seems like we do, unfortunately.

For a kernfs_node to have the KERNFS_NS flag enabled, sysfs_enable_ns
has to be called on it. This is only done in the create_dir function of
lib/kobject.c, and only when the parent of that kobject has a ktype with
the child_ns_type field set to something.

This is the case for class_dir_ktype which is the type used for the
"glue" dirs (the extra macvtap/ that is created under macvtapX).
This, however, is not the case for device_ktype, which is the type
used for every device directory.
When we create tapN directly under macvtapX, tapN doesn't get the
KERNFS_NS flag enabled -- unlike when created under the "glue" dir.

This is problematic when creating the following symlink:
/sys/class/macvtap/tapN -> /sys/devices/virtual/net/macvtapX/tapN.
The tapN in /sys/class/macvtap inherits the namespace tag from
/sys/devices/virtual/net/macvtapX/tapN, which doesn't have one anymore
and kernfs_add_one fails because it expects it to.

Adding a child_ns_type field to device_ktype is probably not a good idea
and seems to cause other problems.

The best workaround is probably to just create a symlink inside the
macvtapX device directory (tapN -> macvtap/tapN).

I'll update my patch accordingly if you don't have a better idea.

>> Should it even be possible to add a device of a class that doesn't
>> support namespaces under one that does?
>> This could lead to dead symlinks in the new device class directory or
>> duplicate warnings because a device of the same name already exists in
>> another namespace.
>
> This definitely looks like something that bears digging into, and fixing
> properly.
>
> Eric

^ permalink raw reply

* Re: [net-next PATCH V4 3/5] samples/bpf: add a README file to get users started
From: Alexei Starovoitov @ 2016-04-28 16:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-kbuild, bblanco, naveen.n.rao, borkmann
In-Reply-To: <20160428122103.16807.41837.stgit@firesoul>

On Thu, Apr 28, 2016 at 02:21:04PM +0200, Jesper Dangaard Brouer wrote:
> Getting started with using examples in samples/bpf/ is not
> straightforward.  There are several dependencies, and specific
> versions of these dependencies.
> 
> Just compiling the example tool is also slightly obscure, e.g. one
> need to call make like:
> 
>  make samples/bpf/
> 
> Do notice the "/" slash after the directory name.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply

* Re: iproute2: bash completion function for tc
From: Alexei Starovoitov @ 2016-04-28 16:53 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Stephen Hemminger, Jamal Hadi Salim, Vincent Jardin, netdev
In-Reply-To: <5721FEE1.7080807@6wind.com>

On Thu, Apr 28, 2016 at 02:15:29PM +0200, Quentin Monnet wrote:
> Hi Alexei, Stephen,
> 
> 2016-04-27 (22:13 UTC-0700) ~ Stephen Hemminger:
> > On Wed, 27 Apr 2016 20:19:26 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Tue, Apr 26, 2016 at 09:28:17AM +0200, Quentin Monnet wrote:
> >>> Hi Jamal, Stephen,
> >>>
> >>> I searched for a function providing auto-completion for `tc` utility in
> >>> bash, but I found none. So I have created one, and I would like share it
> >>> with the community. It is available here:
> >>> https://github.com/6WIND/tc_bash-completion/blob/master/tc
> >>> I would like to make it easily available to tc users, so here is a
> >>> twofold request:
> >>>
> >>> * I do not know where to submit the code. Should I submit here on netdev
> >>> for inclusion in iproute2 package, or rather to the bash-completion
> >>> repository on GitHub? I feel like it would receive better feedback and
> >>> updates if pushed to iproute2. Could you please provide some advice here?
> >>> * The completion for `tc` seems to work well; I have tested it with many
> >>> commands, but I am no tc expert, and there are probably some cases where
> >>> the completion fails to propose the correct choices. I would be really
> >>> interested in any feedback/bug reports that you, or anyone on this list
> >>> who uses tc, could provide.
> >>
> >> that looks very interesting.
> >> I think making it a part of iproute2 is a good thing.
> >> How about installing it into /etc/iproute2/ ?
> >> Stephen, any comments?
> >>
> > 
> > I am ok with keeping it in the repository.
> > But it would need to be installed in the standard bash directory,
> > is that distro dependent?
> > 
> 
> As far as I know the bash-completion directory is not distro dependent,
> but it moved at some point (2011) from /etc/bash_completion.d/ to
> /usr/share/bash-completion/completions/. While backward compatibility is
> provided with the former location, it is now recommended to use the latter.
> 
> So one idea could be to check (in iproute2 Makefile) for existence of
> /usr/share/bash-completion/completions/ directory, and if not found to
> fall back to /etc/bash_completion.d/. If none is found, use
> /usr/share/bash-completion/completions/. Does this seem correct?

make sense to me.
For some reason there is also /etc/bash-completion.d/ (note - vs _)
and git installs there (which seems to be a bug), but your proposal is good as-is.

^ permalink raw reply

* Re: [RFC 07/20] net: dsa: list ports in switch\\
From: Florian Fainelli @ 2016-04-28 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Vivien Didelot
  Cc: netdev, linux-kernel, kernel, David S. Miller, Jiri Pirko
In-Reply-To: <20160427231537.GJ29024@lunn.ch>

On 27/04/16 16:15, Andrew Lunn wrote:
> On Wed, Apr 27, 2016 at 06:30:04PM -0400, Vivien Didelot wrote:
>> List DSA port structures in their switch structure, so that drivers can
>> iterate on them to retrieve information such as their ports membership.
> 
> And this would be so much easier using a plan array.

Agreed, I do not see much value in doing this at the moment. Even if you
have unused ports in a switch, allocating an array is a small price to
pay compared to directly indexing by port number.

NAK from me unless there is a compelling reason for doing so.
-- 
Florian

^ permalink raw reply

* [PATCH nf-next 1/9] netfilter: conntrack: keep BH enabled during lookup
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

No need to disable BH here anymore:

stats are switched to _ATOMIC variant (== this_cpu_inc()), which
nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 1fd0ff1..1b63359 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -472,18 +472,13 @@ ____nf_conntrack_find(struct net *net, const struct nf_conntrack_zone *zone,
 	struct hlist_nulls_node *n;
 	unsigned int bucket = hash_bucket(hash, net);
 
-	/* Disable BHs the entire time since we normally need to disable them
-	 * at least once for the stats anyway.
-	 */
-	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
 		if (nf_ct_key_equal(h, tuple, zone)) {
-			NF_CT_STAT_INC(net, found);
-			local_bh_enable();
+			NF_CT_STAT_INC_ATOMIC(net, found);
 			return h;
 		}
-		NF_CT_STAT_INC(net, searched);
+		NF_CT_STAT_INC_ATOMIC(net, searched);
 	}
 	/*
 	 * if the nulls value we got at the end of this lookup is
@@ -491,10 +486,9 @@ begin:
 	 * We probably met an item that was moved to another chain.
 	 */
 	if (get_nulls_value(n) != bucket) {
-		NF_CT_STAT_INC(net, search_restart);
+		NF_CT_STAT_INC_ATOMIC(net, search_restart);
 		goto begin;
 	}
-	local_bh_enable();
 
 	return NULL;
 }
@@ -735,22 +729,19 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	zone = nf_ct_zone(ignored_conntrack);
 	hash = hash_conntrack(net, tuple);
 
-	/* Disable BHs the entire time since we need to disable them at
-	 * least once for the stats anyway.
-	 */
-	rcu_read_lock_bh();
+	rcu_read_lock();
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (ct != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple) &&
 		    nf_ct_zone_equal(ct, zone, NF_CT_DIRECTION(h))) {
-			NF_CT_STAT_INC(net, found);
-			rcu_read_unlock_bh();
+			NF_CT_STAT_INC_ATOMIC(net, found);
+			rcu_read_unlock();
 			return 1;
 		}
-		NF_CT_STAT_INC(net, searched);
+		NF_CT_STAT_INC_ATOMIC(net, searched);
 	}
-	rcu_read_unlock_bh();
+	rcu_read_unlock();
 
 	return 0;
 }
-- 
2.7.3

^ permalink raw reply related

* [PATCH nf-next 0/9] netfilter: remove per-netns conntrack tables, part 1
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

[ CCing netdev so netns folks can have a look too ]

This patch series removes the per-netns connection tracking tables.
All conntrack objects are then stored in one global global table.

This avoids the infamous 'vmalloc' when lots of namespaces are used:
We no longer allocate a new conntrack table for each namespace (with 64k
size this saves 512kb of memory per netns).

- net namespace address is made part of conntrack hash, to spread
  conntracks over entire table even if netns has overlapping ip addresses.
- lookup and iterators net_eq() to skip conntracks living in a different
  namespace.

Only the main conntrack table is converted here:
NAT bysrc and expectation hashes are still per namespace (will be unified
in a followup series).  Also, this retains the per-namespace kmem cache
for the conntrack objects.  This will also be resolved in a followup series.

Comments welcome.

 include/net/netfilter/nf_conntrack_core.h             |   11 
 include/net/netns/conntrack.h                         |    2 
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c        |    2 
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |   38 ++
 net/netfilter/nf_conntrack_core.c                     |  233 +++++++++---------
 net/netfilter/nf_conntrack_helper.c                   |    6 
 net/netfilter/nf_conntrack_netlink.c                  |   11 
 net/netfilter/nf_conntrack_standalone.c               |   13 -
 net/netfilter/nf_nat_core.c                           |    2 
 net/netfilter/nfnetlink_cttimeout.c                   |    6 
 10 files changed, 179 insertions(+), 145 deletions(-)


^ permalink raw reply

* [PATCH nf-next 2/9] netfilter: conntrack: fix lookup race during hash resize
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

When resizing the conntrack hash table at runtime via
echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with
the conntrack lookup path -- reads can happen in parallel and nothing
prevents readers from observing a the newly allocated hash but the old
size (or vice versa).

So access to hash[bucket] can trigger OOB read access in case the table got
expanded and we saw the new size but the old hash pointer (or it got shrunk
and we got new hash ptr but the size of the old and larger table):

kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN
CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107
[..]
Call Trace:
[<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90
[<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0
[<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760

Use generation counter to obtain the address/length of the same table.

Also add a synchronize_net before freeing the old hash.
AFAICS, without it we might access ct_hash[bucket] after ct_hash has been
freed, provided that lockless reader got delayed by another event:

CPU1			CPU2
seq_begin
seq_retry
<delay>			resize occurs
			free oldhash
for_each(oldhash[size])

Note that resize is only supported in init_netns, it took over 2 minutes
of constant resizing+flooding to produce the warning, so this isn't a
big problem in practice.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 1b63359..29fa08b 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -469,11 +469,18 @@ ____nf_conntrack_find(struct net *net, const struct nf_conntrack_zone *zone,
 		      const struct nf_conntrack_tuple *tuple, u32 hash)
 {
 	struct nf_conntrack_tuple_hash *h;
+	struct hlist_nulls_head *ct_hash;
 	struct hlist_nulls_node *n;
-	unsigned int bucket = hash_bucket(hash, net);
+	unsigned int bucket, sequence;
 
 begin:
-	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
+	do {
+		sequence = read_seqcount_begin(&nf_conntrack_generation);
+		bucket = hash_bucket(hash, net);
+		ct_hash = net->ct.hash;
+	} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
+
+	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
 		if (nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			return h;
@@ -722,15 +729,21 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	struct net *net = nf_ct_net(ignored_conntrack);
 	const struct nf_conntrack_zone *zone;
 	struct nf_conntrack_tuple_hash *h;
+	struct hlist_nulls_head *ct_hash;
+	unsigned int hash, sequence;
 	struct hlist_nulls_node *n;
 	struct nf_conn *ct;
-	unsigned int hash;
 
 	zone = nf_ct_zone(ignored_conntrack);
-	hash = hash_conntrack(net, tuple);
 
 	rcu_read_lock();
-	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
+	do {
+		sequence = read_seqcount_begin(&nf_conntrack_generation);
+		hash = hash_conntrack(net, tuple);
+		ct_hash = net->ct.hash;
+	} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
+
+	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (ct != ignored_conntrack &&
 		    nf_ct_tuple_equal(tuple, &h->tuple) &&
@@ -1607,6 +1620,7 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	nf_conntrack_all_unlock();
 	local_bh_enable();
 
+	synchronize_net();
 	nf_ct_free_hashtable(old_hash, old_size);
 	return 0;
 }
-- 
2.7.3


^ permalink raw reply related

* [PATCH nf-next 3/9] netfilter: conntrack: don't attempt to iterate over empty table
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

Once we place all conntracks into same table iteration becomes more
costly because the table contains conntracks that we are not interested
in (belonging to other netns).

So don't bother scanning if the current namespace has no entries.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 29fa08b..f2e75a5 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1428,6 +1428,9 @@ void nf_ct_iterate_cleanup(struct net *net,
 
 	might_sleep();
 
+	if (atomic_read(&net->ct.count) == 0)
+		return;
+
 	while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
 		/* Time to push up daises... */
 		if (del_timer(&ct->timeout))
-- 
2.7.3


^ permalink raw reply related

* [PATCH nf-next 4/9] netfilter: conntrack: use nf_ct_key_equal() in more places
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

This prepares for upcoming change that places all conntracks into a
single, global table.  For this to work we will need to also compare
net pointer during lookup.  To avoid open-coding such check use the
nf_ct_key_equal helper and then later extend it to also consider net_eq.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f2e75a5..3b9c302 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -572,16 +572,13 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 
 	/* See if there's one in the list already, including reverse */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
-		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				      &h->tuple) &&
-		    nf_ct_zone_equal(nf_ct_tuplehash_to_ctrack(h), zone,
-				     NF_CT_DIRECTION(h)))
+		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+				    zone))
 			goto out;
+
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
-		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				      &h->tuple) &&
-		    nf_ct_zone_equal(nf_ct_tuplehash_to_ctrack(h), zone,
-				     NF_CT_DIRECTION(h)))
+		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+				    zone))
 			goto out;
 
 	add_timer(&ct->timeout);
@@ -665,16 +662,13 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
-		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				      &h->tuple) &&
-		    nf_ct_zone_equal(nf_ct_tuplehash_to_ctrack(h), zone,
-				     NF_CT_DIRECTION(h)))
+		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+				    zone))
 			goto out;
+
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
-		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				      &h->tuple) &&
-		    nf_ct_zone_equal(nf_ct_tuplehash_to_ctrack(h), zone,
-				     NF_CT_DIRECTION(h)))
+		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
+				    zone))
 			goto out;
 
 	/* Timer relative to confirmation time, not original
@@ -746,8 +740,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (ct != ignored_conntrack &&
-		    nf_ct_tuple_equal(tuple, &h->tuple) &&
-		    nf_ct_zone_equal(ct, zone, NF_CT_DIRECTION(h))) {
+		    nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			rcu_read_unlock();
 			return 1;
-- 
2.7.3

^ permalink raw reply related

* [PATCH nf-next 5/9] netfilter: conntrack: small refactoring of conntrack seq_printf
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

The iteration process is lockless, so we test if the conntrack object is
eligible for printing (e.g. is AF_INET) after obtaining the reference
count.

Once we put all conntracks into same hash table we might see more
entries that need to be skipped.

So add a helper and first perform the test in a lockless fashion
for fast skip.

Once we obtain the reference count, just repeat the check.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   | 24 +++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index f0dfe92..483cf79 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -114,6 +114,19 @@ static inline void ct_show_secctx(struct seq_file *s, const struct nf_conn *ct)
 }
 #endif
 
+static bool ct_seq_should_skip(const struct nf_conn *ct,
+			       const struct nf_conntrack_tuple_hash *hash)
+{
+	/* we only want to print DIR_ORIGINAL */
+	if (NF_CT_DIRECTION(hash))
+		return true;
+
+	if (nf_ct_l3num(ct) != AF_INET)
+		return true;
+
+	return false;
+}
+
 static int ct_seq_show(struct seq_file *s, void *v)
 {
 	struct nf_conntrack_tuple_hash *hash = v;
@@ -123,14 +136,15 @@ static int ct_seq_show(struct seq_file *s, void *v)
 	int ret = 0;
 
 	NF_CT_ASSERT(ct);
-	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+	if (ct_seq_should_skip(ct, hash))
 		return 0;
 
+	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
+		return 0;
 
-	/* we only want to print DIR_ORIGINAL */
-	if (NF_CT_DIRECTION(hash))
-		goto release;
-	if (nf_ct_l3num(ct) != AF_INET)
+	/* check if we raced w. object reuse */
+	if (!nf_ct_is_confirmed(ct) ||
+	    ct_seq_should_skip(ct, hash))
 		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
-- 
2.7.3


^ permalink raw reply related

* [PATCH nf-next 7/9] netfilter: conntrack: make netns address part of hash
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

Once we place all conntracks into a global hash table we want them to be
spread across entire hash table, even if namespaces have overlapping ip
addresses.

We add nf_conntrack_netns_hash helper to later re-use it for nat bysrc
and expectation hash handling.  The helper also allows us to avoid the
(then) pointless hashing of init_net if kernel is built without netns
support.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_core.h | 10 +++++++++
 net/netfilter/nf_conntrack_core.c         | 34 +++++++++++++++----------------
 2 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 62e17d1..389e6da 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -12,6 +12,7 @@
 #ifndef _NF_CONNTRACK_CORE_H
 #define _NF_CONNTRACK_CORE_H
 
+#include <linux/hash.h>
 #include <linux/netfilter.h>
 #include <net/netfilter/nf_conntrack_l3proto.h>
 #include <net/netfilter/nf_conntrack_l4proto.h>
@@ -86,4 +87,13 @@ void nf_conntrack_lock(spinlock_t *lock);
 
 extern spinlock_t nf_conntrack_expect_lock;
 
+static inline u32 nf_conntrack_netns_hash(const struct net *net)
+{
+#ifdef CONFIG_NET_NS
+	return hash_ptr(net, 32);
+#else
+	return 0;
+#endif
+}
+
 #endif /* _NF_CONNTRACK_CORE_H */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 10ae2ee..c29b929 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -144,9 +144,11 @@ EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static unsigned int nf_conntrack_hash_rnd __read_mostly;
 
-static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple)
+static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
+			      const struct net *net)
 {
 	unsigned int n;
+	u32 seed;
 
 	get_random_once(&nf_conntrack_hash_rnd, sizeof(nf_conntrack_hash_rnd));
 
@@ -154,32 +156,29 @@ static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple)
 	 * destination ports (which is a multiple of 4) and treat the last
 	 * three bytes manually.
 	 */
+	seed = nf_conntrack_hash_rnd ^ nf_conntrack_netns_hash(net);
 	n = (sizeof(tuple->src) + sizeof(tuple->dst.u3)) / sizeof(u32);
-	return jhash2((u32 *)tuple, n, nf_conntrack_hash_rnd ^
+	return jhash2((u32 *)tuple, n, seed ^
 		      (((__force __u16)tuple->dst.u.all << 16) |
 		      tuple->dst.protonum));
 }
 
-static u32 __hash_bucket(u32 hash, unsigned int size)
-{
-	return reciprocal_scale(hash, size);
-}
-
 static u32 hash_bucket(u32 hash, const struct net *net)
 {
-	return __hash_bucket(hash, net->ct.htable_size);
+	return reciprocal_scale(hash, net->ct.htable_size);
 }
 
-static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
-				  unsigned int size)
+static u32 __hash_conntrack(const struct net *net,
+			    const struct nf_conntrack_tuple *tuple,
+			    unsigned int size)
 {
-	return __hash_bucket(hash_conntrack_raw(tuple), size);
+	return reciprocal_scale(hash_conntrack_raw(tuple, net), size);
 }
 
-static inline u_int32_t hash_conntrack(const struct net *net,
-				       const struct nf_conntrack_tuple *tuple)
+static u32 hash_conntrack(const struct net *net,
+			  const struct nf_conntrack_tuple *tuple)
 {
-	return __hash_conntrack(tuple, net->ct.htable_size);
+	return __hash_conntrack(net, tuple, net->ct.htable_size);
 }
 
 bool
@@ -535,7 +534,7 @@ nf_conntrack_find_get(struct net *net, const struct nf_conntrack_zone *zone,
 		      const struct nf_conntrack_tuple *tuple)
 {
 	return __nf_conntrack_find_get(net, zone, tuple,
-				       hash_conntrack_raw(tuple));
+				       hash_conntrack_raw(tuple, net));
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_find_get);
 
@@ -1041,7 +1040,7 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
 
 	/* look for tuple match */
 	zone = nf_ct_zone_tmpl(tmpl, skb, &tmp);
-	hash = hash_conntrack_raw(&tuple);
+	hash = hash_conntrack_raw(&tuple, net);
 	h = __nf_conntrack_find_get(net, zone, &tuple, hash);
 	if (!h) {
 		h = init_conntrack(net, tmpl, &tuple, l3proto, l4proto,
@@ -1605,7 +1604,8 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 					struct nf_conntrack_tuple_hash, hnnode);
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			hlist_nulls_del_rcu(&h->hnnode);
-			bucket = __hash_conntrack(&h->tuple, hashsize);
+			bucket = __hash_conntrack(nf_ct_net(ct),
+						  &h->tuple, hashsize);
 			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
-- 
2.7.3


^ permalink raw reply related

* [PATCH nf-next 9/9] netfilter: conntrack: consider ct netns in early_drop logic
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

When iterating, skip conntrack entries living in a different netns.

We could ignore netns and kill some other non-assured one, but it
has two problems:

- a netns can kill non-assured conntracks in other namespace
- we would start to 'over-subscribe' the affected/overlimit netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 43 +++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index d58b597..418e4bc 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -763,18 +763,20 @@ static noinline int early_drop(struct net *net, unsigned int _hash)
 {
 	/* Use oldest entry, which is roughly LRU */
 	struct nf_conntrack_tuple_hash *h;
-	struct nf_conn *ct = NULL, *tmp;
+	struct nf_conn *tmp;
 	struct hlist_nulls_node *n;
-	unsigned int i = 0, cnt = 0;
-	int dropped = 0;
-	unsigned int hash, sequence;
+	unsigned int i, hash, sequence;
+	struct nf_conn *ct = NULL;
 	spinlock_t *lockp;
+	bool ret = false;
+
+	i = 0;
 
 	local_bh_disable();
 restart:
 	sequence = read_seqcount_begin(&nf_conntrack_generation);
-	hash = scale_hash(_hash);
-	for (; i < nf_conntrack_htable_size; i++) {
+	for (; i < NF_CT_EVICTION_RANGE; i++) {
+		hash = scale_hash(_hash++);
 		lockp = &nf_conntrack_locks[hash % CONNTRACK_LOCKS];
 		nf_conntrack_lock(lockp);
 		if (read_seqcount_retry(&nf_conntrack_generation, sequence)) {
@@ -784,35 +786,40 @@ restart:
 		hlist_nulls_for_each_entry_rcu(h, n, &nf_conntrack_hash[hash],
 					       hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
-			if (!test_bit(IPS_ASSURED_BIT, &tmp->status) &&
-			    !nf_ct_is_dying(tmp) &&
-			    atomic_inc_not_zero(&tmp->ct_general.use)) {
+
+			if (test_bit(IPS_ASSURED_BIT, &tmp->status) ||
+			    !net_eq(nf_ct_net(tmp), net) ||
+			    nf_ct_is_dying(tmp))
+				continue;
+
+			if (atomic_inc_not_zero(&tmp->ct_general.use)) {
 				ct = tmp;
 				break;
 			}
-			cnt++;
 		}
 
-		hash = (hash + 1) % nf_conntrack_htable_size;
 		spin_unlock(lockp);
-
-		if (ct || cnt >= NF_CT_EVICTION_RANGE)
+		if (ct)
 			break;
-
 	}
+
 	local_bh_enable();
 
 	if (!ct)
-		return dropped;
+		return false;
 
-	if (del_timer(&ct->timeout)) {
+	/* kill only if in same netns -- might have moved due to
+	 * SLAB_DESTROY_BY_RCU rules
+	 */
+	if (net_eq(nf_ct_net(ct), net) && del_timer(&ct->timeout)) {
 		if (nf_ct_delete(ct, 0, 0)) {
-			dropped = 1;
 			NF_CT_STAT_INC_ATOMIC(net, early_drop);
+			ret = true;
 		}
 	}
+
 	nf_ct_put(ct);
-	return dropped;
+	return ret;
 }
 
 static struct nf_conn *
-- 
2.7.3


^ permalink raw reply related

* [PATCH nf-next 6/9] netfilter: conntrack: check netns when comparing conntrack objects
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

Once we place all conntracks in the same hash table we must also compare
the netns pointer to skip conntracks that belong to a different namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   |  8 ++++++--
 net/netfilter/nf_conntrack_core.c                  | 23 ++++++++++++----------
 net/netfilter/nf_conntrack_netlink.c               |  3 +++
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 483cf79..171aba1 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -115,6 +115,7 @@ static inline void ct_show_secctx(struct seq_file *s, const struct nf_conn *ct)
 #endif
 
 static bool ct_seq_should_skip(const struct nf_conn *ct,
+			       const struct net *net,
 			       const struct nf_conntrack_tuple_hash *hash)
 {
 	/* we only want to print DIR_ORIGINAL */
@@ -124,6 +125,9 @@ static bool ct_seq_should_skip(const struct nf_conn *ct,
 	if (nf_ct_l3num(ct) != AF_INET)
 		return true;
 
+	if (!net_eq(nf_ct_net(ct), net))
+		return true;
+
 	return false;
 }
 
@@ -136,7 +140,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
 	int ret = 0;
 
 	NF_CT_ASSERT(ct);
-	if (ct_seq_should_skip(ct, hash))
+	if (ct_seq_should_skip(ct, seq_file_net(s), hash))
 		return 0;
 
 	if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
@@ -144,7 +148,7 @@ static int ct_seq_show(struct seq_file *s, void *v)
 
 	/* check if we raced w. object reuse */
 	if (!nf_ct_is_confirmed(ct) ||
-	    ct_seq_should_skip(ct, hash))
+	    ct_seq_should_skip(ct, seq_file_net(s), hash))
 		goto release;
 
 	l3proto = __nf_ct_l3proto_find(nf_ct_l3num(ct));
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 3b9c302..10ae2ee 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -447,7 +447,8 @@ static void death_by_timeout(unsigned long ul_conntrack)
 static inline bool
 nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 		const struct nf_conntrack_tuple *tuple,
-		const struct nf_conntrack_zone *zone)
+		const struct nf_conntrack_zone *zone,
+		const struct net *net)
 {
 	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
 
@@ -456,7 +457,8 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 	 */
 	return nf_ct_tuple_equal(tuple, &h->tuple) &&
 	       nf_ct_zone_equal(ct, zone, NF_CT_DIRECTION(h)) &&
-	       nf_ct_is_confirmed(ct);
+	       nf_ct_is_confirmed(ct) &&
+	       net_eq(net, nf_ct_net(ct));
 }
 
 /*
@@ -481,7 +483,7 @@ begin:
 	} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
-		if (nf_ct_key_equal(h, tuple, zone)) {
+		if (nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			return h;
 		}
@@ -517,7 +519,7 @@ begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone, net))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
@@ -573,12 +575,12 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	/* See if there's one in the list already, including reverse */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				    zone))
+				    zone, net))
 			goto out;
 
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				    zone))
+				    zone, net))
 			goto out;
 
 	add_timer(&ct->timeout);
@@ -663,12 +665,12 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	   not in the hash.  If there is, we lost race. */
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
-				    zone))
+				    zone, net))
 			goto out;
 
 	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
-				    zone))
+				    zone, net))
 			goto out;
 
 	/* Timer relative to confirmation time, not original
@@ -740,7 +742,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
 		if (ct != ignored_conntrack &&
-		    nf_ct_key_equal(h, tuple, zone)) {
+		    nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			rcu_read_unlock();
 			return 1;
@@ -1383,7 +1385,8 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 				if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 					continue;
 				ct = nf_ct_tuplehash_to_ctrack(h);
-				if (iter(ct, data))
+				if (net_eq(nf_ct_net(ct), net) &&
+				    iter(ct, data))
 					goto found;
 			}
 		}
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 294a8e2..f6bbcb2 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -837,6 +837,9 @@ restart:
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (!net_eq(net, nf_ct_net(ct)))
+				continue;
+
 			/* Dump entries of a given L3 protocol number.
 			 * If it is not specified, ie. l3proto == 0,
 			 * then dump everything. */
-- 
2.7.3

^ permalink raw reply related

* [PATCH nf-next 8/9] netfilter: conntrack: use a single hashtable for all namespaces
From: Florian Westphal @ 2016-04-28 17:13 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal
In-Reply-To: <1461863628-23350-1-git-send-email-fw@strlen.de>

We already include netns address in the hash and compare the netns pointers
during lookup, so even if namespaces have overlapping addresses entries
will be spread across the table.

Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a
64bit system.

NAT bysrc and expectation hash is still per namespace, those will
changed too soon.

Future patch will also make conntrack object slab cache global again.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 checkpatch complains about 'WARNING: line over 80 characters' but
 forcing line breaks looked even worse to me.

 include/net/netfilter/nf_conntrack_core.h          |  1 +
 include/net/netns/conntrack.h                      |  2 -
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c     |  2 +-
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   | 10 ++-
 net/netfilter/nf_conntrack_core.c                  | 78 +++++++++++-----------
 net/netfilter/nf_conntrack_helper.c                |  6 +-
 net/netfilter/nf_conntrack_netlink.c               |  8 +--
 net/netfilter/nf_conntrack_standalone.c            | 13 ++--
 net/netfilter/nf_nat_core.c                        |  2 +-
 net/netfilter/nfnetlink_cttimeout.c                |  6 +-
 10 files changed, 60 insertions(+), 68 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 389e6da..e8ad0ad 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -82,6 +82,7 @@ print_tuple(struct seq_file *s, const struct nf_conntrack_tuple *tuple,
 
 #define CONNTRACK_LOCKS 1024
 
+extern struct hlist_nulls_head *nf_conntrack_hash;
 extern spinlock_t nf_conntrack_locks[CONNTRACK_LOCKS];
 void nf_conntrack_lock(spinlock_t *lock);
 
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index b052785..251c435 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -93,9 +93,7 @@ struct netns_ct {
 	int			sysctl_tstamp;
 	int			sysctl_checksum;
 
-	unsigned int		htable_size;
 	struct kmem_cache	*nf_conntrack_cachep;
-	struct hlist_nulls_head	*hash;
 	struct hlist_head	*expect_hash;
 	struct ct_pcpu __percpu *pcpu_lists;
 	struct ip_conntrack_stat __percpu *stat;
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index e3c46e8..ae1a71a 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -360,7 +360,7 @@ static int ipv4_init_net(struct net *net)
 
 	in->ctl_table[0].data = &nf_conntrack_max;
 	in->ctl_table[1].data = &net->ct.count;
-	in->ctl_table[2].data = &net->ct.htable_size;
+	in->ctl_table[2].data = &nf_conntrack_htable_size;
 	in->ctl_table[3].data = &net->ct.sysctl_checksum;
 	in->ctl_table[4].data = &net->ct.sysctl_log_invalid;
 #endif
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 171aba1..f8fc7ab 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -31,15 +31,14 @@ struct ct_iter_state {
 
 static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
-	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
-	     st->bucket < net->ct.htable_size;
+	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
 		n = rcu_dereference(
-			hlist_nulls_first_rcu(&net->ct.hash[st->bucket]));
+			hlist_nulls_first_rcu(&nf_conntrack_hash[st->bucket]));
 		if (!is_a_nulls(n))
 			return n;
 	}
@@ -49,17 +48,16 @@ static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
 				      struct hlist_nulls_node *head)
 {
-	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(hlist_nulls_next_rcu(head));
 	while (is_a_nulls(head)) {
 		if (likely(get_nulls_value(head) == st->bucket)) {
-			if (++st->bucket >= net->ct.htable_size)
+			if (++st->bucket >= nf_conntrack_htable_size)
 				return NULL;
 		}
 		head = rcu_dereference(
-			hlist_nulls_first_rcu(&net->ct.hash[st->bucket]));
+			hlist_nulls_first_rcu(&nf_conntrack_hash[st->bucket]));
 	}
 	return head;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index c29b929..d58b597 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -68,6 +68,9 @@ EXPORT_SYMBOL_GPL(nf_conntrack_locks);
 __cacheline_aligned_in_smp DEFINE_SPINLOCK(nf_conntrack_expect_lock);
 EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 
+struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
+EXPORT_SYMBOL_GPL(nf_conntrack_hash);
+
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly seqcount_t nf_conntrack_generation;
 static __read_mostly bool nf_conntrack_locks_all;
@@ -163,9 +166,9 @@ static u32 hash_conntrack_raw(const struct nf_conntrack_tuple *tuple,
 		      tuple->dst.protonum));
 }
 
-static u32 hash_bucket(u32 hash, const struct net *net)
+static u32 scale_hash(u32 hash)
 {
-	return reciprocal_scale(hash, net->ct.htable_size);
+	return reciprocal_scale(hash, nf_conntrack_htable_size);
 }
 
 static u32 __hash_conntrack(const struct net *net,
@@ -178,7 +181,7 @@ static u32 __hash_conntrack(const struct net *net,
 static u32 hash_conntrack(const struct net *net,
 			  const struct nf_conntrack_tuple *tuple)
 {
-	return __hash_conntrack(net, tuple, net->ct.htable_size);
+	return scale_hash(hash_conntrack_raw(tuple, net));
 }
 
 bool
@@ -477,8 +480,8 @@ ____nf_conntrack_find(struct net *net, const struct nf_conntrack_zone *zone,
 begin:
 	do {
 		sequence = read_seqcount_begin(&nf_conntrack_generation);
-		bucket = hash_bucket(hash, net);
-		ct_hash = net->ct.hash;
+		bucket = scale_hash(hash);
+		ct_hash = nf_conntrack_hash;
 	} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
@@ -542,12 +545,10 @@ static void __nf_conntrack_hash_insert(struct nf_conn *ct,
 				       unsigned int hash,
 				       unsigned int reply_hash)
 {
-	struct net *net = nf_ct_net(ct);
-
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-			   &net->ct.hash[hash]);
+			   &nf_conntrack_hash[hash]);
 	hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
-			   &net->ct.hash[reply_hash]);
+			   &nf_conntrack_hash[reply_hash]);
 }
 
 int
@@ -572,12 +573,12 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 	} while (nf_conntrack_double_lock(net, hash, reply_hash, sequence));
 
 	/* See if there's one in the list already, including reverse */
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				    zone, net))
 			goto out;
 
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[reply_hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				    zone, net))
 			goto out;
@@ -632,7 +633,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 		sequence = read_seqcount_begin(&nf_conntrack_generation);
 		/* reuse the hash saved before */
 		hash = *(unsigned long *)&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev;
-		hash = hash_bucket(hash, net);
+		hash = scale_hash(hash);
 		reply_hash = hash_conntrack(net,
 					   &ct->tuplehash[IP_CT_DIR_REPLY].tuple);
 
@@ -662,12 +663,12 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
 				    zone, net))
 			goto out;
 
-	hlist_nulls_for_each_entry(h, n, &net->ct.hash[reply_hash], hnnode)
+	hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[reply_hash], hnnode)
 		if (nf_ct_key_equal(h, &ct->tuplehash[IP_CT_DIR_REPLY].tuple,
 				    zone, net))
 			goto out;
@@ -735,7 +736,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	do {
 		sequence = read_seqcount_begin(&nf_conntrack_generation);
 		hash = hash_conntrack(net, tuple);
-		ct_hash = net->ct.hash;
+		ct_hash = nf_conntrack_hash;
 	} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
@@ -772,16 +773,16 @@ static noinline int early_drop(struct net *net, unsigned int _hash)
 	local_bh_disable();
 restart:
 	sequence = read_seqcount_begin(&nf_conntrack_generation);
-	hash = hash_bucket(_hash, net);
-	for (; i < net->ct.htable_size; i++) {
+	hash = scale_hash(_hash);
+	for (; i < nf_conntrack_htable_size; i++) {
 		lockp = &nf_conntrack_locks[hash % CONNTRACK_LOCKS];
 		nf_conntrack_lock(lockp);
 		if (read_seqcount_retry(&nf_conntrack_generation, sequence)) {
 			spin_unlock(lockp);
 			goto restart;
 		}
-		hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash],
-					 hnnode) {
+		hlist_nulls_for_each_entry_rcu(h, n, &nf_conntrack_hash[hash],
+					       hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 			if (!test_bit(IPS_ASSURED_BIT, &tmp->status) &&
 			    !nf_ct_is_dying(tmp) &&
@@ -792,7 +793,7 @@ restart:
 			cnt++;
 		}
 
-		hash = (hash + 1) % net->ct.htable_size;
+		hash = (hash + 1) % nf_conntrack_htable_size;
 		spin_unlock(lockp);
 
 		if (ct || cnt >= NF_CT_EVICTION_RANGE)
@@ -1375,12 +1376,12 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	int cpu;
 	spinlock_t *lockp;
 
-	for (; *bucket < net->ct.htable_size; (*bucket)++) {
+	for (; *bucket < nf_conntrack_htable_size; (*bucket)++) {
 		lockp = &nf_conntrack_locks[*bucket % CONNTRACK_LOCKS];
 		local_bh_disable();
 		nf_conntrack_lock(lockp);
-		if (*bucket < net->ct.htable_size) {
-			hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
+		if (*bucket < nf_conntrack_htable_size) {
+			hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[*bucket], hnnode) {
 				if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 					continue;
 				ct = nf_ct_tuplehash_to_ctrack(h);
@@ -1527,7 +1528,6 @@ i_see_dead_people:
 	}
 
 	list_for_each_entry(net, net_exit_list, exit_list) {
-		nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
 		nf_conntrack_proto_pernet_fini(net);
 		nf_conntrack_helper_pernet_fini(net);
 		nf_conntrack_ecache_pernet_fini(net);
@@ -1598,10 +1598,10 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 	 * though since that required taking the locks.
 	 */
 
-	for (i = 0; i < init_net.ct.htable_size; i++) {
-		while (!hlist_nulls_empty(&init_net.ct.hash[i])) {
-			h = hlist_nulls_entry(init_net.ct.hash[i].first,
-					struct nf_conntrack_tuple_hash, hnnode);
+	for (i = 0; i < nf_conntrack_htable_size; i++) {
+		while (!hlist_nulls_empty(&nf_conntrack_hash[i])) {
+			h = hlist_nulls_entry(nf_conntrack_hash[i].first,
+					      struct nf_conntrack_tuple_hash, hnnode);
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			hlist_nulls_del_rcu(&h->hnnode);
 			bucket = __hash_conntrack(nf_ct_net(ct),
@@ -1609,11 +1609,11 @@ int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 			hlist_nulls_add_head_rcu(&h->hnnode, &hash[bucket]);
 		}
 	}
-	old_size = init_net.ct.htable_size;
-	old_hash = init_net.ct.hash;
+	old_size = nf_conntrack_htable_size;
+	old_hash = nf_conntrack_hash;
 
-	init_net.ct.htable_size = nf_conntrack_htable_size = hashsize;
-	init_net.ct.hash = hash;
+	nf_conntrack_hash = hash;
+	nf_conntrack_htable_size = hashsize;
 
 	write_seqcount_end(&nf_conntrack_generation);
 	nf_conntrack_all_unlock();
@@ -1669,6 +1669,11 @@ int nf_conntrack_init_start(void)
 		 * entries. */
 		max_factor = 4;
 	}
+
+	nf_conntrack_hash = nf_ct_alloc_hashtable(&nf_conntrack_htable_size, 1);
+	if (!nf_conntrack_hash)
+		return -ENOMEM;
+
 	nf_conntrack_max = max_factor * nf_conntrack_htable_size;
 
 	printk(KERN_INFO "nf_conntrack version %s (%u buckets, %d max)\n",
@@ -1747,6 +1752,7 @@ err_tstamp:
 err_acct:
 	nf_conntrack_expect_fini();
 err_expect:
+	nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
 	return ret;
 }
 
@@ -1799,12 +1805,6 @@ int nf_conntrack_init_net(struct net *net)
 		goto err_cache;
 	}
 
-	net->ct.htable_size = nf_conntrack_htable_size;
-	net->ct.hash = nf_ct_alloc_hashtable(&net->ct.htable_size, 1);
-	if (!net->ct.hash) {
-		printk(KERN_ERR "Unable to create nf_conntrack_hash\n");
-		goto err_hash;
-	}
 	ret = nf_conntrack_expect_pernet_init(net);
 	if (ret < 0)
 		goto err_expect;
@@ -1836,8 +1836,6 @@ err_tstamp:
 err_acct:
 	nf_conntrack_expect_pernet_fini(net);
 err_expect:
-	nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
-err_hash:
 	kmem_cache_destroy(net->ct.nf_conntrack_cachep);
 err_cache:
 	kfree(net->ct.slabname);
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 498bf74..cb48e6a 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -424,10 +424,10 @@ static void __nf_conntrack_helper_unregister(struct nf_conntrack_helper *me,
 		spin_unlock_bh(&pcpu->lock);
 	}
 	local_bh_disable();
-	for (i = 0; i < net->ct.htable_size; i++) {
+	for (i = 0; i < nf_conntrack_htable_size; i++) {
 		nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
-		if (i < net->ct.htable_size) {
-			hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
+		if (i < nf_conntrack_htable_size) {
+			hlist_nulls_for_each_entry(h, nn, &nf_conntrack_hash[i], hnnode)
 				unhelp(h, me);
 		}
 		spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index f6bbcb2..e00f178 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -824,16 +824,16 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	last = (struct nf_conn *)cb->args[1];
 
 	local_bh_disable();
-	for (; cb->args[0] < net->ct.htable_size; cb->args[0]++) {
+	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
 		lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
 		nf_conntrack_lock(lockp);
-		if (cb->args[0] >= net->ct.htable_size) {
+		if (cb->args[0] >= nf_conntrack_htable_size) {
 			spin_unlock(lockp);
 			goto out;
 		}
-		hlist_nulls_for_each_entry(h, n, &net->ct.hash[cb->args[0]],
-					 hnnode) {
+		hlist_nulls_for_each_entry(h, n, &nf_conntrack_hash[cb->args[0]],
+					   hnnode) {
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 0f1a45b..f87e84e 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -54,14 +54,13 @@ struct ct_iter_state {
 
 static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 {
-	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 	struct hlist_nulls_node *n;
 
 	for (st->bucket = 0;
-	     st->bucket < net->ct.htable_size;
+	     st->bucket < nf_conntrack_htable_size;
 	     st->bucket++) {
-		n = rcu_dereference(hlist_nulls_first_rcu(&net->ct.hash[st->bucket]));
+		n = rcu_dereference(hlist_nulls_first_rcu(&nf_conntrack_hash[st->bucket]));
 		if (!is_a_nulls(n))
 			return n;
 	}
@@ -71,18 +70,17 @@ static struct hlist_nulls_node *ct_get_first(struct seq_file *seq)
 static struct hlist_nulls_node *ct_get_next(struct seq_file *seq,
 				      struct hlist_nulls_node *head)
 {
-	struct net *net = seq_file_net(seq);
 	struct ct_iter_state *st = seq->private;
 
 	head = rcu_dereference(hlist_nulls_next_rcu(head));
 	while (is_a_nulls(head)) {
 		if (likely(get_nulls_value(head) == st->bucket)) {
-			if (++st->bucket >= net->ct.htable_size)
+			if (++st->bucket >= nf_conntrack_htable_size)
 				return NULL;
 		}
 		head = rcu_dereference(
 				hlist_nulls_first_rcu(
-					&net->ct.hash[st->bucket]));
+					&nf_conntrack_hash[st->bucket]));
 	}
 	return head;
 }
@@ -458,7 +456,7 @@ static struct ctl_table nf_ct_sysctl_table[] = {
 	},
 	{
 		.procname       = "nf_conntrack_buckets",
-		.data           = &init_net.ct.htable_size,
+		.data           = &nf_conntrack_htable_size,
 		.maxlen         = sizeof(unsigned int),
 		.mode           = 0444,
 		.proc_handler   = proc_dointvec,
@@ -512,7 +510,6 @@ static int nf_conntrack_standalone_init_sysctl(struct net *net)
 		goto out_kmemdup;
 
 	table[1].data = &net->ct.count;
-	table[2].data = &net->ct.htable_size;
 	table[3].data = &net->ct.sysctl_checksum;
 	table[4].data = &net->ct.sysctl_log_invalid;
 
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 3d52271..d74e716 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -824,7 +824,7 @@ nfnetlink_parse_nat_setup(struct nf_conn *ct,
 static int __net_init nf_nat_net_init(struct net *net)
 {
 	/* Leave them the same for the moment. */
-	net->ct.nat_htable_size = net->ct.htable_size;
+	net->ct.nat_htable_size = nf_conntrack_htable_size;
 	net->ct.nat_bysource = nf_ct_alloc_hashtable(&net->ct.nat_htable_size, 0);
 	if (!net->ct.nat_bysource)
 		return -ENOMEM;
diff --git a/net/netfilter/nfnetlink_cttimeout.c b/net/netfilter/nfnetlink_cttimeout.c
index 2671b9d..3c84f14 100644
--- a/net/netfilter/nfnetlink_cttimeout.c
+++ b/net/netfilter/nfnetlink_cttimeout.c
@@ -306,10 +306,10 @@ static void ctnl_untimeout(struct net *net, struct ctnl_timeout *timeout)
 	int i;
 
 	local_bh_disable();
-	for (i = 0; i < net->ct.htable_size; i++) {
+	for (i = 0; i < nf_conntrack_htable_size; i++) {
 		nf_conntrack_lock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
-		if (i < net->ct.htable_size) {
-			hlist_nulls_for_each_entry(h, nn, &net->ct.hash[i], hnnode)
+		if (i < nf_conntrack_htable_size) {
+			hlist_nulls_for_each_entry(h, nn, &nf_conntrack_hash[i], hnnode)
 				untimeout(h, timeout);
 		}
 		spin_unlock(&nf_conntrack_locks[i % CONNTRACK_LOCKS]);
-- 
2.7.3

^ permalink raw reply related

* RE: VRF_DEVICE integration plan
From: Elluru, Krishna Mohan @ 2016-04-28 17:16 UTC (permalink / raw)
  To: David Ahern, netdev@vger.kernel.org
  Cc: Kumara, Shantha (HP Networking), Govindan Nair, Anoop
In-Reply-To: <571D5720.7000908@cumulusnetworks.com>

HI David,
	Thanks a lot for your response. It clarifies few of my questions. Please see inline for with tag MOHAN> for my response.

Thanks
Krishna Mohan.
-----Original Message-----
From: David Ahern [mailto:dsa@cumulusnetworks.com] 
Sent: Monday, April 25, 2016 5:01 AM
To: Elluru, Krishna Mohan <elluru.kri.mohan@hpe.com>; netdev@vger.kernel.org
Subject: Re: VRF_DEVICE integration plan

On 4/23/16 10:07 PM, Elluru, Krishna Mohan wrote:
> HI Netdev team,
>
>   	Greetings. We have been monitoring the vrf device approach for l3 isolation from cumulus networks and we are currently interested in validating it. We have following questions on them and hoping to get answers from you/concerned team.
>
> 1. As per the linux documentation, there are known limits on if_index lookup, as the incoming if_index is changed to vrf_device index and thus an application receiving this packet will perceive this as a vrf_device packet, than right if_index. I saw you mentioned about a special flag to identify the origin, but didn't see the same in the latest linux 4.4.2 version code. Is there a patch expected for it?

you are referring to IP{6}_PKTINFO? I have patches from our 4.1 kernel tree that I have rebased to top of tree. I hope to send those out in the next few weeks.
MOHAN> Yes. Sure. Thanks.

>
> 2. What are the future additions planned for this approach? Are there any ipv4 and ipv6 known bugs with vrf_device model?

We have about 20 patches in our tree that I have not sent upstream yet. 
Those patches fix PKTINFO, allow local traffic (e.g, ping in a VRF to a 
local address in a VRF), allow IPv6 multicast and linklocal traffic, and 
the cgroup implementation which has been sent as an RFC.

I posted a few bug fix patches a week or two ago. Not sure what the 
status is with respect to 4.3 - 4.5 trees.

MOHAN> Sure. Are those patches sent over netdev mailer list?

>
> 3. It has been said in the documentation that, with addition of cgroup functionality for vrf device, with net_admin capabilities, we should be able to add an interface to vrf_device, currently it is not so. Any timelines on these?

I don't understand that question. The current implementation allows 
adding interfaces (netdev's) to a VRF. The cgroup allows running a 
process in a VRF context such that AF_INET{6} sockets are automatically 
bound to the VRF device.

MOHAN> sorry for not being clear. My ask was, to create a namespace we need cap_admin privileges currently, but your earlier mails suggested that we should be able to configure/create vrf device with net_admin capabilities. Is this support present /expected to be added soon?

>
> 4. Currently the changes are available and portable from 4.3.x onwards. Is there a plan to port them to previous kernel versions?

no. Anyone wanting to use the vrf patches on other kernel versions will 
need to port them.
MOHAN> Sure.
>
> 5. Is there a possibility of enabling secondary level lookup, to give a leak functionality to parent route table from device local route table? I tested with veth pair, configured one as default gateway, it is possible to forward traffic b/w the interfaces, looking for cleaner method.

Are you referring to inter-vrf routing? See slide 27
http://www.netdevconf.org/1.1/proceedings/slides/ahern-vrf-tutorial.pdf
Full lookup in VRF table
▪ ip route add table vrf-red 1.1.1.0/24 dev vrf-green
MOHAN> In slide 27 above shows inter vrf routing, requirement is to use current namespace global route table if the ip lookup fails in vrf-device routing table. 
Reference: https://www.juniper.net/techpubs/en_US/junose16.1/topics/task/configuration/mbgp-secondary-routing-table-search.html

>
> 6. With "VRF Device" in place,  please confirm if there are any plans to add VRF support for applications like
>
> 	1.	Ping

no need. ping{6} -I <vrf device> ...

> 	2.	Traceroute

no need. traceroute{6} -i <vrf device> ...

> 	3.	DNS-Client [glibc]
>
> 	In case of DNS-Client, most of the name resolution APIs will have to consider the VRF to do the lookup in  and the way the domain-name/name-server configuration is stored.

I have looked into it but no patches worth distributing at the moment.

MOHAN> Okay, thanks for the inputs.

^ permalink raw reply

* Re: [PATCH net-next 0/6] net: make TCP preemptible
From: Alexei Starovoitov @ 2016-04-28 17:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
In-Reply-To: <1461821152-23200-1-git-send-email-edumazet@google.com>

On Wed, Apr 27, 2016 at 10:25:46PM -0700, Eric Dumazet wrote:
> Most of TCP stack assumed it was running from BH handler.
> 
> This is great for most things, as TCP behavior is very sensitive
> to scheduling artifacts.
> 
> However, the prequeue and backlog processing are problematic,
> as they need to be flushed with BH being blocked.
> 
> To cope with modern needs, TCP sockets have big sk_rcvbuf values,
> in the order of 16 MB.
> This means that backlog can hold thousands of packets, and things
> like TCP coalescing or collapsing on this amount of packets can
> lead to insane latency spikes, since BH are blocked for too long.
> 
> It is time to make UDP/TCP stacks preemptible.
> 
> Note that fast path still runs from BH handler.

this looks pretty awesome.
the change will make the backlog run in bh enabled, so that one
large flow reciever will not penalize the rest of the system, right?
but you're saying that prequeue is also expensive, but not touched
by this patchset? was it addressed by your eariler patch?
Or more work still tbd?
I'm just trying to understand more about tcp stack.

^ permalink raw reply

* [PATCH] net: l2tp: fix reversed udp6 checksum flags
From: Wang Shanker @ 2016-04-28 17:29 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Tom Herbert, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]

This patch fixes a bug which causes the behavior of whether to ignore
udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
userspace program requests.

When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
expected that udp6 checksums of received packets of the l2tp tunnel
to create should be ignored. In `l2tp_netlink.c`:
`l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
according to the flag, and then passed to `l2tp_core.c`:
`l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
`l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
to `false`, i.e. be set to the contrary. Similarly, the same should be
done to `udp_conf.use_udp6_tx_checksums`.

Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
---
 net/l2tp/l2tp_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index afca2eb..6edfa99 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1376,9 +1376,9 @@ static int l2tp_tunnel_sock_create(struct net *net,
 			memcpy(&udp_conf.peer_ip6, cfg->peer_ip6,
 			       sizeof(udp_conf.peer_ip6));
 			udp_conf.use_udp6_tx_checksums =
-			    cfg->udp6_zero_tx_checksums;
+			  ! cfg->udp6_zero_tx_checksums;
 			udp_conf.use_udp6_rx_checksums =
-			    cfg->udp6_zero_rx_checksums;
+			  ! cfg->udp6_zero_rx_checksums;
 		} else
 #endif
 		{
-- 
2.5.2


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 4130 bytes --]

^ permalink raw reply related

* [PATCH] net: l2tp: fix reversed udp6 checksum flags
From: Wang Shanker @ 2016-04-28 17:32 UTC (permalink / raw)
  To: netdev; +Cc: James Chapman, Tom Herbert, David S. Miller

This patch fixes a bug which causes the behavior of whether to ignore
udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
userspace program requests.

When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
expected that udp6 checksums of received packets of the l2tp tunnel
to create should be ignored. In `l2tp_netlink.c`:
`l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
according to the flag, and then passed to `l2tp_core.c`:
`l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
`l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
to `false`, i.e. be set to the contrary. Similarly, the same should be
done to `udp_conf.use_udp6_tx_checksums`.

Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
---
net/l2tp/l2tp_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index afca2eb..6edfa99 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1376,9 +1376,9 @@ static int l2tp_tunnel_sock_create(struct net *net,
			memcpy(&udp_conf.peer_ip6, cfg->peer_ip6,
			       sizeof(udp_conf.peer_ip6));
			udp_conf.use_udp6_tx_checksums =
-			    cfg->udp6_zero_tx_checksums;
+			  ! cfg->udp6_zero_tx_checksums;
			udp_conf.use_udp6_rx_checksums =
-			    cfg->udp6_zero_rx_checksums;
+			  ! cfg->udp6_zero_rx_checksums;
		} else
#endif
		{
-- 
2.5.2

^ permalink raw reply related

* Re: [PATCH net-next 0/6] net: make TCP preemptible
From: Eric Dumazet @ 2016-04-28 17:41 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Eric Dumazet, David S . Miller, netdev
In-Reply-To: <20160428172320.GA83199@ast-mbp.thefacebook.com>

On Thu, 2016-04-28 at 10:23 -0700, Alexei Starovoitov wrote:
> On Wed, Apr 27, 2016 at 10:25:46PM -0700, Eric Dumazet wrote:
> > Most of TCP stack assumed it was running from BH handler.
> > 
> > This is great for most things, as TCP behavior is very sensitive
> > to scheduling artifacts.
> > 
> > However, the prequeue and backlog processing are problematic,
> > as they need to be flushed with BH being blocked.
> > 
> > To cope with modern needs, TCP sockets have big sk_rcvbuf values,
> > in the order of 16 MB.
> > This means that backlog can hold thousands of packets, and things
> > like TCP coalescing or collapsing on this amount of packets can
> > lead to insane latency spikes, since BH are blocked for too long.
> > 
> > It is time to make UDP/TCP stacks preemptible.
> > 
> > Note that fast path still runs from BH handler.
> 
> this looks pretty awesome.

Yes, I am pretty excited ;)

> the change will make the backlog run in bh enabled, so that one
> large flow reciever will not penalize the rest of the system, right?

Not only large flows, but flows with losses/reorders.
Typically many flows are in this case when a congestion collapse
happens.



> but you're saying that prequeue is also expensive, but not touched
> by this patchset? was it addressed by your eariler patch?
> Or more work still tbd?

prequeue is handled by "[2/6] tcp: do not block bh during prequeue
processing"

Note that I also sent a patch earlier (("tcp: give prequeue mode some
care")) to control max size of prequeue to 32 packets.

> I'm just trying to understand more about tcp stack.
> 
Sure ;)

I also have a patch to add scheduling point in sendmsg() (ie : draining
the backlog if not empty) for each new skb added to the write queue.

Since each skb is about 64KB (with GSO/TSO), it means an application no
longer will hold the socket lock too long, even when doing a
write()/sendmsg() of say 8 MB at once ;)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox