Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Blake Matheny @ 2014-12-15 16:08 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng
  Cc: Martin Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Josef Bacik, Kernel Team
In-Reply-To: <1418659395.9773.13.camel@edumazet-glaptop2.roam.corp.google.com>

We have an additional set of patches for web10g that builds on these
tracepoints. It can be made to work either way, but I agree the idea of
something like a sockopt would be really nice.

-Blake

On 12/15/14, 8:03 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

>On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>
>> I think patches 1 and 3 are good additions, since they establish
>> few permanent points of instrumentation in tcp stack.
>> Patches 4-5 look more like use cases of tracepoints established
>> before. They may feel like simple additions and, no doubt,
>> they are useful, but since they expose things via tracing
>> infra they become part of api and cannot be changed later,
>> when more stats would be needed.
>> I think systemtap like scripting on top of patches 1 and 3
>> should solve your use case ?
>> Also, have you looked at recent eBPF work?
>> Though it's not completely ready yet, soon it should
>> be able to do the same stats collection as you have
>> in 4/5 without adding permanent pieces to the kernel.
>
>So it looks like web10g like interfaces are very often requested by
>various teams.
>
>And we have many different views on how to hack this. I am astonished by
>number of hacks I saw about this stuff going on.
>
>What about a clean way, extending current TCP_INFO, which is both
>available as a getsockopt() for socket owners and ss/iproute2
>information for 'external entities'
>
>If we consider web10g info needed, then adding a ftrace/eBPF like
>interface is simply yet another piece of code we need to maintain,
>and the argument of 'this should cost nothing if not activated' is
>nonsense since major players need to constantly monitor TCP metrics and
>behavior.
>
>It seems both FaceBook and Google are working on a subset of web10g.
>
>I suggest we meet together and establish a common ground, preferably
>after Christmas holidays.
>
>Thanks
>
>


^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Eric Dumazet @ 2014-12-15 16:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Laurent Chavey, Yuchung Cheng
  Cc: Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Josef Bacik, Kernel Team
In-Reply-To: <CAADnVQJ+8mtB8LD=U7XbxOC2hxhDChxOELhZ3NEYeoTk1G3LYg@mail.gmail.com>

On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:

> I think patches 1 and 3 are good additions, since they establish
> few permanent points of instrumentation in tcp stack.
> Patches 4-5 look more like use cases of tracepoints established
> before. They may feel like simple additions and, no doubt,
> they are useful, but since they expose things via tracing
> infra they become part of api and cannot be changed later,
> when more stats would be needed.
> I think systemtap like scripting on top of patches 1 and 3
> should solve your use case ?
> Also, have you looked at recent eBPF work?
> Though it's not completely ready yet, soon it should
> be able to do the same stats collection as you have
> in 4/5 without adding permanent pieces to the kernel.

So it looks like web10g like interfaces are very often requested by
various teams.

And we have many different views on how to hack this. I am astonished by
number of hacks I saw about this stuff going on.

What about a clean way, extending current TCP_INFO, which is both
available as a getsockopt() for socket owners and ss/iproute2
information for 'external entities'

If we consider web10g info needed, then adding a ftrace/eBPF like
interface is simply yet another piece of code we need to maintain,
and the argument of 'this should cost nothing if not activated' is
nonsense since major players need to constantly monitor TCP metrics and
behavior.

It seems both FaceBook and Google are working on a subset of web10g.

I suggest we meet together and establish a common ground, preferably
after Christmas holidays.

Thanks

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: Jamal Hadi Salim @ 2014-12-15 15:31 UTC (permalink / raw)
  To: Jiri Pirko, Or Gerlitz
  Cc: Or Gerlitz, David S. Miller, Linux Netdev List, Andy Gospodarek,
	John Fastabend
In-Reply-To: <20141214223354.GA2035@nanopsycho.orion>

On 12/14/14 17:33, Jiri Pirko wrote:
> Sun, Dec 14, 2014 at 09:14:27PM CET, gerlitz.or@gmail.com wrote:
>> On Sun, Dec 14, 2014 at 9:23 PM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Sun, Dec 14, 2014 at 05:19:05PM CET, ogerlitz@mellanox.com wrote:
>>>> The current implementations all use dev_uc_add_excl() and such whose API
>>>> doesn't support vlans, so we can't make it with NICs HW for now.
>>>>
>>>> Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
>>>
>>> Maybe I'm missing something, but this commit did not introduce the
>>> problem.
>>
>> This commit introduced the vid param to ndo_fdb_add and ndo_dflt_fdb_add
>> which further call the dev_uc_add APIs... so it did introduced the
>> ability to provide VID into these APIs, right? and we want to protect
>> against anyone using this ability @ this point.
>
> That is in-kernel change. Vs. usespace the patch is a no-op. If userspace
> fills up NDA_VLAN, it is ignored before the patch as well as after. No
> behaviour change, just +- cosmetics.
>

Indeed doesnt break anything, but:
I think this brings to the forefront what these dev_uc/mc addresses
are supposed to be here. I am sure there is a good reason to tie them
to the fdb API - I am just not grokking it at the moment.
The concept of a vlanid makes sense for an fdb entry. It
starts to break when you start tying in vlans - i.e i am not aware
of multicast/unicast device entries which are tied to vlans.
Maybe this is some SRIOV thing?

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Jamal Hadi Salim @ 2014-12-15 15:26 UTC (permalink / raw)
  To: Roopa Prabhu, Jiri Pirko
  Cc: sfeldma, bcrl, tgraf, john.fastabend, stephen, linville, vyasevic,
	netdev, davem, shm, gospo
In-Reply-To: <548DE7E2.6010705@cumulusnetworks.com>


Sorry - i didnt quiet follow the discussion, but i can see the value
of propagating things from parent to children netdevs as part of the
generic approach. And in that spirit:

Ben's patches (and I am sure the cumulus folk do this) expose ports.
i.e you boot up the hardware and you see ports. You can then put these
ports in a bridge and you can offload fdbs and do other parametrization
to the ASIC. IOW, this only becomes a bridge because you created one
in the kernel and attached bridge ports to it.

Lets say i didnt want a bridge. I want instead to take these exposed
ports and create a bond (and maybe play with LACP). How does this
propagation from parent->child->child work then? I think the idea
of just bonding and not exposing it as a switch is a reasonable use
case.
Also how does it work when i start doing L3 and the bond's port doesnt
support L3? Is it time to revive the thing we called TheThing in Du?

cheers,
jamal

On 12/14/14 14:41, Roopa Prabhu wrote:
> On 12/14/14, 7:35 AM, Jiri Pirko wrote:

[..chopped off for brevity and saving electrons..]

cheers,
jamal

^ permalink raw reply

* Re: [bisected] tg3 broken in 3.18.0?
From: Marcelo Ricardo Leitner @ 2014-12-15 15:06 UTC (permalink / raw)
  To: Nils Holland, David Miller; +Cc: netdev, linux-pci, rajatxjain
In-Reply-To: <20141213210251.GA12812@teela.fritz.box>

On 13-12-2014 19:02, Nils Holland wrote:
> rajatxjain@gmail.com
> Bcc:
> Subject: Re: [bisected] tg3 broken in 3.18.0?
> Reply-To:
> In-Reply-To: <20141212.201831.186234837340644301.davem@davemloft.net>
>
> On Fri, Dec 12, 2014 at 08:18:31PM -0500, David Miller wrote:
>> From: Nils Holland <nholland@tisys.org>
>> Date: Sat, 13 Dec 2014 02:14:08 +0100
>>
>>>
>>> My bisect exercise suggests that the following commit is the culprit:
>>>
>>> 89665a6a71408796565bfd29cfa6a7877b17a667 (PCI: Check only the Vendor
>>> ID to identify Configuration Request Retry)
>>
>> You definitely need to bring this up with the author of that change
>> and the relevent list for the PCI subsystem and/or linux-kernel.
>
> I've now already sent an inquiry to Rajat Jain, the author of the
> patch in question, and this message here is now also CC'd to
> linux-pci@.
>
> With this message, I'd like to add one last result of investigation
> I've done today, in the hope that it will aid the folks with more
> knowledge to go after the issue.

FWIW, reverting this change fixes tg3 in here too.

Thanks Nils for doing the bisect!

With these debugs (note the re-revert):

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c 

index 2306268..4474502 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1436,14 +1436,22 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, 
int devfn, u32 *l,
                 return false;

         /* Configuration request Retry Status */
-       while (*l == 0xffff0001) {
-               if (!crs_timeout)
+       printk ("pci %04x:%02x:%02x.%d: 1st %x %x\n", pci_domain_nr(bus), 
bus->number,
+               PCI_SLOT(devfn), PCI_FUNC(devfn), *l, *l & 0xffff);
+       while ((*l & 0xffff) == 0x0001) {
+               if (!crs_timeout) {
+                       printk ("pci %04x:%02x:%02x.%d: crs_timeout: %d\n", 
pci_domain_nr(bus),
+                               bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn), 
crs_timeout);
                         return false;
+               }

                 msleep(delay);
                 delay *= 2;
-               if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
+               if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l)) {
+                       printk ("pci %04x:%02x:%02x.%d: 
pci_bus_read_config_dword failed\n", pci_domain_nr(bus),
+                               bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn));
                         return false;
+               }
                 /* Card hasn't responded in 60 seconds?  Must be stuck. */
                 if (delay > crs_timeout) {
                         printk(KERN_WARNING "pci %04x:%02x:%02x.%d: not 
responding\n",
@@ -1451,6 +1459,7 @@ bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int 
devfn, u32 *l,
                                PCI_FUNC(devfn));
                         return false;
                 }
+               printk ("pci %04x:%02x:%02x.%d: %x %x\n", pci_domain_nr(bus), 
bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn), *l, *l & 0xffff);
         }

         return true;

I'm getting, with commit 89665a6a71408796565bfd29cfa6a7877b17a667:

$ grep 'pci 0000:02' tg3.bad
[    0.190733] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190736] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190810] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
[    0.190885] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
[    0.191048] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
[    0.191382] pci 0000:02:00.0: PME# supported from D3hot D3cold
[    0.191438] pci 0000:02:00.0: System wakeup disabled by ACPI
[    1.561555] pci 0000:02:00.0: 1st 1 1
[    1.561558] pci 0000:02:00.0: crs_timeout: 0
[   20.412021] pci 0000:02:00.0: 1st 1 1
[   20.412022] pci 0000:02:00.0: crs_timeout: 0
[   20.413596] pci 0000:02:00.0: 1st 1 1
[   20.413598] pci 0000:02:00.0: crs_timeout: 0

And without it:

$ grep 'pci 0000:02' tg3.good
[    0.190734] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190738] pci 0000:02:00.0: 1st 165a14e4 14e4
[    0.190811] pci 0000:02:00.0: [14e4:165a] type 00 class 0x020000
[    0.190884] pci 0000:02:00.0: reg 0x10: [mem 0xf7c40000-0xf7c4ffff 64bit]
[    0.191047] pci 0000:02:00.0: reg 0x30: [mem 0xf7c00000-0xf7c3ffff pref]
[    0.191380] pci 0000:02:00.0: PME# supported from D3hot D3cold
[    0.191439] pci 0000:02:00.0: System wakeup disabled by ACPI
[    1.576778] pci 0000:02:00.0: 1st 1 1
[   19.068517] pci 0000:02:00.0: 1st 165a14e4 14e4

Hope that helps!

   Marcelo

^ permalink raw reply related

* Re: [RFC PATCH net-next 1/1] net: Support for switch port configuration
From: Thomas Graf @ 2014-12-15 14:40 UTC (permalink / raw)
  To: Varlese, Marco
  Cc: John Fastabend, Jiri Pirko, netdev@vger.kernel.org,
	stephen@networkplumber.org, Fastabend, John R,
	roopa@cumulusnetworks.com, sfeldma@gmail.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <C4896FB061E7DE4AAC93031BDCA044B104AC5DCF@IRSMSX108.ger.corp.intel.com>

On 12/15/14 at 02:29pm, Varlese, Marco wrote:
> > All of these are highly generic and should *not* be passed through from user
> > space to the driver directly but rather be properly abstracted as Roopa
> > proposed. The value of this API is abstraction.
> How would you let the user enable/disable features then? For instance, how would the user enable/disable flooding for broadcast packets (BFLOODING) on a given port? What I was proposing is to have a list of attributes (to be added in if_link.h) which can be tuned by the user using a tool like iproute2. What do you propose? 

Excellent, I agree with what you are saying. What set me off is that
the patch does not reflect that yet. Instead, the patch introduces
a pure Netlink pass-through API to the driver.

I would expect the patch to:
 1. Parse the Netlink messages and be aware of individual attributes
 2. Validate them
 3. Pass the configuration to the driver using an API that can also
    be consumed from in-kernel users.

> I think I have seen Roopa posting his updated ndo patch and getting some feedback by few people already and as long as I will be able to accomplish the use case described here I am happy with his way.

I think Roopa's patches are supplementary. Not all switchdev users
will be backed with a Linux Bridge. I therefore welcome your patches
very much.

The overlap is in the ndo. I think both the API you propose and
Roopa's bridge code should use the same NDO.

> I do not have an example right now of a vendor specific attribute but I was just saying that might happen (i.e. someone will have a feature not implemented by others?).

That's fine. Once we have them we can consider adding vendor specific
extensions.

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Jamal Hadi Salim @ 2014-12-15 14:29 UTC (permalink / raw)
  To: John Fastabend
  Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <548B4AA4.1020804@gmail.com>

On 12/12/14 15:05, John Fastabend wrote:
> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:


> I'll wake up ;)


Vlad made me go over those patches in a few iterations to make
sure that the use cases covered in the test case work. It is
holiday season, so he may be offline.

> First quick grep of code finds some strange uses of ndo_fdb_dump like
> this in macvlan,
>
>    ./drivers/net/macvlan.c
>          .ndo_fdb_dump           = ndo_dflt_fdb_dump,
>
> I'll be sending a patch once net-next opens up again to resolve it. Its
> harmless though so not really a fix for net.
>
> There seem to be a few places that have the potential to return
> different values then the uc/mc lists.
>
>      ./drivers/net/vxlan.c
>      ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>      ./drivers/net/ethernet/rocker/rocker.c
>
>      ./net/bridge/br_device.c
>

Yes, thats my observation as well.
The question is: Are multi/unicast address unconditionally dumped?
Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
the other driver did).
If you go over the original thread exchange with Vlad, you'll notice
i was kind of unsure why dumping of unicast/multicast had anything to
do with fdb dumping.
It is still my view that we shouldnt be treating these addresses as if
they were fdb entries. But: The problem is once you allow an API to
user space you cant take it back even if people are depending on bugs.


> So I guess we can walk through the list and analyse them a bit.
>
> vxlan:
>
> Try stacking devices on top of the vxlan device this will call a uc_add
> routine if you then change the mac addr on the vlan. This would get
> reported by the dflt fdb dump handlers but not the drivers fdb dump
> handlers. So removing the dflt dump handler from this patch at least
> changes things. We should either explain why this is OK or accept that
> the driver needs to be fixed. Or I guess that the patch is just wrong.
> My guess is one of the latter options.
>
> Also Jamal, your original patch seems like it might of changed this
> and Hubert's patch is reverting back to its original case. Was this
> specific part of your patch intentional?
>

Yes.
This is based on the view that unicast/multicast must be dumped
*unconditionally*. If the view is that uni/mcast addresses are
dumped conditionally based on what the driver thinks, then Hubert's
one liner is good. But i really would like Vlad to comment. 80%
of the effort on my part if you look at the thread was the refactoring
of the code to meet the use case.

I thought the abstraction which requires that your own MAC addresses
are treated as fdb entries was broken - but it is too late to change
that.

cheers,
jamal

^ permalink raw reply

* RE: [RFC PATCH net-next 1/1] net: Support for switch port configuration
From: Varlese, Marco @ 2014-12-15 14:29 UTC (permalink / raw)
  To: Thomas Graf
  Cc: John Fastabend, Jiri Pirko, netdev@vger.kernel.org,
	stephen@networkplumber.org, Fastabend, John R,
	roopa@cumulusnetworks.com, sfeldma@gmail.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20141215140749.GB21952@casper.infradead.org>

> -----Original Message-----
> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
> Sent: Monday, December 15, 2014 2:08 PM
> To: Varlese, Marco
> Cc: John Fastabend; Jiri Pirko; netdev@vger.kernel.org;
> stephen@networkplumber.org; Fastabend, John R;
> roopa@cumulusnetworks.com; sfeldma@gmail.com; linux-
> kernel@vger.kernel.org
> Subject: Re: [RFC PATCH net-next 1/1] net: Support for switch port
> configuration
> 
> On 12/11/14 at 09:59am, Varlese, Marco wrote:
> > An example of attributes are:
> > * enabling/disabling of learning of source addresses on a given port
> > (you can imagine the attribute called LEARNING for example);
> > * internal loopback control (i.e. LOOPBACK) which will control how the
> > flow of traffic behaves from the switch fabric towards an egress port;
> > * flooding for broadcast/multicast/unicast type of packets (i.e.
> > BFLOODING, MFLOODING, UFLOODING);
> 
> All of these are highly generic and should *not* be passed through from user
> space to the driver directly but rather be properly abstracted as Roopa
> proposed. The value of this API is abstraction.
How would you let the user enable/disable features then? For instance, how would the user enable/disable flooding for broadcast packets (BFLOODING) on a given port? What I was proposing is to have a list of attributes (to be added in if_link.h) which can be tuned by the user using a tool like iproute2. What do you propose? 
I think I have seen Roopa posting his updated ndo patch and getting some feedback by few people already and as long as I will be able to accomplish the use case described here I am happy with his way.

> If we introduce per device attributes for generic functions we lose large
> portions of the value gained.
> You mentioned you have additional attributes in mind, maybe you can give a
> few examples which are not generic, i.e. do not apply to multiple vendors.
I do not have an example right now of a vendor specific attribute but I was just saying that might happen (i.e. someone will have a feature not implemented by others?).

^ permalink raw reply

* [PATCH 3.16.y-ckt 111/168] net/ping: handle protocol mismatching scenario
From: Luis Henriques @ 2014-12-15 14:26 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, Jane Zhou, Yiwei Zhao,
	Luis Henriques
In-Reply-To: <1418653622-21105-1-git-send-email-luis.henriques@canonical.com>

3.16.7-ckt3 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jane Zhou <a17711@motorola.com>

commit 91a0b603469069cdcce4d572b7525ffc9fd352a6 upstream.

ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
sock will be returned.
the fix is to "continue" the searching, if no matching, return NULL.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Jane Zhou <a17711@motorola.com>
Signed-off-by: Yiwei Zhao <gbjc64@motorola.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
---
 net/ipv4/ping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 044a0ddf6a79..620e8ffa62e8 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -217,6 +217,8 @@ static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
 					     &ipv6_hdr(skb)->daddr))
 				continue;
 #endif
+		} else {
+			continue;
 		}
 
 		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
-- 
2.1.3

^ permalink raw reply related

* Re: [LKP] [genirq] c291ee62216:
From: Thomas Gleixner @ 2014-12-15 14:25 UTC (permalink / raw)
  To: Huang Ying; +Cc: LKML, LKP ML, Rick Jones, Peter Zijlstra, netdev
In-Reply-To: <1418627248.5745.186.camel@intel.com>

On Mon, 15 Dec 2014, Huang Ying wrote:
> FYI, we noticed the below changes on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/urgent
> commit c291ee622165cb2c8d4e7af63fffd499354a23be ("genirq: Prevent proc race against freeing of irq descriptors")
> 
> testbox/testcase/testparams: lkp-nex04/netperf/performance-300s-200%-SCTP_STREAM

>                            time.voluntary_context_switches
> 
>   100500 ++-----------------------------------------------------------------+
>          O       O                    O          O                          |
>          |  O                            O          O                       |
>   100000 ++           O  O  O O     O          O                            |
>          |    O                  O          O          O                    |
>          |          O                                                       |
>    99500 ++                                                                 |
>          |                                                                  |
>    99000 ++                                                                 |
>          |                                                                  |
>          |                                                                  |
>    98500 ++        .*.*..                          .*.. .*..    .*..*..     |
>          *..*.*..*.      *..*.*..*..*.*..*..*..*.*.    *    *..*       *.*..*
>          |                                                                  |
>    98000 ++-----------------------------------------------------------------+
> 
> 
> 	[*] bisect-good sample
> 	[O] bisect-bad  sample

Cute. Looking at netperf source it seems to do a high frequency
readout of /proc/stat from all involved threads. Which of course
explains that the number of context switches is going up as the stuff
is going to content on the sparse_irq_mutex.

While its possible to fix^W band aid that case, I'm really not too
happy to do so just to please a wreckaged use case. High frequency
polling of /proc/stat is just asking for trouble and on larger
machines it's a complete scalability fail. Especially the interrupt
part is amazingly horrible

      for_each_irq_nr()
	for_each_possible_cpu()

Is it really required for netperf to do that stat poll in a loop or
can it be made smarter?

Btw, in that test scenario runs netserver and the test threads on the
same machine. So the utilization data is pretty useless anyway because
all threads will read more or less the same data which cannot be
correlated to a particular instance.

Thanks,

	tglx

^ permalink raw reply

* [PATCH 2/2] net/macb: remove useless calls of devm_free_irq()
From: Cyrille Pitchen @ 2014-12-15 14:13 UTC (permalink / raw)
  To: nicolas.ferre, davem, linux-arm-kernel, netdev, soren.brinkmann
  Cc: linux-kernel, Cyrille Pitchen
In-Reply-To: <cover.1418651689.git.cyrille.pitchen@atmel.com>

Inside macb_probe(), when devm_request_irq() fails on queue q, there is no need
to call devm_free_irq() on queues 0..q-1 because the managed device resources
are released later when calling free_netdev().

Also removing devm_free_irq() call from macb_remove() for the same reason.

Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
---
 drivers/net/ethernet/cadence/macb.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index 81f317f..414f796 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -2160,7 +2160,7 @@ static int __init macb_probe(struct platform_device *pdev)
 	int err = -ENXIO;
 	const char *mac;
 	void __iomem *mem;
-	unsigned int hw_q, queue_mask, q, num_queues, q_irq = 0;
+	unsigned int hw_q, queue_mask, q, num_queues;
 	struct clk *pclk, *hclk, *tx_clk;
 
 	regs = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2235,11 +2235,11 @@ static int __init macb_probe(struct platform_device *pdev)
 	 * register mapping but we don't want to test the queue index then
 	 * compute the corresponding register offset at run time.
 	 */
-	for (hw_q = 0; hw_q < MACB_MAX_QUEUES; ++hw_q) {
+	for (hw_q = 0, q = 0; hw_q < MACB_MAX_QUEUES; ++hw_q) {
 		if (!(queue_mask & (1 << hw_q)))
 			continue;
 
-		queue = &bp->queues[q_irq];
+		queue = &bp->queues[q];
 		queue->bp = bp;
 		if (hw_q) {
 			queue->ISR  = GEM_ISR(hw_q - 1);
@@ -2261,18 +2261,18 @@ static int __init macb_probe(struct platform_device *pdev)
 		 * must remove the optional gaps that could exist in the
 		 * hardware queue mask.
 		 */
-		queue->irq = platform_get_irq(pdev, q_irq);
+		queue->irq = platform_get_irq(pdev, q);
 		err = devm_request_irq(&pdev->dev, queue->irq, macb_interrupt,
 				       0, dev->name, queue);
 		if (err) {
 			dev_err(&pdev->dev,
 				"Unable to request IRQ %d (error %d)\n",
 				queue->irq, err);
-			goto err_out_free_irq;
+			goto err_out_free_netdev;
 		}
 
 		INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
-		q_irq++;
+		q++;
 	}
 	dev->irq = bp->queues[0].irq;
 
@@ -2350,7 +2350,7 @@ static int __init macb_probe(struct platform_device *pdev)
 	err = register_netdev(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Cannot register net device, aborting.\n");
-		goto err_out_free_irq;
+		goto err_out_free_netdev;
 	}
 
 	err = macb_mii_init(bp);
@@ -2373,9 +2373,7 @@ static int __init macb_probe(struct platform_device *pdev)
 
 err_out_unregister_netdev:
 	unregister_netdev(dev);
-err_out_free_irq:
-	for (q = 0, queue = bp->queues; q < q_irq; ++q, ++queue)
-		devm_free_irq(&pdev->dev, queue->irq, queue);
+err_out_free_netdev:
 	free_netdev(dev);
 err_out_disable_clocks:
 	if (!IS_ERR(tx_clk))
@@ -2392,8 +2390,6 @@ static int __exit macb_remove(struct platform_device *pdev)
 {
 	struct net_device *dev;
 	struct macb *bp;
-	struct macb_queue *queue;
-	unsigned int q;
 
 	dev = platform_get_drvdata(pdev);
 
@@ -2405,9 +2401,6 @@ static int __exit macb_remove(struct platform_device *pdev)
 		kfree(bp->mii_bus->irq);
 		mdiobus_free(bp->mii_bus);
 		unregister_netdev(dev);
-		queue = bp->queues;
-		for (q = 0; q < bp->num_queues; ++q, ++queue)
-			devm_free_irq(&pdev->dev, queue->irq, queue);
 		if (!IS_ERR(bp->tx_clk))
 			clk_disable_unprepare(bp->tx_clk);
 		clk_disable_unprepare(bp->hclk);
-- 
1.8.2.2

^ permalink raw reply related

* [PATCH 1/2] net/macb: fix misplaced call of free_netdev() in macb_remove()
From: Cyrille Pitchen @ 2014-12-15 14:13 UTC (permalink / raw)
  To: nicolas.ferre, davem, linux-arm-kernel, netdev, soren.brinkmann
  Cc: linux-kernel, Cyrille Pitchen
In-Reply-To: <cover.1418651689.git.cyrille.pitchen@atmel.com>

fix a bug introduced by the multiqueue support patch:
"net/macb: add TX multiqueue support for gem"

the "bp" pointer to the netdev private data was dereferenced and used after the
associated memory had been freed by calling free_netdev().

Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
---
 drivers/net/ethernet/cadence/macb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index 0987d2a..81f317f 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -2408,11 +2408,11 @@ static int __exit macb_remove(struct platform_device *pdev)
 		queue = bp->queues;
 		for (q = 0; q < bp->num_queues; ++q, ++queue)
 			devm_free_irq(&pdev->dev, queue->irq, queue);
-		free_netdev(dev);
 		if (!IS_ERR(bp->tx_clk))
 			clk_disable_unprepare(bp->tx_clk);
 		clk_disable_unprepare(bp->hclk);
 		clk_disable_unprepare(bp->pclk);
+		free_netdev(dev);
 	}
 
 	return 0;
-- 
1.8.2.2

^ permalink raw reply related

* [PATCH 0/2] net/macb: fix multiqueue support patch up
From: Cyrille Pitchen @ 2014-12-15 14:13 UTC (permalink / raw)
  To: nicolas.ferre, davem, linux-arm-kernel, netdev, soren.brinkmann
  Cc: linux-kernel, Cyrille Pitchen

This series of patches is a fixup for the multiqueue support patch.

The first patch fixes a bug introduced by the multiqueue support patch.
The second one doesn't fix a bug but simplify the source code by removing
useless calls to devm_free_irq() since we use managed device resources for
IRQs.

They were applied on the net-next tree and tested with a sama5d36ek board.

Cyrille Pitchen (2):
  net/macb: fix misplaced call of free_netdev() in macb_remove()
  net/macb: remove useless calls of devm_free_irq()

 drivers/net/ethernet/cadence/macb.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

-- 
1.8.2.2

^ permalink raw reply

* Re: [RFC PATCH net-next 1/1] net: Support for switch port configuration
From: Thomas Graf @ 2014-12-15 14:07 UTC (permalink / raw)
  To: Varlese, Marco
  Cc: John Fastabend, Jiri Pirko, netdev@vger.kernel.org,
	stephen@networkplumber.org, Fastabend, John R,
	roopa@cumulusnetworks.com, sfeldma@gmail.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <C4896FB061E7DE4AAC93031BDCA044B104AC3914@IRSMSX108.ger.corp.intel.com>

On 12/11/14 at 09:59am, Varlese, Marco wrote:
> An example of attributes are:
> * enabling/disabling of learning of source addresses on a given port (you can imagine the attribute called LEARNING for example);
> * internal loopback control (i.e. LOOPBACK) which will control how the flow of traffic behaves from the switch fabric towards an egress port;
> * flooding for broadcast/multicast/unicast type of packets (i.e. BFLOODING, MFLOODING, UFLOODING);

All of these are highly generic and should *not* be passed through
from user space to the driver directly but rather be properly
abstracted as Roopa proposed. The value of this API is abstraction.
If we introduce per device attributes for generic functions we lose
large portions of the value gained.

You mentioned you have additional attributes in mind, maybe you can
give a few examples which are not generic, i.e. do not apply to
multiple vendors.

^ permalink raw reply

* Re: [RFC PATCH net-next 1/1] net: Support for switch port configuration
From: Thomas Graf @ 2014-12-15 14:05 UTC (permalink / raw)
  To: Varlese, Marco
  Cc: netdev@vger.kernel.org, stephen@networkplumber.org,
	Fastabend, John R, jiri@resnulli.us, roopa@cumulusnetworks.com,
	sfeldma@gmail.com, linux-kernel@vger.kernel.org
In-Reply-To: <C4896FB061E7DE4AAC93031BDCA044B104AC3257@IRSMSX108.ger.corp.intel.com>

On 12/10/14 at 04:23pm, Varlese, Marco wrote:
> +#ifdef CONFIG_NET_SWITCHDEV
> +static int do_setswcfg(struct net_device *dev, struct nlattr *attr)
> +{
> +	int rem, err = -EINVAL;
> +	struct nlattr *v;
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +
> +	nla_for_each_nested(v, attr, rem) {
> +		u32 op = nla_type(v);
> +		u64 value = nla_get_u64(v);
> +
> +		err = ops->ndo_switch_port_set_cfg(dev, op, value);
> +		if (err)
> +			break;
> +	}
> +	return err;
> +}
> +#endif

A strictly technical feedback first: I suggest to split the above into
a validation and commit part to keep Netlink operations atomic. Doing
commit & rollback for the deeply nested configuration we are heading
to will be difficult and error prone. Let's keep updates atomic for as
long as possible, i.e. individual set operations can't fail.

^ permalink raw reply

* Re: [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Willy Tarreau @ 2014-12-15 13:33 UTC (permalink / raw)
  To: Zhu, Yanjun
  Cc: Sergei Shtylyov, Zhu Yanjun, netdev@vger.kernel.org, ALLAN, BRUCE,
	KIRSHER, JEFFREY
In-Reply-To: <E66F6C44A7A6EA49ABFC80EA526506DF65A865A3@ALA-MBB.corp.ad.wrs.com>

On Mon, Dec 15, 2014 at 01:21:43PM +0000, Zhu, Yanjun wrote:
> Hi, Willy
> 
> Thanks for your reply.
> 
> This patch "[PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked" exists in tag v3.1.
> 
> Please follow these steps, this patch can be found:
> 
> 1. git checkout -f v3.1
> 
> 2. git log -p drivers/net/e1000e/ich8lan.c
> 
> 3. search "b7d6e335"
> 
> Then we will find this patch.

Ah it's because you truncated the commit ID from the right instead of from
the left. Truncated commit IDs are valid from the left, not from the right.
In your case, the commit is 6cc7aaed70c96c3933fbacbad582fc79b7d6e335
("e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked"), so
the truncated ID is 6cc7aae, not b7d6e335. It's important to fix that in
your commit messages so that a "git show" works correctly (it failed for me
for this precise reason).

Thanks,
Willy

^ permalink raw reply

* RE: [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked
From: Zhu, Yanjun @ 2014-12-15 13:21 UTC (permalink / raw)
  To: Willy Tarreau, Sergei Shtylyov
  Cc: Zhu Yanjun, netdev@vger.kernel.org, ALLAN, BRUCE,
	KIRSHER, JEFFREY
In-Reply-To: <20141215121639.GB28701@1wt.eu>

Hi, Willy

Thanks for your reply.

This patch "[PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked" exists in tag v3.1.

Please follow these steps, this patch can be found:

1. git checkout -f v3.1

2. git log -p drivers/net/e1000e/ich8lan.c

3. search "b7d6e335"

Then we will find this patch.

Best Regards!
Zhu Yanjun

________________________________________
From: Willy Tarreau [w@1wt.eu]
Sent: Monday, December 15, 2014 4:16 AM
To: Sergei Shtylyov
Cc: Zhu Yanjun; netdev@vger.kernel.org; Zhu, Yanjun; ALLAN, BRUCE; KIRSHER, JEFFREY
Subject: Re: [PATCH 3/5] e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked

Hello,

On Mon, Dec 15, 2014 at 03:01:48PM +0300, Sergei Shtylyov wrote:
> Hello.
>
> On 12/15/2014 11:39 AM, Zhu Yanjun wrote:
>
> >2.6.x kernels require a similar logic change as commit b7d6e335
> >[e1000e: do not toggle LANPHYPC value bit when PHY reset is blocked]
> >introduces for newer kernels.
>
>    Hm, so is this patch to 2.6.x-stable kernels or a recent kernel?
> If the former, you should follow the rules in
> Documentation/stable_kernel_rules.txt.

I don't see anything there that does not comply with the rules. This
patchset looks fine and acceptable to me, I'll just wait a bit so that
if either Jeff or Bruce rejects it I respect their wish.

However you just made me realize that I can't find commit b7d6e335.
Zhu, please double-check that this commit (or an equivalent) was merged
upstream, *this* it a prerequisite for going into -stable.



> >When PHY reset is intentionally blocked on 82577/8/9, do not toggle the
> >LANPHYPC value bit (essentially performing a hard power reset of the
> >device) otherwise the PHY can be put into an unknown state.
>
> >Cleanup whitespace in the same function.
>
> >[yanjun.zhu: whitespace remains unchanged]
>
> >Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
>
>    So, is this your patch, or Bruce's? If the latter, you should add:
>
> From: Bruce Allan <bruce.w.allan@intel.com>
>
> at the start of the change log.
>
> >Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
> >Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> >Signed-off-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>

The case where a commit gets backported is always ambiguous, especially
when the context changes a lot. I'm personally in favor of keeping the
original signed-of-by chain related to the original fix, and adding some
text between it and the backporter's s-o-b indicating that the changes
were, so that original authors do not get blamed for mistakes.

Best regards,
Willy

^ permalink raw reply

* Re: [PATCH 0/9 net-next] rhashtable: Per bucket locks & deferred table resizing
From: Thomas Graf @ 2014-12-15 13:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

Meant for "net-next". Apologies for the missing prefix.

^ permalink raw reply

* [PATCH 9/9] netlink: Lockless lookup with RCU grace period in socket release
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

Defers the release of the socket reference using call_rcu() to
allow using an RCU read-side protected call to rhashtable_lookup()

This restores behaviour and performance gains as previously
introduced by e341694 ("netlink: Convert netlink_lookup() to use
RCU protected hash table") without the side effect of severely
delayed socket destruction.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/netlink/af_netlink.c | 32 ++++++++++++++++----------------
 net/netlink/af_netlink.h |  1 +
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b38b148..4ad23f9 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -97,12 +97,12 @@ static int netlink_dump(struct sock *sk);
 static void netlink_skb_destructor(struct sk_buff *skb);
 
 /* nl_table locking explained:
- * Lookup and traversal are protected with nl_sk_hash_lock or nl_table_lock
- * combined with an RCU read-side lock. Insertion and removal are protected
- * with nl_sk_hash_lock while using RCU list modification primitives and may
- * run in parallel to nl_table_lock protected lookups. Destruction of the
- * Netlink socket may only occur *after* nl_table_lock has been acquired
- * either during or after the socket has been removed from the list.
+ * Lookup and traversal are protected with an RCU read-side lock. Insertion
+ * and removal are protected with nl_sk_hash_lock while using RCU list
+ * modification primitives and may run in parallel to RCU protected lookups.
+ * Destruction of the Netlink socket may only occur *after* nl_table_lock has
+ * been acquired * either during or after the socket has been removed from
+ * the list and after an RCU grace period.
  */
 DEFINE_RWLOCK(nl_table_lock);
 EXPORT_SYMBOL_GPL(nl_table_lock);
@@ -1023,13 +1023,11 @@ static struct sock *netlink_lookup(struct net *net, int protocol, u32 portid)
 	struct netlink_table *table = &nl_table[protocol];
 	struct sock *sk;
 
-	read_lock(&nl_table_lock);
 	rcu_read_lock();
 	sk = __netlink_lookup(table, portid, net);
 	if (sk)
 		sock_hold(sk);
 	rcu_read_unlock();
-	read_unlock(&nl_table_lock);
 
 	return sk;
 }
@@ -1201,6 +1199,13 @@ out_module:
 	goto out;
 }
 
+static void deferred_put_nlk_sk(struct rcu_head *head)
+{
+	struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu);
+
+	sock_put(&nlk->sk);
+}
+
 static int netlink_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -1261,7 +1266,7 @@ static int netlink_release(struct socket *sock)
 	local_bh_disable();
 	sock_prot_inuse_add(sock_net(sk), &netlink_proto, -1);
 	local_bh_enable();
-	sock_put(sk);
+	call_rcu(&nlk->rcu, deferred_put_nlk_sk);
 	return 0;
 }
 
@@ -1276,7 +1281,6 @@ static int netlink_autobind(struct socket *sock)
 
 retry:
 	cond_resched();
-	netlink_table_grab();
 	rcu_read_lock();
 	if (__netlink_lookup(table, portid, net)) {
 		/* Bind collision, search negative portid values. */
@@ -1284,11 +1288,9 @@ retry:
 		if (rover > -4097)
 			rover = -4097;
 		rcu_read_unlock();
-		netlink_table_ungrab();
 		goto retry;
 	}
 	rcu_read_unlock();
-	netlink_table_ungrab();
 
 	err = netlink_insert(sk, net, portid);
 	if (err == -EADDRINUSE)
@@ -2922,9 +2924,8 @@ static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
 }
 
 static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(nl_table_lock) __acquires(RCU)
+	__acquires(RCU)
 {
-	read_lock(&nl_table_lock);
 	rcu_read_lock();
 	return *pos ? netlink_seq_socket_idx(seq, *pos - 1) : SEQ_START_TOKEN;
 }
@@ -2976,10 +2977,9 @@ static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 }
 
 static void netlink_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU) __releases(nl_table_lock)
+	__releases(RCU)
 {
 	rcu_read_unlock();
-	read_unlock(&nl_table_lock);
 }
 
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index b20a173..ab539c5 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -50,6 +50,7 @@ struct netlink_sock {
 #endif /* CONFIG_NETLINK_MMAP */
 
 	struct rhash_head	node;
+	struct rcu_head		rcu;
 };
 
 static inline struct netlink_sock *nlk_sk(struct sock *sk)
-- 
1.9.3

^ permalink raw reply related

* [PATCH 8/9] rhashtable: Supports for nulls marker
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

In order to allow for wider usage of rhashtable, use a special nulls
marker to terminate each chain. The reason for not using the existing
nulls_list is that the prev pointer usage would not be valid as entries
can be linked in two different buckets at the same time.

The 4 nulls base bits can be set through the rhashtable_params structure
like this:

struct rhashtable_params params = {
        [...]
        .nulls_base = (1U << RHT_BASE_SHIFT),
};

This reduces the hash length from 32 bits to 27 bits.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/list_nulls.h |  3 ++-
 include/linux/rhashtable.h | 57 ++++++++++++++++++++++++++++++++++++++--------
 lib/rhashtable.c           | 37 ++++++++++++++++++++++++------
 3 files changed, 79 insertions(+), 18 deletions(-)

diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index 5d10ae36..e8c300e 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -21,8 +21,9 @@ struct hlist_nulls_head {
 struct hlist_nulls_node {
 	struct hlist_nulls_node *next, **pprev;
 };
+#define NULLS_MARKER(value) (1UL | (((long)value) << 1))
 #define INIT_HLIST_NULLS_HEAD(ptr, nulls) \
-	((ptr)->first = (struct hlist_nulls_node *) (1UL | (((long)nulls) << 1)))
+	((ptr)->first = (struct hlist_nulls_node *) NULLS_MARKER(nulls))
 
 #define hlist_nulls_entry(ptr, type, member) container_of(ptr,type,member)
 /**
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index a1688f0..de7cac7 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -18,15 +18,32 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
-#include <linux/rculist.h>
+#include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 
+/*
+ * The end of the chain is marked with a special nulls marks which has
+ * the following format:
+ *
+ * +-------+-----------------------------------------------------+-+
+ * | Base  |                      Hash                           |1|
+ * +-------+-----------------------------------------------------+-+
+ *
+ * Base (4 bits) : Reserved to distinguish between multiple tables.
+ *                 Specified via &struct rhashtable_params.nulls_base.
+ * Hash (27 bits): Full hash (unmasked) of first element added to bucket
+ * 1 (1 bit)     : Nulls marker (always set)
+ *
+ * The remaining bits of the next pointer remain unused for now.
+ */
+#define RHT_BASE_BITS		4
+#define RHT_HASH_BITS		27
+#define RHT_BASE_SHIFT		RHT_HASH_BITS
+
 struct rhash_head {
 	struct rhash_head __rcu		*next;
 };
 
-#define INIT_HASH_HEAD(ptr) ((ptr)->next = NULL)
-
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
@@ -55,6 +72,7 @@ struct rhashtable;
  * @hash_rnd: Seed to use while hashing
  * @max_shift: Maximum number of shifts while expanding
  * @min_shift: Minimum number of shifts while shrinking
+ * @nulls_base: Base value to generate nulls marker
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 128)
  * @hashfn: Function to hash key
  * @obj_hashfn: Function to hash object
@@ -69,6 +87,7 @@ struct rhashtable_params {
 	u32			hash_rnd;
 	size_t			max_shift;
 	size_t			min_shift;
+	u32			nulls_base;
 	size_t			locks_mul;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
@@ -100,6 +119,24 @@ struct rhashtable {
 	bool                            being_destroyed;
 };
 
+static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
+{
+	return NULLS_MARKER(ht->p.nulls_base + hash);
+}
+
+#define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
+	((ptr) = (typeof(ptr)) rht_marker(ht, hash))
+
+static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
+{
+	return ((unsigned long) ptr & 1);
+}
+
+static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
+{
+	return ((unsigned long) ptr) >> 1;
+}
+
 #ifdef CONFIG_PROVE_LOCKING
 int lockdep_rht_mutex_is_held(struct rhashtable *ht);
 int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash);
@@ -157,7 +194,7 @@ void rhashtable_destroy(struct rhashtable *ht);
  */
 #define rht_for_each_continue(pos, head, tbl, hash) \
 	for (pos = rht_dereference_bucket(head, tbl, hash); \
-	     pos; \
+	     !rht_is_a_nulls(pos); \
 	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
 
 /**
@@ -180,7 +217,7 @@ void rhashtable_destroy(struct rhashtable *ht);
  */
 #define rht_for_each_entry_continue(tpos, pos, head, tbl, hash, member)	\
 	for (pos = rht_dereference_bucket(head, tbl, hash);		\
-	     pos && rht_entry(tpos, pos, member);			\
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
 	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
 
 /**
@@ -209,9 +246,9 @@ void rhashtable_destroy(struct rhashtable *ht);
  */
 #define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member)	    \
 	for (pos = rht_dereference_bucket((tbl)->buckets[hash], tbl, hash), \
-	     next = pos ? rht_dereference_bucket(pos->next, tbl, hash)      \
-			: NULL;						    \
-	     pos && rht_entry(tpos, pos, member);			    \
+	     next = !rht_is_a_nulls(pos) ?				    \
+		       rht_dereference_bucket(pos->next, tbl, hash) : NULL; \
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	    \
 	     pos = next)
 
 /**
@@ -228,7 +265,7 @@ void rhashtable_destroy(struct rhashtable *ht);
 #define rht_for_each_rcu_continue(pos, head, tbl, hash)			\
 	for (({barrier(); }),						\
 	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		\
-	     pos;							\
+	     !rht_is_a_nulls(pos);					\
 	     pos = rcu_dereference_raw(pos->next))
 
 /**
@@ -260,7 +297,7 @@ void rhashtable_destroy(struct rhashtable *ht);
 #define rht_for_each_entry_rcu_continue(tpos, pos, head, tbl, hash, member) \
 	for (({barrier(); }),						    \
 	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		    \
-	     pos && rht_entry(tpos, pos, member);			    \
+	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	    \
 	     pos = rht_dereference_bucket_rcu(pos->next, tbl, hash))
 
 /**
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 312e343..cbad192 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -28,6 +28,9 @@
 #define HASH_MIN_SIZE		4UL
 #define BUCKET_LOCKS_PER_CPU   128UL
 
+/* Base bits plus 1 bit for nulls marker */
+#define HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
+
 enum {
 	RHT_LOCK_NORMAL,
 	RHT_LOCK_NESTED,
@@ -86,7 +89,7 @@ static u32 obj_raw_hashfn(const struct rhashtable *ht, const void *ptr)
 		hash = ht->p.hashfn(ptr + ht->p.key_offset, ht->p.key_len,
 				    ht->p.hash_rnd);
 
-	return hash;
+	return hash >> HASH_RESERVED_SPACE;
 }
 
 static u32 key_hashfn(struct rhashtable *ht, const void *key, u32 len)
@@ -95,6 +98,7 @@ static u32 key_hashfn(struct rhashtable *ht, const void *key, u32 len)
 	u32 hash;
 
 	hash = ht->p.hashfn(key, len, ht->p.hash_rnd);
+	hash >>= HASH_RESERVED_SPACE;
 
 	return rht_bucket_index(tbl, hash);
 }
@@ -111,7 +115,7 @@ static struct rhash_head __rcu **bucket_tail(struct bucket_table *tbl, u32 n)
 	struct rhash_head __rcu **pprev;
 
 	for (pprev = &tbl->buckets[n];
-	     rht_dereference_bucket(*pprev, tbl, n);
+	     !rht_is_a_nulls(rht_dereference_bucket(*pprev, tbl, n));
 	     pprev = &rht_dereference_bucket(*pprev, tbl, n)->next)
 		;
 
@@ -164,6 +168,7 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 {
 	struct bucket_table *tbl;
 	size_t size;
+	int i;
 
 	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
 	tbl = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
@@ -180,6 +185,9 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 	}
 
+	for (i = 0; i < nbuckets; i++)
+		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
+
 	return tbl;
 }
 
@@ -221,7 +229,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	/* Old bucket empty, no work needed. */
 	p = rht_dereference_bucket(old_tbl->buckets[old_hash], old_tbl,
 				   old_hash);
-	if (!p)
+	if (rht_is_a_nulls(p))
 		return;
 
 	new_hash = new_hash2 = head_hashfn(ht, new_tbl, p);
@@ -252,8 +260,8 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	/* Find the subsequent node which does hash to the same
 	 * bucket as node P, or NULL if no such node exists.
 	 */
-	next = NULL;
-	if (he) {
+	INIT_RHT_NULLS_HEAD(next, ht, old_hash);
+	if (!rht_is_a_nulls(he)) {
 		rht_for_each_continue(he, he->next, old_tbl, old_hash) {
 			if (head_hashfn(ht, new_tbl, he) == new_hash) {
 				next = he;
@@ -369,11 +377,15 @@ int rhashtable_expand(struct rhashtable *ht)
 		 */
 		complete = true;
 		for (old_hash = 0; old_hash < old_tbl->size; old_hash++) {
+			struct rhash_head *head;
+
 			old_bucket_lock = bucket_lock(old_tbl, old_hash);
 			spin_lock_bh(old_bucket_lock);
 
 			hashtable_chain_unzip(ht, new_tbl, old_tbl, old_hash);
-			if (old_tbl->buckets[old_hash] != NULL)
+			head = rht_dereference_bucket(old_tbl->buckets[old_hash],
+						      old_tbl, old_hash);
+			if (!rht_is_a_nulls(head))
 				complete = false;
 
 			spin_unlock_bh(old_bucket_lock);
@@ -498,6 +510,7 @@ static void rht_deferred_worker(struct work_struct *work)
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
 	struct bucket_table *tbl;
+	struct rhash_head *head;
 	spinlock_t *lock;
 	unsigned hash;
 
@@ -508,7 +521,12 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 	lock = bucket_lock(tbl, hash);
 
 	spin_lock_bh(lock);
-	RCU_INIT_POINTER(obj->next, tbl->buckets[hash]);
+	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
+	if (rht_is_a_nulls(head))
+		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
+	else
+		RCU_INIT_POINTER(obj->next, head);
+
 	rcu_assign_pointer(tbl->buckets[hash], obj);
 	spin_unlock_bh(lock);
 
@@ -709,6 +727,7 @@ static size_t rounded_hashtable_size(struct rhashtable_params *params)
  *	.key_offset = offsetof(struct test_obj, key),
  *	.key_len = sizeof(int),
  *	.hashfn = jhash,
+ *	.nulls_base = (1U << RHT_BASE_SHIFT),
  * };
  *
  * Configuration Example 2: Variable length keys
@@ -741,6 +760,9 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params)
 	    (!params->key_len && !params->obj_hashfn))
 		return -EINVAL;
 
+	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
+		return -EINVAL;
+
 	params->min_shift = max_t(size_t, params->min_shift,
 				  ilog2(HASH_MIN_SIZE));
 
@@ -974,6 +996,7 @@ static int __init test_rht_init(void)
 		.key_offset = offsetof(struct test_obj, value),
 		.key_len = sizeof(int),
 		.hashfn = jhash,
+		.nulls_base = (3U << RHT_BASE_SHIFT),
 		.grow_decision = rht_grow_above_75,
 		.shrink_decision = rht_shrink_below_30,
 	};
-- 
1.9.3

^ permalink raw reply related

* [PATCH 1/9] rhashtable: Do hashing inside of rhashtable_lookup_compare()
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem
  Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh, netfilter-devel
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

Hash the key inside of rhashtable_lookup_compare() like
rhashtable_lookup() does. This allows to simplify the hashing
functions and keep them private.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: netfilter-devel@vger.kernel.org
---
 include/linux/rhashtable.h |  5 +--
 lib/rhashtable.c           | 91 +++++++++++++++-------------------------------
 net/netfilter/nft_hash.c   | 46 ++++++++++++++---------
 net/netlink/af_netlink.c   |  5 +--
 4 files changed, 61 insertions(+), 86 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index b93fd89..1b51221 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -96,9 +96,6 @@ static inline int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
 
 int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
 
-u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len);
-u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr);
-
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node);
 bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node);
 void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
@@ -111,7 +108,7 @@ int rhashtable_expand(struct rhashtable *ht);
 int rhashtable_shrink(struct rhashtable *ht);
 
 void *rhashtable_lookup(const struct rhashtable *ht, const void *key);
-void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
+void *rhashtable_lookup_compare(const struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg);
 
 void rhashtable_destroy(const struct rhashtable *ht);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6c3c723..1ee0eb6 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -42,69 +42,39 @@ static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
 	return (void *) he - ht->p.head_offset;
 }
 
-static u32 __hashfn(const struct rhashtable *ht, const void *key,
-		      u32 len, u32 hsize)
+static u32 rht_bucket_index(const struct bucket_table *tbl, u32 hash)
 {
-	u32 h;
-
-	h = ht->p.hashfn(key, len, ht->p.hash_rnd);
-
-	return h & (hsize - 1);
-}
-
-/**
- * rhashtable_hashfn - compute hash for key of given length
- * @ht:		hash table to compute for
- * @key:	pointer to key
- * @len:	length of key
- *
- * Computes the hash value using the hash function provided in the 'hashfn'
- * of struct rhashtable_params. The returned value is guaranteed to be
- * smaller than the number of buckets in the hash table.
- */
-u32 rhashtable_hashfn(const struct rhashtable *ht, const void *key, u32 len)
-{
-	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
-
-	return __hashfn(ht, key, len, tbl->size);
+	return hash & (tbl->size - 1);
 }
-EXPORT_SYMBOL_GPL(rhashtable_hashfn);
 
-static u32 obj_hashfn(const struct rhashtable *ht, const void *ptr, u32 hsize)
+static u32 obj_raw_hashfn(const struct rhashtable *ht, const void *ptr)
 {
-	if (unlikely(!ht->p.key_len)) {
-		u32 h;
-
-		h = ht->p.obj_hashfn(ptr, ht->p.hash_rnd);
+	u32 hash;
 
-		return h & (hsize - 1);
-	}
+	if (unlikely(!ht->p.key_len))
+		hash = ht->p.obj_hashfn(ptr, ht->p.hash_rnd);
+	else
+		hash = ht->p.hashfn(ptr + ht->p.key_offset, ht->p.key_len,
+				    ht->p.hash_rnd);
 
-	return __hashfn(ht, ptr + ht->p.key_offset, ht->p.key_len, hsize);
+	return hash;
 }
 
-/**
- * rhashtable_obj_hashfn - compute hash for hashed object
- * @ht:		hash table to compute for
- * @ptr:	pointer to hashed object
- *
- * Computes the hash value using the hash function `hashfn` respectively
- * 'obj_hashfn' depending on whether the hash table is set up to work with
- * a fixed length key. The returned value is guaranteed to be smaller than
- * the number of buckets in the hash table.
- */
-u32 rhashtable_obj_hashfn(const struct rhashtable *ht, void *ptr)
+static u32 key_hashfn(const struct rhashtable *ht, const void *key, u32 len)
 {
 	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
+	u32 hash;
+
+	hash = ht->p.hashfn(key, len, ht->p.hash_rnd);
 
-	return obj_hashfn(ht, ptr, tbl->size);
+	return rht_bucket_index(tbl, hash);
 }
-EXPORT_SYMBOL_GPL(rhashtable_obj_hashfn);
 
 static u32 head_hashfn(const struct rhashtable *ht,
-		       const struct rhash_head *he, u32 hsize)
+		       const struct bucket_table *tbl,
+		       const struct rhash_head *he)
 {
-	return obj_hashfn(ht, rht_obj(ht, he), hsize);
+	return rht_bucket_index(tbl, obj_raw_hashfn(ht, rht_obj(ht, he)));
 }
 
 static struct bucket_table *bucket_table_alloc(size_t nbuckets)
@@ -170,9 +140,9 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	 * reaches a node that doesn't hash to the same bucket as the
 	 * previous node p. Call the previous node p;
 	 */
-	h = head_hashfn(ht, p, new_tbl->size);
+	h = head_hashfn(ht, new_tbl, p);
 	rht_for_each(he, p->next, ht) {
-		if (head_hashfn(ht, he, new_tbl->size) != h)
+		if (head_hashfn(ht, new_tbl, he) != h)
 			break;
 		p = he;
 	}
@@ -184,7 +154,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	next = NULL;
 	if (he) {
 		rht_for_each(he, he->next, ht) {
-			if (head_hashfn(ht, he, new_tbl->size) == h) {
+			if (head_hashfn(ht, new_tbl, he) == h) {
 				next = he;
 				break;
 			}
@@ -237,9 +207,9 @@ int rhashtable_expand(struct rhashtable *ht)
 	 * single imprecise chain.
 	 */
 	for (i = 0; i < new_tbl->size; i++) {
-		h = i & (old_tbl->size - 1);
+		h = rht_bucket_index(old_tbl, i);
 		rht_for_each(he, old_tbl->buckets[h], ht) {
-			if (head_hashfn(ht, he, new_tbl->size) == i) {
+			if (head_hashfn(ht, new_tbl, he) == i) {
 				RCU_INIT_POINTER(new_tbl->buckets[i], he);
 				break;
 			}
@@ -353,7 +323,7 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	hash = head_hashfn(ht, obj, tbl->size);
+	hash = head_hashfn(ht, tbl, obj);
 	RCU_INIT_POINTER(obj->next, tbl->buckets[hash]);
 	rcu_assign_pointer(tbl->buckets[hash], obj);
 	ht->nelems++;
@@ -413,7 +383,7 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 
 	ASSERT_RHT_MUTEX(ht);
 
-	h = head_hashfn(ht, obj, tbl->size);
+	h = head_hashfn(ht, tbl, obj);
 
 	pprev = &tbl->buckets[h];
 	rht_for_each(he, tbl->buckets[h], ht) {
@@ -452,7 +422,7 @@ void *rhashtable_lookup(const struct rhashtable *ht, const void *key)
 
 	BUG_ON(!ht->p.key_len);
 
-	h = __hashfn(ht, key, ht->p.key_len, tbl->size);
+	h = key_hashfn(ht, key, ht->p.key_len);
 	rht_for_each_rcu(he, tbl->buckets[h], ht) {
 		if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key,
 			   ht->p.key_len))
@@ -467,7 +437,7 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
 /**
  * rhashtable_lookup_compare - search hash table with compare function
  * @ht:		hash table
- * @hash:	hash value of desired entry
+ * @key:	the pointer to the key
  * @compare:	compare function, must return true on match
  * @arg:	argument passed on to compare function
  *
@@ -479,15 +449,14 @@ EXPORT_SYMBOL_GPL(rhashtable_lookup);
  *
  * Returns the first entry on which the compare function returned true.
  */
-void *rhashtable_lookup_compare(const struct rhashtable *ht, u32 hash,
+void *rhashtable_lookup_compare(const struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg)
 {
 	const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 	struct rhash_head *he;
+	u32 hash;
 
-	if (unlikely(hash >= tbl->size))
-		return NULL;
-
+	hash = key_hashfn(ht, key, ht->p.key_len);
 	rht_for_each_rcu(he, tbl->buckets[hash], ht) {
 		if (!compare(rht_obj(ht, he), arg))
 			continue;
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 1e316ce..614ee09 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -94,28 +94,40 @@ static void nft_hash_remove(const struct nft_set *set,
 	kfree(he);
 }
 
+struct nft_compare_arg {
+	const struct nft_set *set;
+	struct nft_set_elem *elem;
+};
+
+static bool nft_hash_compare(void *ptr, void *arg)
+{
+	struct nft_hash_elem *he = ptr;
+	struct nft_compare_arg *x = arg;
+
+	if (!nft_data_cmp(&he->key, &x->elem->key, x->set->klen)) {
+		x->elem->cookie = &he->node;
+		x->elem->flags = 0;
+		if (x->set->flags & NFT_SET_MAP)
+			nft_data_copy(&x->elem->data, he->data);
+
+		return true;
+	}
+
+	return false;
+}
+
 static int nft_hash_get(const struct nft_set *set, struct nft_set_elem *elem)
 {
 	const struct rhashtable *priv = nft_set_priv(set);
-	const struct bucket_table *tbl = rht_dereference_rcu(priv->tbl, priv);
-	struct rhash_head __rcu * const *pprev;
-	struct nft_hash_elem *he;
-	u32 h;
-
-	h = rhashtable_hashfn(priv, &elem->key, set->klen);
-	pprev = &tbl->buckets[h];
-	rht_for_each_entry_rcu(he, tbl->buckets[h], node) {
-		if (nft_data_cmp(&he->key, &elem->key, set->klen)) {
-			pprev = &he->node.next;
-			continue;
-		}
+	struct nft_compare_arg arg = {
+		.set = set,
+		.elem = elem,
+	};
 
-		elem->cookie = (void *)pprev;
-		elem->flags = 0;
-		if (set->flags & NFT_SET_MAP)
-			nft_data_copy(&elem->data, he->data);
+	if (rhashtable_lookup_compare(priv, &elem->key,
+				      &nft_hash_compare, &arg))
 		return 0;
-	}
+
 	return -ENOENT;
 }
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ef5f77b..dc8354b 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1022,11 +1022,8 @@ static struct sock *__netlink_lookup(struct netlink_table *table, u32 portid,
 		.net = net,
 		.portid = portid,
 	};
-	u32 hash;
 
-	hash = rhashtable_hashfn(&table->hash, &portid, sizeof(portid));
-
-	return rhashtable_lookup_compare(&table->hash, hash,
+	return rhashtable_lookup_compare(&table->hash, &portid,
 					 &netlink_compare, &arg);
 }
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH 0/9] rhashtable: Per bucket locks & deferred table resizing
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh

Prepares for and introduces per bucket spinlocks and deferred table
resizing. This allows for parallel table mutations in different hash
buckets from atomic context. The resizing occurs in the background
in a separate worker thread while lookups, inserts, and removals can
continue.

Also modified the chain linked list to be terminated with a special
nulls marker to allow entries to move between multiple lists.

Last but not least, reintroduces lockless netlink_lookup() with
deferred Netlink socket destruction to avoid the side effect of
increased netlink_release() runtime.

Thomas Graf (9):
  rhashtable: Do hashing inside of rhashtable_lookup_compare()
  rhashtable: Use rht_obj() instead of manual offset calculation
  rhashtable: Convert bucket iterators to take table and index
  rhashtable: Factor out bucket_tail() function
  nft_hash: Remove rhashtable_remove_pprev()
  spinlock: Add spin_lock_bh_nested()
  rhashtable: Per bucket locks & deferred expansion/shrinking
  rhashtable: Supports for nulls marker
  netlink: Lockless lookup with RCU grace period in socket release

 include/linux/list_nulls.h       |   3 +-
 include/linux/rhashtable.h       | 258 ++++++++++++-----
 include/linux/spinlock.h         |   8 +
 include/linux/spinlock_api_smp.h |   2 +
 include/linux/spinlock_api_up.h  |   1 +
 kernel/locking/spinlock.c        |   8 +
 lib/rhashtable.c                 | 607 ++++++++++++++++++++++++++-------------
 net/netfilter/nft_hash.c         |  92 +++---
 net/netlink/af_netlink.c         |  64 ++---
 net/netlink/af_netlink.h         |   1 +
 net/netlink/diag.c               |   4 +-
 11 files changed, 693 insertions(+), 355 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH 3/9] rhashtable: Convert bucket iterators to take table and index
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

This patch is in preparation to introduce per bucket spinlocks. It
extends all iterator macros to take the bucket table and bucket
index. It also introduces a new rht_dereference_bucket() to
handle protected accesses to buckets.

It introduces a barrier() to the RCU iterators to the prevent
the compiler from caching the first element.

The lockdep verifier is introduced as stub which always succeeds
and properly implement in the next patch when the locks are
introduced.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/rhashtable.h | 173 ++++++++++++++++++++++++++++++---------------
 lib/rhashtable.c           |  30 +++++---
 net/netfilter/nft_hash.c   |  12 ++--
 net/netlink/af_netlink.c   |  12 ++--
 net/netlink/diag.c         |   4 +-
 5 files changed, 152 insertions(+), 79 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1b51221..b54e24a 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -87,11 +87,18 @@ struct rhashtable {
 
 #ifdef CONFIG_PROVE_LOCKING
 int lockdep_rht_mutex_is_held(const struct rhashtable *ht);
+int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash);
 #else
 static inline int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
 {
 	return 1;
 }
+
+static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
+					     u32 hash)
+{
+	return 1;
+}
 #endif /* CONFIG_PROVE_LOCKING */
 
 int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
@@ -119,92 +126,144 @@ void rhashtable_destroy(const struct rhashtable *ht);
 #define rht_dereference_rcu(p, ht) \
 	rcu_dereference_check(p, lockdep_rht_mutex_is_held(ht))
 
-#define rht_entry(ptr, type, member) container_of(ptr, type, member)
-#define rht_entry_safe(ptr, type, member) \
-({ \
-	typeof(ptr) __ptr = (ptr); \
-	   __ptr ? rht_entry(__ptr, type, member) : NULL; \
-})
+#define rht_dereference_bucket(p, tbl, hash) \
+	rcu_dereference_protected(p, lockdep_rht_bucket_is_held(tbl, hash))
 
-#define rht_next_entry_safe(pos, ht, member) \
-({ \
-	pos ? rht_entry_safe(rht_dereference((pos)->member.next, ht), \
-			     typeof(*(pos)), member) : NULL; \
-})
+#define rht_dereference_bucket_rcu(p, tbl, hash) \
+	rcu_dereference_check(p, lockdep_rht_bucket_is_held(tbl, hash))
+
+#define rht_entry(tpos, pos, member) \
+	({ tpos = container_of(pos, typeof(*tpos), member); 1; })
 
 /**
- * rht_for_each - iterate over hash chain
- * @pos:	&struct rhash_head to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
+ * rht_for_each_continue - continue iterating over hash chain
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
  */
-#define rht_for_each(pos, head, ht) \
-	for (pos = rht_dereference(head, ht); \
+#define rht_for_each_continue(pos, head, tbl, hash) \
+	for (pos = rht_dereference_bucket(head, tbl, hash); \
 	     pos; \
-	     pos = rht_dereference((pos)->next, ht))
+	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
+
+/**
+ * rht_for_each - iterate over hash chain
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ */
+#define rht_for_each(pos, tbl, hash) \
+	rht_for_each_continue(pos, (tbl)->buckets[hash], tbl, hash)
+
+/**
+ * rht_for_each_entry_continue - continue iterating over hash chain
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
+ */
+#define rht_for_each_entry_continue(tpos, pos, head, tbl, hash, member)	\
+	for (pos = rht_dereference_bucket(head, tbl, hash);		\
+	     pos && rht_entry(tpos, pos, member);			\
+	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
 
 /**
  * rht_for_each_entry - iterate over hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  */
-#define rht_for_each_entry(pos, head, ht, member) \
-	for (pos = rht_entry_safe(rht_dereference(head, ht), \
-				   typeof(*(pos)), member); \
-	     pos; \
-	     pos = rht_next_entry_safe(pos, ht, member))
+#define rht_for_each_entry(tpos, pos, tbl, hash, member)		\
+	rht_for_each_entry_continue(tpos, pos, (tbl)->buckets[hash],	\
+				    tbl, hash, member)
 
 /**
  * rht_for_each_entry_safe - safely iterate over hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @n:		type * to use for temporary next object storage
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @next:	the &struct rhash_head to use as next in loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  *
  * This hash chain list-traversal primitive allows for the looped code to
  * remove the loop cursor from the list.
  */
-#define rht_for_each_entry_safe(pos, n, head, ht, member)		\
-	for (pos = rht_entry_safe(rht_dereference(head, ht), \
-				  typeof(*(pos)), member), \
-	     n = rht_next_entry_safe(pos, ht, member); \
-	     pos; \
-	     pos = n, \
-	     n = rht_next_entry_safe(pos, ht, member))
+#define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member)	    \
+	for (pos = rht_dereference_bucket((tbl)->buckets[hash], tbl, hash), \
+	     next = pos ? rht_dereference_bucket(pos->next, tbl, hash)      \
+			: NULL;						    \
+	     pos && rht_entry(tpos, pos, member);			    \
+	     pos = next)
+
+/**
+ * rht_for_each_rcu_continue - continue iterating over rcu hash chain
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ *
+ * This hash chain list-traversal primitive may safely run concurrently with
+ * the _rcu mutation primitives such as rhashtable_insert() as long as the
+ * traversal is guarded by rcu_read_lock().
+ */
+#define rht_for_each_rcu_continue(pos, head, tbl, hash)			\
+	for (({barrier(); }),						\
+	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		\
+	     pos;							\
+	     pos = rcu_dereference_raw(pos->next))
 
 /**
  * rht_for_each_rcu - iterate over rcu hash chain
- * @pos:	&struct rhash_head to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @ht:		pointer to your struct rhashtable
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
  *
  * This hash chain list-traversal primitive may safely run concurrently with
- * the _rcu fkht mutation primitives such as rht_insert() as long as the
+ * the _rcu mutation primitives such as rhashtable_insert() as long as the
  * traversal is guarded by rcu_read_lock().
  */
-#define rht_for_each_rcu(pos, head, ht) \
-	for (pos = rht_dereference_rcu(head, ht); \
-	     pos; \
-	     pos = rht_dereference_rcu((pos)->next, ht))
+#define rht_for_each_rcu(pos, tbl, hash)				\
+	rht_for_each_rcu_continue(pos, (tbl)->buckets[hash], tbl, hash)
+
+/**
+ * rht_for_each_entry_rcu_continue - continue iterating over rcu hash chain
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @head:	the previous &struct rhash_head to continue from
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
+ *
+ * This hash chain list-traversal primitive may safely run concurrently with
+ * the _rcu mutation primitives such as rhashtable_insert() as long as the
+ * traversal is guarded by rcu_read_lock().
+ */
+#define rht_for_each_entry_rcu_continue(tpos, pos, head, tbl, hash, member) \
+	for (({barrier(); }),						    \
+	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		    \
+	     pos && rht_entry(tpos, pos, member);			    \
+	     pos = rht_dereference_bucket_rcu(pos->next, tbl, hash))
 
 /**
  * rht_for_each_entry_rcu - iterate over rcu hash chain of given type
- * @pos:	type * to use as a loop cursor.
- * @head:	head of the hash chain (struct rhash_head *)
- * @member:	name of the rhash_head within the hashable struct.
+ * @tpos:	the type * to use as a loop cursor.
+ * @pos:	the &struct rhash_head to use as a loop cursor.
+ * @tbl:	the &struct bucket_table
+ * @hash:	the hash value / bucket index
+ * @member:	name of the &struct rhash_head within the hashable struct.
  *
  * This hash chain list-traversal primitive may safely run concurrently with
- * the _rcu fkht mutation primitives such as rht_insert() as long as the
+ * the _rcu mutation primitives such as rhashtable_insert() as long as the
  * traversal is guarded by rcu_read_lock().
  */
-#define rht_for_each_entry_rcu(pos, head, member) \
-	for (pos = rht_entry_safe(rcu_dereference_raw(head), \
-				  typeof(*(pos)), member); \
-	     pos; \
-	     pos = rht_entry_safe(rcu_dereference_raw((pos)->member.next), \
-				  typeof(*(pos)), member))
+#define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member)		\
+	rht_for_each_entry_rcu_continue(tpos, pos, (tbl)->buckets[hash],\
+					tbl, hash, member)
 
 #endif /* _LINUX_RHASHTABLE_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index b658245..ce450d0 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -35,6 +35,12 @@ int lockdep_rht_mutex_is_held(const struct rhashtable *ht)
 	return ht->p.mutex_is_held(ht->p.parent);
 }
 EXPORT_SYMBOL_GPL(lockdep_rht_mutex_is_held);
+
+int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
+{
+	return 1;
+}
+EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #endif
 
 static void *rht_obj(const struct rhashtable *ht, const struct rhash_head *he)
@@ -141,7 +147,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	 * previous node p. Call the previous node p;
 	 */
 	h = head_hashfn(ht, new_tbl, p);
-	rht_for_each(he, p->next, ht) {
+	rht_for_each_continue(he, p->next, old_tbl, n) {
 		if (head_hashfn(ht, new_tbl, he) != h)
 			break;
 		p = he;
@@ -153,7 +159,7 @@ static void hashtable_chain_unzip(const struct rhashtable *ht,
 	 */
 	next = NULL;
 	if (he) {
-		rht_for_each(he, he->next, ht) {
+		rht_for_each_continue(he, he->next, old_tbl, n) {
 			if (head_hashfn(ht, new_tbl, he) == h) {
 				next = he;
 				break;
@@ -208,7 +214,7 @@ int rhashtable_expand(struct rhashtable *ht)
 	 */
 	for (i = 0; i < new_tbl->size; i++) {
 		h = rht_bucket_index(old_tbl, i);
-		rht_for_each(he, old_tbl->buckets[h], ht) {
+		rht_for_each(he, old_tbl, h) {
 			if (head_hashfn(ht, new_tbl, he) == i) {
 				RCU_INIT_POINTER(new_tbl->buckets[i], he);
 				break;
@@ -286,7 +292,7 @@ int rhashtable_shrink(struct rhashtable *ht)
 		 * to the new bucket.
 		 */
 		for (pprev = &ntbl->buckets[i]; *pprev != NULL;
-		     pprev = &rht_dereference(*pprev, ht)->next)
+		     pprev = &rht_dereference_bucket(*pprev, ntbl, i)->next)
 			;
 		RCU_INIT_POINTER(*pprev, tbl->buckets[i + ntbl->size]);
 	}
@@ -386,7 +392,7 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 	h = head_hashfn(ht, tbl, obj);
 
 	pprev = &tbl->buckets[h];
-	rht_for_each(he, tbl->buckets[h], ht) {
+	rht_for_each(he, tbl, h) {
 		if (he != obj) {
 			pprev = &he->next;
 			continue;
@@ -423,7 +429,7 @@ void *rhashtable_lookup(const struct rhashtable *ht, const void *key)
 	BUG_ON(!ht->p.key_len);
 
 	h = key_hashfn(ht, key, ht->p.key_len);
-	rht_for_each_rcu(he, tbl->buckets[h], ht) {
+	rht_for_each_rcu(he, tbl, h) {
 		if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key,
 			   ht->p.key_len))
 			continue;
@@ -457,7 +463,7 @@ void *rhashtable_lookup_compare(const struct rhashtable *ht, const void *key,
 	u32 hash;
 
 	hash = key_hashfn(ht, key, ht->p.key_len);
-	rht_for_each_rcu(he, tbl->buckets[hash], ht) {
+	rht_for_each_rcu(he, tbl, hash) {
 		if (!compare(rht_obj(ht, he), arg))
 			continue;
 		return rht_obj(ht, he);
@@ -625,6 +631,7 @@ static int __init test_rht_lookup(struct rhashtable *ht)
 static void test_bucket_stats(struct rhashtable *ht, bool quiet)
 {
 	unsigned int cnt, rcu_cnt, i, total = 0;
+	struct rhash_head *pos;
 	struct test_obj *obj;
 	struct bucket_table *tbl;
 
@@ -635,14 +642,14 @@ static void test_bucket_stats(struct rhashtable *ht, bool quiet)
 		if (!quiet)
 			pr_info(" [%#4x/%zu]", i, tbl->size);
 
-		rht_for_each_entry_rcu(obj, tbl->buckets[i], node) {
+		rht_for_each_entry_rcu(obj, pos, tbl, i, node) {
 			cnt++;
 			total++;
 			if (!quiet)
 				pr_cont(" [%p],", obj);
 		}
 
-		rht_for_each_entry_rcu(obj, tbl->buckets[i], node)
+		rht_for_each_entry_rcu(obj, pos, tbl, i, node)
 			rcu_cnt++;
 
 		if (rcu_cnt != cnt)
@@ -664,7 +671,8 @@ static void test_bucket_stats(struct rhashtable *ht, bool quiet)
 static int __init test_rhashtable(struct rhashtable *ht)
 {
 	struct bucket_table *tbl;
-	struct test_obj *obj, *next;
+	struct test_obj *obj;
+	struct rhash_head *pos, *next;
 	int err;
 	unsigned int i;
 
@@ -733,7 +741,7 @@ static int __init test_rhashtable(struct rhashtable *ht)
 error:
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 	for (i = 0; i < tbl->size; i++)
-		rht_for_each_entry_safe(obj, next, tbl->buckets[i], ht, node)
+		rht_for_each_entry_safe(obj, pos, next, tbl, i, node)
 			kfree(obj);
 
 	return err;
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index 614ee09..d93f1f4 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -142,7 +142,9 @@ static void nft_hash_walk(const struct nft_ctx *ctx, const struct nft_set *set,
 
 	tbl = rht_dereference_rcu(priv->tbl, priv);
 	for (i = 0; i < tbl->size; i++) {
-		rht_for_each_entry_rcu(he, tbl->buckets[i], node) {
+		struct rhash_head *pos;
+
+		rht_for_each_entry_rcu(he, pos, tbl, i, node) {
 			if (iter->count < iter->skip)
 				goto cont;
 
@@ -197,15 +199,13 @@ static void nft_hash_destroy(const struct nft_set *set)
 {
 	const struct rhashtable *priv = nft_set_priv(set);
 	const struct bucket_table *tbl = priv->tbl;
-	struct nft_hash_elem *he, *next;
+	struct nft_hash_elem *he;
+	struct rhash_head *pos, *next;
 	unsigned int i;
 
 	for (i = 0; i < tbl->size; i++) {
-		for (he = rht_entry(tbl->buckets[i], struct nft_hash_elem, node);
-		     he != NULL; he = next) {
-			next = rht_entry(he->node.next, struct nft_hash_elem, node);
+		rht_for_each_entry_safe(he, pos, next, tbl, i, node)
 			nft_hash_elem_destroy(set, he);
-		}
 	}
 	rhashtable_destroy(priv);
 }
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dc8354b..bfae4be 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2910,7 +2910,9 @@ static struct sock *netlink_seq_socket_idx(struct seq_file *seq, loff_t pos)
 		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 
 		for (j = 0; j < tbl->size; j++) {
-			rht_for_each_entry_rcu(nlk, tbl->buckets[j], node) {
+			struct rhash_head *node;
+
+			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
 				s = (struct sock *)nlk;
 
 				if (sock_net(s) != seq_file_net(seq))
@@ -2938,6 +2940,8 @@ static void *netlink_seq_start(struct seq_file *seq, loff_t *pos)
 static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct rhashtable *ht;
+	const struct bucket_table *tbl;
+	struct rhash_head *node;
 	struct netlink_sock *nlk;
 	struct nl_seq_iter *iter;
 	struct net *net;
@@ -2954,17 +2958,17 @@ static void *netlink_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 
 	i = iter->link;
 	ht = &nl_table[i].hash;
-	rht_for_each_entry(nlk, nlk->node.next, ht, node)
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+	rht_for_each_entry_rcu_continue(nlk, node, nlk->node.next, tbl, iter->hash_idx, node)
 		if (net_eq(sock_net((struct sock *)nlk), net))
 			return nlk;
 
 	j = iter->hash_idx + 1;
 
 	do {
-		const struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
 
 		for (; j < tbl->size; j++) {
-			rht_for_each_entry(nlk, tbl->buckets[j], ht, node) {
+			rht_for_each_entry_rcu(nlk, node, tbl, j, node) {
 				if (net_eq(sock_net((struct sock *)nlk), net)) {
 					iter->link = i;
 					iter->hash_idx = j;
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index de8c74a..fcca36d 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -113,7 +113,9 @@ static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	req = nlmsg_data(cb->nlh);
 
 	for (i = 0; i < htbl->size; i++) {
-		rht_for_each_entry(nlsk, htbl->buckets[i], ht, node) {
+		struct rhash_head *pos;
+
+		rht_for_each_entry(nlsk, pos, htbl, i, node) {
 			sk = (struct sock *)nlsk;
 
 			if (!net_eq(sock_net(sk), net))
-- 
1.9.3

^ permalink raw reply related

* [PATCH 5/9] nft_hash: Remove rhashtable_remove_pprev()
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem
  Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh, netfilter-devel
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

The removal function of nft_hash currently stores a reference to the
previous element during lookup which is used to optimize removal later
on. This was possible because a lock is held throughout calling
rhashtable_lookup() and rhashtable_remove().

With the introdution of deferred table resizing in parallel to lookups
and insertions, the nftables lock will no longer synchronize all
table mutations and the stored pprev may become invalid.

Removing this optimization makes removal slightly more expensive on
average but allows taking the resize cost out of the insert and
remove path.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: netfilter-devel@vger.kernel.org
---
 include/linux/rhashtable.h |  2 --
 lib/rhashtable.c           | 34 +++++++---------------------------
 net/netfilter/nft_hash.c   | 11 +++--------
 3 files changed, 10 insertions(+), 37 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index b54e24a..f624d4b 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -105,8 +105,6 @@ int rhashtable_init(struct rhashtable *ht, struct rhashtable_params *params);
 
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *node);
 bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *node);
-void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev);
 
 bool rht_grow_above_75(const struct rhashtable *ht, size_t new_size);
 bool rht_shrink_below_30(const struct rhashtable *ht, size_t new_size);
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 0bd29c1..e6b85c4 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -345,32 +345,6 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 EXPORT_SYMBOL_GPL(rhashtable_insert);
 
 /**
- * rhashtable_remove_pprev - remove object from hash table given previous element
- * @ht:		hash table
- * @obj:	pointer to hash head inside object
- * @pprev:	pointer to previous element
- *
- * Identical to rhashtable_remove() but caller is alreayd aware of the element
- * in front of the element to be deleted. This is in particular useful for
- * deletion when combined with walking or lookup.
- */
-void rhashtable_remove_pprev(struct rhashtable *ht, struct rhash_head *obj,
-			     struct rhash_head __rcu **pprev)
-{
-	struct bucket_table *tbl = rht_dereference(ht->tbl, ht);
-
-	ASSERT_RHT_MUTEX(ht);
-
-	RCU_INIT_POINTER(*pprev, obj->next);
-	ht->nelems--;
-
-	if (ht->p.shrink_decision &&
-	    ht->p.shrink_decision(ht, tbl->size))
-		rhashtable_shrink(ht);
-}
-EXPORT_SYMBOL_GPL(rhashtable_remove_pprev);
-
-/**
  * rhashtable_remove - remove object from hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
@@ -403,7 +377,13 @@ bool rhashtable_remove(struct rhashtable *ht, struct rhash_head *obj)
 			continue;
 		}
 
-		rhashtable_remove_pprev(ht, he, pprev);
+		RCU_INIT_POINTER(*pprev, he->next);
+		ht->nelems--;
+
+		if (ht->p.shrink_decision &&
+		    ht->p.shrink_decision(ht, tbl->size))
+			rhashtable_shrink(ht);
+
 		return true;
 	}
 
diff --git a/net/netfilter/nft_hash.c b/net/netfilter/nft_hash.c
index d93f1f4..7f903cf 100644
--- a/net/netfilter/nft_hash.c
+++ b/net/netfilter/nft_hash.c
@@ -83,15 +83,10 @@ static void nft_hash_remove(const struct nft_set *set,
 			    const struct nft_set_elem *elem)
 {
 	struct rhashtable *priv = nft_set_priv(set);
-	struct rhash_head *he, __rcu **pprev;
-
-	pprev = elem->cookie;
-	he = rht_dereference((*pprev), priv);
-
-	rhashtable_remove_pprev(priv, he, pprev);
 
+	rhashtable_remove(priv, elem->cookie);
 	synchronize_rcu();
-	kfree(he);
+	kfree(elem->cookie);
 }
 
 struct nft_compare_arg {
@@ -105,7 +100,7 @@ static bool nft_hash_compare(void *ptr, void *arg)
 	struct nft_compare_arg *x = arg;
 
 	if (!nft_data_cmp(&he->key, &x->elem->key, x->set->klen)) {
-		x->elem->cookie = &he->node;
+		x->elem->cookie = he;
 		x->elem->flags = 0;
 		if (x->set->flags & NFT_SET_MAP)
 			nft_data_copy(&x->elem->data, he->data);
-- 
1.9.3

^ permalink raw reply related

* [PATCH 6/9] spinlock: Add spin_lock_bh_nested()
From: Thomas Graf @ 2014-12-15 12:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, herbert, paulmck, edumazet, john.r.fastabend,
	josh
In-Reply-To: <cover.1418647641.git.tgraf@suug.ch>

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/spinlock.h         | 8 ++++++++
 include/linux/spinlock_api_smp.h | 2 ++
 include/linux/spinlock_api_up.h  | 1 +
 kernel/locking/spinlock.c        | 8 ++++++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 262ba4e..3e18379 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -190,6 +190,8 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define raw_spin_lock_nested(lock, subclass) \
 	_raw_spin_lock_nested(lock, subclass)
+# define raw_spin_lock_bh_nested(lock, subclass) \
+	_raw_spin_lock_bh_nested(lock, subclass)
 
 # define raw_spin_lock_nest_lock(lock, nest_lock)			\
 	 do {								\
@@ -205,6 +207,7 @@ static inline void do_raw_spin_unlock(raw_spinlock_t *lock) __releases(lock)
 # define raw_spin_lock_nested(lock, subclass)		\
 	_raw_spin_lock(((void)(subclass), (lock)))
 # define raw_spin_lock_nest_lock(lock, nest_lock)	_raw_spin_lock(lock)
+# define raw_spin_lock_bh_nested(lock, subclass)	_raw_spin_lock_bh(lock)
 #endif
 
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -324,6 +327,11 @@ do {								\
 	raw_spin_lock_nested(spinlock_check(lock), subclass);	\
 } while (0)
 
+#define spin_lock_bh_nested(lock, subclass)			\
+do {								\
+	raw_spin_lock_bh_nested(spinlock_check(lock), subclass);\
+} while (0)
+
 #define spin_lock_nest_lock(lock, nest_lock)				\
 do {									\
 	raw_spin_lock_nest_lock(spinlock_check(lock), nest_lock);	\
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 42dfab8..5344268 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr);
 void __lockfunc _raw_spin_lock(raw_spinlock_t *lock)		__acquires(lock);
 void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 								__acquires(lock);
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+								__acquires(lock);
 void __lockfunc
 _raw_spin_lock_nest_lock(raw_spinlock_t *lock, struct lockdep_map *map)
 								__acquires(lock);
diff --git a/include/linux/spinlock_api_up.h b/include/linux/spinlock_api_up.h
index d0d1888..d3afef9 100644
--- a/include/linux/spinlock_api_up.h
+++ b/include/linux/spinlock_api_up.h
@@ -57,6 +57,7 @@
 
 #define _raw_spin_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_nested(lock, subclass)	__LOCK(lock)
+#define _raw_spin_lock_bh_nested(lock, subclass) __LOCK(lock)
 #define _raw_read_lock(lock)			__LOCK(lock)
 #define _raw_write_lock(lock)			__LOCK(lock)
 #define _raw_spin_lock_bh(lock)			__LOCK_BH(lock)
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index 4b082b5..db3ccb1 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -363,6 +363,14 @@ void __lockfunc _raw_spin_lock_nested(raw_spinlock_t *lock, int subclass)
 }
 EXPORT_SYMBOL(_raw_spin_lock_nested);
 
+void __lockfunc _raw_spin_lock_bh_nested(raw_spinlock_t *lock, int subclass)
+{
+	__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
+	spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
+}
+EXPORT_SYMBOL(_raw_spin_lock_bh_nested);
+
 unsigned long __lockfunc _raw_spin_lock_irqsave_nested(raw_spinlock_t *lock,
 						   int subclass)
 {
-- 
1.9.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox