Netdev List
 help / color / mirror / Atom feed
* Re: linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2018-06-01  3:59 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20180531073855.29c23fce@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1995 bytes --]

Hi all,

On Thu, 31 May 2018 07:38:55 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> On Tue, 29 May 2018 13:25:48 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > After merging the net-next tree, today's linux-next build (x86_64
> > allmodconfig) failed like this:
> > 
> > x86_64-linux-ld: unknown architecture of input file `net/bpfilter/bpfilter_umh.o' is incompatible with i386:x86-64 output
> > 
> > Caused by commit
> > 
> >   d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
> > 
> > In my builds, the host is PowerPC 64 LE ...
> > 
> > I have reverted that commit along with
> > 
> >   61a552eb487f ("bpfilter: fix build dependency")
> >   13405468f49d ("bpfilter: don't pass O_CREAT when opening console for debug")
> > 
> > for today.  
> 
> I am still getting this failure (well, at least yesterday I did).

Still happened today.  My guess is that bpfilter_umh needs to be built
with the kernel compiler (not the host compiler - since ir is meant to
run on the some machine as the kernel, right?) but will require the
kernel architecture libc etc


I replaced the reverts above with the following:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 1 Jun 2018 13:33:28 +1000
Subject: [PATCH] net: bpfilter: mark as BROKEN for now

This does not build in a cross compile environment

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/bpfilter/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
index a948b072c28f..ea4be72fdf6e 100644
--- a/net/bpfilter/Kconfig
+++ b/net/bpfilter/Kconfig
@@ -2,6 +2,7 @@ menuconfig BPFILTER
 	bool "BPF based packet filtering framework (BPFILTER)"
 	default n
 	depends on NET && BPF && INET
+	depends on BROKEN
 	help
 	  This builds experimental bpfilter framework that is aiming to
 	  provide netfilter compatible functionality via BPF
-- 
2.17.0

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [RFC PATCH 00/18] Assorted rhashtable improvements
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel

Hi,
 the following is my current set of rhashtable improvements.
 Some have been seen before, some have been improved,
 others are new.

 They include:
   working list-nulls support
   stability improvements for rhashtable_walk
   bit-spin-locks for simplicity and reduced cache footprint
      during modification
   optional per-cpu locks to improve scalability for modificiation
   various cleanups

 If I get suitable acks I will send more focused subsets to Davem for
 inclusion.

 I had said previously that I thought there was a way to provide
 stable walking of an rhl table in the face of concurrent
 insert/delete.  Having tried, I no longer think this can be
 done without substantial impact to lookups and/or other operations.
 The idea of attaching a marker to the list is incompatible with
 the normal rules for working with rcu-protected lists ("attaching"
 might be manageable.  "moving" or "removing" is the problematic part).

 The last patch is the one I'm least certain of.  It seems like a good
 idea to improve the chance of a walk avoiding any rehash, but it
 cannot provide a solid guarantee without risking a denial-of-service.
 My compromise is to guarantee no rehashes caused by shrinkage, and
 discourage rehashes caused by growth.  I'm not yet sure if that is
 sufficiently valuable, but I thought I would include the patch in the
 RFC anyway.

Thanks,
NeilBrown


---

NeilBrown (18):
      rhashtable: silence RCU warning in rhashtable_test.
      rhashtable: split rhashtable.h
      rhashtable: remove nulls_base and related code.
      rhashtable: detect when object movement might have invalidated a lookup
      rhashtable: simplify INIT_RHT_NULLS_HEAD()
      rhashtable: simplify nested_table_alloc() and rht_bucket_nested_insert()
      rhashtable: use cmpxchg() to protect ->future_tbl.
      rhashtable: clean up dereference of ->future_tbl.
      rhashtable: use cmpxchg() in nested_table_alloc()
      rhashtable: remove rhashtable_walk_peek()
      rhashtable: further improve stability of rhashtable_walk
      rhashtable: add rhashtable_walk_prev()
      rhashtable: don't hold lock on first table throughout insertion.
      rhashtable: allow rht_bucket_var to return NULL.
      rhashtable: use bit_spin_locks to protect hash bucket.
      rhashtable: allow percpu element counter
      rhashtable: rename rht_for_each*continue as *from.
      rhashtable: add rhashtable_walk_delay_rehash()


 .clang-format                                     |    8 
 MAINTAINERS                                       |    2 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h        |    1 
 drivers/staging/lustre/lustre/fid/fid_request.c   |    2 
 drivers/staging/lustre/lustre/fld/fld_request.c   |    1 
 drivers/staging/lustre/lustre/include/lu_object.h |    1 
 include/linux/ipc.h                               |    2 
 include/linux/ipc_namespace.h                     |    2 
 include/linux/mroute_base.h                       |    2 
 include/linux/percpu_counter.h                    |    4 
 include/linux/rhashtable-types.h                  |  147 ++++++
 include/linux/rhashtable.h                        |  537 +++++++++++----------
 include/net/inet_frag.h                           |    2 
 include/net/netfilter/nf_flow_table.h             |    2 
 include/net/sctp/structs.h                        |    2 
 include/net/seg6.h                                |    2 
 include/net/seg6_hmac.h                           |    2 
 ipc/msg.c                                         |    1 
 ipc/sem.c                                         |    1 
 ipc/shm.c                                         |    1 
 ipc/util.c                                        |    2 
 lib/rhashtable.c                                  |  481 +++++++++++--------
 lib/test_rhashtable.c                             |   22 +
 net/bridge/br_fdb.c                               |    1 
 net/bridge/br_vlan.c                              |    1 
 net/bridge/br_vlan_tunnel.c                       |    1 
 net/ipv4/inet_fragment.c                          |    1 
 net/ipv4/ipmr.c                                   |    2 
 net/ipv4/ipmr_base.c                              |    1 
 net/ipv6/ip6mr.c                                  |    2 
 net/ipv6/seg6.c                                   |    1 
 net/ipv6/seg6_hmac.c                              |    1 
 net/netfilter/nf_tables_api.c                     |    1 
 net/sctp/input.c                                  |    1 
 net/sctp/socket.c                                 |    1 
 35 files changed, 760 insertions(+), 481 deletions(-)
 create mode 100644 include/linux/rhashtable-types.h

--
Signature

^ permalink raw reply

* [PATCH 01/18] rhashtable: silence RCU warning in rhashtable_test.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

print_ht in rhashtable_test calls rht_dereference() with neither
RCU protection or the mutex.  This triggers an RCU warning.
So take the mutex to silence the warning.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/test_rhashtable.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index f4000c137dbe..bf92b7aa2a49 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -499,6 +499,8 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
 	unsigned int i, cnt = 0;
 
 	ht = &rhlt->ht;
+	/* Take the mutex to avoid RCU warning */
+	mutex_lock(&ht->mutex);
 	tbl = rht_dereference(ht->tbl, ht);
 	for (i = 0; i < tbl->size; i++) {
 		struct rhash_head *pos, *next;
@@ -532,6 +534,7 @@ static unsigned int __init print_ht(struct rhltable *rhlt)
 		}
 	}
 	printk(KERN_ERR "\n---- ht: ----%s\n-------------\n", buff);
+	mutex_unlock(&ht->mutex);
 
 	return cnt;
 }

^ permalink raw reply related

* [PATCH 02/18] rhashtable: split rhashtable.h
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.

This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code.  rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 MAINTAINERS                                       |    2 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h        |    1 
 drivers/staging/lustre/lustre/fid/fid_request.c   |    2 
 drivers/staging/lustre/lustre/fld/fld_request.c   |    1 
 drivers/staging/lustre/lustre/include/lu_object.h |    1 
 include/linux/ipc.h                               |    2 
 include/linux/ipc_namespace.h                     |    2 
 include/linux/mroute_base.h                       |    2 
 include/linux/rhashtable-types.h                  |  139 +++++++++++++++++++++
 include/linux/rhashtable.h                        |  127 -------------------
 include/net/inet_frag.h                           |    2 
 include/net/netfilter/nf_flow_table.h             |    2 
 include/net/sctp/structs.h                        |    2 
 include/net/seg6.h                                |    2 
 include/net/seg6_hmac.h                           |    2 
 ipc/msg.c                                         |    1 
 ipc/sem.c                                         |    1 
 ipc/shm.c                                         |    1 
 ipc/util.c                                        |    1 
 lib/rhashtable.c                                  |    1 
 net/ipv4/inet_fragment.c                          |    1 
 net/ipv4/ipmr.c                                   |    1 
 net/ipv4/ipmr_base.c                              |    1 
 net/ipv6/ip6mr.c                                  |    1 
 net/ipv6/seg6.c                                   |    1 
 net/ipv6/seg6_hmac.c                              |    1 
 net/netfilter/nf_tables_api.c                     |    1 
 net/sctp/input.c                                  |    1 
 net/sctp/socket.c                                 |    1 
 29 files changed, 169 insertions(+), 134 deletions(-)
 create mode 100644 include/linux/rhashtable-types.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 66985d05bced..a3a4f44d3ce1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12012,7 +12012,9 @@ M:	Herbert Xu <herbert@gondor.apana.org.au>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	lib/rhashtable.c
+F:	lib/test_rhashtable.c
 F:	include/linux/rhashtable.h
+F:	include/linux/rhashtable-types.h
 
 RICOH R5C592 MEMORYSTICK DRIVER
 M:	Maxim Levitsky <maximlevitsky@gmail.com>
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..66a7a02d616a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -46,6 +46,7 @@
 #include <linux/spinlock.h>
 #include <linux/timer.h>
 #include <linux/vmalloc.h>
+#include <linux/rhashtable.h>
 #include <linux/etherdevice.h>
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
diff --git a/drivers/staging/lustre/lustre/fid/fid_request.c b/drivers/staging/lustre/lustre/fid/fid_request.c
index c674652af03a..a9af68178eff 100644
--- a/drivers/staging/lustre/lustre/fid/fid_request.c
+++ b/drivers/staging/lustre/lustre/fid/fid_request.c
@@ -40,7 +40,7 @@
 #define DEBUG_SUBSYSTEM S_FID
 
 #include <linux/module.h>
-
+#include <linux/rhashtable.h>
 #include <obd.h>
 #include <obd_class.h>
 #include <obd_support.h>
diff --git a/drivers/staging/lustre/lustre/fld/fld_request.c b/drivers/staging/lustre/lustre/fld/fld_request.c
index 7b7ba93a4db6..292ff5756243 100644
--- a/drivers/staging/lustre/lustre/fld/fld_request.c
+++ b/drivers/staging/lustre/lustre/fld/fld_request.c
@@ -40,6 +40,7 @@
 #define DEBUG_SUBSYSTEM S_FLD
 
 #include <linux/module.h>
+#include <linux/rhashtable.h>
 #include <asm/div64.h>
 
 #include <obd.h>
diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index 1b5284d2416c..205463c47bda 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -37,6 +37,7 @@
 #include <stdarg.h>
 #include <linux/percpu_counter.h>
 #include <linux/libcfs/libcfs.h>
+#include <linux/rhashtable-types.h>
 #include <uapi/linux/lustre/lustre_idl.h>
 #include <lu_ref.h>
 
diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 6cc2df7f7ac9..e1c9eea6015b 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -4,7 +4,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/uidgid.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 #include <uapi/linux/ipc.h>
 #include <linux/refcount.h>
 
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8eb2f3..6cea726612b7 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -9,7 +9,7 @@
 #include <linux/nsproxy.h>
 #include <linux/ns_common.h>
 #include <linux/refcount.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 
 struct user_namespace;
 
diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index d617fe45543e..fd673be398ff 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -2,7 +2,7 @@
 #define __LINUX_MROUTE_BASE_H
 
 #include <linux/netdevice.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 #include <linux/spinlock.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
new file mode 100644
index 000000000000..9740063ff13b
--- /dev/null
+++ b/include/linux/rhashtable-types.h
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Resizable, Scalable, Concurrent Hash Table
+ *
+ * Simple structures that might be needed in include
+ * files.
+ */
+
+#ifndef _LINUX_RHASHTABLE_TYPES_H
+#define _LINUX_RHASHTABLE_TYPES_H
+
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <linux/mutex.h>
+#include <linux/workqueue.h>
+
+struct rhash_head {
+	struct rhash_head __rcu		*next;
+};
+
+struct rhlist_head {
+	struct rhash_head		rhead;
+	struct rhlist_head __rcu	*next;
+};
+
+struct bucket_table;
+
+/**
+ * struct rhashtable_compare_arg - Key for the function rhashtable_compare
+ * @ht: Hash table
+ * @key: Key to compare against
+ */
+struct rhashtable_compare_arg {
+	struct rhashtable *ht;
+	const void *key;
+};
+
+typedef u32 (*rht_hashfn_t)(const void *data, u32 len, u32 seed);
+typedef u32 (*rht_obj_hashfn_t)(const void *data, u32 len, u32 seed);
+typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
+			       const void *obj);
+
+/**
+ * struct rhashtable_params - Hash table construction parameters
+ * @nelem_hint: Hint on number of elements, should be 75% of desired size
+ * @key_len: Length of key
+ * @key_offset: Offset of key in struct to be hashed
+ * @head_offset: Offset of rhash_head in struct to be hashed
+ * @max_size: Maximum size while expanding
+ * @min_size: Minimum size while shrinking
+ * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
+ * @automatic_shrinking: Enable automatic shrinking of tables
+ * @nulls_base: Base value to generate nulls marker
+ * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
+ * @obj_hashfn: Function to hash object
+ * @obj_cmpfn: Function to compare key with object
+ */
+struct rhashtable_params {
+	u16			nelem_hint;
+	u16			key_len;
+	u16			key_offset;
+	u16			head_offset;
+	unsigned int		max_size;
+	u16			min_size;
+	bool			automatic_shrinking;
+	u8			locks_mul;
+	u32			nulls_base;
+	rht_hashfn_t		hashfn;
+	rht_obj_hashfn_t	obj_hashfn;
+	rht_obj_cmpfn_t		obj_cmpfn;
+};
+
+/**
+ * struct rhashtable - Hash table handle
+ * @tbl: Bucket table
+ * @key_len: Key length for hashfn
+ * @max_elems: Maximum number of elements in table
+ * @p: Configuration parameters
+ * @rhlist: True if this is an rhltable
+ * @run_work: Deferred worker to expand/shrink asynchronously
+ * @mutex: Mutex to protect current/future table swapping
+ * @lock: Spin lock to protect walker list
+ * @nelems: Number of elements in table
+ */
+struct rhashtable {
+	struct bucket_table __rcu	*tbl;
+	unsigned int			key_len;
+	unsigned int			max_elems;
+	struct rhashtable_params	p;
+	bool				rhlist;
+	struct work_struct		run_work;
+	struct mutex                    mutex;
+	spinlock_t			lock;
+	atomic_t			nelems;
+};
+
+/**
+ * struct rhltable - Hash table with duplicate objects in a list
+ * @ht: Underlying rhtable
+ */
+struct rhltable {
+	struct rhashtable ht;
+};
+
+/**
+ * struct rhashtable_walker - Hash table walker
+ * @list: List entry on list of walkers
+ * @tbl: The table that we were walking over
+ */
+struct rhashtable_walker {
+	struct list_head list;
+	struct bucket_table *tbl;
+};
+
+/**
+ * struct rhashtable_iter - Hash table iterator
+ * @ht: Table to iterate through
+ * @p: Current pointer
+ * @list: Current hash list pointer
+ * @walker: Associated rhashtable walker
+ * @slot: Current slot
+ * @skip: Number of entries to skip in slot
+ */
+struct rhashtable_iter {
+	struct rhashtable *ht;
+	struct rhash_head *p;
+	struct rhlist_head *list;
+	struct rhashtable_walker walker;
+	unsigned int slot;
+	unsigned int skip;
+	bool end_of_table;
+};
+
+int rhashtable_init(struct rhashtable *ht,
+		    const struct rhashtable_params *params);
+int rhltable_init(struct rhltable *hlt,
+		  const struct rhashtable_params *params);
+
+#endif /* _LINUX_RHASHTABLE_TYPES_H */
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4e1f535c2034..48754ab07cdf 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * Resizable, Scalable, Concurrent Hash Table
  *
@@ -17,16 +18,14 @@
 #ifndef _LINUX_RHASHTABLE_H
 #define _LINUX_RHASHTABLE_H
 
-#include <linux/atomic.h>
-#include <linux/compiler.h>
 #include <linux/err.h>
 #include <linux/errno.h>
 #include <linux/jhash.h>
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
-#include <linux/mutex.h>
 #include <linux/rculist.h>
 
+#include <linux/rhashtable-types.h>
 /*
  * The end of the chain is marked with a special nulls marks which has
  * the following format:
@@ -64,15 +63,6 @@
  */
 #define RHT_ELASTICITY	16u
 
-struct rhash_head {
-	struct rhash_head __rcu		*next;
-};
-
-struct rhlist_head {
-	struct rhash_head		rhead;
-	struct rhlist_head __rcu	*next;
-};
-
 /**
  * struct bucket_table - Table of hash buckets
  * @size: Number of hash buckets
@@ -102,114 +92,6 @@ struct bucket_table {
 	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
 };
 
-/**
- * struct rhashtable_compare_arg - Key for the function rhashtable_compare
- * @ht: Hash table
- * @key: Key to compare against
- */
-struct rhashtable_compare_arg {
-	struct rhashtable *ht;
-	const void *key;
-};
-
-typedef u32 (*rht_hashfn_t)(const void *data, u32 len, u32 seed);
-typedef u32 (*rht_obj_hashfn_t)(const void *data, u32 len, u32 seed);
-typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
-			       const void *obj);
-
-struct rhashtable;
-
-/**
- * struct rhashtable_params - Hash table construction parameters
- * @nelem_hint: Hint on number of elements, should be 75% of desired size
- * @key_len: Length of key
- * @key_offset: Offset of key in struct to be hashed
- * @head_offset: Offset of rhash_head in struct to be hashed
- * @max_size: Maximum size while expanding
- * @min_size: Minimum size while shrinking
- * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
- * @automatic_shrinking: Enable automatic shrinking of tables
- * @nulls_base: Base value to generate nulls marker
- * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
- * @obj_hashfn: Function to hash object
- * @obj_cmpfn: Function to compare key with object
- */
-struct rhashtable_params {
-	u16			nelem_hint;
-	u16			key_len;
-	u16			key_offset;
-	u16			head_offset;
-	unsigned int		max_size;
-	u16			min_size;
-	bool			automatic_shrinking;
-	u8			locks_mul;
-	u32			nulls_base;
-	rht_hashfn_t		hashfn;
-	rht_obj_hashfn_t	obj_hashfn;
-	rht_obj_cmpfn_t		obj_cmpfn;
-};
-
-/**
- * struct rhashtable - Hash table handle
- * @tbl: Bucket table
- * @key_len: Key length for hashfn
- * @max_elems: Maximum number of elements in table
- * @p: Configuration parameters
- * @rhlist: True if this is an rhltable
- * @run_work: Deferred worker to expand/shrink asynchronously
- * @mutex: Mutex to protect current/future table swapping
- * @lock: Spin lock to protect walker list
- * @nelems: Number of elements in table
- */
-struct rhashtable {
-	struct bucket_table __rcu	*tbl;
-	unsigned int			key_len;
-	unsigned int			max_elems;
-	struct rhashtable_params	p;
-	bool				rhlist;
-	struct work_struct		run_work;
-	struct mutex                    mutex;
-	spinlock_t			lock;
-	atomic_t			nelems;
-};
-
-/**
- * struct rhltable - Hash table with duplicate objects in a list
- * @ht: Underlying rhtable
- */
-struct rhltable {
-	struct rhashtable ht;
-};
-
-/**
- * struct rhashtable_walker - Hash table walker
- * @list: List entry on list of walkers
- * @tbl: The table that we were walking over
- */
-struct rhashtable_walker {
-	struct list_head list;
-	struct bucket_table *tbl;
-};
-
-/**
- * struct rhashtable_iter - Hash table iterator
- * @ht: Table to iterate through
- * @p: Current pointer
- * @list: Current hash list pointer
- * @walker: Associated rhashtable walker
- * @slot: Current slot
- * @skip: Number of entries to skip in slot
- */
-struct rhashtable_iter {
-	struct rhashtable *ht;
-	struct rhash_head *p;
-	struct rhlist_head *list;
-	struct rhashtable_walker walker;
-	unsigned int slot;
-	unsigned int skip;
-	bool end_of_table;
-};
-
 static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
 {
 	return NULLS_MARKER(ht->p.nulls_base + hash);
@@ -376,11 +258,6 @@ static inline int lockdep_rht_bucket_is_held(const struct bucket_table *tbl,
 }
 #endif /* CONFIG_PROVE_LOCKING */
 
-int rhashtable_init(struct rhashtable *ht,
-		    const struct rhashtable_params *params);
-int rhltable_init(struct rhltable *hlt,
-		  const struct rhashtable_params *params);
-
 void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
 			     struct rhash_head *obj);
 
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index ed07e3786d98..f4272a29dc44 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -2,7 +2,7 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 
 struct netns_frags {
 	/* sysctls */
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index 833752dd0c58..3bb75491482f 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -4,7 +4,7 @@
 #include <linux/in.h>
 #include <linux/in6.h>
 #include <linux/netdevice.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 #include <linux/rcupdate.h>
 #include <net/dst.h>
 
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index a0ec462bc1a9..e5ac430c8717 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -48,7 +48,7 @@
 #define __sctp_structs_h__
 
 #include <linux/ktime.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 #include <linux/socket.h>	/* linux/in.h needs this!!    */
 #include <linux/in.h>		/* We get struct sockaddr_in. */
 #include <linux/in6.h>		/* We get struct in6_addr     */
diff --git a/include/net/seg6.h b/include/net/seg6.h
index 099bad59dc90..85ee569c8306 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -18,7 +18,7 @@
 #include <linux/ipv6.h>
 #include <net/lwtunnel.h>
 #include <linux/seg6.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 
 static inline void update_csum_diff4(struct sk_buff *skb, __be32 from,
 				     __be32 to)
diff --git a/include/net/seg6_hmac.h b/include/net/seg6_hmac.h
index 69c3a106056b..7fda469e2758 100644
--- a/include/net/seg6_hmac.h
+++ b/include/net/seg6_hmac.h
@@ -22,7 +22,7 @@
 #include <linux/route.h>
 #include <net/seg6.h>
 #include <linux/seg6_hmac.h>
-#include <linux/rhashtable.h>
+#include <linux/rhashtable-types.h>
 
 #define SEG6_HMAC_MAX_DIGESTSIZE	160
 #define SEG6_HMAC_RING_SIZE		256
diff --git a/ipc/msg.c b/ipc/msg.c
index 56fd1c73eedc..8fa2cac6830a 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -38,6 +38,7 @@
 #include <linux/rwsem.h>
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
+#include <linux/rhashtable.h>
 
 #include <asm/current.h>
 #include <linux/uaccess.h>
diff --git a/ipc/sem.c b/ipc/sem.c
index 06be75d9217a..f0c4a0ed91c7 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -84,6 +84,7 @@
 #include <linux/nsproxy.h>
 #include <linux/ipc_namespace.h>
 #include <linux/sched/wake_q.h>
+#include <linux/rhashtable.h>
 
 #include <linux/uaccess.h>
 #include "util.h"
diff --git a/ipc/shm.c b/ipc/shm.c
index d73269381ec7..5824a7f3253e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -43,6 +43,7 @@
 #include <linux/nsproxy.h>
 #include <linux/mount.h>
 #include <linux/ipc_namespace.h>
+#include <linux/rhashtable.h>
 
 #include <linux/uaccess.h>
 
diff --git a/ipc/util.c b/ipc/util.c
index 4e81182fa0ac..fdffff41f65b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -63,6 +63,7 @@
 #include <linux/rwsem.h>
 #include <linux/memory.h>
 #include <linux/ipc_namespace.h>
+#include <linux/rhashtable.h>
 
 #include <asm/unistd.h>
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 9427b5766134..c9fafea7dc6e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -28,6 +28,7 @@
 #include <linux/rhashtable.h>
 #include <linux/err.h>
 #include <linux/export.h>
+#include <linux/rhashtable.h>
 
 #define HASH_DEFAULT_SIZE	64UL
 #define HASH_MIN_SIZE		4U
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c9e35b81d093..316518f87294 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -20,6 +20,7 @@
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
 #include <linux/slab.h>
+#include <linux/rhashtable.h>
 
 #include <net/sock.h>
 #include <net/inet_frag.h>
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3f7f66..adbc3d3a560b 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -60,6 +60,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/export.h>
+#include <linux/rhashtable.h>
 #include <net/ip_tunnels.h>
 #include <net/checksum.h>
 #include <net/netlink.h>
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 30221701614c..4f39b27b5084 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -2,6 +2,7 @@
  * Common logic shared by IPv4 [ipmr] and IPv6 [ip6mr] implementation
  */
 
+#include <linux/rhashtable.h>
 #include <linux/mroute_base.h>
 
 /* Sets everything common except 'dev', since that is done under locking */
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 298fd8b6ed17..dfb339f79bde 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -32,6 +32,7 @@
 #include <linux/seq_file.h>
 #include <linux/init.h>
 #include <linux/compat.h>
+#include <linux/rhashtable.h>
 #include <net/protocol.h>
 #include <linux/skbuff.h>
 #include <net/raw.h>
diff --git a/net/ipv6/seg6.c b/net/ipv6/seg6.c
index 7f5621d09571..e24a91c7892a 100644
--- a/net/ipv6/seg6.c
+++ b/net/ipv6/seg6.c
@@ -17,6 +17,7 @@
 #include <linux/net.h>
 #include <linux/in6.h>
 #include <linux/slab.h>
+#include <linux/rhashtable.h>
 
 #include <net/ipv6.h>
 #include <net/protocol.h>
diff --git a/net/ipv6/seg6_hmac.c b/net/ipv6/seg6_hmac.c
index 33fb35cbfac1..b1791129a875 100644
--- a/net/ipv6/seg6_hmac.c
+++ b/net/ipv6/seg6_hmac.c
@@ -22,6 +22,7 @@
 #include <linux/icmpv6.h>
 #include <linux/mroute6.h>
 #include <linux/slab.h>
+#include <linux/rhashtable.h>
 
 #include <linux/netfilter.h>
 #include <linux/netfilter_ipv6.h>
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 91e80aa852d6..85077c2f3379 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -14,6 +14,7 @@
 #include <linux/skbuff.h>
 #include <linux/netlink.h>
 #include <linux/vmalloc.h>
+#include <linux/rhashtable.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/nf_tables.h>
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6c36fa..9bbc5f92c941 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -56,6 +56,7 @@
 #include <net/sctp/sm.h>
 #include <net/sctp/checksum.h>
 #include <net/net_namespace.h>
+#include <linux/rhashtable.h>
 
 /* Forward declarations for internal helpers. */
 static int sctp_rcv_ootb(struct sk_buff *);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ae7e7c606f72..0adce7b22675 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -66,6 +66,7 @@
 #include <linux/slab.h>
 #include <linux/file.h>
 #include <linux/compat.h>
+#include <linux/rhashtable.h>
 
 #include <net/ip.h>
 #include <net/icmp.h>

^ permalink raw reply related

* [PATCH 03/18] rhashtable: remove nulls_base and related code.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

This "feature" is unused, undocumented, and untested and so
doesn't really belong.  Next patch will introduce support
to detect when a search gets diverted down a different chain,
which the common purpose of nulls markers.

This patch actually fixes a bug too.  The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.

This patch results in NULLS_MARKER(0) being used for all chains,
and leaves the use of rht_is_a_null() to test for it.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable-types.h |    2 --
 include/linux/rhashtable.h       |   33 +++------------------------------
 lib/rhashtable.c                 |    8 --------
 lib/test_rhashtable.c            |    5 +----
 4 files changed, 4 insertions(+), 44 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 9740063ff13b..763d613ce2c2 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -50,7 +50,6 @@ typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
  * @min_size: Minimum size while shrinking
  * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
  * @automatic_shrinking: Enable automatic shrinking of tables
- * @nulls_base: Base value to generate nulls marker
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
  * @obj_cmpfn: Function to compare key with object
@@ -64,7 +63,6 @@ struct rhashtable_params {
 	u16			min_size;
 	bool			automatic_shrinking;
 	u8			locks_mul;
-	u32			nulls_base;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
 	rht_obj_cmpfn_t		obj_cmpfn;
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 48754ab07cdf..d9f719af7936 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -28,25 +28,8 @@
 #include <linux/rhashtable-types.h>
 /*
  * The end of the chain is marked with a special nulls marks which has
- * the following format:
- *
- * +-------+-----------------------------------------------------+-+
- * | Base  |                      Hash                           |1|
- * +-------+-----------------------------------------------------+-+
- *
- * Base (4 bits) : Reserved to distinguish between multiple tables.
- *                 Specified via &struct rhashtable_params.nulls_base.
- * Hash (27 bits): Full hash (unmasked) of first element added to bucket
- * 1 (1 bit)     : Nulls marker (always set)
- *
- * The remaining bits of the next pointer remain unused for now.
+ * the least significant bit set.
  */
-#define RHT_BASE_BITS		4
-#define RHT_HASH_BITS		27
-#define RHT_BASE_SHIFT		RHT_HASH_BITS
-
-/* Base bits plus 1 bit for nulls marker */
-#define RHT_HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
 
 /* Maximum chain length before rehash
  *
@@ -92,24 +75,14 @@ struct bucket_table {
 	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
 };
 
-static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
-{
-	return NULLS_MARKER(ht->p.nulls_base + hash);
-}
-
 #define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
-	((ptr) = (typeof(ptr)) rht_marker(ht, hash))
+	((ptr) = (typeof(ptr)) NULLS_MARKER(0))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
 	return ((unsigned long) ptr & 1);
 }
 
-static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
-{
-	return ((unsigned long) ptr) >> 1;
-}
-
 static inline void *rht_obj(const struct rhashtable *ht,
 			    const struct rhash_head *he)
 {
@@ -119,7 +92,7 @@ static inline void *rht_obj(const struct rhashtable *ht,
 static inline unsigned int rht_bucket_index(const struct bucket_table *tbl,
 					    unsigned int hash)
 {
-	return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
+	return hash & (tbl->size - 1);
 }
 
 static inline unsigned int rht_key_get_hash(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index c9fafea7dc6e..688693c919be 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -995,7 +995,6 @@ static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed)
  *	.key_offset = offsetof(struct test_obj, key),
  *	.key_len = sizeof(int),
  *	.hashfn = jhash,
- *	.nulls_base = (1U << RHT_BASE_SHIFT),
  * };
  *
  * Configuration Example 2: Variable length keys
@@ -1029,9 +1028,6 @@ int rhashtable_init(struct rhashtable *ht,
 	    (params->obj_hashfn && !params->obj_cmpfn))
 		return -EINVAL;
 
-	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
-		return -EINVAL;
-
 	memset(ht, 0, sizeof(*ht));
 	mutex_init(&ht->mutex);
 	spin_lock_init(&ht->lock);
@@ -1096,10 +1092,6 @@ int rhltable_init(struct rhltable *hlt, const struct rhashtable_params *params)
 {
 	int err;
 
-	/* No rhlist NULLs marking for now. */
-	if (params->nulls_base)
-		return -EINVAL;
-
 	err = rhashtable_init(&hlt->ht, params);
 	hlt->ht.rhlist = true;
 	return err;
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index bf92b7aa2a49..b428a9c7522a 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -83,7 +83,7 @@ static u32 my_hashfn(const void *data, u32 len, u32 seed)
 {
 	const struct test_obj_rhl *obj = data;
 
-	return (obj->value.id % 10) << RHT_HASH_RESERVED_SPACE;
+	return (obj->value.id % 10);
 }
 
 static int my_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
@@ -99,7 +99,6 @@ static struct rhashtable_params test_rht_params = {
 	.key_offset = offsetof(struct test_obj, value),
 	.key_len = sizeof(struct test_obj_val),
 	.hashfn = jhash,
-	.nulls_base = (3U << RHT_BASE_SHIFT),
 };
 
 static struct rhashtable_params test_rht_params_dup = {
@@ -294,8 +293,6 @@ static int __init test_rhltable(unsigned int entries)
 	if (!obj_in_table)
 		goto out_free;
 
-	/* nulls_base not supported in rhlist interface */
-	test_rht_params.nulls_base = 0;
 	err = rhltable_init(&rhlt, &test_rht_params);
 	if (WARN_ON(err))
 		goto out_free;

^ permalink raw reply related

* [PATCH 04/18] rhashtable: detect when object movement might have invalidated a lookup
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Some users of rhashtable might need to change the key
of an object and move it to a different location in the table.
Other users might want to allocate objects using
SLAB_TYPESAFE_BY_RCU which can result in the same memory allocation
being used for a different (type-compatible) purpose and similarly
end up in a different hash-chain.

To support these, we store a unique NULLS_MARKER at the end of
each chain, and when a search fails to find a match, we check
if the NULLS marker found was the expected one.  If not,
the search is repeated.

The unique NULLS_MARKER is derived from the address of the
head of the chain.

If an object is removed and re-added to the same hash chain, we won't
notice by looking that the NULLS marker.  In this case we must be sure
that it was not re-added *after* its original location, or a lookup may
incorrectly fail.  The easiest solution is to ensure it is inserted at
the start of the chain.  insert_slow() already does that,
insert_fast() does not.  So this patch changes insert_fast to always
insert at the head of the chain.

Note that such a user must do their own double-checking of
the object found by rhashtable_lookup_fast() after ensuring
mutual exclusion which anything that might change the key, such as
successfully taking a new reference.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   35 +++++++++++++++++++++++------------
 lib/rhashtable.c           |    8 +++++---
 2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index d9f719af7936..25d839881ae5 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -75,8 +75,10 @@ struct bucket_table {
 	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
 };
 
+#define	RHT_NULLS_MARKER(ptr)	\
+	((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
-	((ptr) = (typeof(ptr)) NULLS_MARKER(0))
+	((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
@@ -471,6 +473,7 @@ static inline struct rhash_head *__rhashtable_lookup(
 		.ht = ht,
 		.key = key,
 	};
+	struct rhash_head __rcu * const *head;
 	struct bucket_table *tbl;
 	struct rhash_head *he;
 	unsigned int hash;
@@ -478,13 +481,19 @@ static inline struct rhash_head *__rhashtable_lookup(
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
 	hash = rht_key_hashfn(ht, tbl, key, params);
-	rht_for_each_rcu(he, tbl, hash) {
-		if (params.obj_cmpfn ?
-		    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
-		    rhashtable_compare(&arg, rht_obj(ht, he)))
-			continue;
-		return he;
-	}
+	head = rht_bucket(tbl, hash);
+	do {
+		rht_for_each_rcu_continue(he, *head, tbl, hash) {
+			if (params.obj_cmpfn ?
+			    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+			    rhashtable_compare(&arg, rht_obj(ht, he)))
+				continue;
+			return he;
+		}
+		/* An object might have been moved to a different hash chain,
+		 * while we walk along it - better check and retry.
+		 */
+	} while (he != RHT_NULLS_MARKER(head));
 
 	/* Ensure we see any new tables. */
 	smp_rmb();
@@ -580,6 +589,7 @@ static inline void *__rhashtable_insert_fast(
 		.ht = ht,
 		.key = key,
 	};
+	struct rhash_head __rcu **headp;
 	struct rhash_head __rcu **pprev;
 	struct bucket_table *tbl;
 	struct rhash_head *head;
@@ -603,12 +613,13 @@ static inline void *__rhashtable_insert_fast(
 	}
 
 	elasticity = RHT_ELASTICITY;
-	pprev = rht_bucket_insert(ht, tbl, hash);
+	headp = rht_bucket_insert(ht, tbl, hash);
+	pprev = headp;
 	data = ERR_PTR(-ENOMEM);
 	if (!pprev)
 		goto out;
 
-	rht_for_each_continue(head, *pprev, tbl, hash) {
+	rht_for_each_continue(head, *headp, tbl, hash) {
 		struct rhlist_head *plist;
 		struct rhlist_head *list;
 
@@ -648,7 +659,7 @@ static inline void *__rhashtable_insert_fast(
 	if (unlikely(rht_grow_above_100(ht, tbl)))
 		goto slow_path;
 
-	head = rht_dereference_bucket(*pprev, tbl, hash);
+	head = rht_dereference_bucket(*headp, tbl, hash);
 
 	RCU_INIT_POINTER(obj->next, head);
 	if (rhlist) {
@@ -658,7 +669,7 @@ static inline void *__rhashtable_insert_fast(
 		RCU_INIT_POINTER(list->next, NULL);
 	}
 
-	rcu_assign_pointer(*pprev, obj);
+	rcu_assign_pointer(*headp, obj);
 
 	atomic_inc(&ht->nelems);
 	if (rht_grow_above_75(ht, tbl))
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 688693c919be..69f05cf9e9e8 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1174,8 +1174,7 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 					    unsigned int hash)
 {
 	const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-	static struct rhash_head __rcu *rhnull =
-		(struct rhash_head __rcu *)NULLS_MARKER(0);
+	static struct rhash_head __rcu *rhnull;
 	unsigned int index = hash & ((1 << tbl->nest) - 1);
 	unsigned int size = tbl->size >> tbl->nest;
 	unsigned int subhash = hash;
@@ -1193,8 +1192,11 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 		subhash >>= shift;
 	}
 
-	if (!ntbl)
+	if (!ntbl) {
+		if (!rhnull)
+			INIT_RHT_NULLS_HEAD(rhnull, NULL, 0);
 		return &rhnull;
+	}
 
 	return &ntbl[subhash].bucket;
 

^ permalink raw reply related

* [PATCH 06/18] rhashtable: simplify nested_table_alloc() and rht_bucket_nested_insert()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Now that we don't use the hash value or shift in nested_table_alloc()
there is room for simplification.
We only need to pass a "is this a leaf" flag to nested_table_alloc(),
and don't need to track as much information in
rht_bucket_nested_insert().

Note there is another minor cleanup in nested_table_alloc() here.
The number of elements in a page of "union nested_tables" is most naturally

  PAGE_SIZE / sizeof(ntbl[0])

The previous code had

  PAGE_SIZE / sizeof(ntbl[0].bucket)

which happens to be the correct value only because the bucket uses all
the space in the union.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 8582e1916c2d..e209069f1d74 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -116,7 +116,7 @@ static void bucket_table_free_rcu(struct rcu_head *head)
 
 static union nested_table *nested_table_alloc(struct rhashtable *ht,
 					      union nested_table __rcu **prev,
-					      unsigned int shifted)
+					      bool leaf)
 {
 	union nested_table *ntbl;
 	int i;
@@ -127,8 +127,8 @@ static union nested_table *nested_table_alloc(struct rhashtable *ht,
 
 	ntbl = kzalloc(PAGE_SIZE, GFP_ATOMIC);
 
-	if (ntbl && shifted) {
-		for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0].bucket); i++)
+	if (ntbl && leaf) {
+		for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0]); i++)
 			INIT_RHT_NULLS_HEAD(ntbl[i].bucket);
 	}
 
@@ -155,7 +155,7 @@ static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 
 	if (!nested_table_alloc(ht, (union nested_table __rcu **)tbl->buckets,
-				0)) {
+				false)) {
 		kfree(tbl);
 		return NULL;
 	}
@@ -1209,24 +1209,18 @@ struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
 	unsigned int index = hash & ((1 << tbl->nest) - 1);
 	unsigned int size = tbl->size >> tbl->nest;
 	union nested_table *ntbl;
-	unsigned int shifted;
-	unsigned int nhash;
 
 	ntbl = (union nested_table *)rcu_dereference_raw(tbl->buckets[0]);
 	hash >>= tbl->nest;
-	nhash = index;
-	shifted = tbl->nest;
 	ntbl = nested_table_alloc(ht, &ntbl[index].table,
-				  size <= (1 << shift) ? shifted : 0);
+				  size <= (1 << shift));
 
 	while (ntbl && size > (1 << shift)) {
 		index = hash & ((1 << shift) - 1);
 		size >>= shift;
 		hash >>= shift;
-		nhash |= index << shifted;
-		shifted += shift;
 		ntbl = nested_table_alloc(ht, &ntbl[index].table,
-					  size <= (1 << shift) ? shifted : 0);
+					  size <= (1 << shift));
 	}
 
 	if (!ntbl)

^ permalink raw reply related

* [PATCH 07/18] rhashtable: use cmpxchg() to protect ->future_tbl.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Rather than borrowing one of the bucket locks to
protect ->future_tbl updates, use cmpxchg().
This gives more freedom to change how bucket locking
is implemented.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |   15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e209069f1d74..b5d17bce19ff 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -297,21 +297,14 @@ static int rhashtable_rehash_attach(struct rhashtable *ht,
 				    struct bucket_table *old_tbl,
 				    struct bucket_table *new_tbl)
 {
-	/* Protect future_tbl using the first bucket lock. */
-	spin_lock_bh(old_tbl->locks);
-
-	/* Did somebody beat us to it? */
-	if (rcu_access_pointer(old_tbl->future_tbl)) {
-		spin_unlock_bh(old_tbl->locks);
-		return -EEXIST;
-	}
-
 	/* Make insertions go into the new, empty table right away. Deletions
 	 * and lookups will be attempted in both tables until we synchronize.
+	 * As cmpxchg() provides strong barriers, we do not need
+	 * rcu_assign_pointer().
 	 */
-	rcu_assign_pointer(old_tbl->future_tbl, new_tbl);
 
-	spin_unlock_bh(old_tbl->locks);
+	if (cmpxchg(&old_tbl->future_tbl, NULL, new_tbl) != NULL)
+		return -EEXIST;
 
 	return 0;
 }

^ permalink raw reply related

* [PATCH 08/18] rhashtable: clean up dereference of ->future_tbl.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Using rht_dereference_bucket() to dereference
->future_tbl looks like a type error, and could be confusing.
Using rht_dereference_rcu() to test a pointer for NULL
adds an unnecessary barrier - rcu_access_pointer() is preferred
for NULL tests when no lock is held.

This uses 3 different ways to access ->future_tbl.
- if we know the mutex is held, use rht_dereference()
- if we don't hold the mutex, and are only testing for NULL,
  use rcu_access_pointer()
- otherwise (using RCU protection for true dereference),
  use rht_dereference_rcu().

Note that this includes a simplification of the call to
rhashtable_last_table() - we don't do an extra dereference
before the call any more.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |    2 +-
 lib/rhashtable.c           |    9 ++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 49f80506d02e..10435a77b156 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -605,7 +605,7 @@ static inline void *__rhashtable_insert_fast(
 	lock = rht_bucket_lock(tbl, hash);
 	spin_lock_bh(lock);
 
-	if (unlikely(rht_dereference_bucket(tbl->future_tbl, tbl, hash))) {
+	if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
 slow_path:
 		spin_unlock_bh(lock);
 		rcu_read_unlock();
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index b5d17bce19ff..1737fbd049da 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -226,8 +226,7 @@ static struct bucket_table *rhashtable_last_table(struct rhashtable *ht,
 static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
-	struct bucket_table *new_tbl = rhashtable_last_table(ht,
-		rht_dereference_rcu(old_tbl->future_tbl, ht));
+	struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
 	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
 	int err = -EAGAIN;
 	struct rhash_head *head, *next, *entry;
@@ -467,7 +466,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
 fail:
 	/* Do not fail the insert if someone else did a rehash. */
-	if (likely(rcu_dereference_raw(tbl->future_tbl)))
+	if (likely(rcu_access_pointer(tbl->future_tbl)))
 		return 0;
 
 	/* Schedule async rehash to retry allocation in process context. */
@@ -540,7 +539,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 	if (PTR_ERR(data) != -EAGAIN && PTR_ERR(data) != -ENOENT)
 		return ERR_CAST(data);
 
-	new_tbl = rcu_dereference(tbl->future_tbl);
+	new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
 	if (new_tbl)
 		return new_tbl;
 
@@ -599,7 +598,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 			break;
 
 		spin_unlock_bh(lock);
-		tbl = rcu_dereference(tbl->future_tbl);
+		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
 	}
 
 	data = rhashtable_lookup_one(ht, tbl, hash, key, obj);

^ permalink raw reply related

* [PATCH 09/18] rhashtable: use cmpxchg() in nested_table_alloc()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

nested_table_alloc() relies on the fact that there is
at most one spinlock allocated for every slot in the top
level nested table, so it is not possible for two threads
to try to allocate the same table at the same time.

A future patch will change the locking and invalidate
this assumption.  So change the code to protect against
a race using cmpxchg() - if it loses, it frees the table
that it allocated.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 1737fbd049da..86c801d04d4a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -132,9 +132,11 @@ static union nested_table *nested_table_alloc(struct rhashtable *ht,
 			INIT_RHT_NULLS_HEAD(ntbl[i].bucket);
 	}
 
-	rcu_assign_pointer(*prev, ntbl);
-
-	return ntbl;
+	if (cmpxchg(prev, NULL, ntbl) == NULL)
+		return ntbl;
+	/* Raced with another thread. */
+	kfree(ntbl);
+	return rcu_dereference(*prev);
 }
 
 static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,

^ permalink raw reply related

* [PATCH 10/18] rhashtable: remove rhashtable_walk_peek()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

This function has a somewhat confused behavior that is not properly
described by the documentation.
Sometimes is returns the previous object, sometimes it returns the
next one.
Sometimes it changes the iterator, sometimes it doesn't.

This function is not currently used and is not worth keeping, so
remove it.

A future patch will introduce a new function with a
simpler interface which can meet the same need that
this was added for.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |    1 -
 lib/rhashtable.c           |   34 ----------------------------------
 2 files changed, 35 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 10435a77b156..4e2fc97667e0 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -247,7 +247,6 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
-void *rhashtable_walk_peek(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 86c801d04d4a..120382b0bad0 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -891,40 +891,6 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
-/**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
- * @iter:	Hash table iterator
- *
- * Returns the next object or NULL when the end of the table is reached.
- *
- * Returns -EAGAIN if resize event occurred.  Note that the iterator
- * will rewind back to the beginning and you may continue to use it.
- */
-void *rhashtable_walk_peek(struct rhashtable_iter *iter)
-{
-	struct rhlist_head *list = iter->list;
-	struct rhashtable *ht = iter->ht;
-	struct rhash_head *p = iter->p;
-
-	if (p)
-		return rht_obj(ht, ht->rhlist ? &list->rhead : p);
-
-	/* No object found in current iter, find next one in the table. */
-
-	if (iter->skip) {
-		/* A nonzero skip value points to the next entry in the table
-		 * beyond that last one that was found. Decrement skip so
-		 * we find the current value. __rhashtable_walk_find_next
-		 * will restore the original value of skip assuming that
-		 * the table hasn't changed.
-		 */
-		iter->skip--;
-	}
-
-	return __rhashtable_walk_find_next(iter);
-}
-EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
-
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:	Hash table iterator

^ permalink raw reply related

* [PATCH 12/18] rhashtable: add rhashtable_walk_prev()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

rhashtable_walk_prev() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_prev() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |    1 +
 lib/rhashtable.c           |   31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index c612d2446d90..1b70d690ab65 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -247,6 +247,7 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 }
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
+void *rhashtable_walk_prev(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 5ab0f4825271..a185f5a8b34d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -915,6 +915,37 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
+/**
+ * rhashtable_walk_prev - Return the previously returned object, if available
+ * @iter:	Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed and in particular it does not
+ * step backwards.
+ */
+void *rhashtable_walk_prev(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	if (!p)
+		return NULL;
+	if (!iter->p_is_unsafe || ht->rhlist)
+		return p;
+	rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+		if (p == iter->p)
+			return p;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_prev);
+
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:	Hash table iterator

^ permalink raw reply related

* [PATCH 13/18] rhashtable: don't hold lock on first table throughout insertion.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

rhashtable_try_insert() currently hold a lock on the bucket in
the first table, while also locking buckets in subsequent tables.
This is unnecessary and looks like a hold-over from some earlier
version of the implementation.

As insert and remove always lock a bucket in each table in turn, and
as insert only inserts in the final table, there cannot be any races
that are not covered by simply locking a bucket in each table in turn.

When an insert call reaches that last table it can be sure that there
is no match entry in any other table as it has searched them all, and
insertion never happens anywhere but in the last table.  The fact that
code tests for the existence of future_tbl while holding a lock on
the relevant bucket ensures that two threads inserting the same key
will make compatible decisions about which is the "last" table.

This simplifies the code and allows the ->rehash field to be
discarded.

We still need a way to ensure that a dead bucket_table is never
re-linked by rhashtable_walk_stop().  This can be achieved by
calling call_rcu() inside the locked region, and checking
->rcu.func in rhashtable_walk_stop().  If it is not NULL, then
the bucket table is empty and dead.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   13 -------------
 lib/rhashtable.c           |   44 +++++++++++---------------------------------
 2 files changed, 11 insertions(+), 46 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1b70d690ab65..5f0511bd5a39 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -63,7 +63,6 @@
 struct bucket_table {
 	unsigned int		size;
 	unsigned int		nest;
-	unsigned int		rehash;
 	u32			hash_rnd;
 	unsigned int		locks_mask;
 	spinlock_t		*locks;
@@ -774,12 +773,6 @@ static inline int rhltable_insert(
  * @obj:	pointer to hash head inside object
  * @params:	hash table parameters
  *
- * Locks down the bucket chain in both the old and new table if a resize
- * is in progress to ensure that writers can't remove from the old table
- * and can't insert to the new table during the atomic operation of search
- * and insertion. Searches for duplicates in both the old and new table if
- * a resize is in progress.
- *
  * This lookup function may only be used for fixed key hash table (key_len
  * parameter set). It will BUG() if used inappropriately.
  *
@@ -835,12 +828,6 @@ static inline void *rhashtable_lookup_get_insert_fast(
  * @obj:	pointer to hash head inside object
  * @params:	hash table parameters
  *
- * Locks down the bucket chain in both the old and new table if a resize
- * is in progress to ensure that writers can't remove from the old table
- * and can't insert to the new table during the atomic operation of search
- * and insertion. Searches for duplicates in both the old and new table if
- * a resize is in progress.
- *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
  * Will trigger an automatic deferred table resizing if residency in the
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index a185f5a8b34d..919eebd6757d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -289,10 +289,9 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
 	while (!(err = rhashtable_rehash_one(ht, old_hash)))
 		;
 
-	if (err == -ENOENT) {
-		old_tbl->rehash++;
+	if (err == -ENOENT)
 		err = 0;
-	}
+
 	spin_unlock_bh(old_bucket_lock);
 
 	return err;
@@ -339,13 +338,16 @@ static int rhashtable_rehash_table(struct rhashtable *ht)
 	spin_lock(&ht->lock);
 	list_for_each_entry(walker, &old_tbl->walkers, list)
 		walker->tbl = NULL;
-	spin_unlock(&ht->lock);
 
 	/* Wait for readers. All new readers will see the new
 	 * table, and thus no references to the old table will
 	 * remain.
+	 * We do this inside the locked region so that
+	 * rhashtable_walk_stop() can check ->rcu.func and know
+	 * not to re-link the table.
 	 */
 	call_rcu(&old_tbl->rcu, bucket_table_free_rcu);
+	spin_unlock(&ht->lock);
 
 	return rht_dereference(new_tbl->future_tbl, ht) ? -EAGAIN : 0;
 }
@@ -591,36 +593,14 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 	struct bucket_table *new_tbl;
 	struct bucket_table *tbl;
 	unsigned int hash;
-	spinlock_t *lock;
 	void *data;
 
-	tbl = rcu_dereference(ht->tbl);
-
-	/* All insertions must grab the oldest table containing
-	 * the hashed bucket that is yet to be rehashed.
-	 */
-	for (;;) {
-		hash = rht_head_hashfn(ht, tbl, obj, ht->p);
-		lock = rht_bucket_lock(tbl, hash);
-		spin_lock_bh(lock);
-
-		if (tbl->rehash <= hash)
-			break;
-
-		spin_unlock_bh(lock);
-		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
-	}
+	new_tbl = rcu_dereference(ht->tbl);
 
-	data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
-	new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
-	if (PTR_ERR(new_tbl) != -EEXIST)
-		data = ERR_CAST(new_tbl);
-
-	while (!IS_ERR_OR_NULL(new_tbl)) {
+	do {
 		tbl = new_tbl;
 		hash = rht_head_hashfn(ht, tbl, obj, ht->p);
-		spin_lock_nested(rht_bucket_lock(tbl, hash),
-				 SINGLE_DEPTH_NESTING);
+		spin_lock(rht_bucket_lock(tbl, hash));
 
 		data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
 		new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
@@ -628,9 +608,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 			data = ERR_CAST(new_tbl);
 
 		spin_unlock(rht_bucket_lock(tbl, hash));
-	}
-
-	spin_unlock_bh(lock);
+	} while (!IS_ERR_OR_NULL(new_tbl));
 
 	if (PTR_ERR(data) == -EAGAIN)
 		data = ERR_PTR(rhashtable_insert_rehash(ht, tbl) ?:
@@ -965,7 +943,7 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
 	ht = iter->ht;
 
 	spin_lock(&ht->lock);
-	if (tbl->rehash < tbl->size)
+	if (tbl->rcu.func == NULL)
 		list_add(&iter->walker.list, &tbl->walkers);
 	else
 		iter->walker.tbl = NULL;

^ permalink raw reply related

* [PATCH 15/18] rhashtable: use bit_spin_locks to protect hash bucket.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

This patch changes rhashtables to use a bit_spin_lock (BIT(1))
the bucket pointer to lock the hash chain for that bucket.

The benefits of a bit spin_lock are:
 - no need to allocate a separate array of locks.
 - no need to have a configuration option to guide the
   choice of the size of this array
 - locking cost if often a single test-and-set in a cache line
   that will have to be loaded anyway.  When inserting at, or removing
   from, the head of the chain, the unlock is free - writing the new
   address in the bucket head implicitly clears the lock bit.
 - even when lockings costs 2 updates (lock and unlock), they are
   in a cacheline that needs to be read anyway.

The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.

Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved.  This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.

As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable-types.h |    2 
 include/linux/rhashtable.h       |  190 +++++++++++++++++++++++++-------------
 ipc/util.c                       |    1 
 lib/rhashtable.c                 |  118 ++++++++++++------------
 net/bridge/br_fdb.c              |    1 
 net/bridge/br_vlan.c             |    1 
 net/bridge/br_vlan_tunnel.c      |    1 
 net/ipv4/ipmr.c                  |    1 
 net/ipv6/ip6mr.c                 |    1 
 9 files changed, 184 insertions(+), 132 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index bc3e84547ba7..39e5e1fb9b65 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -48,7 +48,6 @@ typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
  * @head_offset: Offset of rhash_head in struct to be hashed
  * @max_size: Maximum size while expanding
  * @min_size: Minimum size while shrinking
- * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
  * @automatic_shrinking: Enable automatic shrinking of tables
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
@@ -62,7 +61,6 @@ struct rhashtable_params {
 	unsigned int		max_size;
 	u16			min_size;
 	bool			automatic_shrinking;
-	u8			locks_mul;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
 	rht_obj_cmpfn_t		obj_cmpfn;
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 81ca3ed2927b..8e4d483ddf22 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -24,6 +24,7 @@
 #include <linux/list_nulls.h>
 #include <linux/workqueue.h>
 #include <linux/rculist.h>
+#include <linux/bit_spinlock.h>
 
 #include <linux/rhashtable-types.h>
 /*
@@ -52,8 +53,6 @@
  * @nest: Number of bits of first-level nested table.
  * @rehash: Current bucket being rehashed
  * @hash_rnd: Random seed to fold into hash
- * @locks_mask: Mask to apply before accessing locks[]
- * @locks: Array of spinlocks protecting individual buckets
  * @walkers: List of active walkers
  * @rcu: RCU structure for freeing the table
  * @future_tbl: Table under construction during rehashing
@@ -64,8 +63,6 @@ struct bucket_table {
 	unsigned int		size;
 	unsigned int		nest;
 	u32			hash_rnd;
-	unsigned int		locks_mask;
-	spinlock_t		*locks;
 	struct list_head	walkers;
 	struct rcu_head		rcu;
 
@@ -74,6 +71,61 @@ struct bucket_table {
 	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
 };
 
+/*
+ * We lock a bucket by setting BIT(1) in the pointer - this is always
+ * zero in real pointers and in the nulls marker.
+ * bit_spin_locks do not handle contention well, but the whole point
+ * of the hashtable design is to achieve minimum per-bucket contention.
+ * A nested hash table might not have a bucket pointer.  In that case
+ * we cannot get a lock.  For remove and replace the bucket cannot be
+ * interesting and doesn't need locking.
+ * For insert we allocate the bucket if this is the last bucket_table,
+ * and then take the lock.
+ * Sometimes we unlock a bucket by writing a new pointer there.  In that
+ * case we don't need to unlock, but we do need to reset state such as
+ * local_bh. For that we have rht_unlocked().  This doesn't include
+ * the memory barrier that bit_spin_unlock() provides, but rcu_assign_pointer()
+ * will have provided that.
+ */
+
+static inline void rht_lock(struct rhash_head **bucket)
+{
+	local_bh_disable();
+	bit_spin_lock(1, (unsigned long *)bucket);
+}
+
+static inline void rht_unlock(struct rhash_head **bucket)
+{
+	bit_spin_unlock(1, (unsigned long *)bucket);
+	local_bh_enable();
+}
+
+static inline void rht_unlocked(void)
+{
+	preempt_enable();
+	__release(bitlock);
+	local_bh_enable();
+}
+
+/*
+ * If 'p' is a bucket head and might be locked, rht_ptr returns
+ * the address without the lock bit.
+ */
+static inline struct rhash_head __rcu *rht_ptr(const struct rhash_head *p)
+{
+	return (void *)(((unsigned long)p) & ~2UL);
+}
+
+static inline struct rhash_head __rcu *rht_ptr_locked(const struct rhash_head *p)
+{
+	return (void *)(((unsigned long)p) | 2UL);
+}
+
+static inline bool rht_is_locked(const struct rhash_head *p)
+{
+	return rht_ptr_locked(p) == p;
+}
+
 #define	RHT_NULLS_MARKER(ptr)	\
 	((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr)	\
@@ -197,25 +249,6 @@ static inline bool rht_grow_above_max(const struct rhashtable *ht,
 	return atomic_read(&ht->nelems) >= ht->max_elems;
 }
 
-/* The bucket lock is selected based on the hash and protects mutations
- * on a group of hash buckets.
- *
- * A maximum of tbl->size/2 bucket locks is allocated. This ensures that
- * a single lock always covers both buckets which may both contains
- * entries which link to the same bucket of the old table during resizing.
- * This allows to simplify the locking as locking the bucket in both
- * tables during resize always guarantee protection.
- *
- * IMPORTANT: When holding the bucket lock of both the old and new table
- * during expansions and shrinking, the old bucket lock must always be
- * acquired first.
- */
-static inline spinlock_t *rht_bucket_lock(const struct bucket_table *tbl,
-					  unsigned int hash)
-{
-	return &tbl->locks[hash & tbl->locks_mask];
-}
-
 #ifdef CONFIG_PROVE_LOCKING
 int lockdep_rht_mutex_is_held(struct rhashtable *ht);
 int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash);
@@ -317,7 +350,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * @hash:	the hash value / bucket index
  */
 #define rht_for_each(pos, tbl, hash) \
-	rht_for_each_continue(pos, *rht_bucket(tbl, hash), tbl, hash)
+	rht_for_each_continue(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)
 
 /**
  * rht_for_each_entry_continue - continue iterating over hash chain
@@ -342,7 +375,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * @member:	name of the &struct rhash_head within the hashable struct.
  */
 #define rht_for_each_entry(tpos, pos, tbl, hash, member)		\
-	rht_for_each_entry_continue(tpos, pos, *rht_bucket(tbl, hash),	\
+	rht_for_each_entry_continue(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
 				    tbl, hash, member)
 
 /**
@@ -358,7 +391,8 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * remove the loop cursor from the list.
  */
 #define rht_for_each_entry_safe(tpos, pos, next, tbl, hash, member)	      \
-	for (pos = rht_dereference_bucket(*rht_bucket(tbl, hash), tbl, hash), \
+	for (pos = rht_dereference_bucket(rht_ptr(*rht_bucket(tbl, hash)),    \
+					  tbl, hash),			      \
 	     next = !rht_is_a_nulls(pos) ?				      \
 		       rht_dereference_bucket(pos->next, tbl, hash) : NULL;   \
 	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	      \
@@ -379,7 +413,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  */
 #define rht_for_each_rcu_continue(pos, head, tbl, hash)			\
 	for (({barrier(); }),						\
-	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		\
+		pos = rht_ptr(rht_dereference_bucket_rcu(head, tbl, hash)); \
 	     !rht_is_a_nulls(pos);					\
 	     pos = rcu_dereference_raw(pos->next))
 
@@ -394,7 +428,11 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * traversal is guarded by rcu_read_lock().
  */
 #define rht_for_each_rcu(pos, tbl, hash)				\
-	rht_for_each_rcu_continue(pos, *rht_bucket(tbl, hash), tbl, hash)
+	for (({barrier(); }),						\
+	     pos = rht_ptr(rht_dereference_bucket_rcu(*rht_bucket(tbl, hash), \
+						      tbl, hash));	\
+	     !rht_is_a_nulls(pos);					\
+	     pos = rcu_dereference_raw(pos->next))
 
 /**
  * rht_for_each_entry_rcu_continue - continue iterating over rcu hash chain
@@ -428,7 +466,8 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * traversal is guarded by rcu_read_lock().
  */
 #define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member)		   \
-	rht_for_each_entry_rcu_continue(tpos, pos, *rht_bucket(tbl, hash), \
+	rht_for_each_entry_rcu_continue(tpos, pos,			   \
+					rht_ptr(*rht_bucket(tbl, hash)),   \
 					tbl, hash, member)
 
 /**
@@ -592,9 +631,9 @@ static inline void *__rhashtable_insert_fast(
 	};
 	struct rhash_head __rcu **headp;
 	struct rhash_head __rcu **pprev;
+	struct rhash_head __rcu **lock;
 	struct bucket_table *tbl;
 	struct rhash_head *head;
-	spinlock_t *lock;
 	unsigned int hash;
 	int elasticity;
 	void *data;
@@ -603,24 +642,23 @@ static inline void *__rhashtable_insert_fast(
 
 	tbl = rht_dereference_rcu(ht->tbl, ht);
 	hash = rht_head_hashfn(ht, tbl, obj, params);
-	lock = rht_bucket_lock(tbl, hash);
-	spin_lock_bh(lock);
+	elasticity = RHT_ELASTICITY;
+	headp = rht_bucket_insert(ht, tbl, hash);
+	data = ERR_PTR(-ENOMEM);
+	if (!headp)
+		goto out;
+	lock = pprev = headp;
+	rht_lock(lock);
 
 	if (unlikely(rcu_access_pointer(tbl->future_tbl))) {
 slow_path:
-		spin_unlock_bh(lock);
+		rht_unlock(lock);
 		rcu_read_unlock();
 		return rhashtable_insert_slow(ht, key, obj);
 	}
 
-	elasticity = RHT_ELASTICITY;
-	headp = rht_bucket_insert(ht, tbl, hash);
-	pprev = headp;
-	data = ERR_PTR(-ENOMEM);
-	if (!pprev)
-		goto out;
 
-	rht_for_each_continue(head, *headp, tbl, hash) {
+	rht_for_each_continue(head, rht_ptr(*headp), tbl, hash) {
 		struct rhlist_head *plist;
 		struct rhlist_head *list;
 
@@ -651,6 +689,8 @@ static inline void *__rhashtable_insert_fast(
 		head = rht_dereference_bucket(head->next, tbl, hash);
 		RCU_INIT_POINTER(list->rhead.next, head);
 		rcu_assign_pointer(*pprev, obj);
+		/* This is where we inserted */
+		headp = pprev;
 
 		goto good;
 	}
@@ -667,7 +707,7 @@ static inline void *__rhashtable_insert_fast(
 
 	head = rht_dereference_bucket(*headp, tbl, hash);
 
-	RCU_INIT_POINTER(obj->next, head);
+	RCU_INIT_POINTER(obj->next, rht_ptr(head));
 	if (rhlist) {
 		struct rhlist_head *list;
 
@@ -684,8 +724,15 @@ static inline void *__rhashtable_insert_fast(
 good:
 	data = NULL;
 
+	if (headp == lock) {
+		/* Assigning to *headp unlocked the chain, so we
+		 * don't need to do it again.
+		 */
+		rht_unlocked();
+	} else {
 out:
-	spin_unlock_bh(lock);
+		rht_unlock(lock);
+	}
 	rcu_read_unlock();
 
 	return data;
@@ -697,9 +744,9 @@ static inline void *__rhashtable_insert_fast(
  * @obj:	pointer to hash head inside object
  * @params:	hash table parameters
  *
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
  * on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
  *
  * It is safe to call this function from atomic context.
  *
@@ -726,9 +773,9 @@ static inline int rhashtable_insert_fast(
  * @list:	pointer to hash list head inside object
  * @params:	hash table parameters
  *
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
  * on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
  *
  * It is safe to call this function from atomic context.
  *
@@ -749,9 +796,9 @@ static inline int rhltable_insert_key(
  * @list:	pointer to hash list head inside object
  * @params:	hash table parameters
  *
- * Will take a per bucket spinlock to protect against mutual mutations
+ * Will take the per bucket bitlock to protect against mutual mutations
  * on the same bucket. Multiple insertions may occur in parallel unless
- * they map to the same bucket lock.
+ * they map to the same bucket.
  *
  * It is safe to call this function from atomic context.
  *
@@ -879,20 +926,19 @@ static inline int __rhashtable_remove_fast_one(
 	bool rhlist)
 {
 	struct rhash_head __rcu **pprev;
+	struct rhash_head __rcu **lock;
 	struct rhash_head *he;
-	spinlock_t * lock;
 	unsigned int hash;
 	int err = -ENOENT;
 
 	hash = rht_head_hashfn(ht, tbl, obj, params);
-	lock = rht_bucket_lock(tbl, hash);
-
-	spin_lock_bh(lock);
-
 	pprev = rht_bucket_var(tbl, hash);
 	if (!pprev)
-		goto out;
-	rht_for_each_continue(he, *pprev, tbl, hash) {
+		return -ENOENT;
+	lock = pprev;
+	rht_lock(lock);
+
+	rht_for_each_continue(he, rht_ptr(*pprev), tbl, hash) {
 		struct rhlist_head *list;
 
 		list = container_of(he, struct rhlist_head, rhead);
@@ -933,12 +979,16 @@ static inline int __rhashtable_remove_fast_one(
 		}
 
 		rcu_assign_pointer(*pprev, obj);
+		if (lock == pprev) {
+			/* That rcu_assign_pointer() unlocked the chain */
+			rht_unlocked();
+			goto unlocked;
+		}
 		break;
 	}
 
-out:
-	spin_unlock_bh(lock);
-
+	rht_unlock(lock);
+unlocked:
 	if (err > 0) {
 		atomic_dec(&ht->nelems);
 		if (unlikely(ht->p.automatic_shrinking &&
@@ -1028,8 +1078,8 @@ static inline int __rhashtable_replace_fast(
 	const struct rhashtable_params params)
 {
 	struct rhash_head __rcu **pprev;
+	struct rhash_head __rcu **lock;
 	struct rhash_head *he;
-	spinlock_t *lock;
 	unsigned int hash;
 	int err = -ENOENT;
 
@@ -1040,14 +1090,14 @@ static inline int __rhashtable_replace_fast(
 	if (hash != rht_head_hashfn(ht, tbl, obj_new, params))
 		return -EINVAL;
 
-	lock = rht_bucket_lock(tbl, hash);
-
-	spin_lock_bh(lock);
-
 	pprev = rht_bucket_var(tbl, hash);
 	if (!pprev)
-		goto out;
-	rht_for_each_continue(he, *pprev, tbl, hash) {
+		return -ENOENT;
+
+	lock = pprev;
+	rht_lock(lock);
+
+	rht_for_each_continue(he, rht_ptr(*pprev), tbl, hash) {
 		if (he != obj_old) {
 			pprev = &he->next;
 			continue;
@@ -1056,11 +1106,17 @@ static inline int __rhashtable_replace_fast(
 		rcu_assign_pointer(obj_new->next, obj_old->next);
 		rcu_assign_pointer(*pprev, obj_new);
 		err = 0;
+		if (pprev == lock) {
+			/* We just unlocked the chain by assigning to *pprev */
+			rht_unlocked();
+			goto unlocked;
+		}
 		break;
 	}
-out:
-	spin_unlock_bh(lock);
 
+	rht_unlock(lock);
+
+unlocked:
 	return err;
 }
 
diff --git a/ipc/util.c b/ipc/util.c
index fdffff41f65b..cc78eb76df8b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -105,7 +105,6 @@ static const struct rhashtable_params ipc_kht_params = {
 	.head_offset		= offsetof(struct kern_ipc_perm, khtnode),
 	.key_offset		= offsetof(struct kern_ipc_perm, key),
 	.key_len		= FIELD_SIZEOF(struct kern_ipc_perm, key),
-	.locks_mul		= 1,
 	.automatic_shrinking	= true,
 };
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index bac2493808f0..381acab7a909 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -32,7 +32,6 @@
 
 #define HASH_DEFAULT_SIZE	64UL
 #define HASH_MIN_SIZE		4U
-#define BUCKET_LOCKS_PER_CPU	32UL
 
 union nested_table {
 	union nested_table __rcu *table;
@@ -57,9 +56,11 @@ EXPORT_SYMBOL_GPL(lockdep_rht_mutex_is_held);
 
 int lockdep_rht_bucket_is_held(const struct bucket_table *tbl, u32 hash)
 {
-	spinlock_t *lock = rht_bucket_lock(tbl, hash);
-
-	return (debug_locks) ? lockdep_is_held(lock) : 1;
+	if (!debug_locks)
+		return 1;
+	if (unlikely(tbl->nest))
+		return 1;
+	return bit_spin_is_locked(1, (unsigned long *)&tbl->buckets[hash]);
 }
 EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #else
@@ -105,7 +106,6 @@ static void bucket_table_free(const struct bucket_table *tbl)
 	if (tbl->nest)
 		nested_bucket_table_free(tbl);
 
-	free_bucket_spinlocks(tbl->locks);
 	kvfree(tbl);
 }
 
@@ -172,7 +172,7 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 					       gfp_t gfp)
 {
 	struct bucket_table *tbl = NULL;
-	size_t size, max_locks;
+	size_t size;
 	int i;
 
 	size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]);
@@ -192,16 +192,6 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 
 	tbl->size = size;
 
-	max_locks = size >> 1;
-	if (tbl->nest)
-		max_locks = min_t(size_t, max_locks, 1U << tbl->nest);
-
-	if (alloc_bucket_spinlocks(&tbl->locks, &tbl->locks_mask, max_locks,
-				   ht->p.locks_mul, gfp) < 0) {
-		bucket_table_free(tbl);
-		return NULL;
-	}
-
 	INIT_LIST_HEAD(&tbl->walkers);
 
 	tbl->hash_rnd = get_random_u32();
@@ -225,25 +215,24 @@ static struct bucket_table *rhashtable_last_table(struct rhashtable *ht,
 	return new_tbl;
 }
 
-static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
+static int rhashtable_rehash_one(struct rhashtable *ht,
+				 struct rhash_head __rcu **pprev,
+				 unsigned int old_hash)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
 	struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
-	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
 	struct rhash_head __rcu **inspos;
+	struct rhash_head __rcu **lock;
 	int err = -EAGAIN;
 	struct rhash_head *head, *next, *entry;
-	spinlock_t *new_bucket_lock;
 	unsigned int new_hash;
 
 	if (new_tbl->nest)
 		goto out;
 
 	err = -ENOENT;
-	if (!pprev)
-		goto out;
 
-	rht_for_each_continue(entry, *pprev, old_tbl, old_hash) {
+	rht_for_each_continue(entry, rht_ptr(*pprev), old_tbl, old_hash) {
 		err = 0;
 		next = rht_dereference_bucket(entry->next, old_tbl, old_hash);
 
@@ -258,11 +247,11 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 
 	new_hash = head_hashfn(ht, new_tbl, entry);
 
-	new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
-
-	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
 	inspos = &new_tbl->buckets[new_hash];
-	head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	lock = inspos;
+	rht_lock(lock);
+
+	head = rht_ptr(rht_dereference_bucket(*inspos, new_tbl, new_hash));
 	while (!rht_is_a_nulls(head) && head < entry) {
 		inspos = &head->next;
 		head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
@@ -270,7 +259,14 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	RCU_INIT_POINTER(entry->next, head);
 
 	rcu_assign_pointer(*inspos, entry);
-	spin_unlock(new_bucket_lock);
+	if (inspos != lock)
+		rht_unlock(lock);
+	else
+		rht_unlocked();
+
+	/* Need to preserved the bit lock. */
+	if (rht_is_locked(*pprev))
+		next = rht_ptr_locked(next);
 
 	rcu_assign_pointer(*pprev, next);
 
@@ -282,19 +278,19 @@ static int rhashtable_rehash_chain(struct rhashtable *ht,
 				    unsigned int old_hash)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
-	spinlock_t *old_bucket_lock;
+	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
 	int err;
 
-	old_bucket_lock = rht_bucket_lock(old_tbl, old_hash);
+	if (!pprev)
+		return 0;
+	rht_lock(pprev);
 
-	spin_lock_bh(old_bucket_lock);
-	while (!(err = rhashtable_rehash_one(ht, old_hash)))
+	while (!(err = rhashtable_rehash_one(ht, pprev, old_hash)))
 		;
 
 	if (err == -ENOENT)
 		err = 0;
-
-	spin_unlock_bh(old_bucket_lock);
+	rht_unlock(pprev);
 
 	return err;
 }
@@ -487,6 +483,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 }
 
 static void *rhashtable_lookup_one(struct rhashtable *ht,
+				   struct rhash_head __rcu **pprev,
 				   struct bucket_table *tbl, unsigned int hash,
 				   const void *key, struct rhash_head *obj)
 {
@@ -494,15 +491,12 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
 		.ht = ht,
 		.key = key,
 	};
-	struct rhash_head __rcu **pprev;
+	struct rhash_head **lock = pprev;
 	struct rhash_head *head;
 	int elasticity;
 
 	elasticity = RHT_ELASTICITY;
-	pprev = rht_bucket_var(tbl, hash);
-	if (!pprev)
-		return ERR_PTR(-ENOENT);
-	rht_for_each_continue(head, *pprev, tbl, hash) {
+	rht_for_each_continue(head, rht_ptr(*pprev), tbl, hash) {
 		struct rhlist_head *list;
 		struct rhlist_head *plist;
 
@@ -524,6 +518,9 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
 		RCU_INIT_POINTER(list->next, plist);
 		head = rht_dereference_bucket(head->next, tbl, hash);
 		RCU_INIT_POINTER(list->rhead.next, head);
+		if (pprev == lock)
+			/* Need to preserve the bit lock */
+			obj = rht_ptr_locked(obj);
 		rcu_assign_pointer(*pprev, obj);
 
 		return NULL;
@@ -536,12 +533,13 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
 }
 
 static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
+						  struct rhash_head __rcu **pprev,
 						  struct bucket_table *tbl,
 						  unsigned int hash,
 						  struct rhash_head *obj,
 						  void *data)
 {
-	struct rhash_head __rcu **pprev;
+	struct rhash_head **lock = pprev;
 	struct bucket_table *new_tbl;
 	struct rhash_head *head;
 
@@ -564,11 +562,7 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 	if (unlikely(rht_grow_above_100(ht, tbl)))
 		return ERR_PTR(-EAGAIN);
 
-	pprev = rht_bucket_insert(ht, tbl, hash);
-	if (!pprev)
-		return ERR_PTR(-ENOMEM);
-
-	head = rht_dereference_bucket(*pprev, tbl, hash);
+	head = rht_ptr(rht_dereference_bucket(*pprev, tbl, hash));
 	while (!ht->rhlist && !rht_is_a_nulls(head) && head < obj) {
 		pprev = &head->next;
 		head = rht_dereference_bucket(*pprev, tbl, hash);
@@ -582,6 +576,9 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 		RCU_INIT_POINTER(list->next, NULL);
 	}
 
+	if (pprev == lock)
+		/* Need to preserve the bit lock */
+		obj = (void *)(2UL | (unsigned long)obj);
 	rcu_assign_pointer(*pprev, obj);
 
 	atomic_inc(&ht->nelems);
@@ -596,6 +593,7 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 {
 	struct bucket_table *new_tbl;
 	struct bucket_table *tbl;
+	struct rhash_head __rcu **pprev;
 	unsigned int hash;
 	void *data;
 
@@ -604,14 +602,25 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 	do {
 		tbl = new_tbl;
 		hash = rht_head_hashfn(ht, tbl, obj, ht->p);
-		spin_lock(rht_bucket_lock(tbl, hash));
-
-		data = rhashtable_lookup_one(ht, tbl, hash, key, obj);
-		new_tbl = rhashtable_insert_one(ht, tbl, hash, obj, data);
-		if (PTR_ERR(new_tbl) != -EEXIST)
-			data = ERR_CAST(new_tbl);
-
-		spin_unlock(rht_bucket_lock(tbl, hash));
+		if (rcu_access_pointer(tbl->future_tbl))
+			/* Failure is OK */
+			pprev = rht_bucket_var(tbl, hash);
+		else
+			pprev = rht_bucket_insert(ht, tbl, hash);
+		if (pprev == NULL) {
+			new_tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+			data = ERR_PTR(-EAGAIN);
+		} else {
+			rht_lock(pprev);
+			data = rhashtable_lookup_one(ht, pprev, tbl,
+						     hash, key, obj);
+			new_tbl = rhashtable_insert_one(ht, pprev, tbl,
+							hash, obj, data);
+			if (PTR_ERR(new_tbl) != -EEXIST)
+				data = ERR_CAST(new_tbl);
+
+			rht_unlock(pprev);
+		}
 	} while (!IS_ERR_OR_NULL(new_tbl));
 
 	if (PTR_ERR(data) == -EAGAIN)
@@ -1045,11 +1054,6 @@ int rhashtable_init(struct rhashtable *ht,
 	if (params->nelem_hint)
 		size = rounded_hashtable_size(&ht->p);
 
-	if (params->locks_mul)
-		ht->p.locks_mul = roundup_pow_of_two(params->locks_mul);
-	else
-		ht->p.locks_mul = BUCKET_LOCKS_PER_CPU;
-
 	ht->key_len = ht->p.key_len;
 	if (!params->hashfn) {
 		ht->p.hashfn = jhash;
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index d9e69e4514be..aa989318162f 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -33,7 +33,6 @@ static const struct rhashtable_params br_fdb_rht_params = {
 	.key_offset = offsetof(struct net_bridge_fdb_entry, key),
 	.key_len = sizeof(struct net_bridge_fdb_key),
 	.automatic_shrinking = true,
-	.locks_mul = 1,
 };
 
 static struct kmem_cache *br_fdb_cache __read_mostly;
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 9896f4975353..7569c3c04585 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -21,7 +21,6 @@ static const struct rhashtable_params br_vlan_rht_params = {
 	.key_offset = offsetof(struct net_bridge_vlan, vid),
 	.key_len = sizeof(u16),
 	.nelem_hint = 3,
-	.locks_mul = 1,
 	.max_size = VLAN_N_VID,
 	.obj_cmpfn = br_vlan_cmp,
 	.automatic_shrinking = true,
diff --git a/net/bridge/br_vlan_tunnel.c b/net/bridge/br_vlan_tunnel.c
index 6d2c4eed2dc8..758151863669 100644
--- a/net/bridge/br_vlan_tunnel.c
+++ b/net/bridge/br_vlan_tunnel.c
@@ -34,7 +34,6 @@ static const struct rhashtable_params br_vlan_tunnel_rht_params = {
 	.key_offset = offsetof(struct net_bridge_vlan, tinfo.tunnel_id),
 	.key_len = sizeof(__be64),
 	.nelem_hint = 3,
-	.locks_mul = 1,
 	.obj_cmpfn = br_vlan_tunid_cmp,
 	.automatic_shrinking = true,
 };
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index adbc3d3a560b..1995520518ab 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -371,7 +371,6 @@ static const struct rhashtable_params ipmr_rht_params = {
 	.key_offset = offsetof(struct mfc_cache, cmparg),
 	.key_len = sizeof(struct mfc_cache_cmp_arg),
 	.nelem_hint = 3,
-	.locks_mul = 1,
 	.obj_cmpfn = ipmr_hash_cmp,
 	.automatic_shrinking = true,
 };
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index dfb339f79bde..f7dff3f219d3 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -340,7 +340,6 @@ static const struct rhashtable_params ip6mr_rht_params = {
 	.key_offset = offsetof(struct mfc6_cache, cmparg),
 	.key_len = sizeof(struct mfc6_cache_cmp_arg),
 	.nelem_hint = 3,
-	.locks_mul = 1,
 	.obj_cmpfn = ip6mr_hash_cmp,
 	.automatic_shrinking = true,
 };

^ permalink raw reply related

* [PATCH 16/18] rhashtable: allow percpu element counter
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

When an rhashtable is receiving a high rate of updates on multiple
CPUs, the atomic update to the nelems counter can become a
measurable contention point.

We can use a per-cpu counter instead of an atomic_t.  This reduces the
contention on updates, but reduces the precision for grow/shrink
decisions.  This results in a performance trade-off between updates
and lookups.  Updates should be faster due to reduced contention,
lookups could be slower if a resize is delayed.  This trade-off will
only be appropriate on rhashtables which have a high proportion of
updates compared to lookups.

A fast read of a percpu counter can have an error of as much
as 32 times the number of CPUs.  A more precise count can be
extracted, but this takes longer.  It seems sensible to
use the fast read to schedule a rehash, and the slow read
to choose the new size of the table.  This has the drawback
that if the fast read is sufficiently higher than the slow read,
we might repeatedly schedule a rehash which never happens.
To avoid this, we record the most recently noticed difference
between the two reads and use this to adjust the estimate returned
by the fast read.  If the rehash thread determines that a rehash is
not needed, it will update the recored error so that updaters
will not schedule the thread again until there is substantially
more growth.

When the fast-read shows that the table is over 100% full,
we need to double check with a slow read at that point too,
and update the recorded error to ensure we only resize
when it is really appropriate, but don't keep performing slow
tests to see exactly when it is appropriate.

The error we keep (in pnelems_error) is 1.5 time the absolute
difference between percpu_counter_read_positive() and
percpu_counter_sum_positive().  The extra 50% reduces the frequency
with which we have to do the slow read, at a cost of allowing
the occupancy rate to grow well above 1 (but usually less than 2).

This patch includes a change to include/linux/percpu_counter.h
to make some functions accept a const pointer.

It also adds a module parameter to test_rhashtable.ko to allow
testing of percpu counters.

Signed-off-by: NeilBrown <neil@brown.name>
---
 include/linux/percpu_counter.h   |    4 +
 include/linux/rhashtable-types.h |    8 ++
 include/linux/rhashtable.h       |  126 +++++++++++++++++++++++++++++++++-----
 lib/rhashtable.c                 |   70 +++++++++++++++++----
 lib/test_rhashtable.c            |   14 ++++
 5 files changed, 189 insertions(+), 33 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 4f052496cdfd..a541a6c49a88 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -66,7 +66,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 	return __percpu_counter_sum(fbc);
 }
 
-static inline s64 percpu_counter_read(struct percpu_counter *fbc)
+static inline s64 percpu_counter_read(const struct percpu_counter *fbc)
 {
 	return fbc->count;
 }
@@ -76,7 +76,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  * number for some counter which should never be negative.
  *
  */
-static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
+static inline s64 percpu_counter_read_positive(const struct percpu_counter *fbc)
 {
 	s64 ret = fbc->count;
 
diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 39e5e1fb9b65..c08c3bcfb5ad 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -13,6 +13,7 @@
 #include <linux/compiler.h>
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
+#include <linux/percpu_counter.h>
 
 struct rhash_head {
 	struct rhash_head __rcu		*next;
@@ -49,6 +50,7 @@ typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
  * @max_size: Maximum size while expanding
  * @min_size: Minimum size while shrinking
  * @automatic_shrinking: Enable automatic shrinking of tables
+ * @percpu_count: Use a per-cpu counter instead of atomic_t
  * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
  * @obj_hashfn: Function to hash object
  * @obj_cmpfn: Function to compare key with object
@@ -61,6 +63,7 @@ struct rhashtable_params {
 	unsigned int		max_size;
 	u16			min_size;
 	bool			automatic_shrinking;
+	bool			percpu_count;
 	rht_hashfn_t		hashfn;
 	rht_obj_hashfn_t	obj_hashfn;
 	rht_obj_cmpfn_t		obj_cmpfn;
@@ -77,6 +80,9 @@ struct rhashtable_params {
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
  * @nelems: Number of elements in table
+ * @pnelems: Alternate per-cpu nelems counter
+ * @pnelems_error: absolute difference between approximate and actual
+ *          percpu count at last check: 1.5 * abs("read" - "sum").
  */
 struct rhashtable {
 	struct bucket_table __rcu	*tbl;
@@ -88,6 +94,8 @@ struct rhashtable {
 	struct mutex                    mutex;
 	spinlock_t			lock;
 	atomic_t			nelems;
+	struct percpu_counter		pnelems;
+	unsigned int			pnelems_error;
 };
 
 /**
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 8e4d483ddf22..f58cb62dce05 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -27,6 +27,8 @@
 #include <linux/bit_spinlock.h>
 
 #include <linux/rhashtable-types.h>
+#include <linux/percpu_counter.h>
+
 /*
  * The end of the chain is marked with a special nulls marks which has
  * the least significant bit set.
@@ -200,30 +202,95 @@ static inline unsigned int rht_head_hashfn(
 	       rht_key_hashfn(ht, tbl, ptr + params.key_offset, params);
 }
 
+/*
+ * Per-CPU element counting.
+ * During insert/remove we check the percpu element counter (if active)
+ * with percpu_counter_read_positive().  This is fast but has a possible
+ * error of as much as 32*NUM_CPUS.
+ * In the rehash thread, we determine the number of elements with
+ * percpu_counter_sum_positive().  This is slower and more precise.
+ * We chose a new table size based on this precise count.
+ * If the approximate count is more than 70% of the table size, while
+ * the precise count is below that, the above rules would cause the
+ * rehash thread to be continually scheduled, which is not ideal.
+ * So we record an error, being 1.5 times the absolute difference between
+ * the approximate and precise element counts.  When testing if we need to grow,
+ * we subtract the error from the approximate count, and test that.
+ * When testing if we need to shrink, we add the error.
+ * This ensures we don't thrash the rehash thread, at the cost of delaying
+ * resize operations.  This is the cost trade-off for using per-cpu
+ * element counting:  insert/remove will be faster by avoiding an atomic
+ * operation 31/32 of the time, but lookup might be slower as hash chains
+ * can be expected to be slightly longer on average.  If the error is
+ * so bad that hash chains grow to 16 (could happen when the table is small
+ * and lots of CPUs are adding items), that will trigger a rehash which will
+ * correct the table size update the recorded error.  Possibly the '16' should
+ * be reduce to 8 when per-cpu element counting is active.
+ */
+
+static inline bool __rht_grow_above_75(const struct rhashtable *ht,
+				       const struct bucket_table *tbl,
+				       unsigned int nelems)
+{
+	return nelems > (tbl->size / 4 * 3) &&
+	       (!ht->p.max_size || tbl->size < ht->p.max_size);
+}
 /**
  * rht_grow_above_75 - returns true if nelems > 0.75 * table-size
  * @ht:		hash table
  * @tbl:	current table
  */
 static inline bool rht_grow_above_75(const struct rhashtable *ht,
-				     const struct bucket_table *tbl)
+				     const struct bucket_table *tbl,
+				     const struct rhashtable_params params)
 {
 	/* Expand table when exceeding 75% load */
-	return atomic_read(&ht->nelems) > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_size || tbl->size < ht->p.max_size);
+	unsigned int nelems;
+
+	if (!params.percpu_count)
+		nelems = atomic_read(&ht->nelems);
+	else {
+		nelems = percpu_counter_read_positive(&ht->pnelems);
+		if (nelems > ht->pnelems_error)
+			nelems -= ht->pnelems_error;
+	}
+	return __rht_grow_above_75(ht, tbl, nelems);
 }
 
+static inline bool __rht_shrink_below_30(const struct rhashtable *ht,
+					 const struct bucket_table *tbl,
+					 const unsigned int nelems)
+{
+	/* Shrink table beneath 30% load */
+	return nelems < (tbl->size * 3 / 10) &&
+	       tbl->size > ht->p.min_size;
+}
 /**
  * rht_shrink_below_30 - returns true if nelems < 0.3 * table-size
  * @ht:		hash table
  * @tbl:	current table
  */
 static inline bool rht_shrink_below_30(const struct rhashtable *ht,
-				       const struct bucket_table *tbl)
+				       const struct bucket_table *tbl,
+				       const struct rhashtable_params params)
 {
 	/* Shrink table beneath 30% load */
-	return atomic_read(&ht->nelems) < (tbl->size * 3 / 10) &&
-	       tbl->size > ht->p.min_size;
+	unsigned int nelems;
+
+	if (params.percpu_count)
+		nelems = percpu_counter_read_positive(&ht->pnelems) +
+			ht->pnelems_error;
+	else
+		nelems = atomic_read(&ht->nelems);
+	return __rht_shrink_below_30(ht, tbl, nelems);
+}
+
+static inline bool __rht_grow_above_100(const struct rhashtable *ht,
+					const struct bucket_table *tbl,
+					unsigned int nelems)
+{
+	return nelems > tbl->size &&
+		(!ht->p.max_size || tbl->size < ht->p.max_size);
 }
 
 /**
@@ -232,10 +299,19 @@ static inline bool rht_shrink_below_30(const struct rhashtable *ht,
  * @tbl:	current table
  */
 static inline bool rht_grow_above_100(const struct rhashtable *ht,
-				      const struct bucket_table *tbl)
+				      const struct bucket_table *tbl,
+				      const struct rhashtable_params params)
 {
-	return atomic_read(&ht->nelems) > tbl->size &&
-		(!ht->p.max_size || tbl->size < ht->p.max_size);
+	unsigned int nelems;
+
+	if (!params.percpu_count)
+		nelems = atomic_read(&ht->nelems);
+	else {
+		nelems = percpu_counter_read_positive(&ht->pnelems);
+		if (nelems > ht->pnelems_error)
+			nelems -= ht->pnelems_error;
+	}
+	return __rht_grow_above_100(ht, tbl, nelems);
 }
 
 /**
@@ -244,9 +320,19 @@ static inline bool rht_grow_above_100(const struct rhashtable *ht,
  * @tbl:	current table
  */
 static inline bool rht_grow_above_max(const struct rhashtable *ht,
-				      const struct bucket_table *tbl)
+				      const struct bucket_table *tbl,
+				      const struct rhashtable_params params)
 {
-	return atomic_read(&ht->nelems) >= ht->max_elems;
+	unsigned int nelems;
+
+	if (!params.percpu_count)
+		nelems = atomic_read(&ht->nelems);
+	else {
+		nelems = percpu_counter_read_positive(&ht->pnelems);
+		if (nelems > ht->pnelems_error)
+			nelems -= ht->pnelems_error;
+	}
+	return nelems >= ht->max_elems;
 }
 
 #ifdef CONFIG_PROVE_LOCKING
@@ -699,10 +785,10 @@ static inline void *__rhashtable_insert_fast(
 		goto slow_path;
 
 	data = ERR_PTR(-E2BIG);
-	if (unlikely(rht_grow_above_max(ht, tbl)))
+	if (unlikely(rht_grow_above_max(ht, tbl, params)))
 		goto out;
 
-	if (unlikely(rht_grow_above_100(ht, tbl)))
+	if (unlikely(rht_grow_above_100(ht, tbl, params)))
 		goto slow_path;
 
 	head = rht_dereference_bucket(*headp, tbl, hash);
@@ -717,8 +803,11 @@ static inline void *__rhashtable_insert_fast(
 
 	rcu_assign_pointer(*headp, obj);
 
-	atomic_inc(&ht->nelems);
-	if (rht_grow_above_75(ht, tbl))
+	if (params.percpu_count)
+		percpu_counter_inc(&ht->pnelems);
+	else
+		atomic_inc(&ht->nelems);
+	if (rht_grow_above_75(ht, tbl, params))
 		schedule_work(&ht->run_work);
 
 good:
@@ -990,9 +1079,12 @@ static inline int __rhashtable_remove_fast_one(
 	rht_unlock(lock);
 unlocked:
 	if (err > 0) {
-		atomic_dec(&ht->nelems);
+		if (params.percpu_count)
+			percpu_counter_dec(&ht->pnelems);
+		else
+			atomic_dec(&ht->nelems);
 		if (unlikely(ht->p.automatic_shrinking &&
-			     rht_shrink_below_30(ht, tbl)))
+			     rht_shrink_below_30(ht, tbl, params)))
 			schedule_work(&ht->run_work);
 		err = 0;
 	}
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 381acab7a909..1b9820787cd5 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -386,10 +386,9 @@ static int rhashtable_rehash_alloc(struct rhashtable *ht,
  * It is valid to have concurrent insertions and deletions protected by per
  * bucket locks or concurrent RCU protected lookups and traversals.
  */
-static int rhashtable_shrink(struct rhashtable *ht)
+static int rhashtable_shrink(struct rhashtable *ht, const unsigned int nelems)
 {
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
-	unsigned int nelems = atomic_read(&ht->nelems);
 	unsigned int size = 0;
 
 	if (nelems)
@@ -406,10 +405,28 @@ static int rhashtable_shrink(struct rhashtable *ht)
 	return rhashtable_rehash_alloc(ht, old_tbl, size);
 }
 
+static int check_nelems(struct rhashtable *ht)
+{
+	unsigned int nelems;
+
+	if (ht->p.percpu_count) {
+		unsigned int approx = percpu_counter_read_positive(&ht->pnelems);
+
+		nelems = percpu_counter_sum_positive(&ht->pnelems);
+		if (approx > nelems)
+			ht->pnelems_error = (approx - nelems) * 3 / 2;
+		else
+			ht->pnelems_error = (nelems - approx) * 3 / 2;
+	} else
+		nelems = atomic_read(&ht->nelems);
+	return nelems;
+}
+
 static void rht_deferred_worker(struct work_struct *work)
 {
 	struct rhashtable *ht;
 	struct bucket_table *tbl;
+	unsigned int nelems;
 	int err = 0;
 
 	ht = container_of(work, struct rhashtable, run_work);
@@ -418,10 +435,18 @@ static void rht_deferred_worker(struct work_struct *work)
 	tbl = rht_dereference(ht->tbl, ht);
 	tbl = rhashtable_last_table(ht, tbl);
 
-	if (rht_grow_above_75(ht, tbl))
-		err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
-	else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
-		err = rhashtable_shrink(ht);
+	nelems = check_nelems(ht);
+
+	if (__rht_grow_above_75(ht, tbl, nelems)) {
+		unsigned int size = roundup_pow_of_two(nelems * 4 / 3);
+
+		if (ht->p.max_size && size > ht->p.max_size)
+			size = ht->p.max_size;
+
+		err = rhashtable_rehash_alloc(ht, tbl, size);
+	} else if (ht->p.automatic_shrinking &&
+		   __rht_shrink_below_30(ht, tbl, nelems))
+		err = rhashtable_shrink(ht, nelems);
 	else if (tbl->nest)
 		err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
 
@@ -439,6 +464,7 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 {
 	struct bucket_table *old_tbl;
 	struct bucket_table *new_tbl;
+	unsigned int nelems;
 	unsigned int size;
 	int err;
 
@@ -448,7 +474,12 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 
 	err = -EBUSY;
 
-	if (rht_grow_above_75(ht, tbl))
+	/* Now is a good time to ensure ->pnelems_error is up to date,
+	 * and to use the exact count
+	 */
+	nelems = check_nelems(ht);
+
+	if (__rht_grow_above_75(ht, tbl, nelems))
 		size *= 2;
 	/* Do not schedule more than one rehash */
 	else if (old_tbl != tbl)
@@ -556,11 +587,15 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 	if (PTR_ERR(data) != -ENOENT)
 		return ERR_CAST(data);
 
-	if (unlikely(rht_grow_above_max(ht, tbl)))
+	if (unlikely(rht_grow_above_max(ht, tbl, ht->p)))
 		return ERR_PTR(-E2BIG);
 
-	if (unlikely(rht_grow_above_100(ht, tbl)))
-		return ERR_PTR(-EAGAIN);
+	if (unlikely(rht_grow_above_100(ht, tbl, ht->p))) {
+		unsigned int nelems = check_nelems(ht);
+
+		if (__rht_grow_above_100(ht, tbl, nelems))
+			return ERR_PTR(-EAGAIN);
+	}
 
 	head = rht_ptr(rht_dereference_bucket(*pprev, tbl, hash));
 	while (!ht->rhlist && !rht_is_a_nulls(head) && head < obj) {
@@ -581,8 +616,11 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 		obj = (void *)(2UL | (unsigned long)obj);
 	rcu_assign_pointer(*pprev, obj);
 
-	atomic_inc(&ht->nelems);
-	if (rht_grow_above_75(ht, tbl))
+	if (ht->p.percpu_count)
+		percpu_counter_inc(&ht->pnelems);
+	else
+		atomic_inc(&ht->nelems);
+	if (rht_grow_above_75(ht, tbl, ht->p))
 		schedule_work(&ht->run_work);
 
 	return NULL;
@@ -1069,6 +1107,12 @@ int rhashtable_init(struct rhashtable *ht,
 		return -ENOMEM;
 
 	atomic_set(&ht->nelems, 0);
+	ht->pnelems_error = 0;
+	if (params->percpu_count)
+		if (percpu_counter_init(&ht->pnelems, 0, GFP_KERNEL)) {
+			bucket_table_free(tbl);
+			return -ENOMEM;
+		}
 
 	RCU_INIT_POINTER(ht->tbl, tbl);
 
@@ -1159,6 +1203,8 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
 	}
 
 	bucket_table_free(tbl);
+	if (ht->p.percpu_count)
+		percpu_counter_destroy(&ht->pnelems);
 	mutex_unlock(&ht->mutex);
 }
 EXPORT_SYMBOL_GPL(rhashtable_free_and_destroy);
diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index b428a9c7522a..0902db84a37d 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -57,6 +57,10 @@ static bool enomem_retry = false;
 module_param(enomem_retry, bool, 0);
 MODULE_PARM_DESC(enomem_retry, "Retry insert even if -ENOMEM was returned (default: off)");
 
+static bool percpu_count = false;
+module_param(percpu_count, bool, 0);
+MODULE_PARM_DESC(percpu_count, "Use less-precise percpu counter for counting elements (default: off)");
+
 struct test_obj_val {
 	int	id;
 	int	tid;
@@ -180,6 +184,7 @@ static void test_bucket_stats(struct rhashtable *ht, unsigned int entries)
 	unsigned int err, total = 0, chain_len = 0;
 	struct rhashtable_iter hti;
 	struct rhash_head *pos;
+	unsigned int nelems;
 
 	err = rhashtable_walk_init(ht, &hti, GFP_KERNEL);
 	if (err) {
@@ -206,10 +211,14 @@ static void test_bucket_stats(struct rhashtable *ht, unsigned int entries)
 	rhashtable_walk_stop(&hti);
 	rhashtable_walk_exit(&hti);
 
+	if (ht->p.percpu_count)
+		nelems = percpu_counter_sum(&ht->pnelems);
+	else
+		nelems = atomic_read(&ht->nelems);
 	pr_info("  Traversal complete: counted=%u, nelems=%u, entries=%d, table-jumps=%u\n",
-		total, atomic_read(&ht->nelems), entries, chain_len);
+		total, nelems, entries, chain_len);
 
-	if (total != atomic_read(&ht->nelems) || total != entries)
+	if (total != nelems || total != entries)
 		pr_warn("Test failed: Total count mismatch ^^^");
 }
 
@@ -705,6 +714,7 @@ static int __init test_rht_init(void)
 	test_rht_params.automatic_shrinking = shrinking;
 	test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
 	test_rht_params.nelem_hint = size;
+	test_rht_params.percpu_count = percpu_count;
 
 	objs = vzalloc((test_rht_params.max_size + 1) * sizeof(struct test_obj));
 	if (!objs)

^ permalink raw reply related

* [PATCH 18/18] rhashtable: add rhashtable_walk_delay_rehash()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

This function allows an rhashtable walk to complete
without any restarts as long as it progresses reasonably
quickly.
Any current rehash is waited for, then a counter is incremented
to stop any further rehash from happening until the 100%
occupancy mark is reached.
This particularly ensures that restarts won't happen due to
the hash table shrinking as shrinking is delayed indefinitely.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable-types.h |    3 ++
 include/linux/rhashtable.h       |   14 ++++++++--
 lib/rhashtable.c                 |   54 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index c08c3bcfb5ad..43de8d10638e 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -76,6 +76,7 @@ struct rhashtable_params {
  * @max_elems: Maximum number of elements in table
  * @p: Configuration parameters
  * @rhlist: True if this is an rhltable
+ * @rehash_delayers: number of walkers asking for rehash to be delayed.
  * @run_work: Deferred worker to expand/shrink asynchronously
  * @mutex: Mutex to protect current/future table swapping
  * @lock: Spin lock to protect walker list
@@ -90,6 +91,7 @@ struct rhashtable {
 	unsigned int			max_elems;
 	struct rhashtable_params	p;
 	bool				rhlist;
+	atomic_t			rehash_delayers;
 	struct work_struct		run_work;
 	struct mutex                    mutex;
 	spinlock_t			lock;
@@ -134,6 +136,7 @@ struct rhashtable_iter {
 	unsigned int skip;
 	bool p_is_unsafe;
 	bool end_of_table;
+	bool delay_rehash;
 };
 
 int rhashtable_init(struct rhashtable *ht,
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index e148f72d4a89..42b5d037ad2e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -233,7 +233,8 @@ static inline bool __rht_grow_above_75(const struct rhashtable *ht,
 				       unsigned int nelems)
 {
 	return nelems > (tbl->size / 4 * 3) &&
-	       (!ht->p.max_size || tbl->size < ht->p.max_size);
+	       (!ht->p.max_size || tbl->size < ht->p.max_size) &&
+		atomic_read(&ht->rehash_delayers) == 0;
 }
 /**
  * rht_grow_above_75 - returns true if nelems > 0.75 * table-size
@@ -263,7 +264,8 @@ static inline bool __rht_shrink_below_30(const struct rhashtable *ht,
 {
 	/* Shrink table beneath 30% load */
 	return nelems < (tbl->size * 3 / 10) &&
-	       tbl->size > ht->p.min_size;
+	       tbl->size > ht->p.min_size &&
+	       atomic_read(&ht->rehash_delayers) == 0;
 }
 /**
  * rht_shrink_below_30 - returns true if nelems < 0.3 * table-size
@@ -285,9 +287,14 @@ static inline bool rht_shrink_below_30(const struct rhashtable *ht,
 	return __rht_shrink_below_30(ht, tbl, nelems);
 }
 
+/**
+ * rht_grow_above_100 - returns true if nelems > table-size
+ * @ht:		hash table
+ * @tbl:	current table
+ */
 static inline bool __rht_grow_above_100(const struct rhashtable *ht,
 					const struct bucket_table *tbl,
-					unsigned int nelems)
+					const unsigned int nelems)
 {
 	return nelems > tbl->size &&
 		(!ht->p.max_size || tbl->size < ht->p.max_size);
@@ -357,6 +364,7 @@ void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
 void rhashtable_walk_enter(struct rhashtable *ht,
 			   struct rhashtable_iter *iter);
 void rhashtable_walk_exit(struct rhashtable_iter *iter);
+void rhashtable_walk_delay_rehash(struct rhashtable_iter *iter);
 int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU);
 
 static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 84288142ddf6..aa1e7be8fc5b 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -436,8 +436,16 @@ static void rht_deferred_worker(struct work_struct *work)
 	tbl = rhashtable_last_table(ht, tbl);
 
 	nelems = check_nelems(ht);
+	if (atomic_read(&ht->rehash_delayers) &&
+	    tbl == ht->tbl &&
+	    !__rht_grow_above_100(ht, tbl, nelems))
+		goto out;
 
-	if (__rht_grow_above_75(ht, tbl, nelems)) {
+	/* Need to test both _75 and _100 and _75 always
+	 * says "no need to grow" if ->rehash_delayers
+	 */
+	if (__rht_grow_above_75(ht, tbl, nelems) ||
+	    __rht_grow_above_100(ht, tbl, nelems)) {
 		unsigned int size = roundup_pow_of_two(nelems * 4 / 3);
 
 		if (ht->p.max_size && size > ht->p.max_size)
@@ -453,6 +461,7 @@ static void rht_deferred_worker(struct work_struct *work)
 	if (!err)
 		err = rhashtable_rehash_table(ht);
 
+out:
 	mutex_unlock(&ht->mutex);
 
 	if (err)
@@ -711,6 +720,7 @@ void rhashtable_walk_enter(struct rhashtable *ht, struct rhashtable_iter *iter)
 	iter->slot = 0;
 	iter->skip = 0;
 	iter->end_of_table = 0;
+	iter->delay_rehash = false;
 
 	spin_lock(&ht->lock);
 	iter->walker.tbl =
@@ -732,9 +742,50 @@ void rhashtable_walk_exit(struct rhashtable_iter *iter)
 	if (iter->walker.tbl)
 		list_del(&iter->walker.list);
 	spin_unlock(&iter->ht->lock);
+	if (iter->delay_rehash &&
+	    atomic_dec_and_test(&iter->ht->rehash_delayers))
+		schedule_work(&iter->ht->run_work);
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_exit);
 
+/**
+ * rhashtable_walk_delay_rehash - ask for rehash to be delayed
+ * @iter:	Hash table Iterator
+ *
+ * If a rehash happens during a table walk, the walk can
+ * see duplicate entries and need to restart.
+ * If the walk calls this function immediately after
+ * rhashtable_walk_enter(), the chance of this happening is
+ * substantially reduced.  It waits for any
+ * current rehash to complete, then temporarily disables
+ * shinking and blocks and growth of the table until it
+ * reaches 100% occupancy, rather than the normal 70%.
+ *
+ * This function can be useful when walking a table to
+ * produce a debugfs file.  It should probably *not* be
+ * used when the file is access by an unprivileged process
+ * as that could conceivably result in a minor denial for service.
+ */
+void rhashtable_walk_delay_rehash(struct rhashtable_iter *iter)
+{
+	if (!iter->delay_rehash) {
+		struct rhashtable *ht = iter->ht;
+
+		atomic_inc(&ht->rehash_delayers);
+		iter->delay_rehash = true;
+		/* make sure any schedule rehash has finished */
+		flush_work(&ht->run_work);
+		mutex_lock(&ht->mutex);
+		/* Make double-sure there is only one table */
+		while (rhashtable_rehash_table(ht) == -EAGAIN)
+			;
+		mutex_unlock(&ht->mutex);
+		/* Need to swallow any -EAGAIN */
+		rhashtable_walk_start(iter);
+		rhashtable_walk_stop(iter);
+	}
+}
+
 /**
  * rhashtable_walk_start_check - Start a hash table walk
  * @iter:	Hash table iterator
@@ -1113,6 +1164,7 @@ int rhashtable_init(struct rhashtable *ht,
 			bucket_table_free(tbl);
 			return -ENOMEM;
 		}
+	atomic_set(&ht->rehash_delayers, 0);
 
 	RCU_INIT_POINTER(ht->tbl, tbl);
 

^ permalink raw reply related

* [PATCH 17/18] rhashtable: rename rht_for_each*continue as *from.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

The pattern set by list.h is that for_each..continue()
iterators start at the next entry after the given one,
while for_each..next() iterators start at the given
entry.

The rht_for_each*continue() iterates are documented as though the
start at the 'next' entry, but actually start at the given entry,
and they are used expecting that behaviour.
So fix the documentation and change the names to *from for consistency
with list.h

Signed-off-by: NeilBrown <neilb@suse.com>
---
 .clang-format              |    8 ++++----
 include/linux/rhashtable.h |   38 +++++++++++++++++++-------------------
 lib/rhashtable.c           |    4 ++--
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/.clang-format b/.clang-format
index faffc0d5af4e..53a5188df01a 100644
--- a/.clang-format
+++ b/.clang-format
@@ -332,14 +332,14 @@ ForEachMacros:
   - 'rhl_for_each_entry_rcu'
   - 'rhl_for_each_rcu'
   - 'rht_for_each'
-  - 'rht_for_each_continue'
+  - 'rht_for_each_from'
   - 'rht_for_each_entry'
-  - 'rht_for_each_entry_continue'
+  - 'rht_for_each_entry_from'
   - 'rht_for_each_entry_rcu'
-  - 'rht_for_each_entry_rcu_continue'
+  - 'rht_for_each_entry_rcu_from'
   - 'rht_for_each_entry_safe'
   - 'rht_for_each_rcu'
-  - 'rht_for_each_rcu_continue'
+  - 'rht_for_each_rcu_from'
   - '__rq_for_each_bio'
   - 'rq_for_each_segment'
   - 'scsi_for_each_prot_sg'
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index f58cb62dce05..e148f72d4a89 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -418,13 +418,13 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
 }
 
 /**
- * rht_for_each_continue - continue iterating over hash chain
+ * rht_for_each_from - iterate over hash chain from given head
  * @pos:	the &struct rhash_head to use as a loop cursor.
- * @head:	the previous &struct rhash_head to continue from
+ * @head:	the &struct rhash_head to start from
  * @tbl:	the &struct bucket_table
  * @hash:	the hash value / bucket index
  */
-#define rht_for_each_continue(pos, head, tbl, hash) \
+#define rht_for_each_from(pos, head, tbl, hash) \
 	for (pos = rht_dereference_bucket(head, tbl, hash); \
 	     !rht_is_a_nulls(pos); \
 	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
@@ -436,18 +436,18 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * @hash:	the hash value / bucket index
  */
 #define rht_for_each(pos, tbl, hash) \
-	rht_for_each_continue(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)
+	rht_for_each_from(pos, rht_ptr(*rht_bucket(tbl, hash)), tbl, hash)
 
 /**
- * rht_for_each_entry_continue - continue iterating over hash chain
+ * rht_for_each_entry_from - iterate over hash chain from given head
  * @tpos:	the type * to use as a loop cursor.
  * @pos:	the &struct rhash_head to use as a loop cursor.
- * @head:	the previous &struct rhash_head to continue from
+ * @head:	the &struct rhash_head to start from
  * @tbl:	the &struct bucket_table
  * @hash:	the hash value / bucket index
  * @member:	name of the &struct rhash_head within the hashable struct.
  */
-#define rht_for_each_entry_continue(tpos, pos, head, tbl, hash, member)	\
+#define rht_for_each_entry_from(tpos, pos, head, tbl, hash, member)	\
 	for (pos = rht_dereference_bucket(head, tbl, hash);		\
 	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	\
 	     pos = rht_dereference_bucket((pos)->next, tbl, hash))
@@ -461,7 +461,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * @member:	name of the &struct rhash_head within the hashable struct.
  */
 #define rht_for_each_entry(tpos, pos, tbl, hash, member)		\
-	rht_for_each_entry_continue(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
+	rht_for_each_entry_from(tpos, pos, rht_ptr(*rht_bucket(tbl, hash)), \
 				    tbl, hash, member)
 
 /**
@@ -487,9 +487,9 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
 		       rht_dereference_bucket(pos->next, tbl, hash) : NULL)
 
 /**
- * rht_for_each_rcu_continue - continue iterating over rcu hash chain
+ * rht_for_each_rcu_from - iterate over rcu hash chain from given head
  * @pos:	the &struct rhash_head to use as a loop cursor.
- * @head:	the previous &struct rhash_head to continue from
+ * @head:	the &struct rhash_head to start from
  * @tbl:	the &struct bucket_table
  * @hash:	the hash value / bucket index
  *
@@ -497,7 +497,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * the _rcu mutation primitives such as rhashtable_insert() as long as the
  * traversal is guarded by rcu_read_lock().
  */
-#define rht_for_each_rcu_continue(pos, head, tbl, hash)			\
+#define rht_for_each_rcu_from(pos, head, tbl, hash)			\
 	for (({barrier(); }),						\
 		pos = rht_ptr(rht_dereference_bucket_rcu(head, tbl, hash)); \
 	     !rht_is_a_nulls(pos);					\
@@ -521,10 +521,10 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
 	     pos = rcu_dereference_raw(pos->next))
 
 /**
- * rht_for_each_entry_rcu_continue - continue iterating over rcu hash chain
+ * rht_for_each_entry_rcu_from - iterated over rcu hash chain from given head
  * @tpos:	the type * to use as a loop cursor.
  * @pos:	the &struct rhash_head to use as a loop cursor.
- * @head:	the previous &struct rhash_head to continue from
+ * @head:	the &struct rhash_head to start from
  * @tbl:	the &struct bucket_table
  * @hash:	the hash value / bucket index
  * @member:	name of the &struct rhash_head within the hashable struct.
@@ -533,7 +533,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * the _rcu mutation primitives such as rhashtable_insert() as long as the
  * traversal is guarded by rcu_read_lock().
  */
-#define rht_for_each_entry_rcu_continue(tpos, pos, head, tbl, hash, member) \
+#define rht_for_each_entry_rcu_from(tpos, pos, head, tbl, hash, member) \
 	for (({barrier(); }),						    \
 	     pos = rht_dereference_bucket_rcu(head, tbl, hash);		    \
 	     (!rht_is_a_nulls(pos)) && rht_entry(tpos, pos, member);	    \
@@ -552,7 +552,7 @@ static inline struct rhash_head __rcu **rht_bucket_insert(
  * traversal is guarded by rcu_read_lock().
  */
 #define rht_for_each_entry_rcu(tpos, pos, tbl, hash, member)		   \
-	rht_for_each_entry_rcu_continue(tpos, pos,			   \
+	rht_for_each_entry_rcu_from(tpos, pos,			   \
 					rht_ptr(*rht_bucket(tbl, hash)),   \
 					tbl, hash, member)
 
@@ -609,7 +609,7 @@ static inline struct rhash_head *__rhashtable_lookup(
 	hash = rht_key_hashfn(ht, tbl, key, params);
 	head = rht_bucket(tbl, hash);
 	do {
-		rht_for_each_rcu_continue(he, *head, tbl, hash) {
+		rht_for_each_rcu_from(he, *head, tbl, hash) {
 			if (params.obj_cmpfn ?
 			    params.obj_cmpfn(&arg, rht_obj(ht, he)) :
 			    rhashtable_compare(&arg, rht_obj(ht, he)))
@@ -744,7 +744,7 @@ static inline void *__rhashtable_insert_fast(
 	}
 
 
-	rht_for_each_continue(head, rht_ptr(*headp), tbl, hash) {
+	rht_for_each_from(head, rht_ptr(*headp), tbl, hash) {
 		struct rhlist_head *plist;
 		struct rhlist_head *list;
 
@@ -1027,7 +1027,7 @@ static inline int __rhashtable_remove_fast_one(
 	lock = pprev;
 	rht_lock(lock);
 
-	rht_for_each_continue(he, rht_ptr(*pprev), tbl, hash) {
+	rht_for_each_from(he, rht_ptr(*pprev), tbl, hash) {
 		struct rhlist_head *list;
 
 		list = container_of(he, struct rhlist_head, rhead);
@@ -1189,7 +1189,7 @@ static inline int __rhashtable_replace_fast(
 	lock = pprev;
 	rht_lock(lock);
 
-	rht_for_each_continue(he, rht_ptr(*pprev), tbl, hash) {
+	rht_for_each_from(he, rht_ptr(*pprev), tbl, hash) {
 		if (he != obj_old) {
 			pprev = &he->next;
 			continue;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 1b9820787cd5..84288142ddf6 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -232,7 +232,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht,
 
 	err = -ENOENT;
 
-	rht_for_each_continue(entry, rht_ptr(*pprev), old_tbl, old_hash) {
+	rht_for_each_from(entry, rht_ptr(*pprev), old_tbl, old_hash) {
 		err = 0;
 		next = rht_dereference_bucket(entry->next, old_tbl, old_hash);
 
@@ -527,7 +527,7 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
 	int elasticity;
 
 	elasticity = RHT_ELASTICITY;
-	rht_for_each_continue(head, rht_ptr(*pprev), tbl, hash) {
+	rht_for_each_from(head, rht_ptr(*pprev), tbl, hash) {
 		struct rhlist_head *list;
 		struct rhlist_head *plist;
 

^ permalink raw reply related

* [PATCH 05/18] rhashtable: simplify INIT_RHT_NULLS_HEAD()
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

The 'ht' and 'hash' arguments to INIT_RHT_NULLS_HEAD() are
no longer used - so drop them.  This allows us to also
from the nhash argument from nested_table_alloc().

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |    2 +-
 lib/rhashtable.c           |   17 +++++++----------
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 25d839881ae5..49f80506d02e 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -77,7 +77,7 @@ struct bucket_table {
 
 #define	RHT_NULLS_MARKER(ptr)	\
 	((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
-#define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
+#define INIT_RHT_NULLS_HEAD(ptr)	\
 	((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 69f05cf9e9e8..8582e1916c2d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -116,8 +116,7 @@ static void bucket_table_free_rcu(struct rcu_head *head)
 
 static union nested_table *nested_table_alloc(struct rhashtable *ht,
 					      union nested_table __rcu **prev,
-					      unsigned int shifted,
-					      unsigned int nhash)
+					      unsigned int shifted)
 {
 	union nested_table *ntbl;
 	int i;
@@ -130,8 +129,7 @@ static union nested_table *nested_table_alloc(struct rhashtable *ht,
 
 	if (ntbl && shifted) {
 		for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0].bucket); i++)
-			INIT_RHT_NULLS_HEAD(ntbl[i].bucket, ht,
-					    (i << shifted) | nhash);
+			INIT_RHT_NULLS_HEAD(ntbl[i].bucket);
 	}
 
 	rcu_assign_pointer(*prev, ntbl);
@@ -157,7 +155,7 @@ static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht,
 		return NULL;
 
 	if (!nested_table_alloc(ht, (union nested_table __rcu **)tbl->buckets,
-				0, 0)) {
+				0)) {
 		kfree(tbl);
 		return NULL;
 	}
@@ -207,7 +205,7 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht,
 	tbl->hash_rnd = get_random_u32();
 
 	for (i = 0; i < nbuckets; i++)
-		INIT_RHT_NULLS_HEAD(tbl->buckets[i], ht, i);
+		INIT_RHT_NULLS_HEAD(tbl->buckets[i]);
 
 	return tbl;
 }
@@ -1194,7 +1192,7 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 
 	if (!ntbl) {
 		if (!rhnull)
-			INIT_RHT_NULLS_HEAD(rhnull, NULL, 0);
+			INIT_RHT_NULLS_HEAD(rhnull);
 		return &rhnull;
 	}
 
@@ -1219,7 +1217,7 @@ struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
 	nhash = index;
 	shifted = tbl->nest;
 	ntbl = nested_table_alloc(ht, &ntbl[index].table,
-				  size <= (1 << shift) ? shifted : 0, nhash);
+				  size <= (1 << shift) ? shifted : 0);
 
 	while (ntbl && size > (1 << shift)) {
 		index = hash & ((1 << shift) - 1);
@@ -1228,8 +1226,7 @@ struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
 		nhash |= index << shifted;
 		shifted += shift;
 		ntbl = nested_table_alloc(ht, &ntbl[index].table,
-					  size <= (1 << shift) ? shifted : 0,
-					  nhash);
+					  size <= (1 << shift) ? shifted : 0);
 	}
 
 	if (!ntbl)

^ permalink raw reply related

* [PATCH 11/18] rhashtable: further improve stability of rhashtable_walk
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

If the sequence:
   obj = rhashtable_walk_next(iter);
   rhashtable_walk_stop(iter);
   rhashtable_remove_fast(ht, &obj->head, params);
   rhashtable_walk_start(iter);

 races with another thread inserting or removing
 an object on the same hash chain, a subsequent
 rhashtable_walk_next() is not guaranteed to get the "next"
 object. It is possible that an object could be
 repeated, or missed.

 This can be made more reliable by keeping the objects in a hash chain
 sorted by memory address.  A subsequent rhashtable_walk_next()
 call can reliably find the correct position in the list, and thus
 find the 'next' object.

 It is not possible to take this approach with an rhltable as keeping
 the hash chain in order is not so easy.  When the first object with a
 given key is removed, it is replaced in the chain with the next
 object with the same key, and the address of that object may not be
 correctly ordered.
 I have not yet found any way to achieve the same stability
 with rhltables, that doesn't have a major impact on lookup
 or insert.  No code currently in Linux would benefit from
 such extra stability.

 With this patch:
 - a new object is always inserted after the last object with a
   smaller address, or at the start.  This preserves the property,
   important when allowing objects to be removed and re-added, that
   an object is never inserted *after* a position that it previously
   held in the list.
 - when rhashtable_walk_start() is called, it records that 'p' is not
   'safe', meaning that it cannot be dereferenced.  The revalidation
   that was previously done here is moved to rhashtable_walk_next()
 - when rhashtable_walk_next() is called while p is not NULL and not
   safe, it walks the chain looking for the first object with an
   address greater than p and returns that.  If there is none, it moves
   to the next hash chain.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable-types.h |    1 
 include/linux/rhashtable.h       |   10 ++++-
 lib/rhashtable.c                 |   82 +++++++++++++++++++++++++-------------
 3 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 763d613ce2c2..bc3e84547ba7 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -126,6 +126,7 @@ struct rhashtable_iter {
 	struct rhashtable_walker walker;
 	unsigned int slot;
 	unsigned int skip;
+	bool p_is_unsafe;
 	bool end_of_table;
 };
 
diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 4e2fc97667e0..c612d2446d90 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -627,7 +627,12 @@ static inline void *__rhashtable_insert_fast(
 		    (params.obj_cmpfn ?
 		     params.obj_cmpfn(&arg, rht_obj(ht, head)) :
 		     rhashtable_compare(&arg, rht_obj(ht, head)))) {
-			pprev = &head->next;
+			if (rhlist) {
+				pprev = &head->next;
+			} else {
+				if (head < obj)
+					headp = &head->next;
+			}
 			continue;
 		}
 
@@ -1123,7 +1128,8 @@ static inline int rhashtable_walk_init(struct rhashtable *ht,
  * Note that if you restart a walk after rhashtable_walk_stop you
  * may see the same object twice.  Also, you may miss objects if
  * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * call to rhashtable_walk_start.  Note that this is different to
+ * rhashtable_walk_enter() which misses objects.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 120382b0bad0..5ab0f4825271 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -230,6 +230,7 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
 	struct bucket_table *new_tbl = rhashtable_last_table(ht, old_tbl);
 	struct rhash_head __rcu **pprev = rht_bucket_var(old_tbl, old_hash);
+	struct rhash_head __rcu **inspos;
 	int err = -EAGAIN;
 	struct rhash_head *head, *next, *entry;
 	spinlock_t *new_bucket_lock;
@@ -258,12 +259,15 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 	new_bucket_lock = rht_bucket_lock(new_tbl, new_hash);
 
 	spin_lock_nested(new_bucket_lock, SINGLE_DEPTH_NESTING);
-	head = rht_dereference_bucket(new_tbl->buckets[new_hash],
-				      new_tbl, new_hash);
-
+	inspos = &new_tbl->buckets[new_hash];
+	head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	while (!rht_is_a_nulls(head) && head < entry) {
+		inspos = &head->next;
+		head = rht_dereference_bucket(*inspos, new_tbl, new_hash);
+	}
 	RCU_INIT_POINTER(entry->next, head);
 
-	rcu_assign_pointer(new_tbl->buckets[new_hash], entry);
+	rcu_assign_pointer(*inspos, entry);
 	spin_unlock(new_bucket_lock);
 
 	rcu_assign_pointer(*pprev, next);
@@ -559,6 +563,10 @@ static struct bucket_table *rhashtable_insert_one(struct rhashtable *ht,
 		return ERR_PTR(-ENOMEM);
 
 	head = rht_dereference_bucket(*pprev, tbl, hash);
+	while (!ht->rhlist && !rht_is_a_nulls(head) && head < obj) {
+		pprev = &head->next;
+		head = rht_dereference_bucket(*pprev, tbl, hash);
+	}
 
 	RCU_INIT_POINTER(obj->next, head);
 	if (ht->rhlist) {
@@ -653,10 +661,10 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  *
  * This function prepares a hash table walk.
  *
- * Note that if you restart a walk after rhashtable_walk_stop you
- * may see the same object twice.  Also, you may miss objects if
- * there are removals in between rhashtable_walk_stop and the next
- * call to rhashtable_walk_start.
+ * A walk is guaranteed to return every object that was in
+ * the table before this call, and is still in the table when
+ * rhashtable_walk_next() returns NULL.  Duplicates can be
+ * seen, but only if there is a rehash event during the walk.
  *
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
@@ -740,19 +748,10 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter)
 
 	if (iter->p && !rhlist) {
 		/*
-		 * We need to validate that 'p' is still in the table, and
-		 * if so, update 'skip'
+		 * 'p' will be revalidated when rhashtable_walk_next()
+		 * is called.
 		 */
-		struct rhash_head *p;
-		int skip = 0;
-		rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
-			skip++;
-			if (p == iter->p) {
-				iter->skip = skip;
-				goto found;
-			}
-		}
-		iter->p = NULL;
+		iter->p_is_unsafe = true;
 	} else if (iter->p && rhlist) {
 		/* Need to validate that 'list' is still in the table, and
 		 * if so, update 'skip' and 'p'.
@@ -869,15 +868,39 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 	bool rhlist = ht->rhlist;
 
 	if (p) {
-		if (!rhlist || !(list = rcu_dereference(list->next))) {
-			p = rcu_dereference(p->next);
-			list = container_of(p, struct rhlist_head, rhead);
-		}
-		if (!rht_is_a_nulls(p)) {
-			iter->skip++;
-			iter->p = p;
-			iter->list = list;
-			return rht_obj(ht, rhlist ? &list->rhead : p);
+		if (!rhlist && iter->p_is_unsafe) {
+			/*
+			 * First time next() was called after start().
+			 * Need to find location of 'p' in the list.
+			 */
+			struct rhash_head *p;
+
+			iter->skip = 0;
+			rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+				iter->skip++;
+				if (p <= iter->p)
+					continue;
+
+				/* p is the next object after iter->p */
+				iter->p = p;
+				iter->p_is_unsafe = false;
+				return rht_obj(ht, p);
+			}
+			/* There is no "next" object in the list, move
+			 * to next hash chain.
+			 */
+		} else {
+			if (!rhlist || !(list = rcu_dereference(list->next))) {
+				p = rcu_dereference(p->next);
+				list = container_of(p, struct rhlist_head,
+						    rhead);
+			}
+			if (!rht_is_a_nulls(p)) {
+				iter->skip++;
+				iter->p = p;
+				iter->list = list;
+				return rht_obj(ht, rhlist ? &list->rhead : p);
+			}
 		}
 
 		/* At the end of this slot, switch to next one and then find
@@ -887,6 +910,7 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 		iter->slot++;
 	}
 
+	iter->p_is_unsafe = false;
 	return __rhashtable_walk_find_next(iter);
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);

^ permalink raw reply related

* [PATCH 14/18] rhashtable: allow rht_bucket_var to return NULL.
From: NeilBrown @ 2018-06-01  4:44 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782754287.30340.4395718227884933670.stgit@noble>

Rather than returning a pointer a static nulls, rht_bucket_var()
now returns NULL if the bucket doesn't exist.
This will make the next patch, which stores a bitlock in the
bucket pointer, somewhat cleaner.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |   11 +++++++++--
 lib/rhashtable.c           |   26 +++++++++++++++++++-------
 2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 5f0511bd5a39..81ca3ed2927b 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -256,6 +256,8 @@ void rhashtable_destroy(struct rhashtable *ht);
 
 struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 					    unsigned int hash);
+struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
+					      unsigned int hash);
 struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,
 						   struct bucket_table *tbl,
 						   unsigned int hash);
@@ -285,7 +287,7 @@ static inline struct rhash_head __rcu *const *rht_bucket(
 static inline struct rhash_head __rcu **rht_bucket_var(
 	struct bucket_table *tbl, unsigned int hash)
 {
-	return unlikely(tbl->nest) ? rht_bucket_nested(tbl, hash) :
+	return unlikely(tbl->nest) ? __rht_bucket_nested(tbl, hash) :
 				     &tbl->buckets[hash];
 }
 
@@ -888,6 +890,8 @@ static inline int __rhashtable_remove_fast_one(
 	spin_lock_bh(lock);
 
 	pprev = rht_bucket_var(tbl, hash);
+	if (!pprev)
+		goto out;
 	rht_for_each_continue(he, *pprev, tbl, hash) {
 		struct rhlist_head *list;
 
@@ -932,6 +936,7 @@ static inline int __rhashtable_remove_fast_one(
 		break;
 	}
 
+out:
 	spin_unlock_bh(lock);
 
 	if (err > 0) {
@@ -1040,6 +1045,8 @@ static inline int __rhashtable_replace_fast(
 	spin_lock_bh(lock);
 
 	pprev = rht_bucket_var(tbl, hash);
+	if (!pprev)
+		goto out;
 	rht_for_each_continue(he, *pprev, tbl, hash) {
 		if (he != obj_old) {
 			pprev = &he->next;
@@ -1051,7 +1058,7 @@ static inline int __rhashtable_replace_fast(
 		err = 0;
 		break;
 	}
-
+out:
 	spin_unlock_bh(lock);
 
 	return err;
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 919eebd6757d..bac2493808f0 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -240,8 +240,10 @@ static int rhashtable_rehash_one(struct rhashtable *ht, unsigned int old_hash)
 		goto out;
 
 	err = -ENOENT;
+	if (!pprev)
+		goto out;
 
-	rht_for_each(entry, old_tbl, old_hash) {
+	rht_for_each_continue(entry, *pprev, old_tbl, old_hash) {
 		err = 0;
 		next = rht_dereference_bucket(entry->next, old_tbl, old_hash);
 
@@ -498,6 +500,8 @@ static void *rhashtable_lookup_one(struct rhashtable *ht,
 
 	elasticity = RHT_ELASTICITY;
 	pprev = rht_bucket_var(tbl, hash);
+	if (!pprev)
+		return ERR_PTR(-ENOENT);
 	rht_for_each_continue(head, *pprev, tbl, hash) {
 		struct rhlist_head *list;
 		struct rhlist_head *plist;
@@ -1161,11 +1165,10 @@ void rhashtable_destroy(struct rhashtable *ht)
 }
 EXPORT_SYMBOL_GPL(rhashtable_destroy);
 
-struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
-					    unsigned int hash)
+struct rhash_head __rcu **__rht_bucket_nested(const struct bucket_table *tbl,
+					      unsigned int hash)
 {
 	const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-	static struct rhash_head __rcu *rhnull;
 	unsigned int index = hash & ((1 << tbl->nest) - 1);
 	unsigned int size = tbl->size >> tbl->nest;
 	unsigned int subhash = hash;
@@ -1184,14 +1187,23 @@ struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
 	}
 
 	if (!ntbl) {
-		if (!rhnull)
-			INIT_RHT_NULLS_HEAD(rhnull);
-		return &rhnull;
+		return NULL;
 	}
 
 	return &ntbl[subhash].bucket;
 
 }
+EXPORT_SYMBOL_GPL(__rht_bucket_nested);
+
+struct rhash_head __rcu **rht_bucket_nested(const struct bucket_table *tbl,
+					    unsigned int hash)
+{
+	static struct rhash_head __rcu *rhnull;
+
+	if (!rhnull)
+		INIT_RHT_NULLS_HEAD(rhnull);
+	return __rht_bucket_nested(tbl, hash) ?: &rhnull;
+}
 EXPORT_SYMBOL_GPL(rht_bucket_nested);
 
 struct rhash_head __rcu **rht_bucket_nested_insert(struct rhashtable *ht,

^ permalink raw reply related

* [net-next:master 375/376] net/core/rtnetlink.c:3099:1: warning: the frame size of 1280 bytes is larger than 1024 bytes
From: kbuild test robot @ 2018-06-01  5:07 UTC (permalink / raw)
  To: Kees Cook; +Cc: kbuild-all, netdev

[-- Attachment #1: Type: text/plain, Size: 18883 bytes --]

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   4b8e6ac41a594ea67ded6af6af5935f03221ea4c
commit: ccf8dbcd062a930e64741c939ca784d15316aa0c [375/376] rtnetlink: Remove VLA usage
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        git checkout ccf8dbcd062a930e64741c939ca784d15316aa0c
        # save the attached .config to linux build tree
        make ARCH=um SUBARCH=x86_64

All warnings (new ones prefixed by >>):

   net/core/rtnetlink.c: In function 'rtnl_newlink':
>> net/core/rtnetlink.c:3099:1: warning: the frame size of 1280 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +3099 net/core/rtnetlink.c

e7ed828f1 Vlad Dogaru       2011-01-13  2842  
c21ef3e34 David Ahern       2017-04-16  2843  static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh,
c21ef3e34 David Ahern       2017-04-16  2844  			struct netlink_ext_ack *extack)
38f7b870d Patrick McHardy   2007-06-13  2845  {
3b1e0a655 YOSHIFUJI Hideaki 2008-03-26  2846  	struct net *net = sock_net(skb->sk);
38f7b870d Patrick McHardy   2007-06-13  2847  	const struct rtnl_link_ops *ops;
ba7d49b1f Jiri Pirko        2014-01-22  2848  	const struct rtnl_link_ops *m_ops = NULL;
38f7b870d Patrick McHardy   2007-06-13  2849  	struct net_device *dev;
ba7d49b1f Jiri Pirko        2014-01-22  2850  	struct net_device *master_dev = NULL;
38f7b870d Patrick McHardy   2007-06-13  2851  	struct ifinfomsg *ifm;
38f7b870d Patrick McHardy   2007-06-13  2852  	char kind[MODULE_NAME_LEN];
38f7b870d Patrick McHardy   2007-06-13  2853  	char ifname[IFNAMSIZ];
38f7b870d Patrick McHardy   2007-06-13  2854  	struct nlattr *tb[IFLA_MAX+1];
38f7b870d Patrick McHardy   2007-06-13  2855  	struct nlattr *linkinfo[IFLA_INFO_MAX+1];
5517750f0 Tom Gundersen     2014-07-14  2856  	unsigned char name_assign_type = NET_NAME_USER;
38f7b870d Patrick McHardy   2007-06-13  2857  	int err;
38f7b870d Patrick McHardy   2007-06-13  2858  
95a5afca4 Johannes Berg     2008-10-16  2859  #ifdef CONFIG_MODULES
38f7b870d Patrick McHardy   2007-06-13  2860  replay:
8072f085d Thomas Graf       2007-07-31  2861  #endif
c21ef3e34 David Ahern       2017-04-16  2862  	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy, extack);
38f7b870d Patrick McHardy   2007-06-13  2863  	if (err < 0)
38f7b870d Patrick McHardy   2007-06-13  2864  		return err;
38f7b870d Patrick McHardy   2007-06-13  2865  
4ff66cae7 Christian Brauner 2018-02-07  2866  	err = rtnl_ensure_unique_netns(tb, extack, false);
4ff66cae7 Christian Brauner 2018-02-07  2867  	if (err < 0)
4ff66cae7 Christian Brauner 2018-02-07  2868  		return err;
4ff66cae7 Christian Brauner 2018-02-07  2869  
38f7b870d Patrick McHardy   2007-06-13  2870  	if (tb[IFLA_IFNAME])
38f7b870d Patrick McHardy   2007-06-13  2871  		nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
38f7b870d Patrick McHardy   2007-06-13  2872  	else
38f7b870d Patrick McHardy   2007-06-13  2873  		ifname[0] = '\0';
38f7b870d Patrick McHardy   2007-06-13  2874  
38f7b870d Patrick McHardy   2007-06-13  2875  	ifm = nlmsg_data(nlh);
38f7b870d Patrick McHardy   2007-06-13  2876  	if (ifm->ifi_index > 0)
881d966b4 Eric W. Biederman 2007-09-17  2877  		dev = __dev_get_by_index(net, ifm->ifi_index);
e7ed828f1 Vlad Dogaru       2011-01-13  2878  	else {
e7ed828f1 Vlad Dogaru       2011-01-13  2879  		if (ifname[0])
881d966b4 Eric W. Biederman 2007-09-17  2880  			dev = __dev_get_by_name(net, ifname);
38f7b870d Patrick McHardy   2007-06-13  2881  		else
38f7b870d Patrick McHardy   2007-06-13  2882  			dev = NULL;
e7ed828f1 Vlad Dogaru       2011-01-13  2883  	}
38f7b870d Patrick McHardy   2007-06-13  2884  
ba7d49b1f Jiri Pirko        2014-01-22  2885  	if (dev) {
ba7d49b1f Jiri Pirko        2014-01-22  2886  		master_dev = netdev_master_upper_dev_get(dev);
ba7d49b1f Jiri Pirko        2014-01-22  2887  		if (master_dev)
ba7d49b1f Jiri Pirko        2014-01-22  2888  			m_ops = master_dev->rtnl_link_ops;
ba7d49b1f Jiri Pirko        2014-01-22  2889  	}
ba7d49b1f Jiri Pirko        2014-01-22  2890  
e0d087af7 Eric Dumazet      2009-11-07  2891  	err = validate_linkmsg(dev, tb);
e0d087af7 Eric Dumazet      2009-11-07  2892  	if (err < 0)
1840bb13c Thomas Graf       2008-02-23  2893  		return err;
1840bb13c Thomas Graf       2008-02-23  2894  
38f7b870d Patrick McHardy   2007-06-13  2895  	if (tb[IFLA_LINKINFO]) {
38f7b870d Patrick McHardy   2007-06-13  2896  		err = nla_parse_nested(linkinfo, IFLA_INFO_MAX,
fceb6435e Johannes Berg     2017-04-12  2897  				       tb[IFLA_LINKINFO], ifla_info_policy,
fceb6435e Johannes Berg     2017-04-12  2898  				       NULL);
38f7b870d Patrick McHardy   2007-06-13  2899  		if (err < 0)
38f7b870d Patrick McHardy   2007-06-13  2900  			return err;
38f7b870d Patrick McHardy   2007-06-13  2901  	} else
38f7b870d Patrick McHardy   2007-06-13  2902  		memset(linkinfo, 0, sizeof(linkinfo));
38f7b870d Patrick McHardy   2007-06-13  2903  
38f7b870d Patrick McHardy   2007-06-13  2904  	if (linkinfo[IFLA_INFO_KIND]) {
38f7b870d Patrick McHardy   2007-06-13  2905  		nla_strlcpy(kind, linkinfo[IFLA_INFO_KIND], sizeof(kind));
38f7b870d Patrick McHardy   2007-06-13  2906  		ops = rtnl_link_ops_get(kind);
38f7b870d Patrick McHardy   2007-06-13  2907  	} else {
38f7b870d Patrick McHardy   2007-06-13  2908  		kind[0] = '\0';
38f7b870d Patrick McHardy   2007-06-13  2909  		ops = NULL;
38f7b870d Patrick McHardy   2007-06-13  2910  	}
38f7b870d Patrick McHardy   2007-06-13  2911  
38f7b870d Patrick McHardy   2007-06-13  2912  	if (1) {
ccf8dbcd0 Kees Cook         2018-05-30  2913  		struct nlattr *attr[RTNL_MAX_TYPE + 1];
ccf8dbcd0 Kees Cook         2018-05-30  2914  		struct nlattr *slave_attr[RTNL_SLAVE_MAX_TYPE + 1];
ba7d49b1f Jiri Pirko        2014-01-22  2915  		struct nlattr **data = NULL;
ba7d49b1f Jiri Pirko        2014-01-22  2916  		struct nlattr **slave_data = NULL;
317f4810e Nicolas Dichtel   2015-01-15  2917  		struct net *dest_net, *link_net = NULL;
38f7b870d Patrick McHardy   2007-06-13  2918  
38f7b870d Patrick McHardy   2007-06-13  2919  		if (ops) {
ccf8dbcd0 Kees Cook         2018-05-30  2920  			if (ops->maxtype > RTNL_MAX_TYPE)
ccf8dbcd0 Kees Cook         2018-05-30  2921  				return -EINVAL;
ccf8dbcd0 Kees Cook         2018-05-30  2922  
38f7b870d Patrick McHardy   2007-06-13  2923  			if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
38f7b870d Patrick McHardy   2007-06-13  2924  				err = nla_parse_nested(attr, ops->maxtype,
38f7b870d Patrick McHardy   2007-06-13  2925  						       linkinfo[IFLA_INFO_DATA],
fceb6435e Johannes Berg     2017-04-12  2926  						       ops->policy, NULL);
38f7b870d Patrick McHardy   2007-06-13  2927  				if (err < 0)
38f7b870d Patrick McHardy   2007-06-13  2928  					return err;
38f7b870d Patrick McHardy   2007-06-13  2929  				data = attr;
38f7b870d Patrick McHardy   2007-06-13  2930  			}
38f7b870d Patrick McHardy   2007-06-13  2931  			if (ops->validate) {
a8b8a889e Matthias Schiffer 2017-06-25  2932  				err = ops->validate(tb, data, extack);
38f7b870d Patrick McHardy   2007-06-13  2933  				if (err < 0)
38f7b870d Patrick McHardy   2007-06-13  2934  					return err;
38f7b870d Patrick McHardy   2007-06-13  2935  			}
38f7b870d Patrick McHardy   2007-06-13  2936  		}
38f7b870d Patrick McHardy   2007-06-13  2937  
ba7d49b1f Jiri Pirko        2014-01-22  2938  		if (m_ops) {
ccf8dbcd0 Kees Cook         2018-05-30  2939  			if (ops->slave_maxtype > RTNL_SLAVE_MAX_TYPE)
ccf8dbcd0 Kees Cook         2018-05-30  2940  				return -EINVAL;
ccf8dbcd0 Kees Cook         2018-05-30  2941  
ba7d49b1f Jiri Pirko        2014-01-22  2942  			if (m_ops->slave_maxtype &&
ba7d49b1f Jiri Pirko        2014-01-22  2943  			    linkinfo[IFLA_INFO_SLAVE_DATA]) {
ba7d49b1f Jiri Pirko        2014-01-22  2944  				err = nla_parse_nested(slave_attr,
ba7d49b1f Jiri Pirko        2014-01-22  2945  						       m_ops->slave_maxtype,
ba7d49b1f Jiri Pirko        2014-01-22  2946  						       linkinfo[IFLA_INFO_SLAVE_DATA],
fceb6435e Johannes Berg     2017-04-12  2947  						       m_ops->slave_policy,
fceb6435e Johannes Berg     2017-04-12  2948  						       NULL);
ba7d49b1f Jiri Pirko        2014-01-22  2949  				if (err < 0)
ba7d49b1f Jiri Pirko        2014-01-22  2950  					return err;
ba7d49b1f Jiri Pirko        2014-01-22  2951  				slave_data = slave_attr;
ba7d49b1f Jiri Pirko        2014-01-22  2952  			}
ba7d49b1f Jiri Pirko        2014-01-22  2953  		}
ba7d49b1f Jiri Pirko        2014-01-22  2954  
38f7b870d Patrick McHardy   2007-06-13  2955  		if (dev) {
90c325e3b Nicolas Dichtel   2014-09-01  2956  			int status = 0;
38f7b870d Patrick McHardy   2007-06-13  2957  
38f7b870d Patrick McHardy   2007-06-13  2958  			if (nlh->nlmsg_flags & NLM_F_EXCL)
38f7b870d Patrick McHardy   2007-06-13  2959  				return -EEXIST;
38f7b870d Patrick McHardy   2007-06-13  2960  			if (nlh->nlmsg_flags & NLM_F_REPLACE)
38f7b870d Patrick McHardy   2007-06-13  2961  				return -EOPNOTSUPP;
38f7b870d Patrick McHardy   2007-06-13  2962  
38f7b870d Patrick McHardy   2007-06-13  2963  			if (linkinfo[IFLA_INFO_DATA]) {
38f7b870d Patrick McHardy   2007-06-13  2964  				if (!ops || ops != dev->rtnl_link_ops ||
38f7b870d Patrick McHardy   2007-06-13  2965  				    !ops->changelink)
38f7b870d Patrick McHardy   2007-06-13  2966  					return -EOPNOTSUPP;
38f7b870d Patrick McHardy   2007-06-13  2967  
ad744b223 Matthias Schiffer 2017-06-25  2968  				err = ops->changelink(dev, tb, data, extack);
38f7b870d Patrick McHardy   2007-06-13  2969  				if (err < 0)
38f7b870d Patrick McHardy   2007-06-13  2970  					return err;
ba9989069 Nicolas Dichtel   2014-09-01  2971  				status |= DO_SETLINK_NOTIFY;
38f7b870d Patrick McHardy   2007-06-13  2972  			}
38f7b870d Patrick McHardy   2007-06-13  2973  
ba7d49b1f Jiri Pirko        2014-01-22  2974  			if (linkinfo[IFLA_INFO_SLAVE_DATA]) {
ba7d49b1f Jiri Pirko        2014-01-22  2975  				if (!m_ops || !m_ops->slave_changelink)
ba7d49b1f Jiri Pirko        2014-01-22  2976  					return -EOPNOTSUPP;
ba7d49b1f Jiri Pirko        2014-01-22  2977  
ba7d49b1f Jiri Pirko        2014-01-22  2978  				err = m_ops->slave_changelink(master_dev, dev,
17dd0ec47 Matthias Schiffer 2017-06-25  2979  							      tb, slave_data,
17dd0ec47 Matthias Schiffer 2017-06-25  2980  							      extack);
ba7d49b1f Jiri Pirko        2014-01-22  2981  				if (err < 0)
ba7d49b1f Jiri Pirko        2014-01-22  2982  					return err;
ba9989069 Nicolas Dichtel   2014-09-01  2983  				status |= DO_SETLINK_NOTIFY;
ba7d49b1f Jiri Pirko        2014-01-22  2984  			}
ba7d49b1f Jiri Pirko        2014-01-22  2985  
ddf9f9707 Jakub Kicinski    2017-04-30  2986  			return do_setlink(skb, dev, ifm, extack, tb, ifname,
ddf9f9707 Jakub Kicinski    2017-04-30  2987  					  status);
38f7b870d Patrick McHardy   2007-06-13  2988  		}
38f7b870d Patrick McHardy   2007-06-13  2989  
ffa934f19 Patrick McHardy   2011-01-20  2990  		if (!(nlh->nlmsg_flags & NLM_F_CREATE)) {
ffa934f19 Patrick McHardy   2011-01-20  2991  			if (ifm->ifi_index == 0 && tb[IFLA_GROUP])
90f62cf30 Eric W. Biederman 2014-04-23  2992  				return rtnl_group_changelink(skb, net,
ffa934f19 Patrick McHardy   2011-01-20  2993  						nla_get_u32(tb[IFLA_GROUP]),
ddf9f9707 Jakub Kicinski    2017-04-30  2994  						ifm, extack, tb);
38f7b870d Patrick McHardy   2007-06-13  2995  			return -ENODEV;
ffa934f19 Patrick McHardy   2011-01-20  2996  		}
38f7b870d Patrick McHardy   2007-06-13  2997  
160ca0142 Theuns Verwoerd   2017-01-31  2998  		if (tb[IFLA_MAP] || tb[IFLA_PROTINFO])
38f7b870d Patrick McHardy   2007-06-13  2999  			return -EOPNOTSUPP;
38f7b870d Patrick McHardy   2007-06-13  3000  
38f7b870d Patrick McHardy   2007-06-13  3001  		if (!ops) {
95a5afca4 Johannes Berg     2008-10-16  3002  #ifdef CONFIG_MODULES
38f7b870d Patrick McHardy   2007-06-13  3003  			if (kind[0]) {
38f7b870d Patrick McHardy   2007-06-13  3004  				__rtnl_unlock();
38f7b870d Patrick McHardy   2007-06-13  3005  				request_module("rtnl-link-%s", kind);
38f7b870d Patrick McHardy   2007-06-13  3006  				rtnl_lock();
38f7b870d Patrick McHardy   2007-06-13  3007  				ops = rtnl_link_ops_get(kind);
38f7b870d Patrick McHardy   2007-06-13  3008  				if (ops)
38f7b870d Patrick McHardy   2007-06-13  3009  					goto replay;
38f7b870d Patrick McHardy   2007-06-13  3010  			}
38f7b870d Patrick McHardy   2007-06-13  3011  #endif
38f7b870d Patrick McHardy   2007-06-13  3012  			return -EOPNOTSUPP;
38f7b870d Patrick McHardy   2007-06-13  3013  		}
38f7b870d Patrick McHardy   2007-06-13  3014  
b0ab2fabb Jiri Pirko        2014-06-26  3015  		if (!ops->setup)
b0ab2fabb Jiri Pirko        2014-06-26  3016  			return -EOPNOTSUPP;
b0ab2fabb Jiri Pirko        2014-06-26  3017  
5517750f0 Tom Gundersen     2014-07-14  3018  		if (!ifname[0]) {
38f7b870d Patrick McHardy   2007-06-13  3019  			snprintf(ifname, IFNAMSIZ, "%s%%d", ops->kind);
5517750f0 Tom Gundersen     2014-07-14  3020  			name_assign_type = NET_NAME_ENUM;
5517750f0 Tom Gundersen     2014-07-14  3021  		}
38f7b870d Patrick McHardy   2007-06-13  3022  
5bb8ed075 Christian Brauner 2018-01-29  3023  		dest_net = rtnl_link_get_net_capable(skb, net, tb, CAP_NET_ADMIN);
13ad17745 Eric W. Biederman 2011-01-29  3024  		if (IS_ERR(dest_net))
13ad17745 Eric W. Biederman 2011-01-29  3025  			return PTR_ERR(dest_net);
13ad17745 Eric W. Biederman 2011-01-29  3026  
317f4810e Nicolas Dichtel   2015-01-15  3027  		if (tb[IFLA_LINK_NETNSID]) {
317f4810e Nicolas Dichtel   2015-01-15  3028  			int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
317f4810e Nicolas Dichtel   2015-01-15  3029  
317f4810e Nicolas Dichtel   2015-01-15  3030  			link_net = get_net_ns_by_id(dest_net, id);
317f4810e Nicolas Dichtel   2015-01-15  3031  			if (!link_net) {
317f4810e Nicolas Dichtel   2015-01-15  3032  				err =  -EINVAL;
317f4810e Nicolas Dichtel   2015-01-15  3033  				goto out;
317f4810e Nicolas Dichtel   2015-01-15  3034  			}
06615bed6 Eric W. Biederman 2015-02-26  3035  			err = -EPERM;
06615bed6 Eric W. Biederman 2015-02-26  3036  			if (!netlink_ns_capable(skb, link_net->user_ns, CAP_NET_ADMIN))
06615bed6 Eric W. Biederman 2015-02-26  3037  				goto out;
317f4810e Nicolas Dichtel   2015-01-15  3038  		}
317f4810e Nicolas Dichtel   2015-01-15  3039  
317f4810e Nicolas Dichtel   2015-01-15  3040  		dev = rtnl_create_link(link_net ? : dest_net, ifname,
317f4810e Nicolas Dichtel   2015-01-15  3041  				       name_assign_type, ops, tb);
9c7dafbfa Pavel Emelyanov   2012-08-08  3042  		if (IS_ERR(dev)) {
e71992889 Pavel Emelianov   2007-08-08  3043  			err = PTR_ERR(dev);
9c7dafbfa Pavel Emelyanov   2012-08-08  3044  			goto out;
9c7dafbfa Pavel Emelyanov   2012-08-08  3045  		}
9c7dafbfa Pavel Emelyanov   2012-08-08  3046  
9c7dafbfa Pavel Emelyanov   2012-08-08  3047  		dev->ifindex = ifm->ifi_index;
9c7dafbfa Pavel Emelyanov   2012-08-08  3048  
0e0eee246 Cong Wang         2014-02-11  3049  		if (ops->newlink) {
7a3f4a185 Matthias Schiffer 2017-06-25  3050  			err = ops->newlink(link_net ? : net, dev, tb, data,
7a3f4a185 Matthias Schiffer 2017-06-25  3051  					   extack);
0e0eee246 Cong Wang         2014-02-11  3052  			/* Drivers should call free_netdev() in ->destructor
e51fb1523 Cong Wang         2014-06-03  3053  			 * and unregister it on failure after registration
e51fb1523 Cong Wang         2014-06-03  3054  			 * so that device could be finally freed in rtnl_unlock.
0e0eee246 Cong Wang         2014-02-11  3055  			 */
e51fb1523 Cong Wang         2014-06-03  3056  			if (err < 0) {
e51fb1523 Cong Wang         2014-06-03  3057  				/* If device is not registered at all, free it now */
e51fb1523 Cong Wang         2014-06-03  3058  				if (dev->reg_state == NETREG_UNINITIALIZED)
e51fb1523 Cong Wang         2014-06-03  3059  					free_netdev(dev);
0e0eee246 Cong Wang         2014-02-11  3060  				goto out;
e51fb1523 Cong Wang         2014-06-03  3061  			}
0e0eee246 Cong Wang         2014-02-11  3062  		} else {
2d85cba2b Patrick McHardy   2007-07-11  3063  			err = register_netdevice(dev);
fce9b9be8 Dan Carpenter     2013-08-14  3064  			if (err < 0) {
38f7b870d Patrick McHardy   2007-06-13  3065  				free_netdev(dev);
3729d5021 Patrick McHardy   2010-02-26  3066  				goto out;
fce9b9be8 Dan Carpenter     2013-08-14  3067  			}
0e0eee246 Cong Wang         2014-02-11  3068  		}
3729d5021 Patrick McHardy   2010-02-26  3069  		err = rtnl_configure_link(dev, ifm);
436389007 David S. Miller   2015-03-10  3070  		if (err < 0)
436389007 David S. Miller   2015-03-10  3071  			goto out_unregister;
bdef279b9 Nicolas Dichtel   2015-01-20  3072  		if (link_net) {
317f4810e Nicolas Dichtel   2015-01-15  3073  			err = dev_change_net_namespace(dev, dest_net, ifname);
bdef279b9 Nicolas Dichtel   2015-01-20  3074  			if (err < 0)
436389007 David S. Miller   2015-03-10  3075  				goto out_unregister;
bdef279b9 Nicolas Dichtel   2015-01-20  3076  		}
160ca0142 Theuns Verwoerd   2017-01-31  3077  		if (tb[IFLA_MASTER]) {
33eaf2a6e David Ahern       2017-10-04  3078  			err = do_set_master(dev, nla_get_u32(tb[IFLA_MASTER]),
33eaf2a6e David Ahern       2017-10-04  3079  					    extack);
160ca0142 Theuns Verwoerd   2017-01-31  3080  			if (err)
160ca0142 Theuns Verwoerd   2017-01-31  3081  				goto out_unregister;
160ca0142 Theuns Verwoerd   2017-01-31  3082  		}
3729d5021 Patrick McHardy   2010-02-26  3083  out:
317f4810e Nicolas Dichtel   2015-01-15  3084  		if (link_net)
317f4810e Nicolas Dichtel   2015-01-15  3085  			put_net(link_net);
81adee47d Eric W. Biederman 2009-11-08  3086  		put_net(dest_net);
38f7b870d Patrick McHardy   2007-06-13  3087  		return err;
436389007 David S. Miller   2015-03-10  3088  out_unregister:
436389007 David S. Miller   2015-03-10  3089  		if (ops->newlink) {
436389007 David S. Miller   2015-03-10  3090  			LIST_HEAD(list_kill);
436389007 David S. Miller   2015-03-10  3091  
436389007 David S. Miller   2015-03-10  3092  			ops->dellink(dev, &list_kill);
436389007 David S. Miller   2015-03-10  3093  			unregister_netdevice_many(&list_kill);
436389007 David S. Miller   2015-03-10  3094  		} else {
436389007 David S. Miller   2015-03-10  3095  			unregister_netdevice(dev);
436389007 David S. Miller   2015-03-10  3096  		}
436389007 David S. Miller   2015-03-10  3097  		goto out;
38f7b870d Patrick McHardy   2007-06-13  3098  	}
38f7b870d Patrick McHardy   2007-06-13 @3099  }
38f7b870d Patrick McHardy   2007-06-13  3100  

:::::: The code at line 3099 was first introduced by commit
:::::: 38f7b870d4a6a5d3ec21557e849620cb7d032965 [RTNETLINK]: Link creation API

:::::: TO: Patrick McHardy <kaber@trash.net>
:::::: CC: David S. Miller <davem@sunset.davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7394 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net: phy: consider PHY_IGNORE_INTERRUPT in state machine PHY_NOLINK handling
From: Heiner Kallweit @ 2018-06-01  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: f.fainelli, andrew, netdev
In-Reply-To: <20180531.212616.129115220944192049.davem@davemloft.net>

On 01.06.2018 03:26, David Miller wrote:
> From: Heiner Kallweit <hkallweit1@gmail.com>
> Date: Wed, 30 May 2018 22:13:20 +0200
> 
>> We can bail out immediately also in case of PHY_IGNORE_INTERRUPT because
>> phy_mac_interupt() informs us once the link is up.
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> 
> When state is PHY_NOLINK, the phy_mac_interrupt() code paths
> will change the state to PHY_CHANGELINK before queueing up
> the state machine invocation.
> 
> So I can't even see how we can enter phy_state_machine with
> ->state == PHY_NOLINK is the mac interrupt paths are being
> used properly.
> 
We could enter the state machine with PHY_NOLINK in case
any other activity triggers a state machine run whilst
the link is down. But I'm not sure whether such a
scenario exists.

> Therefore it looks like the code as written is harmless.
> 
> Did you actually hit a problem with this test or is this
> a change based purely upon code inspection?
> 
Right, there is no actual problem, the existing code is
harmless and the change is just based on code inspection.
Small benefit is that it makes clearer that this code path
is applicable in polling mode only.

^ permalink raw reply

* Re: [PATCH net-next] net/smc: fix error return code in smc_setsockopt()
From: Ursula Braun @ 2018-06-01  7:02 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: linux-s390, netdev, kernel-janitors
In-Reply-To: <1527733882-149144-1-git-send-email-weiyongjun1@huawei.com>



On 05/31/2018 04:31 AM, Wei Yongjun wrote:
> Fix to return error code -EINVAL instead of 0 if optlen is invalid.
> 
> Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK")
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
> ---
>  net/smc/af_smc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 2c369d4..973b447 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1420,7 +1420,7 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
>  		return rc;
>  
>  	if (optlen < sizeof(int))
> -		return rc;
> +		return -EINVAL;
>  	get_user(val, (int __user *)optval);
>  
>  	lock_sock(sk);
> 

Thanks for reporting this error. Your fix is fine, but I think we can get rid of
this check at all, since it is already checked in the tcp-code invoked before with

   smc->clcsock->ops->setsockopt(smc->clcsock, level, optname, optval, optlen)

^ permalink raw reply

* Re: [PATCH V4] mlx4_core: allocate ICM memory in page size chunks
From: Michal Hocko @ 2018-06-01  7:31 UTC (permalink / raw)
  To: Qing Huang
  Cc: Eric Dumazet, David Miller, tariqt, haakon.bugge, yanjun.zhu,
	netdev, linux-rdma, linux-kernel, gi-oh.kim,
	santosh.shilimkar@oracle.com
In-Reply-To: <7d8f52e1-aa16-d20c-a9a8-35ad88c0b1ab@oracle.com>

On Thu 31-05-18 19:04:46, Qing Huang wrote:
> 
> 
> On 5/31/2018 2:10 AM, Michal Hocko wrote:
> > On Thu 31-05-18 10:55:32, Michal Hocko wrote:
> > > On Thu 31-05-18 04:35:31, Eric Dumazet wrote:
> > [...]
> > > > I merely copied/pasted from alloc_skb_with_frags() :/
> > > I will have a look at it. Thanks!
> > OK, so this is an example of an incremental development ;).
> > 
> > __GFP_NORETRY was added by ed98df3361f0 ("net: use __GFP_NORETRY for
> > high order allocations") to prevent from OOM killer. Yet this was
> > not enough because fb05e7a89f50 ("net: don't wait for order-3 page
> > allocation") didn't want an excessive reclaim for non-costly orders
> > so it made it completely NOWAIT while it preserved __GFP_NORETRY in
> > place which is now redundant. Should I send a patch?
> > 
> 
> Just curious, how about GFP_ATOMIC flag? Would it work in a similar fashion?
> We experimented
> with it a bit in the past but it seemed to cause other issue in our tests.
> :-)

GFP_ATOMIC is a non-sleeping (aka no reclaim) context with an access to
memory reserves. So the risk is that you deplete those reserves and
cause issues to other subsystems which need them as well.

> By the way, we didn't encounter any OOM killer events. It seemed that the
> mlx4_alloc_icm() triggered slowpath.
> We still had about 2GB free memory while it was highly fragmented.

The compaction was able to make a reasonable forward progress for you.
But considering mlx4_alloc_icm is called with GFP_KERNEL resp. GFP_HIGHUSER
then the OOM killer is clearly possible as long as the order is lower
than 4.

>  #0 [ffff8801f308b380] remove_migration_pte at ffffffff811f0e0b
>  #1 [ffff8801f308b3e0] rmap_walk_file at ffffffff811cb890
>  #2 [ffff8801f308b440] rmap_walk at ffffffff811cbaf2
>  #3 [ffff8801f308b450] remove_migration_ptes at ffffffff811f0db0
>  #4 [ffff8801f308b490] __unmap_and_move at ffffffff811f2ea6
>  #5 [ffff8801f308b4e0] unmap_and_move at ffffffff811f2fc5
>  #6 [ffff8801f308b540] migrate_pages at ffffffff811f3219
>  #7 [ffff8801f308b5c0] compact_zone at ffffffff811b707e
>  #8 [ffff8801f308b650] compact_zone_order at ffffffff811b735d
>  #9 [ffff8801f308b6e0] try_to_compact_pages at ffffffff811b7485
> #10 [ffff8801f308b770] __alloc_pages_direct_compact at ffffffff81195f96
> #11 [ffff8801f308b7b0] __alloc_pages_slowpath at ffffffff811978a1
> #12 [ffff8801f308b890] __alloc_pages_nodemask at ffffffff81197ec1
> #13 [ffff8801f308b970] alloc_pages_current at ffffffff811e261f
> #14 [ffff8801f308b9e0] mlx4_alloc_icm at ffffffffa01f39b2 [mlx4_core]
> 
> Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] netfilter: nft_numgen: fix ptr_ret.cocci warnings
From: Pablo Neira Ayuso @ 2018-06-01  7:46 UTC (permalink / raw)
  To: Laura Garcia
  Cc: kbuild test robot, kbuild-all, Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	linux-kernel
In-Reply-To: <CAF90-WhsJTVcM+N_7n+4Zf_Knqzxc0UPv4i7bAM6Mvqu-GLHCw@mail.gmail.com>

On Thu, May 24, 2018 at 12:40:04PM +0200, Laura Garcia wrote:
> On Wed, May 23, 2018 at 12:58 PM, kbuild test robot
> <fengguang.wu@intel.com> wrote:
> > From: kbuild test robot <fengguang.wu@intel.com>
> >
> > net/netfilter/nft_numgen.c:117:1-3: WARNING: PTR_ERR_OR_ZERO can be used
> >
> >
> >  Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
> >
> > Generated by: scripts/coccinelle/api/ptr_ret.cocci
> >
> > Fixes: d734a2888922 ("netfilter: nft_numgen: add map lookups for numgen statements")
> > CC: Laura Garcia Liebana <nevola@gmail.com>
> > Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
> 
> Acked-by: Laura Garcia Liebana <nevola@gmail.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox