public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Lustre fixes
@ 2015-09-14 22:41 green
  2015-09-14 22:41 ` [PATCH 01/19] staging/lustre/lnet: Reenable lnet router debugfs green
                   ` (18 more replies)
  0 siblings, 19 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Oleg Drokin

From: Oleg Drokin <green@linuxhacker.ru>

This batch of changes is for various accumulated fixes since the
last time I had a chance to look at it.

The only "exception" is the last patch in the series - the
CPT-aware ptlrpcd patch. This one is included because
there was a strange code in that area otherwise that was flagged
as we were working on cpumasks code.

Please consider.

Andreas Dilger (1):
  staging/lustre/ptlrpc: remove LUSTRE_MSG_MAGIC_V1 support

Andrew Perepechko (1):
  staging/lustre/llite: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed

Ann Koehler (1):
  staging/lustre/obdclass: Eliminate hash bucket scans in
    lu_cache_shrink

Ben Evans (1):
  staging/lustre: Remove unused MAY_ constants

Bruno Faccini (1):
  staging/lustre/llite: strengthen checks for hsm flags and archive id

Fan Yong (1):
  staging/lustre/llite: cleanup open handle for client open failure

Frank Zago (1):
  staging/lustre/obdclass: reorganize busy object accounting

Hiroya Nozaki (1):
  staging/lustre/osc: LBUG in osc_lru_reclaim

Isaac Huang (1):
  staging/lustre/o2iblnd: wrong uses of kib_tx_t::tx_nfrags

James Simmons (1):
  staging/lustre/libcfs: remove unused cfs_timer_done

Li Xi (1):
  staging/lustre/osc: use global osc_rq_pool to reduce memory usage

Liang Zhen (3):
  staging/lustre/o2iblnd: connection refcount fix for kiblnd_post_rx
  staging/lustre/lnet: fix deadloop in ksocknal_push
  staging/lustre/o2iblnd: leak cmid in kiblnd_dev_need_failover

Niu Yawei (2):
  staging/lustre/llite: deny non-root user for changelog operations
  staging/lustre/libcfs: minor fix in cfs_hash_for_each_relax()

Olaf Weber (1):
  staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware

Oleg Drokin (2):
  staging/lustre/lnet: Reenable lnet router debugfs
  staging/lustre/lmv: fix potential null pointer dereference

 .../staging/lustre/include/linux/libcfs/libcfs.h   |   8 +
 .../lustre/include/linux/libcfs/libcfs_prim.h      |   1 -
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   4 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c    |   9 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |   3 -
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c |  55 +-
 .../staging/lustre/lnet/klnds/socklnd/socklnd.c    |  51 +-
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   4 +-
 drivers/staging/lustre/lnet/lnet/router_proc.c     |  43 +-
 drivers/staging/lustre/lustre/include/lu_object.h  |   5 +-
 .../lustre/lustre/include/lustre/lustre_idl.h      |  27 +-
 .../staging/lustre/lustre/include/lustre_import.h  |   2 -
 drivers/staging/lustre/lustre/include/lustre_net.h |  63 +-
 drivers/staging/lustre/lustre/include/obd_class.h  |   4 -
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |   8 +-
 drivers/staging/lustre/lustre/libcfs/hash.c        |   4 +
 .../lustre/lustre/libcfs/linux/linux-prim.c        |   6 -
 drivers/staging/lustre/lustre/libcfs/module.c      | 188 +++---
 drivers/staging/lustre/lustre/llite/dir.c          |   3 +
 drivers/staging/lustre/lustre/llite/file.c         |   9 +
 .../staging/lustre/lustre/llite/llite_internal.h   |  11 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |  48 +-
 drivers/staging/lustre/lustre/llite/llite_nfs.c    |   5 +-
 drivers/staging/lustre/lustre/llite/namei.c        |  14 +-
 drivers/staging/lustre/lustre/lmv/lmv_obd.c        |   8 +-
 drivers/staging/lustre/lustre/mdc/mdc_locks.c      |   2 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c    |   2 +-
 drivers/staging/lustre/lustre/obdclass/genops.c    |   1 -
 drivers/staging/lustre/lustre/obdclass/lu_object.c | 101 +--
 drivers/staging/lustre/lustre/osc/lproc_osc.c      |  17 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c      |  28 +-
 .../staging/lustre/lustre/osc/osc_cl_internal.h    |   2 +-
 drivers/staging/lustre/lustre/osc/osc_internal.h   |   6 +-
 drivers/staging/lustre/lustre/osc/osc_page.c       |   3 +-
 drivers/staging/lustre/lustre/osc/osc_request.c    | 120 ++--
 drivers/staging/lustre/lustre/ptlrpc/client.c      |  32 +-
 drivers/staging/lustre/lustre/ptlrpc/import.c      |   7 +-
 drivers/staging/lustre/lustre/ptlrpc/niobuf.c      |   3 +-
 .../staging/lustre/lustre/ptlrpc/pack_generic.c    |  96 +--
 drivers/staging/lustre/lustre/ptlrpc/pinger.c      |   2 +-
 .../staging/lustre/lustre/ptlrpc/ptlrpc_internal.h |   2 +-
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     | 702 +++++++++++++--------
 drivers/staging/lustre/lustre/ptlrpc/wiretest.c    |   4 -
 43 files changed, 923 insertions(+), 790 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/19] staging/lustre/lnet: Reenable lnet router debugfs
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 02/19] staging/lustre/obdclass: reorganize busy object accounting green
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Oleg Drokin

From: Oleg Drokin <green@linuxhacker.ru>

It looks like router proc files were defined out, so I missed them
during debugfs conversion.
Reenable the code and move all the variables to debugfs.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 .../staging/lustre/include/linux/libcfs/libcfs.h   |   8 +
 .../staging/lustre/include/linux/lnet/lib-lnet.h   |   4 +-
 drivers/staging/lustre/lnet/lnet/api-ni.c          |   4 +-
 drivers/staging/lustre/lnet/lnet/router_proc.c     |  43 +----
 drivers/staging/lustre/lustre/libcfs/module.c      | 188 ++++++++++-----------
 5 files changed, 105 insertions(+), 142 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs.h b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
index 01961d9..259a336 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs.h
@@ -161,4 +161,12 @@ extern struct cfs_psdev_ops libcfs_psdev_ops;
 
 extern struct cfs_wi_sched *cfs_sched_rehash;
 
+struct lnet_debugfs_symlink_def {
+	char *name;
+	char *target;
+};
+
+void lustre_insert_debugfs(struct ctl_table *table,
+			   const struct lnet_debugfs_symlink_def *symlinks);
+
 #endif /* _LIBCFS_H */
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index a9c9a07..22d54b2 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -443,8 +443,8 @@ int lnet_del_route(__u32 net, lnet_nid_t gw_nid);
 void lnet_destroy_routes(void);
 int lnet_get_route(int idx, __u32 *net, __u32 *hops,
 		   lnet_nid_t *gateway, __u32 *alive, __u32 *priority);
-void lnet_proc_init(void);
-void lnet_proc_fini(void);
+void lnet_router_debugfs_init(void);
+void lnet_router_debugfs_fini(void);
 int  lnet_rtrpools_alloc(int im_a_router);
 void lnet_rtrpools_free(void);
 lnet_remotenet_t *lnet_find_net_locked(__u32 net);
diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index d14fe70..7fab03b 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -1262,7 +1262,7 @@ LNetNIInit(lnet_pid_t requested_pid)
 	if (rc != 0)
 		goto failed4;
 
-	lnet_proc_init();
+	lnet_router_debugfs_init();
 	goto out;
 
  failed4:
@@ -1305,7 +1305,7 @@ LNetNIFini(void)
 	} else {
 		LASSERT(!the_lnet.ln_niinit_self);
 
-		lnet_proc_fini();
+		lnet_router_debugfs_fini();
 		lnet_router_checker_stop();
 		lnet_ping_target_fini();
 
diff --git a/drivers/staging/lustre/lnet/lnet/router_proc.c b/drivers/staging/lustre/lnet/lnet/router_proc.c
index 40f418b..a9f4cbf 100644
--- a/drivers/staging/lustre/lnet/lnet/router_proc.c
+++ b/drivers/staging/lustre/lnet/lnet/router_proc.c
@@ -25,13 +25,9 @@
 #include "../../include/linux/libcfs/libcfs.h"
 #include "../../include/linux/lnet/lib-lnet.h"
 
-#if  defined(LNET_ROUTER)
-
 /* This is really lnet_proc.c. You might need to update sanity test 215
  * if any file format is changed. */
 
-static struct ctl_table_header *lnet_table_header;
-
 #define LNET_LOFFT_BITS		(sizeof(loff_t) * 8)
 /*
  * NB: max allowed LNET_CPT_BITS is 8 on 64-bit system and 2 on 32-bit system
@@ -914,44 +910,11 @@ static struct ctl_table lnet_table[] = {
 	}
 };
 
-static struct ctl_table top_table[] = {
-	{
-		.procname = "lnet",
-		.mode     = 0555,
-		.data     = NULL,
-		.maxlen   = 0,
-		.child    = lnet_table,
-	},
-	{
-	}
-};
-
-void
-lnet_proc_init(void)
+void lnet_router_debugfs_init(void)
 {
-	if (lnet_table_header == NULL)
-		lnet_table_header = register_sysctl_table(top_table);
+	lustre_insert_debugfs(lnet_table, NULL);
 }
 
-void
-lnet_proc_fini(void)
+void lnet_router_debugfs_fini(void)
 {
-	if (lnet_table_header != NULL)
-		unregister_sysctl_table(lnet_table_header);
-
-	lnet_table_header = NULL;
 }
-
-#else
-
-void
-lnet_proc_init(void)
-{
-}
-
-void
-lnet_proc_fini(void)
-{
-}
-
-#endif
diff --git a/drivers/staging/lustre/lustre/libcfs/module.c b/drivers/staging/lustre/lustre/libcfs/module.c
index 806f974..a19f579 100644
--- a/drivers/staging/lustre/lustre/libcfs/module.c
+++ b/drivers/staging/lustre/lustre/libcfs/module.c
@@ -66,9 +66,6 @@ MODULE_AUTHOR("Peter J. Braam <braam@clusterfs.com>");
 MODULE_DESCRIPTION("Portals v3.1");
 MODULE_LICENSE("GPL");
 
-static void insert_debugfs(void);
-static void remove_debugfs(void);
-
 static struct dentry *lnet_debugfs_root;
 
 static void kportal_memhog_free(struct libcfs_device_userstate *ldu)
@@ -349,90 +346,6 @@ struct cfs_psdev_ops libcfs_psdev_ops = {
 	libcfs_ioctl
 };
 
-static int init_libcfs_module(void)
-{
-	int rc;
-
-	libcfs_arch_init();
-	libcfs_init_nidstrings();
-
-	rc = libcfs_debug_init(5 * 1024 * 1024);
-	if (rc < 0) {
-		pr_err("LustreError: libcfs_debug_init: %d\n", rc);
-		return rc;
-	}
-
-	rc = cfs_cpu_init();
-	if (rc != 0)
-		goto cleanup_debug;
-
-	rc = misc_register(&libcfs_dev);
-	if (rc) {
-		CERROR("misc_register: error %d\n", rc);
-		goto cleanup_cpu;
-	}
-
-	rc = cfs_wi_startup();
-	if (rc) {
-		CERROR("initialize workitem: error %d\n", rc);
-		goto cleanup_deregister;
-	}
-
-	/* max to 4 threads, should be enough for rehash */
-	rc = min(cfs_cpt_weight(cfs_cpt_table, CFS_CPT_ANY), 4);
-	rc = cfs_wi_sched_create("cfs_rh", cfs_cpt_table, CFS_CPT_ANY,
-				 rc, &cfs_sched_rehash);
-	if (rc != 0) {
-		CERROR("Startup workitem scheduler: error: %d\n", rc);
-		goto cleanup_deregister;
-	}
-
-	rc = cfs_crypto_register();
-	if (rc) {
-		CERROR("cfs_crypto_register: error %d\n", rc);
-		goto cleanup_wi;
-	}
-
-	insert_debugfs();
-
-	CDEBUG(D_OTHER, "portals setup OK\n");
-	return 0;
- cleanup_wi:
-	cfs_wi_shutdown();
- cleanup_deregister:
-	misc_deregister(&libcfs_dev);
-cleanup_cpu:
-	cfs_cpu_fini();
- cleanup_debug:
-	libcfs_debug_cleanup();
-	return rc;
-}
-
-static void exit_libcfs_module(void)
-{
-	int rc;
-
-	remove_debugfs();
-
-	if (cfs_sched_rehash != NULL) {
-		cfs_wi_sched_destroy(cfs_sched_rehash);
-		cfs_sched_rehash = NULL;
-	}
-
-	cfs_crypto_unregister();
-	cfs_wi_shutdown();
-
-	misc_deregister(&libcfs_dev);
-
-	cfs_cpu_fini();
-
-	rc = libcfs_debug_cleanup();
-	if (rc)
-		pr_err("LustreError: libcfs_debug_cleanup: %d\n", rc);
-
-	libcfs_arch_cleanup();
-}
-
 static int proc_call_handler(void *data, int write, loff_t *ppos,
 		void __user *buffer, size_t *lenp,
 		int (*handler)(void *data, int write,
@@ -700,11 +613,6 @@ static struct ctl_table lnet_table[] = {
 	}
 };
 
-struct lnet_debugfs_symlink_def {
-	char *name;
-	char *target;
-};
-
 static const struct lnet_debugfs_symlink_def lnet_debugfs_symlinks[] = {
 	{ "console_ratelimit",
 	  "/sys/module/libcfs/parameters/libcfs_console_ratelimit"},
@@ -756,11 +664,10 @@ static const struct file_operations lnet_debugfs_file_operations = {
 	.llseek		= default_llseek,
 };
 
-static void insert_debugfs(void)
+void lustre_insert_debugfs(struct ctl_table *table,
+			   const struct lnet_debugfs_symlink_def *symlinks)
 {
-	struct ctl_table *table;
 	struct dentry *entry;
-	const struct lnet_debugfs_symlink_def *symlinks;
 
 	if (lnet_debugfs_root == NULL)
 		lnet_debugfs_root = debugfs_create_dir("lnet", NULL);
@@ -769,19 +676,20 @@ static void insert_debugfs(void)
 	if (IS_ERR_OR_NULL(lnet_debugfs_root))
 		return;
 
-	for (table = lnet_table; table->procname; table++)
+	for (; table->procname; table++)
 		entry = debugfs_create_file(table->procname, table->mode,
 					    lnet_debugfs_root, table,
 					    &lnet_debugfs_file_operations);
 
-	for (symlinks = lnet_debugfs_symlinks; symlinks->name; symlinks++)
+	for (; symlinks && symlinks->name; symlinks++)
 		entry = debugfs_create_symlink(symlinks->name,
 					       lnet_debugfs_root,
 					       symlinks->target);
 
 }
+EXPORT_SYMBOL_GPL(lustre_insert_debugfs);
 
-static void remove_debugfs(void)
+static void lustre_remove_debugfs(void)
 {
 	if (lnet_debugfs_root != NULL)
 		debugfs_remove_recursive(lnet_debugfs_root);
@@ -789,6 +697,90 @@ static void remove_debugfs(void)
 	lnet_debugfs_root = NULL;
 }
 
+static int init_libcfs_module(void)
+{
+	int rc;
+
+	libcfs_arch_init();
+	libcfs_init_nidstrings();
+
+	rc = libcfs_debug_init(5 * 1024 * 1024);
+	if (rc < 0) {
+		pr_err("LustreError: libcfs_debug_init: %d\n", rc);
+		return rc;
+	}
+
+	rc = cfs_cpu_init();
+	if (rc != 0)
+		goto cleanup_debug;
+
+	rc = misc_register(&libcfs_dev);
+	if (rc) {
+		CERROR("misc_register: error %d\n", rc);
+		goto cleanup_cpu;
+	}
+
+	rc = cfs_wi_startup();
+	if (rc) {
+		CERROR("initialize workitem: error %d\n", rc);
+		goto cleanup_deregister;
+	}
+
+	/* max to 4 threads, should be enough for rehash */
+	rc = min(cfs_cpt_weight(cfs_cpt_table, CFS_CPT_ANY), 4);
+	rc = cfs_wi_sched_create("cfs_rh", cfs_cpt_table, CFS_CPT_ANY,
+				 rc, &cfs_sched_rehash);
+	if (rc != 0) {
+		CERROR("Startup workitem scheduler: error: %d\n", rc);
+		goto cleanup_deregister;
+	}
+
+	rc = cfs_crypto_register();
+	if (rc) {
+		CERROR("cfs_crypto_register: error %d\n", rc);
+		goto cleanup_wi;
+	}
+
+	lustre_insert_debugfs(lnet_table, lnet_debugfs_symlinks);
+
+	CDEBUG(D_OTHER, "portals setup OK\n");
+	return 0;
+ cleanup_wi:
+	cfs_wi_shutdown();
+ cleanup_deregister:
+	misc_deregister(&libcfs_dev);
+cleanup_cpu:
+	cfs_cpu_fini();
+ cleanup_debug:
+	libcfs_debug_cleanup();
+	return rc;
+}
+
+static void exit_libcfs_module(void)
+{
+	int rc;
+
+	lustre_remove_debugfs();
+
+	if (cfs_sched_rehash) {
+		cfs_wi_sched_destroy(cfs_sched_rehash);
+		cfs_sched_rehash = NULL;
+	}
+
+	cfs_crypto_unregister();
+	cfs_wi_shutdown();
+
+	misc_deregister(&libcfs_dev);
+
+	cfs_cpu_fini();
+
+	rc = libcfs_debug_cleanup();
+	if (rc)
+		pr_err("LustreError: libcfs_debug_cleanup: %d\n", rc);
+
+	libcfs_arch_cleanup();
+}
+
 MODULE_VERSION("1.0.0");
 
 module_init(init_libcfs_module);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/19] staging/lustre/obdclass: reorganize busy object accounting
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
  2015-09-14 22:41 ` [PATCH 01/19] staging/lustre/lnet: Reenable lnet router debugfs green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 03/19] staging/lustre/llite: cleanup open handle for client open failure green
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Frank Zago, Oleg Drokin

From: Frank Zago <fzago@cray.com>

Due to some accounting bug, lsb_busy of a hash bucket can become
larger than the total number of objects in said bucket. A busy object
can be counted more than once. When that happens, a negative value is
returned by the shrinker callback.

Instead of trying (and failing) to count the busy objects, count the
objects than are not busy, i.e. the objects that are present on the
lsb_lru list. The number of busy objects is then the difference
between the number of objects in the hash and the objects on the
lsb_lru list.

Signed-off-by: frank zago <fzago@cray.com>
Reviewed-on: http://review.whamcloud.com/12468
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5722
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |  4 +--
 drivers/staging/lustre/lustre/obdclass/lu_object.c | 35 +++++++++-------------
 2 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index a16c9ea..ea13a82 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -554,9 +554,9 @@ struct fld;
 
 struct lu_site_bkt_data {
 	/**
-	 * number of busy object on this bucket
+	 * number of object in this bucket on the lsb_lru list.
 	 */
-	long		      lsb_busy;
+	long			lsb_lru_len;
 	/**
 	 * LRU list, updated on each access to object. Protected by
 	 * bucket lock of lu_site::ls_obj_hash.
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 3111982..4f7899f 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -113,8 +113,6 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		return;
 	}
 
-	LASSERT(bkt->lsb_busy > 0);
-	bkt->lsb_busy--;
 	/*
 	 * When last reference is released, iterate over object
 	 * layers, and notify them that object is no longer busy.
@@ -127,6 +125,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 	if (!lu_object_is_dying(top)) {
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
+		bkt->lsb_lru_len++;
 		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 		return;
 	}
@@ -179,7 +178,13 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 		struct cfs_hash_bd bd;
 
 		cfs_hash_bd_get_and_lock(obj_hash, &top->loh_fid, &bd, 1);
+		if (!list_empty(&top->loh_lru)) {
+			struct lu_site_bkt_data *bkt;
+
 		list_del_init(&top->loh_lru);
+			bkt = cfs_hash_bd_extra_get(obj_hash, &bd);
+			bkt->lsb_lru_len--;
+		}
 		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
 		cfs_hash_bd_unlock(obj_hash, &bd, 1);
 	}
@@ -349,6 +354,7 @@ int lu_site_purge(const struct lu_env *env, struct lu_site *s, int nr)
 			cfs_hash_bd_del_locked(s->ls_obj_hash,
 					       &bd2, &h->loh_hash);
 			list_move(&h->loh_lru, &dispose);
+			bkt->lsb_lru_len--;
 			if (did_sth == 0)
 				did_sth = 1;
 
@@ -561,7 +567,10 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 	if (likely(!lu_object_is_dying(h))) {
 		cfs_hash_get(s->ls_obj_hash, hnode);
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
+		if (!list_empty(&h->loh_lru)) {
 		list_del_init(&h->loh_lru);
+			bkt->lsb_lru_len--;
+		}
 		return lu_object_top(h);
 	}
 
@@ -599,7 +608,6 @@ static struct lu_object *lu_object_new(const struct lu_env *env,
 	struct lu_object	*o;
 	struct cfs_hash	      *hs;
 	struct cfs_hash_bd	    bd;
-	struct lu_site_bkt_data *bkt;
 
 	o = lu_object_alloc(env, dev, f, conf);
 	if (IS_ERR(o))
@@ -607,9 +615,7 @@ static struct lu_object *lu_object_new(const struct lu_env *env,
 
 	hs = dev->ld_site->ls_obj_hash;
 	cfs_hash_bd_get_and_lock(hs, (void *)f, &bd, 1);
-	bkt = cfs_hash_bd_extra_get(hs, &bd);
 	cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-	bkt->lsb_busy++;
 	cfs_hash_bd_unlock(hs, &bd, 1);
 	return o;
 }
@@ -675,11 +681,7 @@ static struct lu_object *lu_object_find_try(const struct lu_env *env,
 
 	shadow = htable_lookup(s, &bd, f, waiter, &version);
 	if (likely(PTR_ERR(shadow) == -ENOENT)) {
-		struct lu_site_bkt_data *bkt;
-
-		bkt = cfs_hash_bd_extra_get(hs, &bd);
 		cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-		bkt->lsb_busy++;
 		cfs_hash_bd_unlock(hs, &bd, 1);
 		return o;
 	}
@@ -926,14 +928,7 @@ static void lu_obj_hop_get(struct cfs_hash *hs, struct hlist_node *hnode)
 	struct lu_object_header *h;
 
 	h = hlist_entry(hnode, struct lu_object_header, loh_hash);
-	if (atomic_add_return(1, &h->loh_ref) == 1) {
-		struct lu_site_bkt_data *bkt;
-		struct cfs_hash_bd	    bd;
-
-		cfs_hash_bd_get(hs, &h->loh_fid, &bd);
-		bkt = cfs_hash_bd_extra_get(hs, &bd);
-		bkt->lsb_busy++;
-	}
+	atomic_inc(&h->loh_ref);
 }
 
 static void lu_obj_hop_put_locked(struct cfs_hash *hs, struct hlist_node *hnode)
@@ -1802,7 +1797,8 @@ static void lu_site_stats_get(struct cfs_hash *hs,
 		struct hlist_head	*hhead;
 
 		cfs_hash_bd_lock(hs, &bd, 1);
-		stats->lss_busy  += bkt->lsb_busy;
+		stats->lss_busy  +=
+			cfs_hash_bd_count_get(&bd) - bkt->lsb_lru_len;
 		stats->lss_total += cfs_hash_bd_count_get(&bd);
 		stats->lss_max_search = max((int)stats->lss_max_search,
 					    cfs_hash_bd_depmax_get(&bd));
@@ -2067,7 +2063,6 @@ void lu_object_assign_fid(const struct lu_env *env, struct lu_object *o,
 {
 	struct lu_site		*s = o->lo_dev->ld_site;
 	struct lu_fid		*old = &o->lo_header->loh_fid;
-	struct lu_site_bkt_data	*bkt;
 	struct lu_object	*shadow;
 	wait_queue_t		 waiter;
 	struct cfs_hash		*hs;
@@ -2082,9 +2077,7 @@ void lu_object_assign_fid(const struct lu_env *env, struct lu_object *o,
 	/* supposed to be unique */
 	LASSERT(IS_ERR(shadow) && PTR_ERR(shadow) == -ENOENT);
 	*old = *fid;
-	bkt = cfs_hash_bd_extra_get(hs, &bd);
 	cfs_hash_bd_add_locked(hs, &bd, &o->lo_header->loh_hash);
-	bkt->lsb_busy++;
 	cfs_hash_bd_unlock(hs, &bd, 1);
 }
 EXPORT_SYMBOL(lu_object_assign_fid);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/19] staging/lustre/llite: cleanup open handle for client open failure
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
  2015-09-14 22:41 ` [PATCH 01/19] staging/lustre/lnet: Reenable lnet router debugfs green
  2015-09-14 22:41 ` [PATCH 02/19] staging/lustre/obdclass: reorganize busy object accounting green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 04/19] staging/lustre/llite: strengthen checks for hsm flags and archive id green
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Fan Yong, Oleg Drokin

From: Fan Yong <fan.yong@intel.com>

For open case, the client side open handling thread may hit error
after the MDT grant the open. Under such case, the client should
send close RPC to the MDT as cleanup; otherwise, the open handle
on the MDT will be leaked there until the client umount or evicted.

If the LFSCK marks LU_OBJECT_HEARD_BANSHEE on the MDT-object that is
opened by others for repairing some inconsistency, such as repairing
multiple-referenced OST-object, because the leaked open handle still
references the MDT-object, then it will block the subsequent threads
that want to locate such object via FID.

Signed-off-by: Fan Yong <fan.yong@intel.com>
Reviewed-on: http://review.whamcloud.com/13709
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6301
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../staging/lustre/lustre/llite/llite_internal.h   |  1 +
 drivers/staging/lustre/lustre/llite/llite_lib.c    | 48 +++++++++++++++++++++-
 drivers/staging/lustre/lustre/llite/namei.c        | 14 +++++--
 3 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 2de64c2..8a3b03e 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -801,6 +801,7 @@ struct md_op_data *ll_prep_md_op_data(struct md_op_data *op_data,
 void ll_finish_md_op_data(struct md_op_data *op_data);
 int ll_get_obd_name(struct inode *inode, unsigned int cmd, unsigned long arg);
 char *ll_get_fsname(struct super_block *sb, char *buf, int buflen);
+void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req);
 
 /* llite/llite_nfs.c */
 extern struct export_operations lustre_export_operations;
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index c60eb46e..725481d 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -1973,6 +1973,47 @@ int ll_remount_fs(struct super_block *sb, int *flags, char *data)
 	return 0;
 }
 
+/**
+ * Cleanup the open handle that is cached on MDT-side.
+ *
+ * For open case, the client side open handling thread may hit error
+ * after the MDT grant the open. Under such case, the client should
+ * send close RPC to the MDT as cleanup; otherwise, the open handle
+ * on the MDT will be leaked there until the client umount or evicted.
+ *
+ * In further, if someone unlinked the file, because the open handle
+ * holds the reference on such file/object, then it will block the
+ * subsequent threads that want to locate such object via FID.
+ *
+ * \param[in] sb	super block for this file-system
+ * \param[in] open_req	pointer to the original open request
+ */
+void ll_open_cleanup(struct super_block *sb, struct ptlrpc_request *open_req)
+{
+	struct mdt_body			*body;
+	struct md_op_data		*op_data;
+	struct ptlrpc_request		*close_req = NULL;
+	struct obd_export		*exp	   = ll_s2sbi(sb)->ll_md_exp;
+
+	body = req_capsule_server_get(&open_req->rq_pill, &RMF_MDT_BODY);
+	OBD_ALLOC_PTR(op_data);
+	if (!op_data) {
+		CWARN("%s: cannot allocate op_data to release open handle for "
+		      DFID "\n",
+		      ll_get_fsname(sb, NULL, 0), PFID(&body->fid1));
+
+		return;
+	}
+
+	op_data->op_fid1 = body->fid1;
+	op_data->op_ioepoch = body->ioepoch;
+	op_data->op_handle = body->handle;
+	op_data->op_mod_time = get_seconds();
+	md_close(exp, op_data, NULL, &close_req);
+	ptlrpc_req_finished(close_req);
+	ll_finish_md_op_data(op_data);
+}
+
 int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 		  struct super_block *sb, struct lookup_intent *it)
 {
@@ -1985,7 +2026,7 @@ int ll_prep_inode(struct inode **inode, struct ptlrpc_request *req,
 	rc = md_get_lustre_md(sbi->ll_md_exp, req, sbi->ll_dt_exp,
 			      sbi->ll_md_exp, &md);
 	if (rc)
-		return rc;
+		goto cleanup;
 
 	if (*inode) {
 		ll_update_inode(*inode, &md);
@@ -2047,6 +2088,11 @@ out:
 	if (md.lsm != NULL)
 		obd_free_memmd(sbi->ll_dt_exp, &md.lsm);
 	md_free_lustre_md(sbi->ll_md_exp, &md);
+
+cleanup:
+	if (rc != 0 && it && it->it_op & IT_OPEN)
+		ll_open_cleanup(sb ? sb : (*inode)->i_sb, req);
+
 	return rc;
 }
 
diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
index 05e7dc8..2635678 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -409,7 +409,7 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 {
 	struct inode *inode = NULL;
 	__u64 bits = 0;
-	int rc;
+	int rc = 0;
 
 	/* NB 1 request reference will be taken away by ll_intent_lock()
 	 * when I return */
@@ -439,8 +439,10 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		struct dentry *alias;
 
 		alias = ll_splice_alias(inode, *de);
-		if (IS_ERR(alias))
-			return PTR_ERR(alias);
+		if (IS_ERR(alias)) {
+			rc = PTR_ERR(alias);
+			goto out;
+		}
 		*de = alias;
 	} else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
 		   !it_disposition(it, DISP_OPEN_CREATE)) {
@@ -471,7 +473,11 @@ static int ll_lookup_it_finish(struct ptlrpc_request *request,
 		}
 	}
 
-	return 0;
+out:
+	if (rc != 0 && it->it_op & IT_OPEN)
+		ll_open_cleanup((*de)->d_sb, request);
+
+	return rc;
 }
 
 static struct dentry *ll_lookup_it(struct inode *parent, struct dentry *dentry,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/19] staging/lustre/llite: strengthen checks for hsm flags and archive id
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (2 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 03/19] staging/lustre/llite: cleanup open handle for client open failure green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 05/19] staging/lustre/ptlrpc: remove LUSTRE_MSG_MAGIC_V1 support green
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Bruno Faccini, Oleg Drokin

From: Bruno Faccini <bruno.faccini@intel.com>

Prior to this patch undefined flags bits and out of range
archive id can be set.

Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Reviewed-on: http://review.whamcloud.com/13337
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5757
Reviewed-by: frank zago <fzago@cray.com>
Reviewed-by: Henri Doreau <henri.doreau@cea.fr>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/include/lustre/lustre_idl.h | 7 +++++++
 drivers/staging/lustre/lustre/llite/file.c                | 9 +++++++++
 2 files changed, 16 insertions(+)

diff --git a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
index e79af19..9416d95 100644
--- a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
+++ b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
@@ -365,6 +365,13 @@ static inline __u64 fid_ver_oid(const struct lu_fid *fid)
 	return ((__u64)fid_ver(fid) << 32 | fid_oid(fid));
 }
 
+/* copytool uses a 32b bitmask field to encode archive-Ids during register
+ * with MDT thru kuc.
+ * archive num = 0 => all
+ * archive num from 1 to 32
+ */
+#define LL_HSM_MAX_ARCHIVE (sizeof(__u32) * 8)
+
 /**
  * Note that reserved SEQ numbers below 12 will conflict with ldiskfs
  * inodes in the IGIF namespace, so these reserved SEQ numbers can be
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index e332326..b610032 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -2118,12 +2118,21 @@ static int ll_hsm_state_set(struct inode *inode, struct hsm_state_set *hss)
 	struct md_op_data	*op_data;
 	int			 rc;
 
+	/* Detect out-of range masks */
+	if ((hss->hss_setmask | hss->hss_clearmask) & ~HSM_FLAGS_MASK)
+		return -EINVAL;
+
 	/* Non-root users are forbidden to set or clear flags which are
 	 * NOT defined in HSM_USER_MASK. */
 	if (((hss->hss_setmask | hss->hss_clearmask) & ~HSM_USER_MASK) &&
 	    !capable(CFS_CAP_SYS_ADMIN))
 		return -EPERM;
 
+	/* Detect out-of range archive id */
+	if ((hss->hss_valid & HSS_ARCHIVE_ID) &&
+	    (hss->hss_archive_id > LL_HSM_MAX_ARCHIVE))
+		return -EINVAL;
+
 	op_data = ll_prep_md_op_data(NULL, inode, NULL, NULL, 0, 0,
 				     LUSTRE_OPC_ANY, hss);
 	if (IS_ERR(op_data))
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/19] staging/lustre/ptlrpc: remove LUSTRE_MSG_MAGIC_V1 support
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (3 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 04/19] staging/lustre/llite: strengthen checks for hsm flags and archive id green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference green
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Oleg Drokin

From: Andreas Dilger <andreas.dilger@intel.com>

Remove the remains of LUSTRE_MSG_MAGIC_V1 support from ptlrpc.
It has not been supported since 1.8 and is not functional since 2.0.

In lustre_msg_check_version(), return an error for unsupported RPC
versions so that the server will reject such RPCs early.  Otherwise
the server only prints an error message and continue on.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-on: http://review.whamcloud.com/14007
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6349
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../lustre/lustre/include/lustre/lustre_idl.h      |  3 -
 drivers/staging/lustre/lustre/include/lustre_net.h |  2 -
 drivers/staging/lustre/lustre/ptlrpc/niobuf.c      |  3 +-
 .../staging/lustre/lustre/ptlrpc/pack_generic.c    | 96 ++++++----------------
 drivers/staging/lustre/lustre/ptlrpc/wiretest.c    |  4 -
 5 files changed, 24 insertions(+), 84 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
index 9416d95..b0c4433 100644
--- a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
+++ b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
@@ -154,10 +154,7 @@
 #define PTL_RPC_MSG_REPLY   4713
 
 /* DON'T use swabbed values of MAGIC as magic! */
-#define LUSTRE_MSG_MAGIC_V1 0x0BD00BD0
 #define LUSTRE_MSG_MAGIC_V2 0x0BD00BD3
-
-#define LUSTRE_MSG_MAGIC_V1_SWABBED 0xD00BD00B
 #define LUSTRE_MSG_MAGIC_V2_SWABBED 0xD30BD00B
 
 #define LUSTRE_MSG_MAGIC LUSTRE_MSG_MAGIC_V2
diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index 3341b5d..5df493e 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -2610,7 +2610,6 @@ void lustre_msg_set_flags(struct lustre_msg *msg, int flags);
 void lustre_msg_clear_flags(struct lustre_msg *msg, int flags);
 __u32 lustre_msg_get_op_flags(struct lustre_msg *msg);
 void lustre_msg_add_op_flags(struct lustre_msg *msg, int flags);
-void lustre_msg_set_op_flags(struct lustre_msg *msg, int flags);
 struct lustre_handle *lustre_msg_get_handle(struct lustre_msg *msg);
 __u32 lustre_msg_get_type(struct lustre_msg *msg);
 __u32 lustre_msg_get_version(struct lustre_msg *msg);
@@ -2626,7 +2625,6 @@ void lustre_msg_set_slv(struct lustre_msg *msg, __u64 slv);
 void lustre_msg_set_limit(struct lustre_msg *msg, __u64 limit);
 int lustre_msg_get_status(struct lustre_msg *msg);
 __u32 lustre_msg_get_conn_cnt(struct lustre_msg *msg);
-int lustre_msg_is_v1(struct lustre_msg *msg);
 __u32 lustre_msg_get_magic(struct lustre_msg *msg);
 __u32 lustre_msg_get_timeout(struct lustre_msg *msg);
 __u32 lustre_msg_get_service_time(struct lustre_msg *msg);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
index 92c746b..22194c0 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
@@ -337,9 +337,8 @@ static void ptlrpc_at_set_reply(struct ptlrpc_request *req, int flags)
 
 	if (req->rq_reqmsg &&
 	    !(lustre_msghdr_get_flags(req->rq_reqmsg) & MSGHDR_AT_SUPPORT)) {
-		CDEBUG(D_ADAPTTO, "No early reply support: flags=%#x req_flags=%#x magic=%d:%x/%x len=%d\n",
+		CDEBUG(D_ADAPTTO, "No early reply support: flags=%#x req_flags=%#x magic=%x/%x len=%d\n",
 		       flags, lustre_msg_get_flags(req->rq_reqmsg),
-		       lustre_msg_is_v1(req->rq_reqmsg),
 		       lustre_msg_get_magic(req->rq_reqmsg),
 		       lustre_msg_get_magic(req->rq_repmsg), req->rq_replen);
 	}
diff --git a/drivers/staging/lustre/lustre/ptlrpc/pack_generic.c b/drivers/staging/lustre/lustre/ptlrpc/pack_generic.c
index e9f8aa0..f138061 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/pack_generic.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/pack_generic.c
@@ -103,6 +103,7 @@ static inline int lustre_msg_check_version_v2(struct lustre_msg_v2 *msg,
 
 int lustre_msg_check_version(struct lustre_msg *msg, __u32 version)
 {
+#define LUSTRE_MSG_MAGIC_V1 0x0BD00BD0
 	switch (msg->lm_magic) {
 	case LUSTRE_MSG_MAGIC_V1:
 		CERROR("msg v1 not supported - please upgrade you system\n");
@@ -113,6 +114,7 @@ int lustre_msg_check_version(struct lustre_msg *msg, __u32 version)
 		CERROR("incorrect message magic: %08x\n", msg->lm_magic);
 		return 0;
 	}
+#undef LUSTRE_MSG_MAGIC_V1
 }
 EXPORT_SYMBOL(lustre_msg_check_version);
 
@@ -433,7 +435,8 @@ void *lustre_msg_buf(struct lustre_msg *m, int n, int min_size)
 	case LUSTRE_MSG_MAGIC_V2:
 		return lustre_msg_buf_v2(m, n, min_size);
 	default:
-		LASSERTF(0, "incorrect message magic: %08x(msg:%p)\n", m->lm_magic, m);
+		LASSERTF(0, "incorrect message magic: %08x (msg:%p)\n",
+			 m->lm_magic, m);
 		return NULL;
 	}
 }
@@ -802,14 +805,11 @@ static inline struct ptlrpc_body *lustre_msg_ptlrpc_body(struct lustre_msg *msg)
 __u32 lustre_msghdr_get_flags(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-	case LUSTRE_MSG_MAGIC_V1_SWABBED:
-		return 0;
 	case LUSTRE_MSG_MAGIC_V2:
 		/* already in host endian */
 		return msg->lm_flags;
 	default:
-		LASSERTF(0, "incorrect message magic: %08x\n", msg->lm_magic);
+		CERROR("incorrect message magic: %08x\n", msg->lm_magic);
 		return 0;
 	}
 }
@@ -818,8 +818,6 @@ EXPORT_SYMBOL(lustre_msghdr_get_flags);
 void lustre_msghdr_set_flags(struct lustre_msg *msg, __u32 flags)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2:
 		msg->lm_flags = flags;
 		return;
@@ -833,12 +831,12 @@ __u32 lustre_msg_get_flags(struct lustre_msg *msg)
 	switch (msg->lm_magic) {
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
-		if (!pb) {
-			CERROR("invalid msg %p: no ptlrpc body!\n", msg);
-			return 0;
-		}
-		return pb->pb_flags;
+		if (pb)
+			return pb->pb_flags;
+
+		CERROR("invalid msg %p: no ptlrpc body!\n", msg);
 	}
+	/* no break */
 	default:
 		/* flags might be printed in debug code while message
 		 * uninitialized */
@@ -897,12 +895,12 @@ __u32 lustre_msg_get_op_flags(struct lustre_msg *msg)
 	switch (msg->lm_magic) {
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
-		if (!pb) {
-			CERROR("invalid msg %p: no ptlrpc body!\n", msg);
-			return 0;
-		}
-		return pb->pb_op_flags;
+		if (pb)
+			return pb->pb_op_flags;
+
+		CERROR("invalid msg %p: no ptlrpc body!\n", msg);
 	}
+	/* no break */
 	default:
 		return 0;
 	}
@@ -924,21 +922,6 @@ void lustre_msg_add_op_flags(struct lustre_msg *msg, int flags)
 }
 EXPORT_SYMBOL(lustre_msg_add_op_flags);
 
-void lustre_msg_set_op_flags(struct lustre_msg *msg, int flags)
-{
-	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V2: {
-		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
-		LASSERTF(pb, "invalid msg %p: no ptlrpc body!\n", msg);
-		pb->pb_op_flags |= flags;
-		return;
-	}
-	default:
-		LASSERTF(0, "incorrect message magic: %08x\n", msg->lm_magic);
-	}
-}
-EXPORT_SYMBOL(lustre_msg_set_op_flags);
-
 struct lustre_handle *lustre_msg_get_handle(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
@@ -1020,8 +1003,8 @@ __u32 lustre_msg_get_opc(struct lustre_msg *msg)
 		return pb->pb_opc;
 	}
 	default:
-		CERROR("incorrect message magic: %08x(msg:%p)\n", msg->lm_magic, msg);
-		LBUG();
+		CERROR("incorrect message magic: %08x (msg:%p)\n",
+		       msg->lm_magic, msg);
 		return 0;
 	}
 }
@@ -1066,8 +1049,6 @@ EXPORT_SYMBOL(lustre_msg_get_last_committed);
 __u64 *lustre_msg_get_versions(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return NULL;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		if (!pb) {
@@ -1106,12 +1087,12 @@ int lustre_msg_get_status(struct lustre_msg *msg)
 	switch (msg->lm_magic) {
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
-		if (!pb) {
-			CERROR("invalid msg %p: no ptlrpc body!\n", msg);
-			return -EINVAL;
-		}
-		return pb->pb_status;
+		if (pb)
+			return pb->pb_status;
+
+		CERROR("invalid msg %p: no ptlrpc body!\n", msg);
 	}
+	/* no break */
 	default:
 		/* status might be printed in debug code while message
 		 * uninitialized */
@@ -1214,18 +1195,6 @@ __u32 lustre_msg_get_conn_cnt(struct lustre_msg *msg)
 }
 EXPORT_SYMBOL(lustre_msg_get_conn_cnt);
 
-int lustre_msg_is_v1(struct lustre_msg *msg)
-{
-	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-	case LUSTRE_MSG_MAGIC_V1_SWABBED:
-		return 1;
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL(lustre_msg_is_v1);
-
 __u32 lustre_msg_get_magic(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
@@ -1241,9 +1210,6 @@ EXPORT_SYMBOL(lustre_msg_get_magic);
 __u32 lustre_msg_get_timeout(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-	case LUSTRE_MSG_MAGIC_V1_SWABBED:
-		return 0;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		if (!pb) {
@@ -1255,16 +1221,13 @@ __u32 lustre_msg_get_timeout(struct lustre_msg *msg)
 	}
 	default:
 		CERROR("incorrect message magic: %08x\n", msg->lm_magic);
-		return 0;
+		return -EPROTO;
 	}
 }
 
 __u32 lustre_msg_get_service_time(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-	case LUSTRE_MSG_MAGIC_V1_SWABBED:
-		return 0;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		if (!pb) {
@@ -1283,9 +1246,6 @@ __u32 lustre_msg_get_service_time(struct lustre_msg *msg)
 char *lustre_msg_get_jobid(struct lustre_msg *msg)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-	case LUSTRE_MSG_MAGIC_V1_SWABBED:
-		return NULL;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb =
 			lustre_msg_buf_v2(msg, MSG_PTLRPC_BODY_OFF,
@@ -1409,8 +1369,6 @@ EXPORT_SYMBOL(lustre_msg_set_last_committed);
 void lustre_msg_set_versions(struct lustre_msg *msg, __u64 *versions)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		LASSERTF(pb, "invalid msg %p: no ptlrpc body!\n", msg);
@@ -1474,8 +1432,6 @@ EXPORT_SYMBOL(lustre_msg_set_conn_cnt);
 void lustre_msg_set_timeout(struct lustre_msg *msg, __u32 timeout)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		LASSERTF(pb, "invalid msg %p: no ptlrpc body!\n", msg);
@@ -1490,8 +1446,6 @@ void lustre_msg_set_timeout(struct lustre_msg *msg, __u32 timeout)
 void lustre_msg_set_service_time(struct lustre_msg *msg, __u32 service_time)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2: {
 		struct ptlrpc_body *pb = lustre_msg_ptlrpc_body(msg);
 		LASSERTF(pb, "invalid msg %p: no ptlrpc body!\n", msg);
@@ -1506,8 +1460,6 @@ void lustre_msg_set_service_time(struct lustre_msg *msg, __u32 service_time)
 void lustre_msg_set_jobid(struct lustre_msg *msg, char *jobid)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2: {
 		__u32 opc = lustre_msg_get_opc(msg);
 		struct ptlrpc_body *pb;
@@ -1537,8 +1489,6 @@ EXPORT_SYMBOL(lustre_msg_set_jobid);
 void lustre_msg_set_cksum(struct lustre_msg *msg, __u32 cksum)
 {
 	switch (msg->lm_magic) {
-	case LUSTRE_MSG_MAGIC_V1:
-		return;
 	case LUSTRE_MSG_MAGIC_V2:
 		msg->lm_cksum = cksum;
 		return;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/wiretest.c b/drivers/staging/lustre/lustre/ptlrpc/wiretest.c
index d6d9204..b2313af 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/wiretest.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/wiretest.c
@@ -636,12 +636,8 @@ void lustre_assert_wire_constants(void)
 		 (long long)(int)offsetof(struct lustre_msg_v2, lm_buflens[0]));
 	LASSERTF((int)sizeof(((struct lustre_msg_v2 *)0)->lm_buflens[0]) == 4, "found %lld\n",
 		 (long long)(int)sizeof(((struct lustre_msg_v2 *)0)->lm_buflens[0]));
-	LASSERTF(LUSTRE_MSG_MAGIC_V1 == 0x0BD00BD0, "found 0x%.8x\n",
-		LUSTRE_MSG_MAGIC_V1);
 	LASSERTF(LUSTRE_MSG_MAGIC_V2 == 0x0BD00BD3, "found 0x%.8x\n",
 		LUSTRE_MSG_MAGIC_V2);
-	LASSERTF(LUSTRE_MSG_MAGIC_V1_SWABBED == 0xD00BD00B, "found 0x%.8x\n",
-		LUSTRE_MSG_MAGIC_V1_SWABBED);
 	LASSERTF(LUSTRE_MSG_MAGIC_V2_SWABBED == 0xD30BD00B, "found 0x%.8x\n",
 		LUSTRE_MSG_MAGIC_V2_SWABBED);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (4 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 05/19] staging/lustre/ptlrpc: remove LUSTRE_MSG_MAGIC_V1 support green
@ 2015-09-14 22:41 ` green
  2015-09-15 13:26   ` Trevor Woerner
  2015-09-14 22:41 ` [PATCH 07/19] staging/lustre/llite: deny non-root user for changelog operations green
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Oleg Drokin

From: Oleg Drokin <oleg.drokin@intel.com>

In lmv_disconnect_mdc do procfs removal only if we actually know the name.

Reviewed-on: http://review.whamcloud.com/14605
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6517
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/lmv/lmv_obd.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
index 0fc0b61..cebbacf 100644
--- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
+++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
@@ -593,11 +593,11 @@ static int lmv_disconnect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
 		mdc_obd->obd_force = obd->obd_force;
 		mdc_obd->obd_fail = obd->obd_fail;
 		mdc_obd->obd_no_recov = obd->obd_no_recov;
-	}
 
-	if (lmv->lmv_tgts_kobj)
-		sysfs_remove_link(lmv->lmv_tgts_kobj,
-				  mdc_obd->obd_name);
+		if (lmv->lmv_tgts_kobj)
+			sysfs_remove_link(lmv->lmv_tgts_kobj,
+					  mdc_obd->obd_name);
+	}
 
 	rc = obd_fid_fini(tgt->ltd_exp->exp_obd);
 	if (rc)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/19] staging/lustre/llite: deny non-root user for changelog operations
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (5 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 08/19] staging/lustre/o2iblnd: connection refcount fix for kiblnd_post_rx green
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Niu Yawei, Oleg Drokin

From: Niu Yawei <yawei.niu@intel.com>

To avoid potential security problems, non-privileged users should
have no permission to run 'lfs changelog' & 'lfs changelog_clear'.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Reviewed-on: http://review.whamcloud.com/14280
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6415
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/llite/dir.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index d407fcc..cc6f0f5 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -1734,6 +1734,9 @@ out_quotactl:
 	}
 	case OBD_IOC_CHANGELOG_SEND:
 	case OBD_IOC_CHANGELOG_CLEAR:
+		if (!capable(CFS_CAP_SYS_ADMIN))
+			return -EPERM;
+
 		rc = copy_and_ioctl(cmd, sbi->ll_md_exp, (void *)arg,
 				    sizeof(struct ioc_changelog));
 		return rc;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/19] staging/lustre/o2iblnd: connection refcount fix for kiblnd_post_rx
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (6 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 07/19] staging/lustre/llite: deny non-root user for changelog operations green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 09/19] staging/lustre/osc: LBUG in osc_lru_reclaim green
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Liang Zhen, Oleg Drokin

From: Liang Zhen <liang.zhen@intel.com>

kiblnd_post_rx() can't refer to rx::rx_conn anymore after
ib_post_recv() because this rx can be polled out by another thread
which may drop this rx and destroy rx::rx_conn.

This patch fixes this issue by taking an extra refcount on connection
before calling ib_post_recv().

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-on: http://review.whamcloud.com/12852
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5678
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Amir Shehata <amir.shehata@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 345ed4d..c0f5682 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -178,24 +178,28 @@ kiblnd_post_rx(kib_rx_t *rx, int credit)
 
 	rx->rx_nob = -1;			/* flag posted */
 
+	/* NB: need an extra reference after ib_post_recv because we don't
+	 * own this rx (and rx::rx_conn) anymore, LU-5678.
+	 */
+	kiblnd_conn_addref(conn);
 	rc = ib_post_recv(conn->ibc_cmid->qp, &rx->rx_wrq, &bad_wrq);
-	if (rc != 0) {
+	if (unlikely(rc != 0)) {
 		CERROR("Can't post rx for %s: %d, bad_wrq: %p\n",
 		       libcfs_nid2str(conn->ibc_peer->ibp_nid), rc, bad_wrq);
 		rx->rx_nob = 0;
 	}
 
 	if (conn->ibc_state < IBLND_CONN_ESTABLISHED) /* Initial post */
-		return rc;
+		goto out;
 
-	if (rc != 0) {
+	if (unlikely(rc != 0)) {
 		kiblnd_close_conn(conn, rc);
 		kiblnd_drop_rx(rx);	     /* No more posts for this rx */
-		return rc;
+		goto out;
 	}
 
 	if (credit == IBLND_POSTRX_NO_CREDIT)
-		return 0;
+		goto out;
 
 	spin_lock(&conn->ibc_lock);
 	if (credit == IBLND_POSTRX_PEER_CREDIT)
@@ -205,7 +209,9 @@ kiblnd_post_rx(kib_rx_t *rx, int credit)
 	spin_unlock(&conn->ibc_lock);
 
 	kiblnd_check_sends(conn);
-	return 0;
+out:
+	kiblnd_conn_decref(conn);
+	return rc;
 }
 
 static kib_tx_t *
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/19] staging/lustre/osc: LBUG in osc_lru_reclaim
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (7 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 08/19] staging/lustre/o2iblnd: connection refcount fix for kiblnd_post_rx green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 10/19] staging/lustre/libcfs: minor fix in cfs_hash_for_each_relax() green
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Hiroya Nozaki, Oleg Drokin

From: Hiroya Nozaki <nozaki.hiroya@jp.fujitsu.com>

LASSERT touches cl_client_cache->ccc_lru without any protection.
So this patch the LASSERT moves to the section protected by
cl_client_cache->ccc_lru_lock

Signed-off-by: Hiroya Nozaki <nozaki.hiroya@jp.fujitsu.com>
Reviewed-on: http://review.whamcloud.com/14901
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6624
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/osc/osc_page.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/osc/osc_page.c b/drivers/staging/lustre/lustre/osc/osc_page.c
index 856d859..2af3232 100644
--- a/drivers/staging/lustre/lustre/osc/osc_page.c
+++ b/drivers/staging/lustre/lustre/osc/osc_page.c
@@ -818,7 +818,6 @@ static int osc_lru_reclaim(struct client_obd *cli)
 	int rc;
 
 	LASSERT(cache != NULL);
-	LASSERT(!list_empty(&cache->ccc_lru));
 
 	rc = osc_lru_shrink(cli, lru_shrink_min);
 	if (rc != 0) {
@@ -835,6 +834,8 @@ static int osc_lru_reclaim(struct client_obd *cli)
 	/* Reclaim LRU slots from other client_obd as it can't free enough
 	 * from its own. This should rarely happen. */
 	spin_lock(&cache->ccc_lru_lock);
+	LASSERT(!list_empty(&cache->ccc_lru));
+
 	cache->ccc_lru_shrinkers++;
 	list_move_tail(&cli->cl_lru_osc, &cache->ccc_lru);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/19] staging/lustre/libcfs: minor fix in cfs_hash_for_each_relax()
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (8 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 09/19] staging/lustre/osc: LBUG in osc_lru_reclaim green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 11/19] staging/lustre/lnet: fix deadloop in ksocknal_push green
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Niu Yawei, Oleg Drokin

From: Niu Yawei <yawei.niu@intel.com>

cfs_hash_for_each_relax() should break iteration when callback
returns non-zero value.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Reviewed-on: http://review.whamcloud.com/14927
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6636
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/libcfs/hash.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/staging/lustre/lustre/libcfs/hash.c b/drivers/staging/lustre/lustre/libcfs/hash.c
index 286641b..6f4c7d4 100644
--- a/drivers/staging/lustre/lustre/libcfs/hash.c
+++ b/drivers/staging/lustre/lustre/libcfs/hash.c
@@ -1623,8 +1623,12 @@ cfs_hash_for_each_relax(struct cfs_hash *hs, cfs_hash_for_each_cb_t func,
 				if (rc) /* callback wants to break iteration */
 					break;
 			}
+			if (rc) /* callback wants to break iteration */
+				break;
 		}
 		cfs_hash_bd_unlock(hs, &bd, 0);
+		if (rc) /* callback wants to break iteration */
+			break;
 	}
 	cfs_hash_unlock(hs, 0);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/19] staging/lustre/lnet: fix deadloop in ksocknal_push
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (9 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 10/19] staging/lustre/libcfs: minor fix in cfs_hash_for_each_relax() green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 12/19] staging/lustre/o2iblnd: wrong uses of kib_tx_t::tx_nfrags green
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Liang Zhen, Oleg Drokin

From: Liang Zhen <liang.zhen@intel.com>

ksocknal_push() should break the loop if it can't find matching peer

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4423
Reviewed-on: http://review.whamcloud.com/10128
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <doug.s.oucharek@intel.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../staging/lustre/lnet/klnds/socklnd/socklnd.c    | 51 +++++++++++-----------
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
index d8bfcad..22f4cd0 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
@@ -1874,52 +1874,51 @@ ksocknal_push_peer(ksock_peer_t *peer)
 	}
 }
 
-static int
-ksocknal_push(lnet_ni_t *ni, lnet_process_id_t id)
+static int ksocknal_push(lnet_ni_t *ni, lnet_process_id_t id)
 {
-	ksock_peer_t *peer;
+	struct list_head *start;
+	struct list_head *end;
 	struct list_head *tmp;
-	int index;
-	int i;
-	int j;
 	int rc = -ENOENT;
+	unsigned int hsize = ksocknal_data.ksnd_peer_hash_size;
 
-	for (i = 0; i < ksocknal_data.ksnd_peer_hash_size; i++) {
-		for (j = 0; ; j++) {
-			read_lock(&ksocknal_data.ksnd_global_lock);
+	if (id.nid == LNET_NID_ANY) {
+		start = &ksocknal_data.ksnd_peers[0];
+		end = &ksocknal_data.ksnd_peers[hsize - 1];
+	} else {
+		start = end = ksocknal_nid2peerlist(id.nid);
+	}
 
-			index = 0;
-			peer = NULL;
+	for (tmp = start; tmp <= end; tmp++) {
+		int peer_off; /* searching offset in peer hash table */
 
-			list_for_each(tmp, &ksocknal_data.ksnd_peers[i]) {
-				peer = list_entry(tmp, ksock_peer_t,
-						      ksnp_list);
+		for (peer_off = 0; ; peer_off++) {
+			ksock_peer_t *peer;
+			int i = 0;
 
+			read_lock(&ksocknal_data.ksnd_global_lock);
+			list_for_each_entry(peer, tmp, ksnp_list) {
 				if (!((id.nid == LNET_NID_ANY ||
 				       id.nid == peer->ksnp_id.nid) &&
 				      (id.pid == LNET_PID_ANY ||
-				       id.pid == peer->ksnp_id.pid))) {
-					peer = NULL;
+				       id.pid == peer->ksnp_id.pid)))
 					continue;
-				}
 
-				if (index++ == j) {
+				if (i++ == peer_off) {
 					ksocknal_peer_addref(peer);
 					break;
 				}
 			}
-
 			read_unlock(&ksocknal_data.ksnd_global_lock);
 
-			if (peer != NULL) {
-				rc = 0;
-				ksocknal_push_peer(peer);
-				ksocknal_peer_decref(peer);
-			}
-		}
+			if (i == 0) /* no match */
+				break;
 
+			rc = 0;
+			ksocknal_push_peer(peer);
+			ksocknal_peer_decref(peer);
+		}
 	}
-
 	return rc;
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/19] staging/lustre/o2iblnd: wrong uses of kib_tx_t::tx_nfrags
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (10 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 11/19] staging/lustre/lnet: fix deadloop in ksocknal_push green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 13/19] staging/lustre/llite: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed green
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Isaac Huang, Oleg Drokin

From: Isaac Huang <he.huang@intel.com>

The kib_tx_t::tx_nfrags field is the # entries in
the kib_tx_t::tx_frags array, rather than # DMA
mapped entries. So kiblnd_send/kiblnd_recv should
use kib_rdma_desc_t::rd_nfrags instead.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Reviewed-on: http://review.whamcloud.com/12857
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5956
Reviewed-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Doug Oucharek <doug.s.oucharek@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h    |  3 --
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 37 +++++++++++-----------
 2 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 07e81cb..eb08400 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -912,9 +912,6 @@ struct ib_mr *kiblnd_find_dma_mr(kib_hca_dev_t *hdev,
 				 __u64 addr, __u64 size);
 void kiblnd_map_rx_descs(kib_conn_t *conn);
 void kiblnd_unmap_rx_descs(kib_conn_t *conn);
-int kiblnd_map_tx(lnet_ni_t *ni, kib_tx_t *tx,
-		  kib_rdma_desc_t *rd, int nfrags);
-void kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx);
 void kiblnd_pool_free_node(kib_pool_t *pool, struct list_head *node);
 struct list_head *kiblnd_pool_alloc_node(kib_poolset_t *ps);
 
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index c0f5682..8ab73ee 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -40,6 +40,8 @@
 
 #include "o2iblnd.h"
 
+static void kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx);
+
 static void
 kiblnd_tx_done(lnet_ni_t *ni, kib_tx_t *tx)
 {
@@ -596,8 +598,7 @@ kiblnd_fmr_map_tx(kib_net_t *net, kib_tx_t *tx, kib_rdma_desc_t *rd, int nob)
 	return 0;
 }
 
-void
-kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx)
+static void kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx)
 {
 	kib_net_t *net = ni->ni_data;
 
@@ -615,9 +616,8 @@ kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx)
 	}
 }
 
-int
-kiblnd_map_tx(lnet_ni_t *ni, kib_tx_t *tx,
-	      kib_rdma_desc_t *rd, int nfrags)
+static int kiblnd_map_tx(lnet_ni_t *ni, kib_tx_t *tx, kib_rdma_desc_t *rd,
+			 int nfrags)
 {
 	kib_hca_dev_t *hdev = tx->tx_pool->tpo_hdev;
 	kib_net_t *net = ni->ni_data;
@@ -1427,6 +1427,7 @@ kiblnd_send(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg)
 	unsigned int payload_offset = lntmsg->msg_offset;
 	unsigned int payload_nob = lntmsg->msg_len;
 	kib_msg_t *ibmsg;
+	kib_rdma_desc_t  *rd;
 	kib_tx_t *tx;
 	int nob;
 	int rc;
@@ -1470,16 +1471,14 @@ kiblnd_send(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg)
 		}
 
 		ibmsg = tx->tx_msg;
-
+		rd = &ibmsg->ibm_u.get.ibgm_rd;
 		if ((lntmsg->msg_md->md_options & LNET_MD_KIOV) == 0)
-			rc = kiblnd_setup_rd_iov(ni, tx,
-						 &ibmsg->ibm_u.get.ibgm_rd,
+			rc = kiblnd_setup_rd_iov(ni, tx, rd,
 						 lntmsg->msg_md->md_niov,
 						 lntmsg->msg_md->md_iov.iov,
 						 0, lntmsg->msg_md->md_length);
 		else
-			rc = kiblnd_setup_rd_kiov(ni, tx,
-						  &ibmsg->ibm_u.get.ibgm_rd,
+			rc = kiblnd_setup_rd_kiov(ni, tx, rd,
 						  lntmsg->msg_md->md_niov,
 						  lntmsg->msg_md->md_iov.kiov,
 						  0, lntmsg->msg_md->md_length);
@@ -1490,7 +1489,7 @@ kiblnd_send(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg)
 			return -EIO;
 		}
 
-		nob = offsetof(kib_get_msg_t, ibgm_rd.rd_frags[tx->tx_nfrags]);
+		nob = offsetof(kib_get_msg_t, ibgm_rd.rd_frags[rd->rd_nfrags]);
 		ibmsg->ibm_u.get.ibgm_cookie = tx->tx_cookie;
 		ibmsg->ibm_u.get.ibgm_hdr = *hdr;
 
@@ -1655,7 +1654,6 @@ kiblnd_recv(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg, int delayed,
 	kib_msg_t *rxmsg = rx->rx_msg;
 	kib_conn_t *conn = rx->rx_conn;
 	kib_tx_t *tx;
-	kib_msg_t *txmsg;
 	int nob;
 	int post_credit = IBLND_POSTRX_PEER_CREDIT;
 	int rc = 0;
@@ -1692,7 +1690,10 @@ kiblnd_recv(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg, int delayed,
 		lnet_finalize(ni, lntmsg, 0);
 		break;
 
-	case IBLND_MSG_PUT_REQ:
+	case IBLND_MSG_PUT_REQ: {
+		kib_msg_t	*txmsg;
+		kib_rdma_desc_t *rd;
+
 		if (mlen == 0) {
 			lnet_finalize(ni, lntmsg, 0);
 			kiblnd_send_completion(rx->rx_conn, IBLND_MSG_PUT_NAK, 0,
@@ -1710,13 +1711,12 @@ kiblnd_recv(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg, int delayed,
 		}
 
 		txmsg = tx->tx_msg;
+		rd = &txmsg->ibm_u.putack.ibpam_rd;
 		if (kiov == NULL)
-			rc = kiblnd_setup_rd_iov(ni, tx,
-						 &txmsg->ibm_u.putack.ibpam_rd,
+			rc = kiblnd_setup_rd_iov(ni, tx, rd,
 						 niov, iov, offset, mlen);
 		else
-			rc = kiblnd_setup_rd_kiov(ni, tx,
-						  &txmsg->ibm_u.putack.ibpam_rd,
+			rc = kiblnd_setup_rd_kiov(ni, tx, rd,
 						  niov, kiov, offset, mlen);
 		if (rc != 0) {
 			CERROR("Can't setup PUT sink for %s: %d\n",
@@ -1728,7 +1728,7 @@ kiblnd_recv(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg, int delayed,
 			break;
 		}
 
-		nob = offsetof(kib_putack_msg_t, ibpam_rd.rd_frags[tx->tx_nfrags]);
+		nob = offsetof(kib_putack_msg_t, ibpam_rd.rd_frags[rd->rd_nfrags]);
 		txmsg->ibm_u.putack.ibpam_src_cookie = rxmsg->ibm_u.putreq.ibprm_cookie;
 		txmsg->ibm_u.putack.ibpam_dst_cookie = tx->tx_cookie;
 
@@ -1741,6 +1741,7 @@ kiblnd_recv(lnet_ni_t *ni, void *private, lnet_msg_t *lntmsg, int delayed,
 		/* reposted buffer reserved for PUT_DONE */
 		post_credit = IBLND_POSTRX_NO_CREDIT;
 		break;
+		}
 
 	case IBLND_MSG_GET_REQ:
 		if (lntmsg != NULL) {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/19] staging/lustre/llite: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (11 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 12/19] staging/lustre/o2iblnd: wrong uses of kib_tx_t::tx_nfrags green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 14/19] staging/lustre/obdclass: Eliminate hash bucket scans in lu_cache_shrink green
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Andrew Perepechko, Oleg Drokin

From: Andrew Perepechko <andrew.perepechko@seagate.com>

ll_iget_for_nfs() can call unbalanced iput() causing memory
leaks. This patch removes this iput() call.

Also, avoid unhashing disconnected dentries in
d_lustre_invalidate(), which is another source of memory
leaks.

One of the symptoms of the leak is the following crash pattern:
LustreError: 14812:0:(lu_object.c:1251:lu_device_fini())
 ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
LustreError: 14812:0:(lu_object.c:1251:lu_device_fini()) LBUG
Pid: 14812, comm: umount

Call Trace:
 [<ffffffffa11bc895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa11bce97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1458a48>] lu_device_fini+0xb8/0xc0 [obdclass]
 [<ffffffffa08e9ab2>] lovsub_device_free+0x52/0x220 [lov]
 [<ffffffffa145c64e>] lu_stack_fini+0x7e/0xc0 [obdclass]
 [<ffffffffa146356e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffa08bc1a8>] lov_device_fini+0x58/0x120 [lov]
 [<ffffffffa145c619>] lu_stack_fini+0x49/0xc0 [obdclass]
 [<ffffffffa146356e>] cl_stack_fini+0xe/0x10 [obdclass]
 [<ffffffffa0e1279d>] cl_sb_fini+0x6d/0x190 [lustre]
 [<ffffffffa0dd34bc>] ll_put_super+0x1bc/0x11e0 [lustre]
 [<ffffffff811cd0f2>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff811a59df>] ? destroy_inode+0x2f/0x60
 [<ffffffff811a5eac>] ? dispose_list+0xfc/0x120
 [<ffffffff811a62a6>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8118b35b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b446>] kill_anon_super+0x16/0x60
 [<ffffffffa144e7ba>] lustre_kill_super+0x4a/0x60 [obdclass]
 [<ffffffff8118bbe7>] deactivate_super+0x57/0x80
 [<ffffffff811aabef>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab73b>] sys_umount+0x7b/0x3a0
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com>
Reviewed-on: http://review.whamcloud.com/15480
Xyratex-bug-id: MRP-2414
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6794
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Reviewed-by: Bobi Jam <bobijam@hotmail.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/llite/llite_internal.h | 10 +++++++++-
 drivers/staging/lustre/lustre/llite/llite_nfs.c      |  5 +----
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 8a3b03e..37bf331 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -1466,7 +1466,15 @@ static inline void d_lustre_invalidate(struct dentry *dentry, int nested)
 	spin_lock_nested(&dentry->d_lock,
 			 nested ? DENTRY_D_LOCK_NESTED : DENTRY_D_LOCK_NORMAL);
 	__d_lustre_invalidate(dentry);
-	if (d_count(dentry) == 0)
+	/*
+	 * We should be careful about dentries created by d_obtain_alias().
+	 * These dentries are not put in the dentry tree, instead they are
+	 * linked to sb->s_anon through dentry->d_hash.
+	 * shrink_dcache_for_umount() shrinks the tree and sb->s_anon list.
+	 * If we unhashed such a dentry, unmount would not be able to find
+	 * it and busy inodes would be reported.
+	 */
+	if (d_count(dentry) == 0 && !(dentry->d_flags & DCACHE_DISCONNECTED))
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 }
diff --git a/drivers/staging/lustre/lustre/llite/llite_nfs.c b/drivers/staging/lustre/lustre/llite/llite_nfs.c
index 8d1c253..400acacc 100644
--- a/drivers/staging/lustre/lustre/llite/llite_nfs.c
+++ b/drivers/staging/lustre/lustre/llite/llite_nfs.c
@@ -168,11 +168,8 @@ ll_iget_for_nfs(struct super_block *sb, struct lu_fid *fid, struct lu_fid *paren
 		spin_unlock(&lli->lli_lock);
 	}
 
+	/* N.B. d_obtain_alias() drops inode ref on error */
 	result = d_obtain_alias(inode);
-	if (IS_ERR(result)) {
-		iput(inode);
-		return result;
-	}
 
 	return result;
 }
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/19] staging/lustre/obdclass: Eliminate hash bucket scans in lu_cache_shrink
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (12 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 13/19] staging/lustre/llite: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 15/19] staging/lustre: Remove unused MAY_ constants green
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Ann Koehler, Oleg Drokin

From: Ann Koehler <amk@cray.com>

The lu_cache_shrink slab shrinker is too slow, accounting for > 90% of
the time spent in shrink_slab when allocating huge pages. Most of its
time is spent iterating over the buckets in each site's object hash
table to compute the number of freeable objects. This iteration is
eliminated by adding an lru length count to the lu_site struct. A
percpu counter is used to maintain the lru length, so that the
lu_site does not need to be locked when an object is accessed through
the hash table. A counter is updated whenever an object is added to
or deleted from any of the hash table buckets. The number of freeable
objects is the sum of the counter values across all cpus.

Signed-off-by: Ann Koehler <amk@cray.com>
Reviewed-on: http://review.whamcloud.com/14066
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6365
Reviewed-by: Mike Pershin <mike.pershin@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/include/lu_object.h  |  1 +
 drivers/staging/lustre/lustre/obdclass/lu_object.c | 66 ++++++++++++++--------
 2 files changed, 42 insertions(+), 25 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lu_object.h b/drivers/staging/lustre/lustre/include/lu_object.h
index ea13a82..96e271d 100644
--- a/drivers/staging/lustre/lustre/include/lu_object.h
+++ b/drivers/staging/lustre/lustre/include/lu_object.h
@@ -584,6 +584,7 @@ enum {
 	LU_SS_CACHE_RACE,
 	LU_SS_CACHE_DEATH_RACE,
 	LU_SS_LRU_PURGED,
+	LU_SS_LRU_LEN,	/* # of objects in lsb_lru lists */
 	LU_SS_LAST_STAT
 };
 
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 4f7899f..c892e82 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -59,6 +59,7 @@
 #include <linux/list.h>
 
 static void lu_object_free(const struct lu_env *env, struct lu_object *o);
+static __u32 ls_stats_read(struct lprocfs_stats *stats, int idx);
 
 /**
  * Decrease reference counter on object. If last reference is freed, return
@@ -126,6 +127,9 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
 		LASSERT(list_empty(&top->loh_lru));
 		list_add_tail(&top->loh_lru, &bkt->lsb_lru);
 		bkt->lsb_lru_len++;
+		lprocfs_counter_incr(site->ls_stats, LU_SS_LRU_LEN);
+		CDEBUG(D_INODE, "Add %p to site lru. hash: %p, bkt: %p, lru_len: %ld\n",
+		       o, site->ls_obj_hash, bkt, bkt->lsb_lru_len);
 		cfs_hash_bd_unlock(site->ls_obj_hash, &bd, 1);
 		return;
 	}
@@ -174,16 +178,18 @@ void lu_object_unhash(const struct lu_env *env, struct lu_object *o)
 	top = o->lo_header;
 	set_bit(LU_OBJECT_HEARD_BANSHEE, &top->loh_flags);
 	if (!test_and_set_bit(LU_OBJECT_UNHASHED, &top->loh_flags)) {
-		struct cfs_hash *obj_hash = o->lo_dev->ld_site->ls_obj_hash;
+		struct lu_site *site = o->lo_dev->ld_site;
+		struct cfs_hash *obj_hash = site->ls_obj_hash;
 		struct cfs_hash_bd bd;
 
 		cfs_hash_bd_get_and_lock(obj_hash, &top->loh_fid, &bd, 1);
 		if (!list_empty(&top->loh_lru)) {
 			struct lu_site_bkt_data *bkt;
 
-		list_del_init(&top->loh_lru);
+			list_del_init(&top->loh_lru);
 			bkt = cfs_hash_bd_extra_get(obj_hash, &bd);
 			bkt->lsb_lru_len--;
+			lprocfs_counter_decr(site->ls_stats, LU_SS_LRU_LEN);
 		}
 		cfs_hash_bd_del_locked(obj_hash, &bd, &top->loh_hash);
 		cfs_hash_bd_unlock(obj_hash, &bd, 1);
@@ -355,6 +361,7 @@ int lu_site_purge(const struct lu_env *env, struct lu_site *s, int nr)
 					       &bd2, &h->loh_hash);
 			list_move(&h->loh_lru, &dispose);
 			bkt->lsb_lru_len--;
+			lprocfs_counter_decr(s->ls_stats, LU_SS_LRU_LEN);
 			if (did_sth == 0)
 				did_sth = 1;
 
@@ -568,8 +575,9 @@ static struct lu_object *htable_lookup(struct lu_site *s,
 		cfs_hash_get(s->ls_obj_hash, hnode);
 		lprocfs_counter_incr(s->ls_stats, LU_SS_CACHE_HIT);
 		if (!list_empty(&h->loh_lru)) {
-		list_del_init(&h->loh_lru);
+			list_del_init(&h->loh_lru);
 			bkt->lsb_lru_len--;
+			lprocfs_counter_decr(s->ls_stats, LU_SS_LRU_LEN);
 		}
 		return lu_object_top(h);
 	}
@@ -1029,6 +1037,12 @@ int lu_site_init(struct lu_site *s, struct lu_device *top)
 			     0, "cache_death_race", "cache_death_race");
 	lprocfs_counter_init(s->ls_stats, LU_SS_LRU_PURGED,
 			     0, "lru_purged", "lru_purged");
+	/*
+	 * Unlike other counters, lru_len can be decremented so
+	 * need lc_sum instead of just lc_count
+	 */
+	lprocfs_counter_init(s->ls_stats, LU_SS_LRU_LEN,
+			     LPROCFS_CNTR_AVGMINMAX, "lru_len", "lru_len");
 
 	INIT_LIST_HEAD(&s->ls_linkage);
 	s->ls_top_dev = top;
@@ -1817,27 +1831,21 @@ static void lu_site_stats_get(struct cfs_hash *hs,
 
 
 /*
- * There exists a potential lock inversion deadlock scenario when using
- * Lustre on top of ZFS. This occurs between one of ZFS's
- * buf_hash_table.ht_lock's, and Lustre's lu_sites_guard lock. Essentially,
- * thread A will take the lu_sites_guard lock and sleep on the ht_lock,
- * while thread B will take the ht_lock and sleep on the lu_sites_guard
- * lock. Obviously neither thread will wake and drop their respective hold
- * on their lock.
+ * lu_cache_shrink_count returns the number of cached objects that are
+ * candidates to be freed by shrink_slab(). A counter, which tracks
+ * the number of items in the site's lru, is maintained in the per cpu
+ * stats of each site. The counter is incremented when an object is added
+ * to a site's lru and decremented when one is removed. The number of
+ * free-able objects is the sum of all per cpu counters for all sites.
  *
- * To prevent this from happening we must ensure the lu_sites_guard lock is
- * not taken while down this code path. ZFS reliably does not set the
- * __GFP_FS bit in its code paths, so this can be used to determine if it
- * is safe to take the lu_sites_guard lock.
- *
- * Ideally we should accurately return the remaining number of cached
- * objects without taking the  lu_sites_guard lock, but this is not
- * possible in the current implementation.
+ * Using a per cpu counter is a compromise solution to concurrent access:
+ * lu_object_put() can update the counter without locking the site and
+ * lu_cache_shrink_count can sum the counters without locking each
+ * ls_obj_hash bucket.
  */
 static unsigned long lu_cache_shrink_count(struct shrinker *sk,
 					   struct shrink_control *sc)
 {
-	struct lu_site_stats stats;
 	struct lu_site *s;
 	struct lu_site *tmp;
 	unsigned long cached = 0;
@@ -1847,14 +1855,14 @@ static unsigned long lu_cache_shrink_count(struct shrinker *sk,
 
 	mutex_lock(&lu_sites_guard);
 	list_for_each_entry_safe(s, tmp, &lu_sites, ls_linkage) {
-		memset(&stats, 0, sizeof(stats));
-		lu_site_stats_get(s->ls_obj_hash, &stats, 0);
-		cached += stats.lss_total - stats.lss_busy;
+		cached += ls_stats_read(s->ls_stats, LU_SS_LRU_LEN);
 	}
 	mutex_unlock(&lu_sites_guard);
 
 	cached = (cached / 100) * sysctl_vfs_cache_pressure;
-	CDEBUG(D_INODE, "%ld objects cached\n", cached);
+	CDEBUG(D_INODE, "%ld objects cached, cache pressure %d\n",
+	       cached, sysctl_vfs_cache_pressure);
+
 	return cached;
 }
 
@@ -1988,6 +1996,13 @@ static __u32 ls_stats_read(struct lprocfs_stats *stats, int idx)
 	struct lprocfs_counter ret;
 
 	lprocfs_stats_collect(stats, idx, &ret);
+	if (idx == LU_SS_LRU_LEN)
+		/*
+		 * protect against counter on cpu A being decremented
+		 * before counter is incremented on cpu B; unlikely
+		 */
+		return (__u32)((ret.lc_sum > 0) ? ret.lc_sum : 0);
+
 	return (__u32)ret.lc_count;
 }
 
@@ -2002,7 +2017,7 @@ int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 	memset(&stats, 0, sizeof(stats));
 	lu_site_stats_get(s->ls_obj_hash, &stats, 1);
 
-	seq_printf(m, "%d/%d %d/%d %d %d %d %d %d %d %d\n",
+	seq_printf(m, "%d/%d %d/%d %d %d %d %d %d %d %d %d\n",
 		   stats.lss_busy,
 		   stats.lss_total,
 		   stats.lss_populated,
@@ -2013,7 +2028,8 @@ int lu_site_stats_print(const struct lu_site *s, struct seq_file *m)
 		   ls_stats_read(s->ls_stats, LU_SS_CACHE_MISS),
 		   ls_stats_read(s->ls_stats, LU_SS_CACHE_RACE),
 		   ls_stats_read(s->ls_stats, LU_SS_CACHE_DEATH_RACE),
-		   ls_stats_read(s->ls_stats, LU_SS_LRU_PURGED));
+		   ls_stats_read(s->ls_stats, LU_SS_LRU_PURGED),
+		   ls_stats_read(s->ls_stats, LU_SS_LRU_LEN));
 	return 0;
 }
 EXPORT_SYMBOL(lu_site_stats_print);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 15/19] staging/lustre: Remove unused MAY_ constants
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (13 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 14/19] staging/lustre/obdclass: Eliminate hash bucket scans in lu_cache_shrink green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 16/19] staging/lustre/osc: use global osc_rq_pool to reduce memory usage green
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Ben Evans, Oleg Drokin

From: Ben Evans <bevans@cray.com>

Remove unused MAY_ constants from lustre_idl.h

Signed-off-by: Ben Evans <bevans@cray.com>
Reviewed-on: http://review.whamcloud.com/15398
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6450
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../staging/lustre/lustre/include/lustre/lustre_idl.h   | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
index b0c4433..d20e199 100644
--- a/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
+++ b/drivers/staging/lustre/lustre/include/lustre/lustre_idl.h
@@ -2370,23 +2370,6 @@ void lustre_swab_mdt_rec_setattr(struct mdt_rec_setattr *sa);
 					      */
 #define MDS_OPEN_RELEASE   02000000000000ULL /* Open the file for HSM release */
 
-/* permission for create non-directory file */
-#define MAY_CREATE      (1 << 7)
-/* permission for create directory file */
-#define MAY_LINK	(1 << 8)
-/* permission for delete from the directory */
-#define MAY_UNLINK      (1 << 9)
-/* source's permission for rename */
-#define MAY_RENAME_SRC  (1 << 10)
-/* target's permission for rename */
-#define MAY_RENAME_TAR  (1 << 11)
-/* part (parent's) VTX permission check */
-#define MAY_VTX_PART    (1 << 12)
-/* full VTX permission check */
-#define MAY_VTX_FULL    (1 << 13)
-/* lfs rgetfacl permission check */
-#define MAY_RGETFACL    (1 << 14)
-
 enum mds_op_bias {
 	MDS_CHECK_SPLIT		= 1 << 0,
 	MDS_CROSS_REF		= 1 << 1,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 16/19] staging/lustre/osc: use global osc_rq_pool to reduce memory usage
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (14 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 15/19] staging/lustre: Remove unused MAY_ constants green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 17/19] staging/lustre/o2iblnd: leak cmid in kiblnd_dev_need_failover green
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Li Xi, Wu Libin, Wang Shilong,
	Oleg Drokin

From: Li Xi <lixi@ddn.com>

The per-osc request pools consume a lot of memory if there are
hundreds of OSCs on one client. This will be a critical problem
if the client doesn't have sufficient memory for both OSCs and
applications.

This patch replaces per-osc request pools with a global pool
osc_rq_pool. The total memory usage is 5MB by default. And it
can be set by a module parameter of OSC:
"options osc osc_reqpool_mem_max=POOL_SIZE". The unit of POOL_SIZE
is MB. If cl_max_rpcs_in_flight is the same for all OSCs, the
memory usage of the OSC pool can be calculated as:
Min(POOL_SIZE * 1M,
    (cl_max_rpcs_in_flight + 2) * OSC number * OST_MAXREQSIZE)

Also, this patch changes the allocation logic of OSC write requests.
The allocation from osc_rq_pool will only be tried after normal
allocation failed.

Signed-off-by: Wu Libin <lwu@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Reviewed-on: http://review.whamcloud.com/15422
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6770
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 .../staging/lustre/lustre/include/lustre_import.h  |  2 -
 drivers/staging/lustre/lustre/include/lustre_net.h |  7 +-
 drivers/staging/lustre/lustre/include/obd_class.h  |  4 --
 drivers/staging/lustre/lustre/obdclass/genops.c    |  1 -
 drivers/staging/lustre/lustre/osc/lproc_osc.c      | 17 ++++-
 drivers/staging/lustre/lustre/osc/osc_internal.h   |  4 ++
 drivers/staging/lustre/lustre/osc/osc_request.c    | 79 ++++++++++++++++++----
 drivers/staging/lustre/lustre/ptlrpc/client.c      | 21 +++---
 8 files changed, 94 insertions(+), 41 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_import.h b/drivers/staging/lustre/lustre/include/lustre_import.h
index 5a38f3d..c8b89a3 100644
--- a/drivers/staging/lustre/lustre/include/lustre_import.h
+++ b/drivers/staging/lustre/lustre/include/lustre_import.h
@@ -306,8 +306,6 @@ struct obd_import {
 	__u32		     imp_msg_magic;
 	__u32		     imp_msghdr_flags;       /* adjusted based on server capability */
 
-	struct ptlrpc_request_pool *imp_rq_pool;	  /* emergency request pool */
-
 	struct imp_at	     imp_at;		 /* adaptive timeout data */
 	time_t		    imp_last_reply_time;    /* for health check */
 };
diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index 5df493e..313a56c 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -500,7 +500,7 @@ struct ptlrpc_request_pool {
 	/** Maximum message size that would fit into a request from this pool */
 	int prp_rq_size;
 	/** Function to allocate more requests for this pool */
-	void (*prp_populate)(struct ptlrpc_request_pool *, int);
+	int (*prp_populate)(struct ptlrpc_request_pool *, int);
 };
 
 struct lu_context;
@@ -2381,11 +2381,11 @@ void ptlrpc_set_add_new_req(struct ptlrpcd_ctl *pc,
 			    struct ptlrpc_request *req);
 
 void ptlrpc_free_rq_pool(struct ptlrpc_request_pool *pool);
-void ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq);
+int ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq);
 
 struct ptlrpc_request_pool *
 ptlrpc_init_rq_pool(int, int,
-		    void (*populate_pool)(struct ptlrpc_request_pool *, int));
+		    int (*populate_pool)(struct ptlrpc_request_pool *, int));
 
 void ptlrpc_at_set_req_timeout(struct ptlrpc_request *req);
 struct ptlrpc_request *ptlrpc_request_alloc(struct obd_import *imp,
@@ -2957,7 +2957,6 @@ void ptlrpc_lprocfs_brw(struct ptlrpc_request *req, int bytes);
 
 /* ptlrpc/llog_client.c */
 extern struct llog_operations llog_client_ops;
-
 /** @} net */
 
 #endif
diff --git a/drivers/staging/lustre/lustre/include/obd_class.h b/drivers/staging/lustre/lustre/include/obd_class.h
index 87bb2ce..ce6fa55 100644
--- a/drivers/staging/lustre/lustre/include/obd_class.h
+++ b/drivers/staging/lustre/lustre/include/obd_class.h
@@ -626,10 +626,6 @@ static inline void obd_cleanup_client_import(struct obd_device *obd)
 		CDEBUG(D_CONFIG, "%s: client import never connected\n",
 		       obd->obd_name);
 		ptlrpc_invalidate_import(imp);
-		if (imp->imp_rq_pool) {
-			ptlrpc_free_rq_pool(imp->imp_rq_pool);
-			imp->imp_rq_pool = NULL;
-		}
 		client_destroy_import(imp);
 		obd->u.cli.cl_import = NULL;
 	}
diff --git a/drivers/staging/lustre/lustre/obdclass/genops.c b/drivers/staging/lustre/lustre/obdclass/genops.c
index 370d5b4..594955d 100644
--- a/drivers/staging/lustre/lustre/obdclass/genops.c
+++ b/drivers/staging/lustre/lustre/obdclass/genops.c
@@ -1663,7 +1663,6 @@ static void obd_zombie_export_add(struct obd_export *exp)
 static void obd_zombie_import_add(struct obd_import *imp)
 {
 	LASSERT(!imp->imp_sec);
-	LASSERT(!imp->imp_rq_pool);
 	spin_lock(&obd_zombie_impexp_lock);
 	LASSERT(list_empty(&imp->imp_zombie_chain));
 	zombies_count++;
diff --git a/drivers/staging/lustre/lustre/osc/lproc_osc.c b/drivers/staging/lustre/lustre/osc/lproc_osc.c
index ff6d2e2..c504d15 100644
--- a/drivers/staging/lustre/lustre/osc/lproc_osc.c
+++ b/drivers/staging/lustre/lustre/osc/lproc_osc.c
@@ -96,9 +96,9 @@ static ssize_t max_rpcs_in_flight_store(struct kobject *kobj,
 	struct obd_device *dev = container_of(kobj, struct obd_device,
 					      obd_kobj);
 	struct client_obd *cli = &dev->u.cli;
-	struct ptlrpc_request_pool *pool = cli->cl_import->imp_rq_pool;
 	int rc;
 	unsigned long val;
+	int adding, added, req_count;
 
 	rc = kstrtoul(buffer, 10, &val);
 	if (rc)
@@ -107,8 +107,19 @@ static ssize_t max_rpcs_in_flight_store(struct kobject *kobj,
 	if (val < 1 || val > OSC_MAX_RIF_MAX)
 		return -ERANGE;
 
-	if (pool && val > cli->cl_max_rpcs_in_flight)
-		pool->prp_populate(pool, val-cli->cl_max_rpcs_in_flight);
+	adding = val - cli->cl_max_rpcs_in_flight;
+	req_count = atomic_read(&osc_pool_req_count);
+	if (adding > 0 && req_count < osc_reqpool_maxreqcount) {
+		/*
+		 * There might be some race which will cause over-limit
+		 * allocation, but it is fine.
+		 */
+		if (req_count + adding > osc_reqpool_maxreqcount)
+			adding = osc_reqpool_maxreqcount - req_count;
+
+		added = osc_rq_pool->prp_populate(osc_rq_pool, adding);
+		atomic_add(added, &osc_pool_req_count);
+	}
 
 	client_obd_list_lock(&cli->cl_loi_list_lock);
 	cli->cl_max_rpcs_in_flight = val;
diff --git a/drivers/staging/lustre/lustre/osc/osc_internal.h b/drivers/staging/lustre/lustre/osc/osc_internal.h
index 470698b..7d0a3e2 100644
--- a/drivers/staging/lustre/lustre/osc/osc_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_internal.h
@@ -39,6 +39,10 @@
 
 #define OAP_MAGIC 8675309
 
+extern atomic_t osc_pool_req_count;
+extern unsigned int osc_reqpool_maxreqcount;
+extern struct ptlrpc_request_pool *osc_rq_pool;
+
 struct lu_env;
 
 enum async_flags {
diff --git a/drivers/staging/lustre/lustre/osc/osc_request.c b/drivers/staging/lustre/lustre/osc/osc_request.c
index 114c550..f41f762 100644
--- a/drivers/staging/lustre/lustre/osc/osc_request.c
+++ b/drivers/staging/lustre/lustre/osc/osc_request.c
@@ -50,9 +50,18 @@
 #include "../include/lustre_param.h"
 #include "../include/lustre_fid.h"
 #include "../include/obd_class.h"
+#include "../include/obd.h"
 #include "osc_internal.h"
 #include "osc_cl_internal.h"
 
+atomic_t osc_pool_req_count;
+unsigned int osc_reqpool_maxreqcount;
+struct ptlrpc_request_pool *osc_rq_pool;
+
+/* max memory used for request pool, unit is MB */
+static unsigned int osc_reqpool_mem_max = 5;
+module_param(osc_reqpool_mem_max, uint, 0444);
+
 struct osc_brw_async_args {
 	struct obdo       *aa_oa;
 	int		aa_requested_nob;
@@ -1268,7 +1277,7 @@ static int osc_brw_prep_request(int cmd, struct client_obd *cli,
 	if ((cmd & OBD_BRW_WRITE) != 0) {
 		opc = OST_WRITE;
 		req = ptlrpc_request_alloc_pool(cli->cl_import,
-						cli->cl_import->imp_rq_pool,
+						osc_rq_pool,
 						&RQF_OST_BRW_WRITE);
 	} else {
 		opc = OST_READ;
@@ -3163,6 +3172,9 @@ int osc_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 	struct client_obd *cli = &obd->u.cli;
 	void *handler;
 	int rc;
+	int adding;
+	int added;
+	int req_count;
 
 	rc = ptlrpcd_addref();
 	if (rc)
@@ -3191,15 +3203,20 @@ int osc_setup(struct obd_device *obd, struct lustre_cfg *lcfg)
 		ptlrpc_lprocfs_register_obd(obd);
 	}
 
-	/* We need to allocate a few requests more, because
-	 * brw_interpret tries to create new requests before freeing
-	 * previous ones, Ideally we want to have 2x max_rpcs_in_flight
-	 * reserved, but I'm afraid that might be too much wasted RAM
-	 * in fact, so 2 is just my guess and still should work. */
-	cli->cl_import->imp_rq_pool =
-		ptlrpc_init_rq_pool(cli->cl_max_rpcs_in_flight + 2,
-				    OST_MAXREQSIZE,
-				    ptlrpc_add_rqs_to_pool);
+	/*
+	 * We try to control the total number of requests with a upper limit
+	 * osc_reqpool_maxreqcount. There might be some race which will cause
+	 * over-limit allocation, but it is fine.
+	 */
+	req_count = atomic_read(&osc_pool_req_count);
+	if (req_count < osc_reqpool_maxreqcount) {
+		adding = cli->cl_max_rpcs_in_flight + 2;
+		if (req_count + adding > osc_reqpool_maxreqcount)
+			adding = osc_reqpool_maxreqcount - req_count;
+
+		added = ptlrpc_add_rqs_to_pool(osc_rq_pool, adding);
+		atomic_add(added, &osc_pool_req_count);
+	}
 
 	INIT_LIST_HEAD(&cli->cl_grant_shrink_list);
 	ns_register_cancel(obd->obd_namespace, osc_cancel_for_recovery);
@@ -3339,6 +3356,8 @@ extern struct lock_class_key osc_ast_guard_class;
 static int __init osc_init(void)
 {
 	struct lprocfs_static_vars lvars = { NULL };
+	unsigned int reqpool_size;
+	unsigned int reqsize;
 	int rc;
 
 	/* print an address of _any_ initialized kernel symbol from this
@@ -3354,14 +3373,45 @@ static int __init osc_init(void)
 
 	rc = class_register_type(&osc_obd_ops, NULL,
 				 LUSTRE_OSC_NAME, &osc_device_type);
-	if (rc) {
-		lu_kmem_fini(osc_caches);
-		return rc;
-	}
+	if (rc)
+		goto out_kmem;
 
 	spin_lock_init(&osc_ast_guard);
 	lockdep_set_class(&osc_ast_guard, &osc_ast_guard_class);
 
+	/* This is obviously too much memory, only prevent overflow here */
+	if (osc_reqpool_mem_max >= 1 << 12 || osc_reqpool_mem_max == 0) {
+		rc = -EINVAL;
+		goto out_type;
+	}
+
+	reqpool_size = osc_reqpool_mem_max << 20;
+
+	reqsize = 1;
+	while (reqsize < OST_MAXREQSIZE)
+		reqsize = reqsize << 1;
+
+	/*
+	 * We don't enlarge the request count in OSC pool according to
+	 * cl_max_rpcs_in_flight. The allocation from the pool will only be
+	 * tried after normal allocation failed. So a small OSC pool won't
+	 * cause much performance degression in most of cases.
+	 */
+	osc_reqpool_maxreqcount = reqpool_size / reqsize;
+
+	atomic_set(&osc_pool_req_count, 0);
+	osc_rq_pool = ptlrpc_init_rq_pool(0, OST_MAXREQSIZE,
+					  ptlrpc_add_rqs_to_pool);
+
+	if (osc_rq_pool)
+		return 0;
+
+	rc = -ENOMEM;
+
+out_type:
+	class_unregister_type(LUSTRE_OSC_NAME);
+out_kmem:
+	lu_kmem_fini(osc_caches);
 	return rc;
 }
 
@@ -3369,6 +3419,7 @@ static void /*__exit*/ osc_exit(void)
 {
 	class_unregister_type(LUSTRE_OSC_NAME);
 	lu_kmem_fini(osc_caches);
+	ptlrpc_free_rq_pool(osc_rq_pool);
 }
 
 MODULE_AUTHOR("Sun Microsystems, Inc. <http://www.lustre.org/>");
diff --git a/drivers/staging/lustre/lustre/ptlrpc/client.c b/drivers/staging/lustre/lustre/ptlrpc/client.c
index 1800db1..90b24fc 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/client.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/client.c
@@ -446,7 +446,7 @@ EXPORT_SYMBOL(ptlrpc_free_rq_pool);
 /**
  * Allocates, initializes and adds \a num_rq requests to the pool \a pool
  */
-void ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq)
+int ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq)
 {
 	int i;
 	int size = 1;
@@ -468,11 +468,11 @@ void ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq)
 		spin_unlock(&pool->prp_lock);
 		req = ptlrpc_request_cache_alloc(GFP_NOFS);
 		if (!req)
-			return;
+			return i;
 		msg = libcfs_kvzalloc(size, GFP_NOFS);
 		if (!msg) {
 			ptlrpc_request_cache_free(req);
-			return;
+			return i;
 		}
 		req->rq_reqbuf = msg;
 		req->rq_reqbuf_len = size;
@@ -481,6 +481,7 @@ void ptlrpc_add_rqs_to_pool(struct ptlrpc_request_pool *pool, int num_rq)
 		list_add_tail(&req->rq_list, &pool->prp_req_list);
 	}
 	spin_unlock(&pool->prp_lock);
+	return num_rq;
 }
 EXPORT_SYMBOL(ptlrpc_add_rqs_to_pool);
 
@@ -494,7 +495,7 @@ EXPORT_SYMBOL(ptlrpc_add_rqs_to_pool);
  */
 struct ptlrpc_request_pool *
 ptlrpc_init_rq_pool(int num_rq, int msgsize,
-		    void (*populate_pool)(struct ptlrpc_request_pool *, int))
+		    int (*populate_pool)(struct ptlrpc_request_pool *, int))
 {
 	struct ptlrpc_request_pool *pool;
 
@@ -512,11 +513,6 @@ ptlrpc_init_rq_pool(int num_rq, int msgsize,
 
 	populate_pool(pool, num_rq);
 
-	if (list_empty(&pool->prp_req_list)) {
-		/* have not allocated a single request for the pool */
-		kfree(pool);
-		pool = NULL;
-	}
 	return pool;
 }
 EXPORT_SYMBOL(ptlrpc_init_rq_pool);
@@ -702,11 +698,10 @@ struct ptlrpc_request *__ptlrpc_request_alloc(struct obd_import *imp,
 {
 	struct ptlrpc_request *request = NULL;
 
-	if (pool)
-		request = ptlrpc_prep_req_from_pool(pool);
+	request = ptlrpc_request_cache_alloc(GFP_NOFS);
 
-	if (!request)
-		request = ptlrpc_request_cache_alloc(GFP_NOFS);
+	if (!request && pool)
+		request = ptlrpc_prep_req_from_pool(pool);
 
 	if (request) {
 		LASSERTF((unsigned long)imp > 0x1000, "%p", imp);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 17/19] staging/lustre/o2iblnd: leak cmid in kiblnd_dev_need_failover
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (15 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 16/19] staging/lustre/osc: use global osc_rq_pool to reduce memory usage green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 18/19] staging/lustre/libcfs: remove unused cfs_timer_done green
  2015-09-14 22:41 ` [PATCH 19/19] staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware green
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Liang Zhen, Oleg Drokin

From: Liang Zhen <liang.zhen@intel.com>

cmid created by kiblnd_dev_need_failover should always be destroyed,
however it is not the case in current implementation and we will leak
cmid when this function detected a device failover.

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Reviewed-on: http://review.whamcloud.com/14603
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6480
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index c29d2ce..faa70f0 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -2228,13 +2228,10 @@ static int kiblnd_dev_need_failover(kib_dev_t *dev)
 		return rc;
 	}
 
-	if (dev->ibd_hdev->ibh_ibdev == cmid->device) {
-		/* don't need device failover */
-		rdma_destroy_id(cmid);
-		return 0;
-	}
+	rc = dev->ibd_hdev->ibh_ibdev != cmid->device; /* true for failover */
+	rdma_destroy_id(cmid);
 
-	return 1;
+	return rc;
 }
 
 int kiblnd_dev_failover(kib_dev_t *dev)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 18/19] staging/lustre/libcfs: remove unused cfs_timer_done
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (16 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 17/19] staging/lustre/o2iblnd: leak cmid in kiblnd_dev_need_failover green
@ 2015-09-14 22:41 ` green
  2015-09-14 22:41 ` [PATCH 19/19] staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware green
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, James Simmons, frank zago, Oleg Drokin

From: James Simmons <uja.ornl@yahoo.com>

Remove the cfs_timer_done function in the libcfs
kernel module since it is not used anywhere.

Signed-off-by: James Simmons <uja.ornl@yahoo.com>
Signed-off-by: frank zago <fzago@cray.com>
Reviewed-on: http://review.whamcloud.com/13917
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6245
Reviewed-by: John L. Hammond <john.hammond@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/include/linux/libcfs/libcfs_prim.h | 1 -
 drivers/staging/lustre/lustre/libcfs/linux/linux-prim.c   | 6 ------
 2 files changed, 7 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_prim.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_prim.h
index 978d3e2..62ade08 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_prim.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_prim.h
@@ -49,7 +49,6 @@ void add_wait_queue_exclusive_head(wait_queue_head_t *, wait_queue_t *);
 
 void cfs_init_timer(struct timer_list *t);
 void cfs_timer_init(struct timer_list *t, cfs_timer_func_t *func, void *arg);
-void cfs_timer_done(struct timer_list *t);
 void cfs_timer_arm(struct timer_list *t, unsigned long deadline);
 void cfs_timer_disarm(struct timer_list *t);
 int  cfs_timer_is_armed(struct timer_list *t);
diff --git a/drivers/staging/lustre/lustre/libcfs/linux/linux-prim.c b/drivers/staging/lustre/lustre/libcfs/linux/linux-prim.c
index 838f5f3..12b6af4 100644
--- a/drivers/staging/lustre/lustre/libcfs/linux/linux-prim.c
+++ b/drivers/staging/lustre/lustre/libcfs/linux/linux-prim.c
@@ -84,12 +84,6 @@ void cfs_timer_init(struct timer_list *t, cfs_timer_func_t *func, void *arg)
 }
 EXPORT_SYMBOL(cfs_timer_init);
 
-void cfs_timer_done(struct timer_list *t)
-{
-	return;
-}
-EXPORT_SYMBOL(cfs_timer_done);
-
 void cfs_timer_arm(struct timer_list *t, unsigned long deadline)
 {
 	mod_timer(t, deadline);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 19/19] staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware
  2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
                   ` (17 preceding siblings ...)
  2015-09-14 22:41 ` [PATCH 18/19] staging/lustre/libcfs: remove unused cfs_timer_done green
@ 2015-09-14 22:41 ` green
  18 siblings, 0 replies; 22+ messages in thread
From: green @ 2015-09-14 22:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Linux Kernel Mailing List, Olaf Weber, Oleg Drokin

From: Olaf Weber <olaf@sgi.com>

On NUMA systems, the placement of worker threads relative to the
memory they use greatly affects performance. The CPT mechanism can be
used to constrain a number of Lustre thread types, and this change
makes it possible to configure the placement of ptlrpcd threads in a
similar manner.

To simplify the code changes, the global structures used to manage
ptlrpcd threads are changed to one per CPT. In particular this means
there will be one ptlrpcd recovery thread per CPT.

To prevent ptlrpcd threads from wandering all over the system, all
ptlrpcd thread are bound to a CPT. Note that some CPT configuration
is always created, but the defaults are not likely to be correct for
a NUMA system. After discussing the options with Liang Zhen we
decided that we would not bind ptlrpcd threads to specific CPUs,
and rather trust the kernel scheduler to migrate ptlrpcd threads.

With all ptlrpcd threads bound to a CPT, but not to specific CPUs,
the load policy mechanism can be radically simplified:

- PDL_POLICY_LOCAL and PDL_POLICY_ROUND are currently identical.
- PDL_POLICY_ROUND, if fully implemented, would cost us the locality
  we are trying to achieve, so most or all calls using this policy
  would have to be changed to PDL_POLICY_LOCAL.
- PDL_POLICY_PREFERRED is not used, and cannot be implemented without
  binding ptlrpcd threads to individual CPUs.
- PDL_POLICY_SAME is rarely used, and cannot be implemented without
  binding ptlrpcd threads to individual CPUs.

The partner mechanism is also updated, because now all ptlrpcd
threads are "bound" threads. The only difference between the various
bind policies, PDB_POLICY_NONE, PDB_POLICY_FULL, PDB_POLICY_PAIR, and
PDB_POLICY_NEIGHBOR, is the number of partner threads. The bind
policy is replaced with a tunable that directly specifies the size of
the groups of ptlrpcd partner threads.

Ensure that the ptlrpc_request_set for a ptlrpcd thread is created on
the same CPT that the thread will work on. When threads are bound to
specific nodes and/or CPUs in a NUMA system, it pays to ensure that
the datastructures used by these threads are also on the same node.

Visible changes:

* ptlrpcd thread names include the CPT number, for example
  "ptlrpcd_02_07". In this case the "07" is relative to the CPT, and
  not a CPU number.

Tunables added:

* ptlrpcd_cpts (string): A CPT string describing the CPU partitions
  that ptlrpcd threads should run on. Used to make ptlrpcd threads
  run on a subset of all CPTs.

* ptlrpcd_per_cpt_max (int): The maximum number of ptlrpcd threads
  to run in a CPT.

* ptlrpcd_partner_group_size (int): The desired number of threads
  in each ptlrpcd partner thread group. Default is 2, corresponding
  to the old PDB_POLICY_PAIR. A negative value makes all ptlrpcd
  threads in a CPT partners of each other.

Tunables obsoleted:

* max_ptlrpcds: The new ptlrcpd_per_cpt_max can be used to obtain the
  same effect.

* ptlrpcd_bind_policy: The new ptlrpcd_partner_group_size can be used
  to obtain the same effect.

Internal interface changes:

* pdb_policy_t and related code have been removed. Groups of partner
  ptlrpcd threads are still created, and all threads in a partner
  group are bound on the same CPT. The ptlrpcd threads bound to a
  CPT are typically divided into several partner groups. The partner
  groups on a CPT all have an equal number of ptlrpcd threads.

* pdl_policy_t and related code have been removed. Since ptlrpcd
  threads are not bound to a specific CPU, all the code that avoids
  scheduling on the current CPU (or attempts to do so) has been
  removed as non-functional. A simplified form of PDL_POLICY_LOCAL
  is kept as the only load policy.

* LIOD_BIND and related code have been removed. All ptlrpcd threads
  are now bound to a CPT, and no additional binding policy is
  implemented.

* ptlrpc_prep_set(): Changed to allocate a ptlrpc_request_set
  on the current CPT.

* ptlrpcd(): If an error is encountered before entering the main loop
  store the error in pc_error before exiting.

* ptlrpcd_start(): Check pc_error to verify that the ptlrpcd thread
  has successfully entered its main loop.

* ptlrpcd_init(): Initialize the struct ptlrpcd_ctl for all threads
  for a CPT before starting any of them. This closes a race during
  startup where a partner thread could reference a non-initialized
  struct ptlrpcd_ctl.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-on: http://review.whamcloud.com/13972
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6325
Reviewed-by: Grégoire Pichon <gregoire.pichon@bull.net>
Reviewed-by: Stephen Champion <schamp@sgi.com>
Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
---
 drivers/staging/lustre/lustre/include/lustre_net.h |  54 +-
 drivers/staging/lustre/lustre/ldlm/ldlm_request.c  |   8 +-
 drivers/staging/lustre/lustre/mdc/mdc_locks.c      |   2 +-
 drivers/staging/lustre/lustre/mdc/mdc_request.c    |   2 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c      |  28 +-
 .../staging/lustre/lustre/osc/osc_cl_internal.h    |   2 +-
 drivers/staging/lustre/lustre/osc/osc_internal.h   |   2 +-
 drivers/staging/lustre/lustre/osc/osc_request.c    |  41 +-
 drivers/staging/lustre/lustre/ptlrpc/client.c      |  11 +-
 drivers/staging/lustre/lustre/ptlrpc/import.c      |   7 +-
 drivers/staging/lustre/lustre/ptlrpc/pinger.c      |   2 +-
 .../staging/lustre/lustre/ptlrpc/ptlrpc_internal.h |   2 +-
 drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c     | 702 +++++++++++++--------
 13 files changed, 486 insertions(+), 377 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/lustre_net.h b/drivers/staging/lustre/lustre/include/lustre_net.h
index 313a56c..3b6a2d7 100644
--- a/drivers/staging/lustre/lustre/include/lustre_net.h
+++ b/drivers/staging/lustre/lustre/include/lustre_net.h
@@ -2191,21 +2191,29 @@ struct ptlrpcd_ctl {
 	 */
 	struct lu_env	       pc_env;
 	/**
-	 * Index of ptlrpcd thread in the array.
+	 * CPT the thread is bound on.
 	 */
-	int			 pc_index;
+	int				pc_cpt;
 	/**
-	 * Number of the ptlrpcd's partners.
+	 * Index of ptlrpcd thread in the array.
 	 */
-	int			 pc_npartners;
+	int				pc_index;
 	/**
 	 * Pointer to the array of partners' ptlrpcd_ctl structure.
 	 */
 	struct ptlrpcd_ctl	**pc_partners;
 	/**
+	 * Number of the ptlrpcd's partners.
+	 */
+	int				pc_npartners;
+	/**
 	 * Record the partner index to be processed next.
 	 */
 	int			 pc_cursor;
+	/**
+	 * Error code if the thread failed to fully start.
+	 */
+	int				pc_error;
 };
 
 /* Bits for pc_flags */
@@ -2228,10 +2236,6 @@ enum ptlrpcd_ctl_flags {
 	 * This is a recovery ptlrpc thread.
 	 */
 	LIOD_RECOVERY    = 1 << 3,
-	/**
-	 * The ptlrpcd is bound to some CPU core.
-	 */
-	LIOD_BIND	= 1 << 4,
 };
 
 /**
@@ -2903,43 +2907,11 @@ void ptlrpc_pinger_ir_down(void);
 /** @} */
 int ptlrpc_pinger_suppress_pings(void);
 
-/* ptlrpc daemon bind policy */
-typedef enum {
-	/* all ptlrpcd threads are free mode */
-	PDB_POLICY_NONE	  = 1,
-	/* all ptlrpcd threads are bound mode */
-	PDB_POLICY_FULL	  = 2,
-	/* <free1 bound1> <free2 bound2> ... <freeN boundN> */
-	PDB_POLICY_PAIR	  = 3,
-	/* <free1 bound1> <bound1 free2> ... <freeN boundN> <boundN free1>,
-	 * means each ptlrpcd[X] has two partners: thread[X-1] and thread[X+1].
-	 * If kernel supports NUMA, pthrpcd threads are binded and
-	 * grouped by NUMA node */
-	PDB_POLICY_NEIGHBOR      = 4,
-} pdb_policy_t;
-
-/* ptlrpc daemon load policy
- * It is caller's duty to specify how to push the async RPC into some ptlrpcd
- * queue, but it is not enforced, affected by "ptlrpcd_bind_policy". If it is
- * "PDB_POLICY_FULL", then the RPC will be processed by the selected ptlrpcd,
- * Otherwise, the RPC may be processed by the selected ptlrpcd or its partner,
- * depends on which is scheduled firstly, to accelerate the RPC processing. */
-typedef enum {
-	/* on the same CPU core as the caller */
-	PDL_POLICY_SAME	 = 1,
-	/* within the same CPU partition, but not the same core as the caller */
-	PDL_POLICY_LOCAL	= 2,
-	/* round-robin on all CPU cores, but not the same core as the caller */
-	PDL_POLICY_ROUND	= 3,
-	/* the specified CPU core is preferred, but not enforced */
-	PDL_POLICY_PREFERRED    = 4,
-} pdl_policy_t;
-
 /* ptlrpc/ptlrpcd.c */
 void ptlrpcd_stop(struct ptlrpcd_ctl *pc, int force);
 void ptlrpcd_free(struct ptlrpcd_ctl *pc);
 void ptlrpcd_wake(struct ptlrpc_request *req);
-void ptlrpcd_add_req(struct ptlrpc_request *req, pdl_policy_t policy, int idx);
+void ptlrpcd_add_req(struct ptlrpc_request *req);
 void ptlrpcd_add_rqset(struct ptlrpc_request_set *set);
 int ptlrpcd_addref(void);
 void ptlrpcd_decref(void);
diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
index 6245a2c..b5ee9bd 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -1212,12 +1212,12 @@ int ldlm_cli_cancel_req(struct obd_export *exp, struct list_head *cancels,
 
 		ptlrpc_request_set_replen(req);
 		if (flags & LCF_ASYNC) {
-			ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
+			ptlrpcd_add_req(req);
 			sent = count;
 			goto out;
-		} else {
-			rc = ptlrpc_queue_wait(req);
 		}
+
+		rc = ptlrpc_queue_wait(req);
 		if (rc == LUSTRE_ESTALE) {
 			CDEBUG(D_DLMTRACE, "client/server (nid %s) out of sync -- not fatal\n",
 			       libcfs_nid2str(req->rq_import->
@@ -2223,7 +2223,7 @@ static int replay_one_lock(struct obd_import *imp, struct ldlm_lock *lock)
 	aa = ptlrpc_req_async_args(req);
 	aa->lock_handle = body->lock_handle[0];
 	req->rq_interpret_reply = (ptlrpc_interpterer_t)replay_lock_interpret;
-	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
+	ptlrpcd_add_req(req);
 
 	return 0;
 }
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_locks.c b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
index e6b3bf9..20da064 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_locks.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_locks.c
@@ -1307,7 +1307,7 @@ int mdc_intent_getattr_async(struct obd_export *exp,
 	ga->ga_einfo = einfo;
 
 	req->rq_interpret_reply = mdc_intent_getattr_async_interpret;
-	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
+	ptlrpcd_add_req(req);
 
 	return 0;
 }
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 204d512..d32ae761 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -2639,7 +2639,7 @@ static int mdc_renew_capa(struct obd_export *exp, struct obd_capa *oc,
 	ra->ra_oc = oc;
 	ra->ra_cb = cb;
 	req->rq_interpret_reply = mdc_interpret_renew_capa;
-	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
+	ptlrpcd_add_req(req);
 	return 0;
 }
 
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index c72035e..62da061 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -1934,7 +1934,7 @@ static int get_write_extents(struct osc_object *obj, struct list_head *rpclist)
 
 static int
 osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol)
+		   struct osc_object *osc)
 {
 	LIST_HEAD(rpclist);
 	struct osc_extent *ext;
@@ -1986,7 +1986,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
 
 	if (!list_empty(&rpclist)) {
 		LASSERT(page_count > 0);
-		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE, pol);
+		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_WRITE);
 		LASSERT(list_empty(&rpclist));
 	}
 
@@ -2006,7 +2006,7 @@ osc_send_write_rpc(const struct lu_env *env, struct client_obd *cli,
  */
 static int
 osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct osc_object *osc, pdl_policy_t pol)
+		  struct osc_object *osc)
 {
 	struct osc_extent *ext;
 	struct osc_extent *next;
@@ -2033,7 +2033,7 @@ osc_send_read_rpc(const struct lu_env *env, struct client_obd *cli,
 		osc_object_unlock(osc);
 
 		LASSERT(page_count > 0);
-		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ, pol);
+		rc = osc_build_rpc(env, cli, &rpclist, OBD_BRW_READ);
 		LASSERT(list_empty(&rpclist));
 
 		osc_object_lock(osc);
@@ -2079,8 +2079,7 @@ static struct osc_object *osc_next_obj(struct client_obd *cli)
 }
 
 /* called with the loi list lock held */
-static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
-			   pdl_policy_t pol)
+static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli)
 {
 	struct osc_object *osc;
 	int rc = 0;
@@ -2109,7 +2108,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 		 * do io on writes while there are cache waiters */
 		osc_object_lock(osc);
 		if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
-			rc = osc_send_write_rpc(env, cli, osc, pol);
+			rc = osc_send_write_rpc(env, cli, osc);
 			if (rc < 0) {
 				CERROR("Write request failed with %d\n", rc);
 
@@ -2133,7 +2132,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 			}
 		}
 		if (osc_makes_rpc(cli, osc, OBD_BRW_READ)) {
-			rc = osc_send_read_rpc(env, cli, osc, pol);
+			rc = osc_send_read_rpc(env, cli, osc);
 			if (rc < 0)
 				CERROR("Read request failed with %d\n", rc);
 		}
@@ -2149,7 +2148,7 @@ static void osc_check_rpcs(const struct lu_env *env, struct client_obd *cli,
 }
 
 static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
-			  struct osc_object *osc, pdl_policy_t pol, int async)
+			  struct osc_object *osc, int async)
 {
 	int rc = 0;
 
@@ -2161,7 +2160,7 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
 		 * potential stack overrun problem. LU-2859 */
 		atomic_inc(&cli->cl_lru_shrinkers);
 		client_obd_list_lock(&cli->cl_loi_list_lock);
-		osc_check_rpcs(env, cli, pol);
+		osc_check_rpcs(env, cli);
 		client_obd_list_unlock(&cli->cl_loi_list_lock);
 		atomic_dec(&cli->cl_lru_shrinkers);
 	} else {
@@ -2175,14 +2174,13 @@ static int osc_io_unplug0(const struct lu_env *env, struct client_obd *cli,
 static int osc_io_unplug_async(const struct lu_env *env,
 			       struct client_obd *cli, struct osc_object *osc)
 {
-	/* XXX: policy is no use actually. */
-	return osc_io_unplug0(env, cli, osc, PDL_POLICY_ROUND, 1);
+	return osc_io_unplug0(env, cli, osc, 1);
 }
 
 void osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol)
+		   struct osc_object *osc)
 {
-	(void)osc_io_unplug0(env, cli, osc, pol, 0);
+	(void)osc_io_unplug0(env, cli, osc, 0);
 }
 
 int osc_prep_async_page(struct osc_object *osc, struct osc_page *ops,
@@ -2922,7 +2920,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 	}
 
 	if (unplug)
-		osc_io_unplug(env, osc_cli(obj), obj, PDL_POLICY_ROUND);
+		osc_io_unplug(env, osc_cli(obj), obj);
 
 	if (hp || discard) {
 		int rc;
diff --git a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
index 365b278..75bfda6 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_cl_internal.h
@@ -454,7 +454,7 @@ int osc_cache_writeback_range(const struct lu_env *env, struct osc_object *obj,
 int osc_cache_wait_range(const struct lu_env *env, struct osc_object *obj,
 			 pgoff_t start, pgoff_t end);
 void osc_io_unplug(const struct lu_env *env, struct client_obd *cli,
-		   struct osc_object *osc, pdl_policy_t pol);
+		   struct osc_object *osc);
 
 void osc_object_set_contended  (struct osc_object *obj);
 void osc_object_clear_contended(struct osc_object *obj);
diff --git a/drivers/staging/lustre/lustre/osc/osc_internal.h b/drivers/staging/lustre/lustre/osc/osc_internal.h
index 7d0a3e2..448fdf4 100644
--- a/drivers/staging/lustre/lustre/osc/osc_internal.h
+++ b/drivers/staging/lustre/lustre/osc/osc_internal.h
@@ -132,7 +132,7 @@ int osc_sync_base(struct obd_export *exp, struct obd_info *oinfo,
 
 int osc_process_config_base(struct obd_device *obd, struct lustre_cfg *cfg);
 int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct list_head *ext_list, int cmd, pdl_policy_t p);
+		  struct list_head *ext_list, int cmd);
 int osc_lru_shrink(struct client_obd *cli, int target);
 
 extern spinlock_t osc_ast_guard;
diff --git a/drivers/staging/lustre/lustre/osc/osc_request.c b/drivers/staging/lustre/lustre/osc/osc_request.c
index f41f762..9f53627 100644
--- a/drivers/staging/lustre/lustre/osc/osc_request.c
+++ b/drivers/staging/lustre/lustre/osc/osc_request.c
@@ -437,7 +437,7 @@ int osc_setattr_async_base(struct obd_export *exp, struct obd_info *oinfo,
 	/* do mds to ost setattr asynchronously */
 	if (!rqset) {
 		/* Do not wait for response. */
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	} else {
 		req->rq_interpret_reply =
 			(ptlrpc_interpterer_t)osc_setattr_interpret;
@@ -449,7 +449,7 @@ int osc_setattr_async_base(struct obd_export *exp, struct obd_info *oinfo,
 		sa->sa_cookie = cookie;
 
 		if (rqset == PTLRPCD_SET)
-			ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+			ptlrpcd_add_req(req);
 		else
 			ptlrpc_set_add_req(rqset, req);
 	}
@@ -590,7 +590,7 @@ int osc_punch_base(struct obd_export *exp, struct obd_info *oinfo,
 	sa->sa_upcall = upcall;
 	sa->sa_cookie = cookie;
 	if (rqset == PTLRPCD_SET)
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	else
 		ptlrpc_set_add_req(rqset, req);
 
@@ -657,7 +657,7 @@ int osc_sync_base(struct obd_export *exp, struct obd_info *oinfo,
 	fa->fa_cookie = cookie;
 
 	if (rqset == PTLRPCD_SET)
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+		ptlrpcd_add_req(req);
 	else
 		ptlrpc_set_add_req(rqset, req);
 
@@ -826,7 +826,7 @@ static int osc_destroy(const struct lu_env *env, struct obd_export *exp,
 	}
 
 	/* Do not wait for response */
-	ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(req);
 	return 0;
 }
 
@@ -1718,7 +1718,7 @@ static int osc_brw_redo_request(struct ptlrpc_request *request,
 	 * to add a series of BRW RPCs into a self-defined ptlrpc_request_set
 	 * and wait for all of them to be finished. We should inherit request
 	 * set from old request. */
-	ptlrpcd_add_req(new_req, PDL_POLICY_SAME, -1);
+	ptlrpcd_add_req(new_req);
 
 	DEBUG_REQ(D_INFO, new_req, "new request");
 	return 0;
@@ -1859,7 +1859,7 @@ static int brw_interpret(const struct lu_env *env,
 	osc_wake_cache_waiters(cli);
 	client_obd_list_unlock(&cli->cl_loi_list_lock);
 
-	osc_io_unplug(env, cli, NULL, PDL_POLICY_SAME);
+	osc_io_unplug(env, cli, NULL);
 	return rc;
 }
 
@@ -1869,7 +1869,7 @@ static int brw_interpret(const struct lu_env *env,
  * Extents in the list must be in OES_RPC state.
  */
 int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
-		  struct list_head *ext_list, int cmd, pdl_policy_t pol)
+		  struct list_head *ext_list, int cmd)
 {
 	struct ptlrpc_request *req = NULL;
 	struct osc_extent *ext;
@@ -2043,19 +2043,7 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
 		  page_count, aa, cli->cl_r_in_flight,
 		  cli->cl_w_in_flight);
 
-	/* XXX: Maybe the caller can check the RPC bulk descriptor to
-	 * see which CPU/NUMA node the majority of pages were allocated
-	 * on, and try to assign the async RPC to the CPU core
-	 * (PDL_POLICY_PREFERRED) to reduce cross-CPU memory traffic.
-	 *
-	 * But on the other hand, we expect that multiple ptlrpcd
-	 * threads and the initial write sponsor can run in parallel,
-	 * especially when data checksum is enabled, which is CPU-bound
-	 * operation and single ptlrpcd thread cannot process in time.
-	 * So more ptlrpcd threads sharing BRW load
-	 * (with PDL_POLICY_ROUND) seems better.
-	 */
-	ptlrpcd_add_req(req, pol, -1);
+	ptlrpcd_add_req(req);
 	rc = 0;
 
 out:
@@ -2382,7 +2370,7 @@ int osc_enqueue_base(struct obd_export *exp, struct ldlm_res_id *res_id,
 			req->rq_interpret_reply =
 				(ptlrpc_interpterer_t)osc_enqueue_interpret;
 			if (rqset == PTLRPCD_SET)
-				ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+				ptlrpcd_add_req(req);
 			else
 				ptlrpc_set_add_req(rqset, req);
 		} else if (intent) {
@@ -2997,8 +2985,9 @@ static int osc_set_info_async(const struct lu_env *env, struct obd_export *exp,
 		LASSERT(set != NULL);
 		ptlrpc_set_add_req(set, req);
 		ptlrpc_check_set(NULL, set);
-	} else
-		ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	} else {
+		ptlrpcd_add_req(req);
+	}
 
 	return 0;
 }
@@ -3090,7 +3079,7 @@ static int osc_import_event(struct obd_device *obd,
 			cli = &obd->u.cli;
 			/* all pages go to failing rpcs due to the invalid
 			 * import */
-			osc_io_unplug(env, cli, NULL, PDL_POLICY_ROUND);
+			osc_io_unplug(env, cli, NULL);
 
 			ldlm_namespace_cleanup(ns, LDLM_FL_LOCAL_ONLY);
 			cl_env_put(env, &refcheck);
@@ -3162,7 +3151,7 @@ static int brw_queue_work(const struct lu_env *env, void *data)
 
 	CDEBUG(D_CACHE, "Run writeback work for client obd %p.\n", cli);
 
-	osc_io_unplug(env, cli, NULL, PDL_POLICY_SAME);
+	osc_io_unplug(env, cli, NULL);
 	return 0;
 }
 
diff --git a/drivers/staging/lustre/lustre/ptlrpc/client.c b/drivers/staging/lustre/lustre/ptlrpc/client.c
index 90b24fc..e1830fe 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/client.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/client.c
@@ -844,14 +844,17 @@ ptlrpc_prep_req(struct obd_import *imp, __u32 version, int opcode, int count,
 EXPORT_SYMBOL(ptlrpc_prep_req);
 
 /**
- * Allocate and initialize new request set structure.
+ * Allocate and initialize new request set structure on the current CPT.
  * Returns a pointer to the newly allocated set structure or NULL on error.
  */
 struct ptlrpc_request_set *ptlrpc_prep_set(void)
 {
 	struct ptlrpc_request_set *set;
+	int cpt;
 
-	set = kzalloc(sizeof(*set), GFP_NOFS);
+	cpt = cfs_cpt_current(cfs_cpt_table, 0);
+	set = kzalloc_node(sizeof(*set), GFP_NOFS,
+			   cfs_cpt_spread_node(cfs_cpt_table, cpt));
 	if (!set)
 		return NULL;
 	atomic_set(&set->set_refcount, 1);
@@ -2827,7 +2830,7 @@ int ptlrpc_replay_req(struct ptlrpc_request *req)
 	atomic_inc(&req->rq_import->imp_replay_inflight);
 	ptlrpc_request_addref(req); /* ptlrpcd needs a ref */
 
-	ptlrpcd_add_req(req, PDL_POLICY_LOCAL, -1);
+	ptlrpcd_add_req(req);
 	return 0;
 }
 EXPORT_SYMBOL(ptlrpc_replay_req);
@@ -3033,7 +3036,7 @@ static void ptlrpcd_add_work_req(struct ptlrpc_request *req)
 	req->rq_xid		= ptlrpc_next_xid();
 	req->rq_import_generation = req->rq_import->imp_generation;
 
-	ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(req);
 }
 
 static int work_interpreter(const struct lu_env *env,
diff --git a/drivers/staging/lustre/lustre/ptlrpc/import.c b/drivers/staging/lustre/lustre/ptlrpc/import.c
index f5b3245..c52ceef 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/import.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/import.c
@@ -742,12 +742,11 @@ int ptlrpc_connect_import(struct obd_import *imp)
 
 	DEBUG_REQ(D_RPCTRACE, request, "(re)connect request (timeout %d)",
 		  request->rq_timeout);
-	ptlrpcd_add_req(request, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(request);
 	rc = 0;
 out:
-	if (rc != 0) {
+	if (rc != 0)
 		IMPORT_SET_STATE(imp, LUSTRE_IMP_DISCON);
-	}
 
 	return rc;
 }
@@ -1257,7 +1256,7 @@ static int signal_completed_replay(struct obd_import *imp)
 		req->rq_timeout *= 3;
 	req->rq_interpret_reply = completed_replay_interpret;
 
-	ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(req);
 	return 0;
 }
 
diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
index f8edb79..d3aea4a 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
@@ -105,7 +105,7 @@ static int ptlrpc_ping(struct obd_import *imp)
 
 	DEBUG_REQ(D_INFO, req, "pinging %s->%s",
 		  imp->imp_obd->obd_uuid.uuid, obd2cli_tgt(imp->imp_obd));
-	ptlrpcd_add_req(req, PDL_POLICY_ROUND, -1);
+	ptlrpcd_add_req(req);
 
 	return 0;
 }
diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpc_internal.h b/drivers/staging/lustre/lustre/ptlrpc/ptlrpc_internal.h
index 6dc3998..1d64ca7 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpc_internal.h
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpc_internal.h
@@ -50,7 +50,7 @@ extern struct mutex ptlrpc_all_services_mutex;
 
 int ptlrpc_start_thread(struct ptlrpc_service_part *svcpt, int wait);
 /* ptlrpcd.c */
-int ptlrpcd_start(int index, int max, const char *name, struct ptlrpcd_ctl *pc);
+int ptlrpcd_start(struct ptlrpcd_ctl *pc);
 
 /* client.c */
 struct ptlrpc_bulk_desc *ptlrpc_new_bulk(unsigned npages, unsigned max_brw,
diff --git a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
index 17cc81d..00efdbf 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/ptlrpcd.c
@@ -67,22 +67,94 @@
 
 #include "ptlrpc_internal.h"
 
+/* One of these per CPT. */
 struct ptlrpcd {
 	int pd_size;
 	int pd_index;
+	int pd_cpt;
+	int pd_cursor;
 	int pd_nthreads;
-	struct ptlrpcd_ctl pd_thread_rcv;
+	int pd_groupsize;
 	struct ptlrpcd_ctl pd_threads[0];
 };
 
+/*
+ * max_ptlrpcds is obsolete, but retained to ensure that the kernel
+ * module will load on a system where it has been tuned.
+ * A value other than 0 implies it was tuned, in which case the value
+ * is used to derive a setting for ptlrpcd_per_cpt_max.
+ */
 static int max_ptlrpcds;
 module_param(max_ptlrpcds, int, 0644);
 MODULE_PARM_DESC(max_ptlrpcds, "Max ptlrpcd thread count to be started.");
 
-static int ptlrpcd_bind_policy = PDB_POLICY_PAIR;
+/*
+ * ptlrpcd_bind_policy is obsolete, but retained to ensure that
+ * the kernel module will load on a system where it has been tuned.
+ * A value other than 0 implies it was tuned, in which case the value
+ * is used to derive a setting for ptlrpcd_partner_group_size.
+ */
+static int ptlrpcd_bind_policy;
 module_param(ptlrpcd_bind_policy, int, 0644);
-MODULE_PARM_DESC(ptlrpcd_bind_policy, "Ptlrpcd threads binding mode.");
-static struct ptlrpcd *ptlrpcds;
+MODULE_PARM_DESC(ptlrpcd_bind_policy,
+		 "Ptlrpcd threads binding mode (obsolete).");
+
+/*
+ * ptlrpcd_per_cpt_max: The maximum number of ptlrpcd threads to run
+ * in a CPT.
+ */
+static int ptlrpcd_per_cpt_max;
+module_param(ptlrpcd_per_cpt_max, int, 0644);
+MODULE_PARM_DESC(ptlrpcd_per_cpt_max,
+		 "Max ptlrpcd thread count to be started per cpt.");
+
+/*
+ * ptlrpcd_partner_group_size: The desired number of threads in each
+ * ptlrpcd partner thread group. Default is 2, corresponding to the
+ * old PDB_POLICY_PAIR. A negative value makes all ptlrpcd threads in
+ * a CPT partners of each other.
+ */
+static int ptlrpcd_partner_group_size;
+module_param(ptlrpcd_partner_group_size, int, 0644);
+MODULE_PARM_DESC(ptlrpcd_partner_group_size,
+		 "Number of ptlrpcd threads in a partner group.");
+
+/*
+ * ptlrpcd_cpts: A CPT string describing the CPU partitions that
+ * ptlrpcd threads should run on. Used to make ptlrpcd threads run on
+ * a subset of all CPTs.
+ *
+ * ptlrpcd_cpts=2
+ * ptlrpcd_cpts=[2]
+ *   run ptlrpcd threads only on CPT 2.
+ *
+ * ptlrpcd_cpts=0-3
+ * ptlrpcd_cpts=[0-3]
+ *   run ptlrpcd threads on CPTs 0, 1, 2, and 3.
+ *
+ * ptlrpcd_cpts=[0-3,5,7]
+ *   run ptlrpcd threads on CPTS 0, 1, 2, 3, 5, and 7.
+ */
+static char *ptlrpcd_cpts;
+module_param(ptlrpcd_cpts, charp, 0644);
+MODULE_PARM_DESC(ptlrpcd_cpts,
+		 "CPU partitions ptlrpcd threads should run in");
+
+/* ptlrpcds_cpt_idx maps cpt numbers to an index in the ptlrpcds array. */
+static int		*ptlrpcds_cpt_idx;
+
+/* ptlrpcds_num is the number of entries in the ptlrpcds array. */
+static int		ptlrpcds_num;
+static struct ptlrpcd	**ptlrpcds;
+
+/*
+ * In addition to the regular thread pool above, there is a single
+ * global recovery thread. Recovery isn't critical for performance,
+ * and doesn't block, but must always be able to proceed, and it is
+ * possible that all normal ptlrpcd threads are blocked. Hence the
+ * need for a dedicated thread.
+ */
+static struct ptlrpcd_ctl ptlrpcd_rcv;
 
 struct mutex ptlrpcd_mutex;
 static int ptlrpcd_users;
@@ -98,45 +170,29 @@ void ptlrpcd_wake(struct ptlrpc_request *req)
 EXPORT_SYMBOL(ptlrpcd_wake);
 
 static struct ptlrpcd_ctl *
-ptlrpcd_select_pc(struct ptlrpc_request *req, pdl_policy_t policy, int index)
+ptlrpcd_select_pc(struct ptlrpc_request *req)
 {
-	int idx = 0;
+	struct ptlrpcd	*pd;
+	int		cpt;
+	int		idx;
 
 	if (req != NULL && req->rq_send_state != LUSTRE_IMP_FULL)
-		return &ptlrpcds->pd_thread_rcv;
-
-	switch (policy) {
-	case PDL_POLICY_SAME:
-		idx = smp_processor_id() % ptlrpcds->pd_nthreads;
-		break;
-	case PDL_POLICY_LOCAL:
-		/* Before CPU partition patches available, process it the same
-		 * as "PDL_POLICY_ROUND". */
-# ifdef CFS_CPU_MODE_NUMA
-# warning "fix this code to use new CPU partition APIs"
-# endif
-		/* Fall through to PDL_POLICY_ROUND until the CPU
-		 * CPU partition patches are available. */
-		index = -1;
-	case PDL_POLICY_PREFERRED:
-		if (index >= 0 && index < num_online_cpus()) {
-			idx = index % ptlrpcds->pd_nthreads;
-			break;
-		}
-		/* Fall through to PDL_POLICY_ROUND for bad index. */
-	default:
-		/* Fall through to PDL_POLICY_ROUND for unknown policy. */
-	case PDL_POLICY_ROUND:
+		return &ptlrpcd_rcv;
+
+	cpt = cfs_cpt_current(cfs_cpt_table, 1);
+	if (!ptlrpcds_cpt_idx)
+		idx = cpt;
+	else
+		idx = ptlrpcds_cpt_idx[cpt];
+	pd = ptlrpcds[idx];
+
 		/* We do not care whether it is strict load balance. */
-		idx = ptlrpcds->pd_index + 1;
-		if (idx == smp_processor_id())
-			idx++;
-		idx %= ptlrpcds->pd_nthreads;
-		ptlrpcds->pd_index = idx;
-		break;
-	}
+	idx = pd->pd_cursor;
+	if (++idx == pd->pd_nthreads)
+		idx = 0;
+	pd->pd_cursor = idx;
 
-	return &ptlrpcds->pd_threads[idx];
+	return &pd->pd_threads[idx];
 }
 
 /**
@@ -150,7 +206,7 @@ void ptlrpcd_add_rqset(struct ptlrpc_request_set *set)
 	struct ptlrpc_request_set *new;
 	int count, i;
 
-	pc = ptlrpcd_select_pc(NULL, PDL_POLICY_LOCAL, -1);
+	pc = ptlrpcd_select_pc(NULL);
 	new = pc->pc_set;
 
 	list_for_each_safe(pos, tmp, &set->set_requests) {
@@ -212,7 +268,7 @@ static int ptlrpcd_steal_rqset(struct ptlrpc_request_set *des,
  * Requests that are added to the ptlrpcd queue are sent via
  * ptlrpcd_check->ptlrpc_check_set().
  */
-void ptlrpcd_add_req(struct ptlrpc_request *req, pdl_policy_t policy, int idx)
+void ptlrpcd_add_req(struct ptlrpc_request *req)
 {
 	struct ptlrpcd_ctl *pc;
 
@@ -242,7 +298,7 @@ void ptlrpcd_add_req(struct ptlrpc_request *req, pdl_policy_t policy, int idx)
 		spin_unlock(&req->rq_lock);
 	}
 
-	pc = ptlrpcd_select_pc(req, policy, idx);
+	pc = ptlrpcd_select_pc(req);
 
 	DEBUG_REQ(D_INFO, req, "add req [%p] to pc [%s:%d]",
 		  req, pc->pc_name, pc->pc_index);
@@ -372,25 +428,29 @@ static int ptlrpcd_check(struct lu_env *env, struct ptlrpcd_ctl *pc)
 static int ptlrpcd(void *arg)
 {
 	struct ptlrpcd_ctl *pc = arg;
-	struct ptlrpc_request_set *set = pc->pc_set;
+	struct ptlrpc_request_set *set;
 	struct lu_env env = { .le_ses = NULL };
-	int rc, exit = 0;
+	int rc = 0;
+	int exit = 0;
 
 	unshare_fs_struct();
-#if defined(CONFIG_SMP)
-	if (test_bit(LIOD_BIND, &pc->pc_flags)) {
-		int index = pc->pc_index;
-
-		if (index >= 0 && index < num_possible_cpus()) {
-			while (!cpu_online(index)) {
-				if (++index >= num_possible_cpus())
-					index = 0;
-			}
-			set_cpus_allowed_ptr(current,
-					cpumask_of_node(cpu_to_node(index)));
-		}
+	if (cfs_cpt_bind(cfs_cpt_table, pc->pc_cpt) != 0)
+		CWARN("Failed to bind %s on CPT %d\n", pc->pc_name, pc->pc_cpt);
+
+	/*
+	 * Allocate the request set after the thread has been bound
+	 * above. This is safe because no requests will be queued
+	 * until all ptlrpcd threads have confirmed that they have
+	 * successfully started.
+	 */
+	set = ptlrpc_prep_set();
+	if (!set) {
+		rc = -ENOMEM;
+		goto failed;
 	}
-#endif
+	spin_lock(&pc->pc_lock);
+	pc->pc_set = set;
+	spin_unlock(&pc->pc_lock);
 	/*
 	 * XXX So far only "client" ptlrpcd uses an environment. In
 	 * the future, ptlrpcd thread (or a thread-set) has to given
@@ -398,10 +458,10 @@ static int ptlrpcd(void *arg)
 	 */
 	rc = lu_context_init(&env.le_ctx,
 			     LCT_CL_THREAD|LCT_REMEMBER|LCT_NOREF);
-	complete(&pc->pc_starting);
-
 	if (rc != 0)
-		return rc;
+		goto failed;
+
+	complete(&pc->pc_starting);
 
 	/*
 	 * This mainloop strongly resembles ptlrpc_set_wait() except that our
@@ -447,174 +507,97 @@ static int ptlrpcd(void *arg)
 	complete(&pc->pc_finishing);
 
 	return 0;
+failed:
+	pc->pc_error = rc;
+	complete(&pc->pc_starting);
+	return rc;
 }
 
-/* XXX: We want multiple CPU cores to share the async RPC load. So we start many
- *      ptlrpcd threads. We also want to reduce the ptlrpcd overhead caused by
- *      data transfer cross-CPU cores. So we bind ptlrpcd thread to specified
- *      CPU core. But binding all ptlrpcd threads maybe cause response delay
- *      because of some CPU core(s) busy with other loads.
- *
- *      For example: "ls -l", some async RPCs for statahead are assigned to
- *      ptlrpcd_0, and ptlrpcd_0 is bound to CPU_0, but CPU_0 may be quite busy
- *      with other non-ptlrpcd, like "ls -l" itself (we want to the "ls -l"
- *      thread, statahead thread, and ptlrpcd thread can run in parallel), under
- *      such case, the statahead async RPCs can not be processed in time, it is
- *      unexpected. If ptlrpcd_0 can be re-scheduled on other CPU core, it may
- *      be better. But it breaks former data transfer policy.
- *
- *      So we shouldn't be blind for avoiding the data transfer. We make some
- *      compromise: divide the ptlrpcd threads pool into two parts. One part is
- *      for bound mode, each ptlrpcd thread in this part is bound to some CPU
- *      core. The other part is for free mode, all the ptlrpcd threads in the
- *      part can be scheduled on any CPU core. We specify some partnership
- *      between bound mode ptlrpcd thread(s) and free mode ptlrpcd thread(s),
- *      and the async RPC load within the partners are shared.
+static void ptlrpcd_ctl_init(struct ptlrpcd_ctl *pc, int index, int cpt)
+{
+	pc->pc_index = index;
+	pc->pc_cpt = cpt;
+	init_completion(&pc->pc_starting);
+	init_completion(&pc->pc_finishing);
+	spin_lock_init(&pc->pc_lock);
+
+	if (index < 0) {
+		/* Recovery thread. */
+		snprintf(pc->pc_name, sizeof(pc->pc_name), "ptlrpcd_rcv");
+	} else {
+		/* Regular thread. */
+		snprintf(pc->pc_name, sizeof(pc->pc_name),
+			 "ptlrpcd_%02d_%02d", cpt, index);
+	}
+}
+
+/* XXX: We want multiple CPU cores to share the async RPC load. So we
+ *	start many ptlrpcd threads. We also want to reduce the ptlrpcd
+ *	overhead caused by data transfer cross-CPU cores. So we bind
+ *	all ptlrpcd threads to a CPT, in the expectation that CPTs
+ *	will be defined in a way that matches these boundaries. Within
+ *	a CPT a ptlrpcd thread can be scheduled on any available core.
  *
- *      It can partly avoid data transfer cross-CPU (if the bound mode ptlrpcd
- *      thread can be scheduled in time), and try to guarantee the async RPC
- *      processed ASAP (as long as the free mode ptlrpcd thread can be scheduled
- *      on any CPU core).
+ *	Each ptlrpcd thread has its own request queue. This can cause
+ *	response delay if the thread is already busy. To help with
+ *	this we define partner threads: these are other threads bound
+ *	to the same CPT which will check for work in each other's
+ *	request queues if they have no work to do.
  *
- *      As for how to specify the partnership between bound mode ptlrpcd
- *      thread(s) and free mode ptlrpcd thread(s), the simplest way is to use
- *      <free bound> pair. In future, we can specify some more complex
- *      partnership based on the patches for CPU partition. But before such
- *      patches are available, we prefer to use the simplest one.
+ *	The desired number of partner threads can be tuned by setting
+ *	ptlrpcd_partner_group_size. The default is to create pairs of
+ *	partner threads.
  */
-# ifdef CFS_CPU_MODE_NUMA
-# warning "fix ptlrpcd_bind() to use new CPU partition APIs"
-# endif
-static int ptlrpcd_bind(int index, int max)
+static int ptlrpcd_partners(struct ptlrpcd *pd, int index)
 {
 	struct ptlrpcd_ctl *pc;
+	struct ptlrpcd_ctl **ppc;
+	int first;
+	int i;
 	int rc = 0;
-#if defined(CONFIG_NUMA)
-	cpumask_t mask;
-#endif
+	int size;
+
+	LASSERT(index >= 0 && index < pd->pd_nthreads);
+	pc = &pd->pd_threads[index];
+	pc->pc_npartners = pd->pd_groupsize - 1;
+
+	if (pc->pc_npartners <= 0)
+		goto out;
 
-	LASSERT(index <= max - 1);
-	pc = &ptlrpcds->pd_threads[index];
-	switch (ptlrpcd_bind_policy) {
-	case PDB_POLICY_NONE:
-		pc->pc_npartners = -1;
-		break;
-	case PDB_POLICY_FULL:
+	size = sizeof(struct ptlrpcd_ctl *) * pc->pc_npartners;
+	pc->pc_partners = kzalloc_node(size, GFP_NOFS,
+				       cfs_cpt_spread_node(cfs_cpt_table,
+							   pc->pc_cpt));
+	if (!pc->pc_partners) {
 		pc->pc_npartners = 0;
-		set_bit(LIOD_BIND, &pc->pc_flags);
-		break;
-	case PDB_POLICY_PAIR:
-		LASSERT(max % 2 == 0);
-		pc->pc_npartners = 1;
-		break;
-	case PDB_POLICY_NEIGHBOR:
-#if defined(CONFIG_NUMA)
-	{
-		int i;
-		cpumask_copy(&mask, cpumask_of_node(cpu_to_node(index)));
-		for (i = max; i < num_online_cpus(); i++)
-			cpumask_clear_cpu(i, &mask);
-		pc->pc_npartners = cpumask_weight(&mask) - 1;
-		set_bit(LIOD_BIND, &pc->pc_flags);
-	}
-#else
-		LASSERT(max >= 3);
-		pc->pc_npartners = 2;
-#endif
-		break;
-	default:
-		CERROR("unknown ptlrpcd bind policy %d\n", ptlrpcd_bind_policy);
-		rc = -EINVAL;
+		rc = -ENOMEM;
+		goto out;
 	}
 
-	if (rc == 0 && pc->pc_npartners > 0) {
-		pc->pc_partners = kcalloc(pc->pc_npartners,
-					  sizeof(struct ptlrpcd_ctl *),
-					  GFP_NOFS);
-		if (pc->pc_partners == NULL) {
-			pc->pc_npartners = 0;
-			rc = -ENOMEM;
-		} else {
-			switch (ptlrpcd_bind_policy) {
-			case PDB_POLICY_PAIR:
-				if (index & 0x1) {
-					set_bit(LIOD_BIND, &pc->pc_flags);
-					pc->pc_partners[0] = &ptlrpcds->
-						pd_threads[index - 1];
-					ptlrpcds->pd_threads[index - 1].
-						pc_partners[0] = pc;
-				}
-				break;
-			case PDB_POLICY_NEIGHBOR:
-#if defined(CONFIG_NUMA)
-			{
-				struct ptlrpcd_ctl *ppc;
-				int i, pidx;
-				/* partners are cores in the same NUMA node.
-				 * setup partnership only with ptlrpcd threads
-				 * that are already initialized
-				 */
-				for (pidx = 0, i = 0; i < index; i++) {
-					if (cpumask_test_cpu(i, &mask)) {
-						ppc = &ptlrpcds->pd_threads[i];
-						pc->pc_partners[pidx++] = ppc;
-						ppc->pc_partners[ppc->
-							  pc_npartners++] = pc;
-					}
-				}
-				/* adjust number of partners to the number
-				 * of partnership really setup */
-				pc->pc_npartners = pidx;
-			}
-#else
-				if (index & 0x1)
-					set_bit(LIOD_BIND, &pc->pc_flags);
-				if (index > 0) {
-					pc->pc_partners[0] = &ptlrpcds->
-						pd_threads[index - 1];
-					ptlrpcds->pd_threads[index - 1].
-						pc_partners[1] = pc;
-					if (index == max - 1) {
-						pc->pc_partners[1] =
-						&ptlrpcds->pd_threads[0];
-						ptlrpcds->pd_threads[0].
-						pc_partners[0] = pc;
-					}
-				}
-#endif
-				break;
-			}
-		}
+	first = index - index % pd->pd_groupsize;
+	ppc = pc->pc_partners;
+	for (i = first; i < first + pd->pd_groupsize; i++) {
+		if (i != index)
+			*ppc++ = &pd->pd_threads[i];
 	}
-
+out:
 	return rc;
 }
 
-
-int ptlrpcd_start(int index, int max, const char *name, struct ptlrpcd_ctl *pc)
+int ptlrpcd_start(struct ptlrpcd_ctl *pc)
 {
-	int rc;
+	struct task_struct *task;
+	int rc = 0;
 
 	/*
 	 * Do not allow start second thread for one pc.
 	 */
 	if (test_and_set_bit(LIOD_START, &pc->pc_flags)) {
 		CWARN("Starting second thread (%s) for same pc %p\n",
-		      name, pc);
+		      pc->pc_name, pc);
 		return 0;
 	}
 
-	pc->pc_index = index;
-	init_completion(&pc->pc_starting);
-	init_completion(&pc->pc_finishing);
-	spin_lock_init(&pc->pc_lock);
-	strlcpy(pc->pc_name, name, sizeof(pc->pc_name));
-	pc->pc_set = ptlrpc_prep_set();
-	if (pc->pc_set == NULL) {
-		rc = -ENOMEM;
-		goto out;
-	}
-
 	/*
 	 * So far only "client" ptlrpcd uses an environment. In the future,
 	 * ptlrpcd thread (or a thread-set) has to be given an argument,
@@ -622,29 +605,21 @@ int ptlrpcd_start(int index, int max, const char *name, struct ptlrpcd_ctl *pc)
 	 */
 	rc = lu_context_init(&pc->pc_env.le_ctx, LCT_CL_THREAD|LCT_REMEMBER);
 	if (rc != 0)
-		goto out_set;
+		goto out;
 
-	{
-		struct task_struct *task;
-		if (index >= 0) {
-			rc = ptlrpcd_bind(index, max);
-			if (rc < 0)
-				goto out_env;
-		}
+	task = kthread_run(ptlrpcd, pc, "%s", pc->pc_name);
+	if (IS_ERR(task)) {
+		rc = PTR_ERR(task);
+		goto out_set;
+	}
 
-		task = kthread_run(ptlrpcd, pc, "%s", pc->pc_name);
-		if (IS_ERR(task)) {
-			rc = PTR_ERR(task);
-			goto out_env;
-		}
+	wait_for_completion(&pc->pc_starting);
+	rc = pc->pc_error;
+	if (rc != 0)
+		goto out_set;
 
-		wait_for_completion(&pc->pc_starting);
-	}
 	return 0;
 
-out_env:
-	lu_context_fini(&pc->pc_env.le_ctx);
-
 out_set:
 	if (pc->pc_set != NULL) {
 		struct ptlrpc_request_set *set = pc->pc_set;
@@ -654,7 +629,7 @@ out_set:
 		spin_unlock(&pc->pc_lock);
 		ptlrpc_set_destroy(set);
 	}
-	clear_bit(LIOD_BIND, &pc->pc_flags);
+	lu_context_fini(&pc->pc_env.le_ctx);
 
 out:
 	clear_bit(LIOD_START, &pc->pc_flags);
@@ -694,7 +669,6 @@ void ptlrpcd_free(struct ptlrpcd_ctl *pc)
 	clear_bit(LIOD_START, &pc->pc_flags);
 	clear_bit(LIOD_STOP, &pc->pc_flags);
 	clear_bit(LIOD_FORCE, &pc->pc_flags);
-	clear_bit(LIOD_BIND, &pc->pc_flags);
 
 out:
 	if (pc->pc_npartners > 0) {
@@ -704,88 +678,262 @@ out:
 		pc->pc_partners = NULL;
 	}
 	pc->pc_npartners = 0;
+	pc->pc_error = 0;
 }
 
 static void ptlrpcd_fini(void)
 {
 	int i;
+	int j;
 
 	if (ptlrpcds != NULL) {
-		for (i = 0; i < ptlrpcds->pd_nthreads; i++)
-			ptlrpcd_stop(&ptlrpcds->pd_threads[i], 0);
-		for (i = 0; i < ptlrpcds->pd_nthreads; i++)
-			ptlrpcd_free(&ptlrpcds->pd_threads[i]);
-		ptlrpcd_stop(&ptlrpcds->pd_thread_rcv, 0);
-		ptlrpcd_free(&ptlrpcds->pd_thread_rcv);
+		for (i = 0; i < ptlrpcds_num; i++) {
+			if (!ptlrpcds[i])
+				break;
+			for (j = 0; j < ptlrpcds[i]->pd_nthreads; j++)
+				ptlrpcd_stop(&ptlrpcds[i]->pd_threads[j], 0);
+			for (j = 0; j < ptlrpcds[i]->pd_nthreads; j++)
+				ptlrpcd_free(&ptlrpcds[i]->pd_threads[j]);
+			kfree(ptlrpcds[i]);
+			ptlrpcds[i] = NULL;
+		}
 		kfree(ptlrpcds);
-		ptlrpcds = NULL;
 	}
+	ptlrpcds_num = 0;
+
+	ptlrpcd_stop(&ptlrpcd_rcv, 0);
+	ptlrpcd_free(&ptlrpcd_rcv);
+
+	kfree(ptlrpcds_cpt_idx);
+	ptlrpcds_cpt_idx = NULL;
 }
 
 static int ptlrpcd_init(void)
 {
-	int nthreads = num_online_cpus();
-	char name[16];
-	int size, i = -1, j, rc = 0;
-
-	if (max_ptlrpcds > 0 && max_ptlrpcds < nthreads)
-		nthreads = max_ptlrpcds;
-	if (nthreads < 2)
-		nthreads = 2;
-	if (nthreads < 3 && ptlrpcd_bind_policy == PDB_POLICY_NEIGHBOR)
-		ptlrpcd_bind_policy = PDB_POLICY_PAIR;
-	else if (nthreads % 2 != 0 && ptlrpcd_bind_policy == PDB_POLICY_PAIR)
-		nthreads &= ~1; /* make sure it is even */
-
-	size = offsetof(struct ptlrpcd, pd_threads[nthreads]);
-	ptlrpcds = kzalloc(size, GFP_NOFS);
+	int nthreads;
+	int groupsize;
+	int size;
+	int i;
+	int j;
+	int rc = 0;
+	struct cfs_cpt_table *cptable;
+	__u32 *cpts = NULL;
+	int ncpts;
+	int cpt;
+	struct ptlrpcd *pd;
+
+	/*
+	 * Determine the CPTs that ptlrpcd threads will run on.
+	 */
+	cptable = cfs_cpt_table;
+	ncpts = cfs_cpt_number(cptable);
+	if (ptlrpcd_cpts) {
+		struct cfs_expr_list *el;
+
+		size = ncpts * sizeof(ptlrpcds_cpt_idx[0]);
+		ptlrpcds_cpt_idx = kzalloc(size, GFP_KERNEL);
+		if (!ptlrpcds_cpt_idx) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		rc = cfs_expr_list_parse(ptlrpcd_cpts,
+					 strlen(ptlrpcd_cpts),
+					 0, ncpts - 1, &el);
+
+		if (rc != 0) {
+			CERROR("ptlrpcd_cpts: invalid CPT pattern string: %s",
+			       ptlrpcd_cpts);
+			rc = -EINVAL;
+			goto out;
+		}
+
+		rc = cfs_expr_list_values(el, ncpts, &cpts);
+		cfs_expr_list_free(el);
+		if (rc <= 0) {
+			CERROR("ptlrpcd_cpts: failed to parse CPT array %s: %d\n",
+			       ptlrpcd_cpts, rc);
+			if (rc == 0)
+				rc = -EINVAL;
+			goto out;
+		}
+
+		/*
+		 * Create the cpt-to-index map. When there is no match
+		 * in the cpt table, pick a cpt at random. This could
+		 * be changed to take the topology of the system into
+		 * account.
+		 */
+		for (cpt = 0; cpt < ncpts; cpt++) {
+			for (i = 0; i < rc; i++)
+				if (cpts[i] == cpt)
+					break;
+			if (i >= rc)
+				i = cpt % rc;
+			ptlrpcds_cpt_idx[cpt] = i;
+		}
+
+		cfs_expr_list_values_free(cpts, rc);
+		ncpts = rc;
+	}
+	ptlrpcds_num = ncpts;
+
+	size = ncpts * sizeof(ptlrpcds[0]);
+	ptlrpcds = kzalloc(size, GFP_KERNEL);
 	if (!ptlrpcds) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
-	snprintf(name, sizeof(name), "ptlrpcd_rcv");
-	set_bit(LIOD_RECOVERY, &ptlrpcds->pd_thread_rcv.pc_flags);
-	rc = ptlrpcd_start(-1, nthreads, name, &ptlrpcds->pd_thread_rcv);
+	/*
+	 * The max_ptlrpcds parameter is obsolete, but do something
+	 * sane if it has been tuned, and complain if
+	 * ptlrpcd_per_cpt_max has also been tuned.
+	 */
+	if (max_ptlrpcds != 0) {
+		CWARN("max_ptlrpcds is obsolete.\n");
+		if (ptlrpcd_per_cpt_max == 0) {
+			ptlrpcd_per_cpt_max = max_ptlrpcds / ncpts;
+			/* Round up if there is a remainder. */
+			if (max_ptlrpcds % ncpts != 0)
+				ptlrpcd_per_cpt_max++;
+			CWARN("Setting ptlrpcd_per_cpt_max = %d\n",
+			      ptlrpcd_per_cpt_max);
+		} else {
+			CWARN("ptlrpd_per_cpt_max is also set!\n");
+		}
+	}
+
+	/*
+	 * The ptlrpcd_bind_policy parameter is obsolete, but do
+	 * something sane if it has been tuned, and complain if
+	 * ptlrpcd_partner_group_size is also tuned.
+	 */
+	if (ptlrpcd_bind_policy != 0) {
+		CWARN("ptlrpcd_bind_policy is obsolete.\n");
+		if (ptlrpcd_partner_group_size == 0) {
+			switch (ptlrpcd_bind_policy) {
+			case 1: /* PDB_POLICY_NONE */
+			case 2: /* PDB_POLICY_FULL */
+				ptlrpcd_partner_group_size = 1;
+				break;
+			case 3: /* PDB_POLICY_PAIR */
+				ptlrpcd_partner_group_size = 2;
+				break;
+			case 4: /* PDB_POLICY_NEIGHBOR */
+#ifdef CONFIG_NUMA
+				ptlrpcd_partner_group_size = -1; /* CPT */
+#else
+				ptlrpcd_partner_group_size = 3; /* Triplets */
+#endif
+				break;
+			default: /* Illegal value, use the default. */
+				ptlrpcd_partner_group_size = 2;
+				break;
+			}
+			CWARN("Setting ptlrpcd_partner_group_size = %d\n",
+			      ptlrpcd_partner_group_size);
+		} else {
+			CWARN("ptlrpcd_partner_group_size is also set!\n");
+		}
+	}
+
+	if (ptlrpcd_partner_group_size == 0)
+		ptlrpcd_partner_group_size = 2;
+	else if (ptlrpcd_partner_group_size < 0)
+		ptlrpcd_partner_group_size = -1;
+	else if (ptlrpcd_per_cpt_max > 0 &&
+		 ptlrpcd_partner_group_size > ptlrpcd_per_cpt_max)
+		ptlrpcd_partner_group_size = ptlrpcd_per_cpt_max;
+
+	/*
+	 * Start the recovery thread first.
+	 */
+	set_bit(LIOD_RECOVERY, &ptlrpcd_rcv.pc_flags);
+	ptlrpcd_ctl_init(&ptlrpcd_rcv, -1, CFS_CPT_ANY);
+	rc = ptlrpcd_start(&ptlrpcd_rcv);
 	if (rc < 0)
 		goto out;
 
-	/* XXX: We start nthreads ptlrpc daemons. Each of them can process any
-	 *      non-recovery async RPC to improve overall async RPC efficiency.
-	 *
-	 *      But there are some issues with async I/O RPCs and async non-I/O
-	 *      RPCs processed in the same set under some cases. The ptlrpcd may
-	 *      be blocked by some async I/O RPC(s), then will cause other async
-	 *      non-I/O RPC(s) can not be processed in time.
-	 *
-	 *      Maybe we should distinguish blocked async RPCs from non-blocked
-	 *      async RPCs, and process them in different ptlrpcd sets to avoid
-	 *      unnecessary dependency. But how to distribute async RPCs load
-	 *      among all the ptlrpc daemons becomes another trouble. */
-	for (i = 0; i < nthreads; i++) {
-		snprintf(name, sizeof(name), "ptlrpcd_%d", i);
-		rc = ptlrpcd_start(i, nthreads, name, &ptlrpcds->pd_threads[i]);
-		if (rc < 0)
+	for (i = 0; i < ncpts; i++) {
+		if (!cpts)
+			cpt = i;
+		else
+			cpt = cpts[i];
+
+		nthreads = cfs_cpt_weight(cptable, cpt);
+		if (ptlrpcd_per_cpt_max > 0 && ptlrpcd_per_cpt_max < nthreads)
+			nthreads = ptlrpcd_per_cpt_max;
+		if (nthreads < 2)
+			nthreads = 2;
+
+		if (ptlrpcd_partner_group_size <= 0) {
+			groupsize = nthreads;
+		} else if (nthreads <= ptlrpcd_partner_group_size) {
+			groupsize = nthreads;
+		} else {
+			groupsize = ptlrpcd_partner_group_size;
+			if (nthreads % groupsize != 0)
+				nthreads += groupsize - (nthreads % groupsize);
+		}
+
+		size = offsetof(struct ptlrpcd, pd_threads[nthreads]);
+		pd = kzalloc_node(size, GFP_NOFS,
+				  cfs_cpt_spread_node(cfs_cpt_table, cpt));
+		if (!pd) {
+			rc = -ENOMEM;
 			goto out;
-	}
+		}
+		pd->pd_size = size;
+		pd->pd_index = i;
+		pd->pd_cpt = cpt;
+		pd->pd_cursor = 0;
+		pd->pd_nthreads = nthreads;
+		pd->pd_groupsize = groupsize;
+		ptlrpcds[i] = pd;
 
-	ptlrpcds->pd_size = size;
-	ptlrpcds->pd_index = 0;
-	ptlrpcds->pd_nthreads = nthreads;
+		/*
+		 * The ptlrpcd threads in a partner group can access
+		 * each other's struct ptlrpcd_ctl, so these must be
+		 * initialized before any thread is started.
+		 */
+		for (j = 0; j < nthreads; j++) {
+			ptlrpcd_ctl_init(&pd->pd_threads[j], j, cpt);
+			rc = ptlrpcd_partners(pd, j);
+			if (rc < 0)
+				goto out;
+		}
 
-out:
-	if (rc != 0 && ptlrpcds != NULL) {
-		for (j = 0; j <= i; j++)
-			ptlrpcd_stop(&ptlrpcds->pd_threads[j], 0);
-		for (j = 0; j <= i; j++)
-			ptlrpcd_free(&ptlrpcds->pd_threads[j]);
-		ptlrpcd_stop(&ptlrpcds->pd_thread_rcv, 0);
-		ptlrpcd_free(&ptlrpcds->pd_thread_rcv);
-		kfree(ptlrpcds);
-		ptlrpcds = NULL;
+		/* XXX: We start nthreads ptlrpc daemons.
+		 *	Each of them can process any non-recovery
+		 *	async RPC to improve overall async RPC
+		 *	efficiency.
+		 *
+		 *	But there are some issues with async I/O RPCs
+		 *	and async non-I/O RPCs processed in the same
+		 *	set under some cases. The ptlrpcd may be
+		 *	blocked by some async I/O RPC(s), then will
+		 *	cause other async non-I/O RPC(s) can not be
+		 *	processed in time.
+		 *
+		 *	Maybe we should distinguish blocked async RPCs
+		 *	from non-blocked async RPCs, and process them
+		 *	in different ptlrpcd sets to avoid unnecessary
+		 *	dependency. But how to distribute async RPCs
+		 *	load among all the ptlrpc daemons becomes
+		 *	another trouble.
+		 */
+		for (j = 0; j < nthreads; j++) {
+			rc = ptlrpcd_start(&pd->pd_threads[j]);
+			if (rc < 0)
+				goto out;
+		}
 	}
+out:
+	if (rc != 0)
+		ptlrpcd_fini();
 
-	return 0;
+	return rc;
 }
 
 int ptlrpcd_addref(void)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference
  2015-09-14 22:41 ` [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference green
@ 2015-09-15 13:26   ` Trevor Woerner
  2015-09-15 13:57     ` Oleg Drokin
  0 siblings, 1 reply; 22+ messages in thread
From: Trevor Woerner @ 2015-09-15 13:26 UTC (permalink / raw)
  To: green, Greg Kroah-Hartman, devel, Andreas Dilger
  Cc: Oleg Drokin, Linux Kernel Mailing List

On 09/14/15 18:41, green@linuxhacker.ru wrote:
> Reviewed-on: http://review.whamcloud.com/14605

I'm confused why the patch found in this email doesn't match the patch I
find when I click on the above link? Some of the patches in this series
match what I find on your jenkins URLs, and some do not.

For example, the function call in the body of the "if" below is
"sysfs_remove_link()" but the function call in the "if" body of the code
I find at http://review.whamcloud.com/14605 after clicking on the
"lustre/lmv/lmv_obd.c" link is "lprocfs_remove_proc_entry()".

Maybe I'm not using your jenkins correctly?

> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6517
> Reviewed-by: James Simmons <uja.ornl@yahoo.com>
> Reviewed-by: John L. Hammond <john.hammond@intel.com>
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>
> ---
>  drivers/staging/lustre/lustre/lmv/lmv_obd.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> index 0fc0b61..cebbacf 100644
> --- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> +++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
> @@ -593,11 +593,11 @@ static int lmv_disconnect_mdc(struct obd_device *obd, struct lmv_tgt_desc *tgt)
>  		mdc_obd->obd_force = obd->obd_force;
>  		mdc_obd->obd_fail = obd->obd_fail;
>  		mdc_obd->obd_no_recov = obd->obd_no_recov;
> -	}
>  
> -	if (lmv->lmv_tgts_kobj)
> -		sysfs_remove_link(lmv->lmv_tgts_kobj,
> -				  mdc_obd->obd_name);
> +		if (lmv->lmv_tgts_kobj)
> +			sysfs_remove_link(lmv->lmv_tgts_kobj,
> +					  mdc_obd->obd_name);
> +	}
>  
>  	rc = obd_fid_fini(tgt->ltd_exp->exp_obd);
>  	if (rc)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference
  2015-09-15 13:26   ` Trevor Woerner
@ 2015-09-15 13:57     ` Oleg Drokin
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Drokin @ 2015-09-15 13:57 UTC (permalink / raw)
  To: Trevor Woerner
  Cc: Greg Kroah-Hartman, devel, Andreas Dilger,
	Linux Kernel Mailing List

Hello!

On Sep 15, 2015, at 9:26 AM, Trevor Woerner wrote:

> On 09/14/15 18:41, green@linuxhacker.ru wrote:
>> Reviewed-on: http://review.whamcloud.com/14605
> 
> I'm confused why the patch found in this email doesn't match the patch I
> find when I click on the above link? Some of the patches in this series
> match what I find on your jenkins URLs, and some do not.
> 
> For example, the function call in the body of the "if" below is
> "sysfs_remove_link()" but the function call in the "if" body of the code
> I find at http://review.whamcloud.com/14605 after clicking on the
> "lustre/lmv/lmv_obd.c" link is "lprocfs_remove_proc_entry()".
> 
> Maybe I'm not using your jenkins correctly?

You are using it correctly.

The patch is a "port" from one tree to another, but the pointer is to
the original patch. As the trees have diverged, various differences have accumulated.

The pointer is still useful to better be able to see which patches were already
included and which were not yet.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-09-15 13:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-14 22:41 [PATCH 00/19] Lustre fixes green
2015-09-14 22:41 ` [PATCH 01/19] staging/lustre/lnet: Reenable lnet router debugfs green
2015-09-14 22:41 ` [PATCH 02/19] staging/lustre/obdclass: reorganize busy object accounting green
2015-09-14 22:41 ` [PATCH 03/19] staging/lustre/llite: cleanup open handle for client open failure green
2015-09-14 22:41 ` [PATCH 04/19] staging/lustre/llite: strengthen checks for hsm flags and archive id green
2015-09-14 22:41 ` [PATCH 05/19] staging/lustre/ptlrpc: remove LUSTRE_MSG_MAGIC_V1 support green
2015-09-14 22:41 ` [PATCH 06/19] staging/lustre/lmv: fix potential null pointer dereference green
2015-09-15 13:26   ` Trevor Woerner
2015-09-15 13:57     ` Oleg Drokin
2015-09-14 22:41 ` [PATCH 07/19] staging/lustre/llite: deny non-root user for changelog operations green
2015-09-14 22:41 ` [PATCH 08/19] staging/lustre/o2iblnd: connection refcount fix for kiblnd_post_rx green
2015-09-14 22:41 ` [PATCH 09/19] staging/lustre/osc: LBUG in osc_lru_reclaim green
2015-09-14 22:41 ` [PATCH 10/19] staging/lustre/libcfs: minor fix in cfs_hash_for_each_relax() green
2015-09-14 22:41 ` [PATCH 11/19] staging/lustre/lnet: fix deadloop in ksocknal_push green
2015-09-14 22:41 ` [PATCH 12/19] staging/lustre/o2iblnd: wrong uses of kib_tx_t::tx_nfrags green
2015-09-14 22:41 ` [PATCH 13/19] staging/lustre/llite: ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed green
2015-09-14 22:41 ` [PATCH 14/19] staging/lustre/obdclass: Eliminate hash bucket scans in lu_cache_shrink green
2015-09-14 22:41 ` [PATCH 15/19] staging/lustre: Remove unused MAY_ constants green
2015-09-14 22:41 ` [PATCH 16/19] staging/lustre/osc: use global osc_rq_pool to reduce memory usage green
2015-09-14 22:41 ` [PATCH 17/19] staging/lustre/o2iblnd: leak cmid in kiblnd_dev_need_failover green
2015-09-14 22:41 ` [PATCH 18/19] staging/lustre/libcfs: remove unused cfs_timer_done green
2015-09-14 22:41 ` [PATCH 19/19] staging/lustre/ptlrpc: make ptlrpcd threads cpt-aware green

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox