* [PATCHv1 0/6] rdma controller support
@ 2016-01-05 18:58 Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 1/6] rdmacg: Added rdma cgroup header file Parav Pandit
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
This patchset adds support for RDMA cgroup by addressing review comments
of [1] by implementing published RFC [2].
Overview:
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources. This results into service unavailibility.
RDMA cgroup addresses this issue by allowing resource accounting,
limit enforcement on per cgroup, per rdma device basis.
Resources are not defined by the RDMA cgroup. Resources are defined
by RDMA/IB stack & optionally by HCA vendor device drivers.
This allows rdma cgroup to remain constant while RDMA/IB
stack can evolve without the need of rdma cgroup update. A new
resource can be easily added by the RDMA/IB stack without touching
rdma cgroup.
RDMA uverbs layer will enforce limits on well defined RDMA verb
resources without any HCA vendor device driver involvement.
RDMA uverbs layer will not do accounting of hw vendor specific resources.
Instead rdma cgroup provides set of APIs through which vendor specific 
drivers can define their own resources (upto 64) that can be accounted by
rdma cgroup.
Resource limit enforcement is hierarchical.
When process is migrated with active RDMA resources, rdma cgroup
continues to charge original cgroup.
Changes from v0:
(To address comments from Haggai, Doug, Liran, Tejun, Sean, Jason)
 * Redesigned to support per device per cgroup limit settings by bringing
   concept of resource pool.
 * Redesigned to let IB stack define the resources instead of rdma controller
   using resource template.
 * Redesigned to support hw vendor specific limits setting (optional to drivers).
 * Created new rdma controller instead of piggyback on device cgroup.
 * Fixed race conditions for multiple tasks sharing rdma resources.
 * Removed dependency on the task_struct.
[1] https://lkml.org/lkml/2015/9/7/476
[2] https://lkml.org/lkml/2015/10/28/144
This patchset is for Tejun's for-4.5 branch.
It is not attempted on Doug's rdma tree yet, which I will do once I receive
comments for this pathset.
Parav Pandit (6):
  rdmacg: Added rdma cgroup header file
  IB/core: Added members to support rdma cgroup
  rdmacg: implements rdma cgroup
  IB/core: rdmacg support infrastructure APIs
  IB/core: use rdma cgroup for resource accounting
  rdmacg: Added documentation for rdma controller.
 Documentation/cgroup-legacy/rdma.txt  |  129 ++++
 Documentation/cgroup.txt              |   79 +++
 drivers/infiniband/core/Makefile      |    1 +
 drivers/infiniband/core/cgroup.c      |   80 +++
 drivers/infiniband/core/core_priv.h   |    5 +
 drivers/infiniband/core/device.c      |    8 +
 drivers/infiniband/core/uverbs_cmd.c  |  244 ++++++-
 drivers/infiniband/core/uverbs_main.c |   30 +
 include/linux/cgroup_rdma.h           |   91 +++
 include/linux/cgroup_subsys.h         |    4 +
 include/rdma/ib_verbs.h               |   20 +
 init/Kconfig                          |   12 +
 kernel/Makefile                       |    1 +
 kernel/cgroup_rdma.c                  | 1220 +++++++++++++++++++++++++++++++++
 14 files changed, 1907 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/cgroup-legacy/rdma.txt
 create mode 100644 drivers/infiniband/core/cgroup.c
 create mode 100644 include/linux/cgroup_rdma.h
 create mode 100644 kernel/cgroup_rdma.c
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 39+ messages in thread
* [PATCHv1 1/6] rdmacg: Added rdma cgroup header file
  2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
@ 2016-01-05 18:58 ` Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 2/6] IB/core: Added members to support rdma cgroup Parav Pandit
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
Added rdma cgroup header file which defines its APIs to perform
charing/uncharing functionality.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/cgroup_rdma.h | 91 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 include/linux/cgroup_rdma.h
diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
new file mode 100644
index 0000000..01d220f
--- /dev/null
+++ b/include/linux/cgroup_rdma.h
@@ -0,0 +1,91 @@
+#ifndef _CGROUP_RDMA_H
+#define _CGROUP_RDMA_H
+
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+enum rdmacg_resource_pool_type {
+	RDMACG_RESOURCE_POOL_VERB,
+	RDMACG_RESOURCE_POOL_HW,
+	RDMACG_RESOURCE_POOL_TYPE_MAX,
+};
+
+struct ib_device;
+struct pid;
+struct match_token;
+
+#ifdef CONFIG_CGROUP_RDMA
+#define RDMACG_MAX_RESOURCE_INDEX (64)
+
+struct rdmacg_pool_info {
+	struct match_token *resource_table;
+	int resource_count;
+};
+
+struct rdmacg_resource_pool_ops {
+	struct rdmacg_pool_info*
+		(*get_resource_pool_tokens)(struct ib_device *);
+};
+
+/* APIs for RDMA/IB subsystem to publish when a device wants to
+ * participate in resource accounting
+ */
+void rdmacg_register_ib_device(struct ib_device *device);
+void rdmacg_unregister_ib_device(struct ib_device *device);
+
+/* APIs for RDMA/IB subsystem to charge/uncharge pool specific resources */
+int rdmacg_try_charge_resource(struct ib_device *device,
+			       struct pid *pid,
+			       enum rdmacg_resource_pool_type type,
+			       int resource_index,
+			       int num);
+void rdmacg_uncharge_resource(struct ib_device *device,
+			      struct pid *pid,
+			      enum rdmacg_resource_pool_type type,
+			      int resource_index,
+			      int num);
+
+void rdmacg_set_rpool_ops(struct ib_device *device,
+			  enum rdmacg_resource_pool_type pool_type,
+			  struct rdmacg_resource_pool_ops *ops);
+void rdmacg_clear_rpool_ops(struct ib_device *device,
+			    enum rdmacg_resource_pool_type pool_type);
+int rdmacg_query_resource_limit(struct ib_device *device,
+				struct pid *pid,
+				enum rdmacg_resource_pool_type type,
+				int *limits, int max_count);
+#else
+/* APIs for RDMA/IB subsystem to charge/uncharge device specific resources */
+static inline
+int rdmacg_try_charge_resource(struct ib_device *device,
+			       struct pid *pid,
+			       enum rdmacg_resource_pool_type type,
+			       int resource_index,
+			       int num)
+{ return 0; }
+
+static inline void rdmacg_uncharge_resource(struct ib_device *device,
+					    struct pid *pid,
+					    enum rdmacg_resource_pool_type type,
+					    int resource_index,
+					    int num)
+{ }
+
+static inline
+int rdmacg_query_resource_limit(struct ib_device *device,
+				struct pid *pid,
+				enum rdmacg_resource_pool_type type,
+				int *limits, int max_count)
+{
+	int i;
+
+	for (i = 0; i < max_count; i++)
+		limits[i] = S32_MAX;
+
+	return 0;
+}
+#endif	/* CONFIG_CGROUP_RDMA */
+#endif	/* _CGROUP_RDMA_H */
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
  2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 1/6] rdmacg: Added rdma cgroup header file Parav Pandit
@ 2016-01-05 18:58 ` Parav Pandit
       [not found]   ` <1452020286-9508-3-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-01-05 18:58 ` [PATCHv1 4/6] IB/core: rdmacg support infrastructure APIs Parav Pandit
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
Added function pointer table to store resource pool specific
operation for each resource type (verb and hw).
Added list node to link device to rdma cgroup so that it can
participate in resource accounting and limit configuration.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/rdma/ib_verbs.h | 7 +++++++
 1 file changed, 7 insertions(+)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9a68a19..1a17249 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -51,6 +51,7 @@
 #include <linux/socket.h>
 #include <uapi/linux/if_ether.h>
 
+#include <linux/cgroup_rdma.h>
 #include <linux/atomic.h>
 #include <linux/mmu_notifier.h>
 #include <asm/uaccess.h>
@@ -1823,6 +1824,12 @@ struct ib_device {
 	u8                           node_type;
 	u8                           phys_port_cnt;
 
+#ifdef CONFIG_CGROUP_RDMA
+	struct rdmacg_resource_pool_ops
+				*rpool_ops[RDMACG_RESOURCE_POOL_TYPE_MAX];
+	struct list_head             rdmacg_list;
+#endif
+
 	/**
 	 * The following mandatory functions are used only at device
 	 * registration.  Keep functions such as these at the end of this
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCHv1 3/6] rdmacg: implements rdma cgroup
       [not found] ` <1452020286-9508-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-01-05 18:58   ` Parav Pandit
  2016-01-05 22:01     ` Tejun Heo
  2016-01-05 21:56   ` [PATCHv1 0/6] rdma controller support Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA, liranl-VPRAkNaXOzVWk0Htik3J/w,
	sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w
Adds RDMA controller to limit the number of RDMA resources that can be
consumed by processes of a rdma cgroup.
RDMA resources are global resource that can be exhauasted without
reaching any kmemcg or other policy. RDMA cgroup implementation allows
limiting RDMA/IB well defined resources to be limited per cgroup.
RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup, per resource pool_type entity which allows setting
up accounting limits on per device basis.
RDMA cgroup returns error when user space applications try to allocate
resources more than its configured limit.
Rdma cgroup implements resource accounting for two types of resource
pools.
(a) RDMA IB specification level verb resources defined by IB stack
(b) HCA vendor device specific resources defined by vendor device driver
Resources are not defined by the RDMA cgroup, instead they are defined
by the external module, typically IB stack and optionally by HCA drivers
for those RDMA devices which doesn't have one to one mapping of IB verb
resource with hardware resource.
Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/cgroup_subsys.h |    4 +
 init/Kconfig                  |   12 +
 kernel/Makefile               |    1 +
 kernel/cgroup_rdma.c          | 1220 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1237 insertions(+)
 create mode 100644 kernel/cgroup_rdma.c
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 0df0336a..d0e597c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
 SUBSYS(pids)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDMA)
+SUBSYS(rdma)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f8754f5..f8055f5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1070,6 +1070,18 @@ config CGROUP_PIDS
 	  since the PIDs limit only affects a process's ability to fork, not to
 	  attach to a cgroup.
 
+config CGROUP_RDMA
+	bool "RDMA controller"
+	help
+	  Provides enforcement of RDMA resources at RDMA/IB verb level and
+	  enforcement of any RDMA/IB capable hardware advertized resources.
+	  Its fairly easy for applications to exhaust RDMA resources, which
+	  can result into kernel consumers or other application consumers of
+	  RDMA resources left with no resources. RDMA controller is designed
+	  to stop this from happening.
+	  Attaching existing processes with active RDMA resources to the cgroup
+	  hierarchy will be allowed even if can cross the hierarchy's limit.
+
 config CGROUP_FREEZER
 	bool "Freezer controller"
 	help
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..26e413c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,6 +57,7 @@ obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
+obj-$(CONFIG_CGROUP_RDMA) += cgroup_rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_rdma.c b/kernel/cgroup_rdma.c
new file mode 100644
index 0000000..14c6fab
--- /dev/null
+++ b/kernel/cgroup_rdma.c
@@ -0,0 +1,1220 @@
+/*
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+#include <linux/pid.h>
+#include <linux/spinlock.h>
+#include <linux/parser.h>
+#include <rdma/ib_verbs.h>
+#include <linux/atomic.h>
+#include <linux/seq_file.h>
+#include <linux/hashtable.h>
+#include <linux/cgroup.h>
+#include <linux/cgroup_rdma.h>
+
+enum rdmacg_file_type {
+	RDMACG_VERB_RESOURCE_LIMIT,
+	RDMACG_VERB_RESOURCE_USAGE,
+	RDMACG_VERB_RESOURCE_FAILCNT,
+	RDMACG_VERB_RESOURCE_LIST,
+	RDMACG_HW_RESOURCE_LIMIT,
+	RDMACG_HW_RESOURCE_USAGE,
+	RDMACG_HW_RESOURCE_FAILCNT,
+	RDMACG_HW_RESOURCE_LIST,
+};
+
+#define RDMACG_USR_CMD_REMOVE "remove"
+
+/* resource tracker per resource for rdma cgroup */
+struct cg_resource {
+	atomic_t usage;
+	int limit;
+	atomic_t failcnt;
+};
+
+/**
+ * pool type indicating either it got created as part of default
+ * operation or user has configured the group.
+ * Depends on the creator of the pool, its decided to free up
+ * later or not.
+ */
+enum rpool_creator {
+	RDMACG_RPOOL_CREATOR_DEFAULT,
+	RDMACG_RPOOL_CREATOR_USR,
+};
+
+/**
+ * resource pool object which represents, per cgroup, per device,
+ * per resource pool_type resources.
+ */
+struct cg_resource_pool {
+	struct list_head cg_list;
+	struct ib_device *device;
+	enum rdmacg_resource_pool_type type;
+
+	struct cg_resource *resources;
+
+	atomic_t refcnt;	/* count active user tasks of this pool */
+	atomic_t creator;	/* user created or default type */
+};
+
+struct rdma_cgroup {
+	struct cgroup_subsys_state	css;
+
+	struct list_head rpool_head;	/* head to keep track of all resource
+					 * pools that belongs to this cgroup.
+					 */
+	spinlock_t	cg_list_lock;	/* protects cgroup resource pool list */
+
+};
+
+/* hash table to keep map of tgid to owner cgroup */
+DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
+DEFINE_SPINLOCK(pid_cg_map_lock);	/* lock to protect hash table access */
+
+/* Keeps mapping of pid to its owning cgroup at rdma level,
+ * This mapping doesn't change, even if process migrates from one to other
+ * rdma cgroup.
+ */
+struct pid_cg_map {
+	struct pid *pid;		/* hash key */
+	struct rdma_cgroup *cg;
+
+	struct hlist_node hlist;	/* pid to cgroup hash table link */
+	atomic_t refcnt;		/* count active user tasks to figure out
+					 * when to free the memory
+					 */
+};
+
+static DEFINE_MUTEX(dev_mutex);
+static LIST_HEAD(dev_list);
+
+static struct rdma_cgroup *css_rdmacg(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct rdma_cgroup, css);
+}
+
+static struct rdma_cgroup *parent_rdmacg(struct rdma_cgroup *cg)
+{
+	return css_rdmacg(cg->css.parent);
+}
+
+static inline struct rdma_cgroup *task_rdmacg(struct task_struct *task)
+{
+	return css_rdmacg(task_css(task, rdma_cgrp_id));
+}
+
+static struct rdmacg_resource_pool_ops*
+	get_pool_ops(struct ib_device *device,
+		     enum rdmacg_resource_pool_type pool_type)
+{
+	return device->rpool_ops[pool_type];
+}
+
+static inline void set_resource_limit(struct cg_resource_pool *rpool,
+				      int index,
+				      int new_limit)
+{
+	rpool->resources[index].limit = new_limit;
+}
+
+static void _free_cg_rpool(struct cg_resource_pool *rpool)
+{
+	kfree(rpool->resources);
+	kfree(rpool);
+}
+
+static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
+			      struct cg_resource_pool *rpool)
+{
+	spin_lock(&cg->cg_list_lock);
+
+	/* if its started getting used by other task,
+	 * before we take the spin lock, then skip,
+	 * freeing it.
+	 */
+	if (atomic_read(&rpool->refcnt) == 0) {
+		list_del_init(&rpool->cg_list);
+		spin_unlock(&cg->cg_list_lock);
+
+		_free_cg_rpool(rpool);
+		return;
+	}
+	spin_unlock(&cg->cg_list_lock);
+}
+
+static void dealloc_cg_rpool(struct rdma_cgroup *cg,
+			     struct cg_resource_pool *rpool)
+{
+	/* Don't free the resource pool which is created by the
+	 * user, otherwise we miss the configured limits. We don't
+	 * gain much either by splitting storage of limit and usage.
+	 * So keep it around until user deletes the limits.
+	 */
+	if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
+		_dealloc_cg_rpool(cg, rpool);
+}
+
+static void put_cg_rpool(struct rdma_cgroup *cg,
+			 struct cg_resource_pool *rpool)
+{
+	if (atomic_dec_and_test(&rpool->refcnt))
+		dealloc_cg_rpool(cg, rpool);
+}
+
+static struct cg_resource_pool*
+	alloc_cg_rpool(struct rdma_cgroup *cg,
+		       struct ib_device *device,
+		       int count,
+		       enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool;
+	int i, ret;
+
+	rpool = kzalloc(sizeof(*rpool), GFP_KERNEL);
+	if (!rpool) {
+		ret = -ENOMEM;
+		goto err;
+	}
+	rpool->resources = kcalloc(count, sizeof(*rpool->resources),
+				   GFP_KERNEL);
+	if (!rpool->resources) {
+		ret = -ENOMEM;
+		goto alloc_err;
+	}
+
+	/* set pool ownership and type, so that it can be freed correctly */
+	rpool->device = device;
+	rpool->type = type;
+	INIT_LIST_HEAD(&rpool->cg_list);
+	atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+	for (i = 0; i < count; i++)
+		set_resource_limit(rpool, i, S32_MAX);
+
+	return rpool;
+
+alloc_err:
+	kfree(rpool);
+err:
+	return ERR_PTR(ret);
+}
+
+static struct cg_resource_pool*
+	find_cg_rpool(struct rdma_cgroup *cg,
+		      struct ib_device *device,
+		      enum rdmacg_resource_pool_type type)
+
+{
+	struct cg_resource_pool *pool;
+
+	list_for_each_entry(pool, &cg->rpool_head, cg_list)
+		if (pool->device == device && pool->type == type)
+			return pool;
+
+	return NULL;
+}
+
+static struct cg_resource_pool*
+	_get_cg_rpool(struct rdma_cgroup *cg,
+		      struct ib_device *device,
+		      enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (rpool)
+		goto found;
+found:
+	spin_unlock(&cg->cg_list_lock);
+	return rpool;
+}
+
+/**
+ * get_cg_rpool - get pid_cg map reference.
+ * @cg: cgroup for which resouce pool to be allocated.
+ * @device: device for which to allocate resource pool.
+ * @type: type of the resource pool.
+ *
+ * It searches a cgroup resource pool object for a given pid, if it finds
+ * it would return it. If none is present, it would allocate
+ * new resource pool entry of default type.
+ * Returns resource pool on success, else return error pointer or null.
+ */
+static struct cg_resource_pool*
+	get_cg_rpool(struct rdma_cgroup *cg,
+		     struct ib_device *device,
+		     enum rdmacg_resource_pool_type type)
+{
+	struct cg_resource_pool *rpool, *other_rpool;
+	struct rdmacg_pool_info *pool_info;
+	struct rdmacg_resource_pool_ops *ops;
+	int ret = 0;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (rpool) {
+		atomic_inc(&rpool->refcnt);
+		spin_unlock(&cg->cg_list_lock);
+		return rpool;
+	}
+	spin_unlock(&cg->cg_list_lock);
+
+	/* ops cannot be NULL at this stage, as caller made to charge/get
+	 * the resource pool being aware of such need and invoking with
+	 * because it has setup resource pool ops.
+	 */
+	ops = get_pool_ops(device, type);
+	pool_info = ops->get_resource_pool_tokens(device);
+	if (!pool_info) {
+		ret = -EINVAL;
+		goto err;
+	}
+	if (pool_info->resource_count == 0 ||
+	    pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/* allocate resource pool */
+	rpool = alloc_cg_rpool(cg, device, pool_info->resource_count, type);
+	if (IS_ERR_OR_NULL(rpool))
+		return rpool;
+
+	/* cgroup lock is held to synchronize with multiple
+	 * resource pool creation in parallel.
+	 */
+	spin_lock(&cg->cg_list_lock);
+	other_rpool = find_cg_rpool(cg, device, type);
+	/* if other task added resource pool for this device for this cgroup
+	 * free up which was recently created and use the one we found.
+	 */
+	if (other_rpool) {
+		atomic_inc(&other_rpool->refcnt);
+		spin_unlock(&cg->cg_list_lock);
+		_free_cg_rpool(rpool);
+		return other_rpool;
+	}
+
+	atomic_inc(&rpool->refcnt);
+	list_add_tail(&rpool->cg_list, &cg->rpool_head);
+
+	spin_unlock(&cg->cg_list_lock);
+	return rpool;
+
+err:
+	spin_unlock(&cg->cg_list_lock);
+	return ERR_PTR(ret);
+}
+
+static struct pid_cg_map *find_owner_cg_for_pid(struct pid *pid)
+{
+	struct pid_cg_map *map;
+	unsigned long key = (unsigned long)pid;
+
+	hash_for_each_possible(pid_cg_map_tbl, map, hlist, key)
+		if (map->pid == pid)
+			return map;
+
+	return NULL;
+}
+
+/**
+ * put_pid_cg_map - release pid_cg map reference.
+ * @map: pointer to map
+ *
+ * It free the map object if this is the last reference,
+ * otherwise just decrement the refcnt.
+ */
+static void put_pid_cg_map(struct pid_cg_map *map)
+{
+	struct rdma_cgroup *cg = map->cg;
+	struct pid_cg_map *free_map = NULL;
+
+	if (atomic_dec_and_test(&map->refcnt)) {
+		spin_lock(&pid_cg_map_lock);
+		if (atomic_read(&map->refcnt) == 0) {
+			/* read the refcnt again to make sure, if its
+			 * already in use before above spin lock
+			 * is taken by other task, than dont free it.
+			 */
+			hash_del(&map->hlist);
+			free_map = map;
+		}
+		spin_unlock(&pid_cg_map_lock);
+		if (free_map) {
+			/* now that pid is getting detached from cg,
+			 * drop the css ref count.
+			 */
+			css_put(&cg->css);
+			kfree(free_map);
+		}
+	}
+}
+
+/**
+ * _get_pid_cg_map - get the rdma cgroup from pid
+ * @cg: pointer to cgroup
+ *
+ * It search and return a cgroup map for a given pid, if it finds
+ * it would return it. If none is present, it would allocate
+ * new map entry.
+ * Returns map object on success, else return error pointer or null.
+ */
+static struct pid_cg_map *_get_pid_cg_map(struct pid *pid)
+{
+	struct pid_cg_map *map;
+
+	spin_lock(&pid_cg_map_lock);
+	map = find_owner_cg_for_pid(pid);
+	spin_unlock(&pid_cg_map_lock);
+
+	/* if we dont find a map for a given pid, its either
+	 * bug in caller with wrong pid, or bug in rdma
+	 * controller.
+	 */
+	WARN_ON(!map);
+	return map;
+}
+
+/**
+ * get_pid_cg_map - get the rdma cgroup from pid
+ * @cg: pointer to cgroup
+ *
+ * It searches a cgroup map object for a given pid, if it finds
+ * it would return it. If none is present, it would allocate
+ * new map entry.
+ * Returns map object on success, else return error pointer or null.
+ */
+static struct pid_cg_map *get_pid_cg_map(struct pid *pid)
+{
+	struct pid_cg_map *map, *other_map;
+
+	spin_lock(&pid_cg_map_lock);
+
+	map = find_owner_cg_for_pid(pid);
+	if (map)
+		atomic_inc(&map->refcnt);
+
+	spin_unlock(&pid_cg_map_lock);
+
+	/* existing cg found, as most likely the case
+	 * after first resource allocation.
+	 */
+	if (likely(map))
+		return map;
+
+	/* cg not found for this tgid */
+	map = kzalloc(sizeof(*map), GFP_KERNEL);
+	if (!map)
+		return NULL;
+	map->pid = pid;
+	INIT_HLIST_NODE(&map->hlist);
+
+	/* always consider the cgroup of calling process to map to,
+	 * as pid's cgroup and pid's thread leader might have been already
+	 * terminated by now.
+	 */
+	map->cg = task_rdmacg(current);
+
+	/* search again, in case if other process has perform binding, before
+	 * this task can do.
+	 */
+	spin_lock(&pid_cg_map_lock);
+	other_map = find_owner_cg_for_pid(pid);
+	if (other_map) {
+		/* found the map added by other task, reuse it */
+		atomic_inc(&other_map->refcnt);
+		spin_unlock(&pid_cg_map_lock);
+		kfree(map);
+
+		map = other_map;
+	} else {
+		hash_add(pid_cg_map_tbl, &map->hlist, (unsigned long)pid);
+		atomic_inc(&map->refcnt);
+		spin_unlock(&pid_cg_map_lock);
+
+		/* Hold the reference to css for every new map added.
+		 * so that even if task migrate to new cgroup, we hold the css.
+		 * We hold the reference to css, until rdma resources
+		 * are bound to it. This can keep the css to be around
+		 * for longer time, but it ensures that resource
+		 * charing/uncharging happens to right cgroup,
+		 * even when parent resource creator task is terminated
+		 * (and cgroup also might have been removed, but css is not).
+		 */
+		css_get(&map->cg->css);
+	}
+	return map;
+}
+
+/**
+ * uncharge_cg_resource - hierarchically uncharge resource for rdma cgroup
+ * @cg: pointer to cg to uncharge and all parents in hierarchy
+ * @device: pointer to ib device
+ * @type: the type of resource pool to uncharge
+ * @index: index of the resource to uncharge in cg (resource pool)
+ * @num: the number of rdma resource to uncharge
+ *
+ * It also frees the resource pool in the hierarchy for the resource pool
+ * of default type which was created as part of charing operation.
+ */
+static void uncharge_cg_resource(struct rdma_cgroup *cg,
+				 struct ib_device *device,
+				 enum rdmacg_resource_pool_type type,
+				 int index,
+				 int num)
+{
+	struct cg_resource_pool *rpool;
+
+	rpool = _get_cg_rpool(cg, device, type);
+	/*
+	 * A negative count (or overflow) is invalid,
+	 * and indicates a bug in the device rdma controller.
+	 */
+	WARN_ON_ONCE(atomic_add_negative(-num,
+					 &rpool->resources[index].usage));
+	put_cg_rpool(cg, rpool);
+}
+
+/**
+ * rdmacg_uncharge_resource - hierarchically uncharge rdma resource count
+ * @device: pointer to ib device
+ * @pid: thread group pid which will owns the resource
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to uncharge in cg in given resource pool type
+ * @num: the number of rdma resource to uncharge
+ *
+ */
+void rdmacg_uncharge_resource(struct ib_device *device,
+			      struct pid *pid,
+			      enum rdmacg_resource_pool_type type,
+			      int index,
+			      int num)
+{
+	struct pid_cg_map *map;
+	struct rdma_cgroup *cg, *p;
+
+	if (!num)
+		return;
+
+	map = _get_pid_cg_map(pid);
+	cg = map->cg;
+
+	for (p = cg; p; p = parent_rdmacg(p))
+		uncharge_cg_resource(p, device, type, index, num);
+
+	put_pid_cg_map(map);
+}
+EXPORT_SYMBOL(rdmacg_uncharge_resource);
+
+/**
+ * try_charge_resource - hierarchically try to charge given resource
+ * @cg: pointer to cg to charge and all parents in hierarchy
+ * @device: pointer to ib device
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to charge in cg (resource pool)
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ *
+ * It allocates resource pool in the hierarchy for each parent it come
+ * across for first resource. Later on resource pool will be available.
+ * Therefore it will be much faster thereon to charge/uncharge.
+ */
+static int try_charge_resource(struct rdma_cgroup *cg,
+			       struct ib_device *device,
+			       enum rdmacg_resource_pool_type type,
+			       int index,
+			       int num)
+{
+	struct cg_resource_pool *rpool;
+	struct rdma_cgroup *p, *q;
+	int ret;
+
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		s64 new;
+
+		rpool = get_cg_rpool(p, device, type);
+		if (IS_ERR_OR_NULL(rpool)) {
+			ret = PTR_ERR(rpool);
+			goto err;
+		}
+
+		new = atomic_add_return(num, &rpool->resources[index].usage);
+		if (new > rpool->resources[index].limit) {
+			atomic_inc(&rpool->resources[index].failcnt);
+			ret = -EAGAIN;
+			goto revert;
+		}
+	}
+	return 0;
+
+revert:
+	uncharge_cg_resource(p, device, type, index, num);
+err:
+	for (q = cg; q != p; q = parent_rdmacg(q))
+		uncharge_cg_resource(q, device, type, index, num);
+	return ret;
+}
+
+/**
+ * rdmacg_try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @device: pointer to ib device
+ * @pid: thread group pid which will own the resource until its
+ *       destroyed.
+ * @type: the type of resource pool to charge
+ * @index: index of the resource to charge in cg (resource pool)
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows charing resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN, -ENOMEM or -EINVAL.
+ */
+int rdmacg_try_charge_resource(struct ib_device *device,
+			       struct pid *pid,
+			       enum rdmacg_resource_pool_type type,
+			       int index,
+			       int num)
+{
+	struct rdma_cgroup *cg;
+	struct pid_cg_map *map;
+	int status;
+
+	map = get_pid_cg_map(pid);
+	if (!map)
+		return -ENOMEM;
+	cg = map->cg;
+
+	/* Charger needs to account resources on three criteria.
+	 * (a) per cgroup (b) per device & (c) per resource type resource usage.
+	 * Per cgroup resource usage ensures that tasks of cgroup doesn't cross
+	 * the configured limits.
+	 * Per device provides granular configuration in multi device usage.
+	 * Per resource type allows resource charing for multiple
+	 * category of resources (at present two, (a) verb level & (b) hw driver
+	 * define.
+	 */
+
+	/* charge the cgroup resource pool */
+	status = try_charge_resource(cg, device, type, index, num);
+	if (status)
+		goto charge_err;
+
+	return 0;
+
+charge_err:
+	put_pid_cg_map(map);
+	return status;
+}
+EXPORT_SYMBOL(rdmacg_try_charge_resource);
+
+/**
+ * rdmacg_register_ib_device - register the ib device from rdma cgroup.
+ * @device: pointer to ib device whose resources need to be accounted.
+ *
+ * If IB subsystem wish a device to participate in rdma cgroup resource
+ * tracking, it must invoke this API to register with rdma cgroup before
+ * any user space application can start using the RDMA resources. IB subsystem
+ * and/or HCA driver must invoke rdmacg_set_rpool_ops() either for verb or
+ * for hw or for both the types as they are mandetory operations to have
+ * to register with rdma cgroup.
+ *
+ */void rdmacg_register_ib_device(struct ib_device *device)
+{
+	INIT_LIST_HEAD(&device->rdmacg_list);
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&device->rdmacg_list, &dev_list);
+	mutex_unlock(&dev_mutex);
+}
+EXPORT_SYMBOL(rdmacg_register_ib_device);
+
+/**
+ * rdmacg_unregister_ib_device - unregister the ib device from rdma cgroup.
+ * @device: pointer to ib device which was previously registered with rdma
+ *          cgroup using rdmacg_register_ib_device().
+ *
+ * IB subsystem must invoke this after all the resources of the IB device
+ * are destroyed and after ensuring that no more resources will be created
+ * when this API is invoked.
+ */
+void rdmacg_unregister_ib_device(struct ib_device *device)
+{
+	/* Synchronize with any active resource settings,
+	 * usage query happening via configfs.
+	 * At this stage, there should not be any active resource pools
+	 * for this device, as RDMA/IB subsystem is expected to shutdown,
+	 * tear down all the applications and free up resources.
+	 */
+	mutex_lock(&dev_mutex);
+	list_del_init(&device->rdmacg_list);
+	mutex_unlock(&dev_mutex);
+}
+EXPORT_SYMBOL(rdmacg_unregister_ib_device);
+
+/**
+ * rdmacg_set_rpool_ops - helper function to set the resource pool
+ *                        call back function to provide matching
+ *                        string tokens to for defining names of the
+ *                        resources.
+ * @device: pointer to ib device for which to set the resource pool
+ *          operations.
+ * @pool_type: resource pool type, either VERBS type or HW type.
+ * @ops: pointer to function pointers to return (a) matching token
+ *       which will be invoked to find out string tokens and
+ *       (b) to find out maximum resource limits that is supported
+ *       by each device of given resource pool type.
+ *
+ * This helper function allows setting resouce pool specific operation
+ * callback functions.
+ * It must be called one by respective subsystem that implements resource
+ * definition of rdma and owns the task of charging/uncharing the resource.
+ * This must be called before ib device is registered with the rdma cgroup
+ * using rdmacg_register_ib_device().
+ */
+void rdmacg_set_rpool_ops(struct ib_device *device,
+			  enum rdmacg_resource_pool_type pool_type,
+			  struct rdmacg_resource_pool_ops *ops)
+{
+	device->rpool_ops[pool_type] = ops;
+}
+EXPORT_SYMBOL(rdmacg_set_rpool_ops);
+
+/**
+ * rdmacg_clear_rpool_ops -
+ * helper function to clear the resource pool ops which was setup
+ * before using rdmacg_set_rpool_ops.
+ * @device: pointer to ib device for which to clear the resource pool
+ *          operations.
+ * @pool_type: resource pool type, either VERBS type or HW type.
+ *
+ * This must be called after deregistering ib device using rdma cgroup
+ * using rdmacg_unregister_ib_device().
+ */
+void rdmacg_clear_rpool_ops(struct ib_device *device,
+			    enum rdmacg_resource_pool_type pool_type)
+{
+	device->rpool_ops[pool_type] = NULL;
+}
+EXPORT_SYMBOL(rdmacg_clear_rpool_ops);
+
+/**
+ * rdmacg_query_resource_limit - query the resource limits that
+ * might have been configured by the user.
+ * @device: pointer to ib device
+ * @pid: thread group pid that wants to know the resource limits
+ *       of the cgroup.
+ * @type: the type of resource pool to know the limits of.
+ * @limits: pointer to an array of limits where rdma cg will provide
+ *          the configured limits of the cgroup.
+ * @limits_array_len: number of array elements to be filled up at limits.
+ *
+ * This function follows charing resource in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise appropriate error code.
+ */
+int rdmacg_query_resource_limit(struct ib_device *device,
+				struct pid *pid,
+				enum rdmacg_resource_pool_type type,
+				int *limits, int limits_array_len)
+{
+	struct rdma_cgroup *cg, *p, *q;
+	struct task_struct *task;
+	struct cg_resource_pool *rpool;
+	struct rdmacg_pool_info *pool_info;
+	struct rdmacg_resource_pool_ops *ops;
+	int i, status = 0;
+
+	task = pid_task(pid, PIDTYPE_PID);
+	cg = task_rdmacg(task);
+
+	ops = get_pool_ops(device, type);
+	if (!ops) {
+		status = -EINVAL;
+		goto err;
+	}
+	pool_info = ops->get_resource_pool_tokens(device);
+	if (!pool_info) {
+		status = -EINVAL;
+		goto err;
+	}
+	if (pool_info->resource_count == 0 ||
+	    pool_info->resource_count > RDMACG_MAX_RESOURCE_INDEX ||
+	    pool_info->resource_count < limits_array_len) {
+		status = -EINVAL;
+		goto err;
+	}
+
+	/* initialize to max */
+	for (i = 0; i < limits_array_len; i++)
+		limits[i] = S32_MAX;
+
+	/* check in hirerchy which pool get the least amount of
+	 * resource limits.
+	 */
+	for (p = cg; p; p = parent_rdmacg(p)) {
+		/* get handle to cgroups rpool */
+		rpool = get_cg_rpool(cg, device, type);
+		if (IS_ERR_OR_NULL(rpool))
+			goto rpool_err;
+
+		for (i = 0; i < limits_array_len; i++)
+			limits[i] = min_t(int, limits[i],
+					rpool->resources[i].limit);
+
+		put_cg_rpool(cg, rpool);
+	}
+	return 0;
+
+rpool_err:
+	for (q = cg; q != p; q = parent_rdmacg(q))
+		put_cg_rpool(cg, _get_cg_rpool(cg, device, type));
+err:
+	return status;
+}
+EXPORT_SYMBOL(rdmacg_query_resource_limit);
+
+static int rdmacg_parse_limits(char *options, struct match_token *opt_tbl,
+			       int *new_limits, u64 *enables)
+{
+	substring_t argstr[MAX_OPT_ARGS];
+	const char *c;
+	int err = -ENOMEM;
+
+	/* parse resource options */
+	while ((c = strsep(&options, " ")) != NULL) {
+		int token, intval, ret;
+
+		err = -EINVAL;
+		token = match_token((char *)c, opt_tbl, argstr);
+		if (token < 0)
+			goto err;
+
+		ret = match_int(&argstr[0], &intval);
+		if (ret < 0) {
+			pr_err("bad value (not int) at '%s'\n", c);
+			goto err;
+		}
+		new_limits[token] = intval;
+		*enables |= BIT(token);
+	}
+	return 0;
+
+err:
+	return err;
+}
+
+static enum rdmacg_resource_pool_type of_to_pool_type(int of_type)
+{
+	enum rdmacg_resource_pool_type pool_type;
+
+	switch (of_type) {
+	case RDMACG_VERB_RESOURCE_LIMIT:
+	case RDMACG_VERB_RESOURCE_USAGE:
+	case RDMACG_VERB_RESOURCE_FAILCNT:
+	case RDMACG_VERB_RESOURCE_LIST:
+		pool_type = RDMACG_RESOURCE_POOL_VERB;
+		break;
+	case RDMACG_HW_RESOURCE_LIMIT:
+	case RDMACG_HW_RESOURCE_USAGE:
+	case RDMACG_HW_RESOURCE_FAILCNT:
+	case RDMACG_HW_RESOURCE_LIST:
+	default:
+		pool_type = RDMACG_RESOURCE_POOL_HW;
+		break;
+	};
+	return pool_type;
+}
+
+static struct ib_device *_rdmacg_get_device(const char *name)
+{
+	struct ib_device *device;
+
+	list_for_each_entry(device, &dev_list, rdmacg_list)
+		if (!strncmp(name, device->name, IB_DEVICE_NAME_MAX))
+			return device;
+
+	return NULL;
+}
+
+static void remove_unused_cg_rpool(struct rdma_cgroup *cg,
+				   struct ib_device *device,
+				   enum rdmacg_resource_pool_type type,
+				   int count)
+{
+	struct cg_resource_pool *rpool = NULL;
+	int i;
+
+	spin_lock(&cg->cg_list_lock);
+	rpool = find_cg_rpool(cg, device, type);
+	if (!rpool) {
+		spin_unlock(&cg->cg_list_lock);
+		return;
+	}
+	/* found the resource pool, check now what to do
+	 * based on its reference count.
+	 */
+	if (atomic_read(&rpool->refcnt) == 0) {
+		/* if there is no active user of the rpool,
+		 * free the memory, default group will get
+		 * allocated automatically when new resource
+		 * is created.
+		 */
+		list_del_init(&rpool->cg_list);
+
+		spin_unlock(&cg->cg_list_lock);
+
+		_free_cg_rpool(rpool);
+	} else {
+		/* if there are active processes and thereby
+		 * active resources, set limits to max.
+		 * Resource pool will get freed later on,
+		 * when last resource will get deallocated.
+		 */
+		for (i = 0; i < count; i++)
+			set_resource_limit(rpool, i, S32_MAX);
+		atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+		spin_unlock(&cg->cg_list_lock);
+	}
+}
+
+static ssize_t rdmacg_resource_set_limit(struct kernfs_open_file *of,
+					 char *buf, size_t nbytes, loff_t off)
+{
+	struct rdma_cgroup *cg = css_rdmacg(of_css(of));
+	const char *dev_name;
+	struct cg_resource_pool *rpool;
+	struct rdmacg_resource_pool_ops *ops;
+	struct ib_device *device;
+	char *options = strstrip(buf);
+	enum rdmacg_resource_pool_type pool_type;
+	struct rdmacg_pool_info *resource_tokens;
+	u64 enables = 0;
+	int *new_limits;
+	int i = 0, ret = 0;
+	bool remove = false;
+
+	/* extract the device name first */
+	dev_name = strsep(&options, " ");
+	if (!dev_name) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	/* check if user asked to remove the cgroup limits */
+	if (strstr(options, RDMACG_USR_CMD_REMOVE))
+		remove = true;
+
+	new_limits = kcalloc(RDMACG_MAX_RESOURCE_INDEX, sizeof(int),
+			     GFP_KERNEL);
+	if (!new_limits) {
+		ret = -ENOMEM;
+		goto opt_err;
+	}
+
+	/* acquire lock to synchronize with hot plug devices */
+	mutex_lock(&dev_mutex);
+
+	device = _rdmacg_get_device(dev_name);
+	if (!device) {
+		ret = -ENODEV;
+		goto parse_err;
+	}
+	pool_type = of_to_pool_type(of_cft(of)->private);
+	ops = get_pool_ops(device, pool_type);
+	if (!ops) {
+		ret = -EINVAL;
+		goto parse_err;
+	}
+
+	resource_tokens = ops->get_resource_pool_tokens(device);
+	if (IS_ERR_OR_NULL(resource_tokens)) {
+		ret = -EINVAL;
+		goto parse_err;
+	}
+
+	if (remove) {
+		remove_unused_cg_rpool(cg, device, pool_type,
+				       resource_tokens->resource_count);
+		/* user asked to clear the limits; ignore rest of the options */
+		goto parse_err;
+	}
+
+	/* user didn't ask to remove, act on the options */
+	ret = rdmacg_parse_limits(options,
+				  resource_tokens->resource_table,
+				  new_limits, &enables);
+	if (ret)
+		goto parse_err;
+
+	rpool = get_cg_rpool(cg, device, pool_type);
+	if (IS_ERR_OR_NULL(rpool)) {
+		if (IS_ERR(rpool))
+			ret = PTR_ERR(rpool);
+		else
+			ret = -ENOMEM;
+		goto parse_err;
+	}
+	/* set pool type as user regardless of previous type as
+	 * user is configuring the limit now.
+	 */
+	atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_USR);
+
+	/* now set the new limits on the existing or newly created rool */
+	while (enables) {
+		/* if user set the limit, enables bit is set */
+		if (enables & BIT(i)) {
+			enables &= ~BIT(i);
+			set_resource_limit(rpool, i, new_limits[i]);
+		}
+		i++;
+	}
+	atomic_dec(&rpool->refcnt);
+parse_err:
+	mutex_unlock(&dev_mutex);
+opt_err:
+	kfree(new_limits);
+err:
+	return ret ?: nbytes;
+}
+
+static int get_resource_val(struct cg_resource *resource,
+			    enum rdmacg_file_type type)
+{
+	int val = 0;
+
+	switch (type) {
+	case RDMACG_VERB_RESOURCE_LIMIT:
+	case RDMACG_HW_RESOURCE_LIMIT:
+		val = resource->limit;
+		break;
+	case RDMACG_VERB_RESOURCE_USAGE:
+	case RDMACG_HW_RESOURCE_USAGE:
+		val = atomic_read(&resource->usage);
+		break;
+	case RDMACG_VERB_RESOURCE_FAILCNT:
+	case RDMACG_HW_RESOURCE_FAILCNT:
+		val = atomic_read(&resource->failcnt);
+		break;
+	default:
+		val = 0;
+		break;
+	};
+	return val;
+}
+
+static int rdmacg_resource_read(struct seq_file *sf, void *v)
+{
+	struct rdma_cgroup *cg = css_rdmacg(seq_css(sf));
+	struct cg_resource_pool *rpool;
+	struct rdmacg_pool_info *pool_info;
+	struct match_token *resource_table;
+	struct rdmacg_resource_pool_ops *ops;
+	enum rdmacg_resource_pool_type pool_type;
+	int i, value, ret = 0;
+
+	pool_type = of_to_pool_type(seq_cft(sf)->private);
+
+	mutex_lock(&dev_mutex);
+
+	list_for_each_entry(rpool, &cg->rpool_head, cg_list) {
+		if (rpool->type != pool_type)
+			continue;
+
+		ops = get_pool_ops(rpool->device, pool_type);
+		if (!ops) {
+			ret = -EPERM;
+			goto err;
+		}
+
+		pool_info = ops->get_resource_pool_tokens(rpool->device);
+		if (IS_ERR_OR_NULL(pool_info)) {
+			ret = -EINVAL;
+			goto err;
+		}
+		seq_printf(sf, "%s ", rpool->device->name);
+
+		resource_table = pool_info->resource_table;
+		for (i = 0; i < pool_info->resource_count; i++) {
+			value = get_resource_val(&rpool->resources[i],
+						 seq_cft(sf)->private);
+			seq_printf(sf, resource_table[i].pattern, value);
+			seq_putc(sf, ' ');
+		}
+		seq_putc(sf, '\n');
+	}
+
+err:
+	mutex_unlock(&dev_mutex);
+	return ret;
+}
+
+static int rdmacg_resource_list(struct seq_file *sf, void *v)
+{
+	struct ib_device *device;
+	struct rdmacg_pool_info *pool_info;
+	struct match_token *resource_table;
+	struct rdmacg_resource_pool_ops *ops;
+	enum rdmacg_resource_pool_type pool_type;
+	int i, ret = 0;
+	char *name, *org_ptr;
+	char **namep;
+
+	pool_type = of_to_pool_type(seq_cft(sf)->private);
+
+	mutex_lock(&dev_mutex);
+
+	list_for_each_entry(device, &dev_list, rdmacg_list) {
+		ops = get_pool_ops(device, pool_type);
+		if (!ops)
+			continue;
+
+		pool_info = ops->get_resource_pool_tokens(device);
+		if (IS_ERR_OR_NULL(pool_info)) {
+			ret = -EINVAL;
+			goto err;
+		}
+		seq_printf(sf, "%s ", device->name);
+
+		resource_table = pool_info->resource_table;
+		for (i = 0; i < pool_info->resource_count; i++) {
+			name = kzalloc(strlen(resource_table[i].pattern),
+				       GFP_KERNEL);
+			if (!name) {
+				ret = -ENOMEM;
+				goto err;
+			}
+			namep = &name;
+			org_ptr = name;
+			strcpy(name, resource_table[i].pattern);
+			name = strsep(namep, "=");
+			if (!*namep) {
+				kfree(org_ptr);
+				continue;
+			}
+			seq_puts(sf, name);
+			seq_putc(sf, ' ');
+			kfree(org_ptr);
+		}
+		seq_putc(sf, '\n');
+	}
+
+err:
+	mutex_unlock(&dev_mutex);
+	return ret;
+}
+
+static struct cftype rdmacg_files[] = {
+	{
+		.name = "resource.verb.limit",
+		.write = rdmacg_resource_set_limit,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_VERB_RESOURCE_LIMIT,
+	},
+	{
+		.name = "resource.verb.usage",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_VERB_RESOURCE_USAGE,
+	},
+	{
+		.name = "resource.verb.failcnt",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_VERB_RESOURCE_FAILCNT,
+	},
+	{
+		.name = "resource.verb.list",
+		.seq_show = rdmacg_resource_list,
+		.private = RDMACG_VERB_RESOURCE_LIST,
+	},
+	{
+		.name = "resource.hw.limit",
+		.write = rdmacg_resource_set_limit,
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_HW_RESOURCE_LIMIT,
+	},
+	{
+		.name = "resource.hw.usage",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_HW_RESOURCE_USAGE,
+	},
+	{
+		.name = "resource.hw.failcnt",
+		.seq_show = rdmacg_resource_read,
+		.private = RDMACG_HW_RESOURCE_FAILCNT,
+	},
+	{
+		.name = "resource.hw.list",
+		.seq_show = rdmacg_resource_list,
+		.private = RDMACG_HW_RESOURCE_LIST,
+	},
+	{ }	/* terminate */
+};
+
+static struct cgroup_subsys_state *
+rdmacg_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct rdma_cgroup *cg;
+
+	cg = kzalloc(sizeof(*cg), GFP_KERNEL);
+	if (!cg)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&cg->rpool_head);
+	spin_lock_init(&cg->cg_list_lock);
+
+	return &cg->css;
+}
+
+static void rdmacg_css_free(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+
+	kfree(cg);
+}
+
+/**
+ * rdmacg_css_offline - cgroup css_offline callback
+ * @css: css of interest
+ *
+ * This function is called when @css is about to go away and responsible
+ * for shooting down all rdmacg associated with @css. As part of that it
+ * marks all the resource pool as default type, so that when resources are
+ * freeded, associated resource pool can be freed as well.
+ *
+ */
+static void rdmacg_css_offline(struct cgroup_subsys_state *css)
+{
+	struct rdma_cgroup *cg = css_rdmacg(css);
+	struct cg_resource_pool *rpool;
+
+	spin_lock(&cg->cg_list_lock);
+
+	/* mark resource pool as of default type, so that they can be freed
+	 * later on when actual resources are freed.
+	 */
+	list_for_each_entry(rpool, &cg->rpool_head, cg_list)
+		atomic_set(&rpool->creator, RDMACG_RPOOL_CREATOR_DEFAULT);
+
+	spin_unlock(&cg->cg_list_lock);
+}
+
+struct cgroup_subsys rdma_cgrp_subsys = {
+	.css_alloc	= rdmacg_css_alloc,
+	.css_free	= rdmacg_css_free,
+	.css_offline	= rdmacg_css_offline,
+	.legacy_cftypes	= rdmacg_files,
+	.dfl_cftypes	= rdmacg_files,
+};
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCHv1 4/6] IB/core: rdmacg support infrastructure APIs
  2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 1/6] rdmacg: Added rdma cgroup header file Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 2/6] IB/core: Added members to support rdma cgroup Parav Pandit
@ 2016-01-05 18:58 ` Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 5/6] IB/core: use rdma cgroup for resource accounting Parav Pandit
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
It defines verb RDMA resources that will be registered with
RDMA cgroup. It defines new APIs to register device with
RDMA cgroup and defines resource token table access interface.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/Makefile    |  1 +
 drivers/infiniband/core/cgroup.c    | 80 +++++++++++++++++++++++++++++++++++++
 drivers/infiniband/core/core_priv.h |  5 +++
 include/rdma/ib_verbs.h             | 13 ++++++
 4 files changed, 99 insertions(+)
 create mode 100644 drivers/infiniband/core/cgroup.c
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..df40cee 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -13,6 +13,7 @@ ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
 				roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
 ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
+ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/cgroup.c b/drivers/infiniband/core/cgroup.c
new file mode 100644
index 0000000..8d80add
--- /dev/null
+++ b/drivers/infiniband/core/cgroup.c
@@ -0,0 +1,80 @@
+#include <linux/kernel.h>
+#include <linux/parser.h>
+#include <linux/cgroup_rdma.h>
+
+#include "core_priv.h"
+
+/**
+ * resource table definition as to be seen by the user.
+ * Need to add entries to it when more resources are
+ * added/defined at IB verb/core layer.
+ */
+static match_table_t resource_tokens = {
+	{RDMA_VERB_RESOURCE_UCTX, "uctx=%d"},
+	{RDMA_VERB_RESOURCE_AH, "ah=%d"},
+	{RDMA_VERB_RESOURCE_PD, "pd=%d"},
+	{RDMA_VERB_RESOURCE_CQ, "cq=%d"},
+	{RDMA_VERB_RESOURCE_MR, "mr=%d"},
+	{RDMA_VERB_RESOURCE_MW, "mw=%d"},
+	{RDMA_VERB_RESOURCE_SRQ, "srq=%d"},
+	{RDMA_VERB_RESOURCE_QP, "qp=%d"},
+	{RDMA_VERB_RESOURCE_FLOW, "flow=%d"},
+	{-1, NULL}
+};
+
+/**
+ * setup table pointers for RDMA cgroup to access.
+ */
+static struct rdmacg_pool_info verbs_token_info = {
+	.resource_table = resource_tokens,
+	.resource_count =
+		(sizeof(resource_tokens) / sizeof(struct match_token)) - 1,
+};
+
+static struct rdmacg_pool_info*
+	rdmacg_get_resource_pool_tokens(struct ib_device *device)
+{
+	return &verbs_token_info;
+}
+
+static struct rdmacg_resource_pool_ops verbs_pool_ops = {
+	.get_resource_pool_tokens = &rdmacg_get_resource_pool_tokens,
+};
+
+/**
+ * ib_device_register_rdmacg - register with rdma cgroup.
+ * @device: device to register to participate in resource
+ *          accounting by rdma cgroup.
+ *
+ * Register with the rdma cgroup. Should be called before
+ * exposing rdma device to user space applications to avoid
+ * resource accounting leak.
+ * HCA drivers should set resource pool ops first if they wish
+ * to support hw specific resource accounting before IB core
+ * registers with rdma cgroup.
+ */
+void ib_device_register_rdmacg(struct ib_device *device)
+{
+	rdmacg_set_rpool_ops(device,
+			     RDMACG_RESOURCE_POOL_VERB,
+			     &verbs_pool_ops);
+	rdmacg_register_ib_device(device);
+}
+
+/**
+ * ib_device_unregister_rdmacg - unregister with rdma cgroup.
+ * @device: device to unregister.
+ *
+ * Unregister with the rdma cgroup. Should be called after
+ * all the resources are deallocated, and after a stage when any
+ * other resource allocation of user application cannot be done
+ * for this device to avoid any leak in accounting.
+ * HCA drivers should clear resource pool ops after ib stack
+ * unregisters with rdma cgroup.
+ */
+void ib_device_unregister_rdmacg(struct ib_device *device)
+{
+	rdmacg_unregister_ib_device(device);
+	rdmacg_clear_rpool_ops(device,
+			       RDMACG_RESOURCE_POOL_VERB);
+}
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 5cf6eb7..29bdfe2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -92,4 +92,9 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+#ifdef CONFIG_CGROUP_RDMA
+void ib_device_register_rdmacg(struct ib_device *device);
+void ib_device_unregister_rdmacg(struct ib_device *device);
+#endif
+
 #endif /* _CORE_PRIV_H */
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1a17249..f44b884 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -96,6 +96,19 @@ enum rdma_protocol_type {
 	RDMA_PROTOCOL_USNIC_UDP
 };
 
+enum rdma_resource_type {
+	RDMA_VERB_RESOURCE_UCTX,
+	RDMA_VERB_RESOURCE_AH,
+	RDMA_VERB_RESOURCE_PD,
+	RDMA_VERB_RESOURCE_CQ,
+	RDMA_VERB_RESOURCE_MR,
+	RDMA_VERB_RESOURCE_MW,
+	RDMA_VERB_RESOURCE_SRQ,
+	RDMA_VERB_RESOURCE_QP,
+	RDMA_VERB_RESOURCE_FLOW,
+	RDMA_VERB_RESOURCE_MAX,
+};
+
 __attribute_const__ enum rdma_transport_type
 rdma_node_get_transport(enum rdma_node_type node_type);
 
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCHv1 5/6] IB/core: use rdma cgroup for resource accounting
  2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
                   ` (2 preceding siblings ...)
  2016-01-05 18:58 ` [PATCHv1 4/6] IB/core: rdmacg support infrastructure APIs Parav Pandit
@ 2016-01-05 18:58 ` Parav Pandit
  2016-01-05 18:58 ` [PATCHv1 6/6] rdmacg: Added documentation for rdma controller Parav Pandit
       [not found] ` <1452020286-9508-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  5 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
It uses charge API to perform resource charing before allocating low
level resource. It continues to link the resource to the owning
thread group leader task.
It uncharges the resource after successful deallocation of resource.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/device.c      |   8 ++
 drivers/infiniband/core/uverbs_cmd.c  | 244 +++++++++++++++++++++++++++++++---
 drivers/infiniband/core/uverbs_main.c |  30 +++++
 3 files changed, 265 insertions(+), 17 deletions(-)
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 179e813..59cab6b 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -352,6 +352,10 @@ int ib_register_device(struct ib_device *device,
 		goto out;
 	}
 
+#ifdef CONFIG_CGROUP_RDMA
+	ib_device_register_rdmacg(device);
+#endif
+
 	ret = ib_device_register_sysfs(device, port_callback);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't register device %s with driver model\n",
@@ -405,6 +409,10 @@ void ib_unregister_device(struct ib_device *device)
 
 	mutex_unlock(&device_mutex);
 
+#ifdef CONFIG_CGROUP_RDMA
+	ib_device_unregister_rdmacg(device);
+#endif
+
 	ib_device_unregister_sysfs(device);
 	ib_cache_cleanup_one(device);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 94816ae..1b3d60b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -294,6 +294,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 #endif
 	struct ib_ucontext		 *ucontext;
 	struct file			 *filp;
+	struct pid                       *tgid;
 	int ret;
 
 	if (out_len < sizeof resp)
@@ -313,10 +314,20 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
+	rcu_read_lock();
+	tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	rcu_read_unlock();
+
+	ret = rdmacg_try_charge_resource(ib_dev, tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_UCTX, 1);
+	if (ret)
+		goto err_charge;
+
 	ucontext = ib_dev->alloc_ucontext(ib_dev, &udata);
 	if (IS_ERR(ucontext)) {
 		ret = PTR_ERR(ucontext);
-		goto err;
+		goto err_alloc;
 	}
 
 	ucontext->device = ib_dev;
@@ -330,7 +341,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&ucontext->xrcd_list);
 	INIT_LIST_HEAD(&ucontext->rule_list);
 	rcu_read_lock();
-	ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	ucontext->tgid = tgid;
 	rcu_read_unlock();
 	ucontext->closing = 0;
 
@@ -383,9 +394,15 @@ err_fd:
 	put_unused_fd(resp.async_fd);
 
 err_free:
-	put_pid(ucontext->tgid);
 	ib_dev->dealloc_ucontext(ucontext);
 
+err_alloc:
+	rdmacg_uncharge_resource(ib_dev, tgid, RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_UCTX, 1);
+
+err_charge:
+	put_pid(tgid);
+
 err:
 	mutex_unlock(&file->mutex);
 	return ret;
@@ -394,7 +411,8 @@ err:
 static void copy_query_dev_fields(struct ib_uverbs_file *file,
 				  struct ib_device *ib_dev,
 				  struct ib_uverbs_query_device_resp *resp,
-				  struct ib_device_attr *attr)
+				  struct ib_device_attr *attr,
+				  int *limits)
 {
 	resp->fw_ver		= attr->fw_ver;
 	resp->node_guid		= ib_dev->node_guid;
@@ -405,14 +423,19 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->vendor_part_id	= attr->vendor_part_id;
 	resp->hw_ver		= attr->hw_ver;
 	resp->max_qp		= attr->max_qp;
+	resp->max_qp		= min_t(int, attr->max_qp,
+					limits[RDMA_VERB_RESOURCE_QP]);
 	resp->max_qp_wr		= attr->max_qp_wr;
 	resp->device_cap_flags	= attr->device_cap_flags;
 	resp->max_sge		= attr->max_sge;
 	resp->max_sge_rd	= attr->max_sge_rd;
-	resp->max_cq		= attr->max_cq;
+	resp->max_cq		= min_t(int, attr->max_cq,
+					limits[RDMA_VERB_RESOURCE_CQ]);
 	resp->max_cqe		= attr->max_cqe;
-	resp->max_mr		= attr->max_mr;
-	resp->max_pd		= attr->max_pd;
+	resp->max_mr		= min_t(int, attr->max_mr,
+					limits[RDMA_VERB_RESOURCE_MR]);
+	resp->max_pd		= min_t(int, attr->max_pd,
+					limits[RDMA_VERB_RESOURCE_PD]);
 	resp->max_qp_rd_atom	= attr->max_qp_rd_atom;
 	resp->max_ee_rd_atom	= attr->max_ee_rd_atom;
 	resp->max_res_rd_atom	= attr->max_res_rd_atom;
@@ -421,16 +444,19 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->atomic_cap		= attr->atomic_cap;
 	resp->max_ee			= attr->max_ee;
 	resp->max_rdd			= attr->max_rdd;
-	resp->max_mw			= attr->max_mw;
+	resp->max_mw			= min_t(int, attr->max_mw,
+						limits[RDMA_VERB_RESOURCE_MW]);
 	resp->max_raw_ipv6_qp		= attr->max_raw_ipv6_qp;
 	resp->max_raw_ethy_qp		= attr->max_raw_ethy_qp;
 	resp->max_mcast_grp		= attr->max_mcast_grp;
 	resp->max_mcast_qp_attach	= attr->max_mcast_qp_attach;
 	resp->max_total_mcast_qp_attach	= attr->max_total_mcast_qp_attach;
-	resp->max_ah			= attr->max_ah;
+	resp->max_ah			= min_t(int, attr->max_ah,
+						limits[RDMA_VERB_RESOURCE_AH]);
 	resp->max_fmr			= attr->max_fmr;
 	resp->max_map_per_fmr		= attr->max_map_per_fmr;
-	resp->max_srq			= attr->max_srq;
+	resp->max_srq			= min_t(int, attr->max_srq,
+						limits[RDMA_VERB_RESOURCE_SRQ]);
 	resp->max_srq_wr		= attr->max_srq_wr;
 	resp->max_srq_sge		= attr->max_srq_sge;
 	resp->max_pkeys			= attr->max_pkeys;
@@ -447,6 +473,8 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 	struct ib_uverbs_query_device_resp resp;
 	struct ib_device_attr              attr;
 	int                                ret;
+	int                                limits[RDMA_VERB_RESOURCE_MAX];
+	struct pid                         *tgid;
 
 	if (out_len < sizeof resp)
 		return -ENOSPC;
@@ -458,14 +486,28 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rcu_read_lock();
+	tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	rcu_read_unlock();
+
+	ret = rdmacg_query_resource_limit(ib_dev, tgid,
+					  RDMACG_RESOURCE_POOL_VERB,
+					  limits, RDMA_VERB_RESOURCE_MAX);
+	put_pid(tgid);
+	if (ret)
+		goto err;
+
 	memset(&resp, 0, sizeof resp);
-	copy_query_dev_fields(file, ib_dev, &resp, &attr);
+	copy_query_dev_fields(file, ib_dev, &resp, &attr, limits);
 
 	if (copy_to_user((void __user *) (unsigned long) cmd.response,
 			 &resp, sizeof resp))
 		return -EFAULT;
 
 	return in_len;
+
+err:
+	return ret;
 }
 
 ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
@@ -529,6 +571,7 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	struct ib_udata                udata;
 	struct ib_uobject             *uobj;
 	struct ib_pd                  *pd;
+	struct pid                    *tgid;
 	int                            ret;
 
 	if (out_len < sizeof resp)
@@ -545,6 +588,16 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -ENOMEM;
 
+	tgid = file->ucontext->tgid;
+
+	ret = rdmacg_try_charge_resource(file->device->ib_dev, tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_PD, 1);
+	if (ret) {
+		kfree(uobj);
+		return -EPERM;
+	}
+
 	init_uobj(uobj, 0, file->ucontext, &pd_lock_class);
 	down_write(&uobj->mutex);
 
@@ -590,6 +643,9 @@ err_idr:
 	ib_dealloc_pd(pd);
 
 err:
+	rdmacg_uncharge_resource(file->device->ib_dev, tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_PD, 1);
 	put_uobj_write(uobj);
 	return ret;
 }
@@ -602,6 +658,8 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
 	struct ib_uverbs_dealloc_pd cmd;
 	struct ib_uobject          *uobj;
 	struct ib_pd		   *pd;
+	struct pid                 *tgid;
+	struct ib_device           *device;
 	int                         ret;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -622,6 +680,13 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
 	if (ret)
 		goto err_put;
 
+	tgid = uobj->context->tgid;
+	device = uobj->context->device;
+
+	rdmacg_uncharge_resource(device, tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_PD, 1);
+
 	uobj->live = 0;
 	put_uobj_write(uobj);
 
@@ -995,6 +1060,12 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
 		}
 	}
 
+	ret = rdmacg_try_charge_resource(pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_MR, 1);
+	if (ret)
+		goto err_charge;
+
 	mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
 				     cmd.access_flags, &udata);
 	if (IS_ERR(mr)) {
@@ -1043,6 +1114,11 @@ err_unreg:
 	ib_dereg_mr(mr);
 
 err_put:
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_MR, 1);
+
+err_charge:
 	put_pd_read(pd);
 
 err_free:
@@ -1152,6 +1228,8 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 	struct ib_uverbs_dereg_mr cmd;
 	struct ib_mr             *mr;
 	struct ib_uobject	 *uobj;
+	struct pid               *tgid;
+	struct ib_pd             *pd;
 	int                       ret = -EINVAL;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -1163,6 +1241,9 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 
 	mr = uobj->object;
 
+	tgid = uobj->context->tgid;
+	pd = mr->pd;
+
 	ret = ib_dereg_mr(mr);
 	if (!ret)
 		uobj->live = 0;
@@ -1172,6 +1253,10 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rdmacg_uncharge_resource(pd->device, tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_MR, 1);
+
 	idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1214,6 +1299,12 @@ ssize_t ib_uverbs_alloc_mw(struct ib_uverbs_file *file,
 		goto err_free;
 	}
 
+	ret = rdmacg_try_charge_resource(pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_MW, 1);
+	if (ret)
+		goto err_charge;
+
 	mw = pd->device->alloc_mw(pd, cmd.mw_type);
 	if (IS_ERR(mw)) {
 		ret = PTR_ERR(mw);
@@ -1259,6 +1350,11 @@ err_unalloc:
 	ib_dealloc_mw(mw);
 
 err_put:
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_MW, 1);
+
+err_charge:
 	put_pd_read(pd);
 
 err_free:
@@ -1273,6 +1369,7 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 {
 	struct ib_uverbs_dealloc_mw cmd;
 	struct ib_mw               *mw;
+	struct ib_pd               *pd;
 	struct ib_uobject	   *uobj;
 	int                         ret = -EINVAL;
 
@@ -1284,6 +1381,7 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 		return -EINVAL;
 
 	mw = uobj->object;
+	pd = mw->pd;
 
 	ret = ib_dealloc_mw(mw);
 	if (!ret)
@@ -1294,6 +1392,10 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_MW, 1);
+
 	idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1359,6 +1461,7 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 	struct ib_ucq_object           *obj;
 	struct ib_uverbs_event_file    *ev_file = NULL;
 	struct ib_cq                   *cq;
+	struct pid                     *tgid;
 	int                             ret;
 	struct ib_uverbs_ex_create_cq_resp resp;
 	struct ib_cq_init_attr attr = {};
@@ -1393,6 +1496,14 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 	if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags))
 		attr.flags = cmd->flags;
 
+	tgid = file->ucontext->tgid;
+
+	ret = rdmacg_try_charge_resource(file->device->ib_dev, tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_CQ, 1);
+	if (ret)
+		goto err_charge;
+
 	cq = ib_dev->create_cq(ib_dev, &attr,
 					     file->ucontext, uhw);
 	if (IS_ERR(cq)) {
@@ -1440,6 +1551,11 @@ err_free:
 	ib_destroy_cq(cq);
 
 err_file:
+	rdmacg_uncharge_resource(file->device->ib_dev, tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_CQ, 1);
+
+err_charge:
 	if (ev_file)
 		ib_uverbs_release_ucq(file, ev_file, obj);
 
@@ -1697,6 +1813,8 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_cq_resp resp;
 	struct ib_uobject		*uobj;
 	struct ib_cq               	*cq;
+	struct ib_device                *device;
+	struct pid                      *tgid;
 	struct ib_ucq_object        	*obj;
 	struct ib_uverbs_event_file	*ev_file;
 	int                        	 ret = -EINVAL;
@@ -1720,6 +1838,12 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	tgid    = uobj->context->tgid;
+	device  = uobj->context->device;
+	rdmacg_uncharge_resource(device, tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_CQ, 1);
+
 	idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1775,6 +1899,12 @@ static int create_qp(struct ib_uverbs_file *file,
 		  &qp_lock_class);
 	down_write(&obj->uevent.uobject.mutex);
 
+	pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
+	if (!pd) {
+		ret = -EINVAL;
+		goto err_put;
+	}
+
 	if (cmd->qp_type == IB_QPT_XRC_TGT) {
 		xrcd = idr_read_xrcd(cmd->pd_handle, file->ucontext,
 				     &xrcd_uobj);
@@ -1809,8 +1939,7 @@ static int create_qp(struct ib_uverbs_file *file,
 
 		scq = idr_read_cq(cmd->send_cq_handle, file->ucontext, !!rcq);
 		rcq = rcq ?: scq;
-		pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
-		if (!pd || !scq) {
+		if (!scq) {
 			ret = -EINVAL;
 			goto err_put;
 		}
@@ -1856,6 +1985,12 @@ static int create_qp(struct ib_uverbs_file *file,
 			goto err_put;
 		}
 
+	ret = rdmacg_try_charge_resource(pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_QP, 1);
+	if (ret)
+		goto err_put;
+
 	if (cmd->qp_type == IB_QPT_XRC_TGT)
 		qp = ib_create_qp(pd, &attr);
 	else
@@ -1863,7 +1998,7 @@ static int create_qp(struct ib_uverbs_file *file,
 
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
-		goto err_put;
+		goto err_create;
 	}
 
 	if (cmd->qp_type != IB_QPT_XRC_TGT) {
@@ -1938,6 +2073,11 @@ err_cb:
 err_destroy:
 	ib_destroy_qp(qp);
 
+err_create:
+	rdmacg_uncharge_resource(device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_QP, 1);
+
 err_put:
 	if (xrcd)
 		put_xrcd_read(xrcd_uobj);
@@ -2377,6 +2517,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_qp_resp resp;
 	struct ib_uobject		*uobj;
 	struct ib_qp               	*qp;
+	struct ib_pd                    *pd;
 	struct ib_uqp_object        	*obj;
 	int                        	 ret = -EINVAL;
 
@@ -2389,6 +2530,7 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	qp  = uobj->object;
+	pd  = qp->pd;
 	obj = container_of(uobj, struct ib_uqp_object, uevent.uobject);
 
 	if (!list_empty(&obj->mcast_list)) {
@@ -2405,6 +2547,10 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_QP, 1);
+
 	if (obj->uxrcd)
 		atomic_dec(&obj->uxrcd->refcnt);
 
@@ -2846,10 +2992,16 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
 	memset(&attr.dmac, 0, sizeof(attr.dmac));
 	memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16);
 
+	ret = rdmacg_try_charge_resource(pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_AH, 1);
+	if (ret)
+		goto err_put;
+
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
 		ret = PTR_ERR(ah);
-		goto err_put;
+		goto err_create;
 	}
 
 	ah->uobject  = uobj;
@@ -2885,6 +3037,11 @@ err_copy:
 err_destroy:
 	ib_destroy_ah(ah);
 
+err_create:
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_AH, 1);
+
 err_put:
 	put_pd_read(pd);
 
@@ -2899,6 +3056,7 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 {
 	struct ib_uverbs_destroy_ah cmd;
 	struct ib_ah		   *ah;
+	struct ib_pd		   *pd;
 	struct ib_uobject	   *uobj;
 	int			    ret;
 
@@ -2909,6 +3067,7 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	ah = uobj->object;
+	pd = ah->pd;
 
 	ret = ib_destroy_ah(ah);
 	if (!ret)
@@ -2919,6 +3078,10 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_AH, 1);
+
 	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -3171,10 +3334,17 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
 		err = -EINVAL;
 		goto err_free;
 	}
+
+	err = rdmacg_try_charge_resource(qp->pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_FLOW, 1);
+	if (err)
+		goto err_free;
+
 	flow_id = ib_create_flow(qp, flow_attr, IB_FLOW_DOMAIN_USER);
 	if (IS_ERR(flow_id)) {
 		err = PTR_ERR(flow_id);
-		goto err_free;
+		goto err_create;
 	}
 	flow_id->qp = qp;
 	flow_id->uobject = uobj;
@@ -3208,6 +3378,10 @@ err_copy:
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 destroy_flow:
 	ib_destroy_flow(flow_id);
+err_create:
+	rdmacg_uncharge_resource(qp->pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_FLOW, 1);
 err_free:
 	kfree(flow_attr);
 err_put:
@@ -3228,6 +3402,7 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_flow	cmd;
 	struct ib_flow			*flow_id;
 	struct ib_uobject		*uobj;
+	struct ib_pd			*pd;
 	int				ret;
 
 	if (ucore->inlen < sizeof(cmd))
@@ -3245,11 +3420,16 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -EINVAL;
 	flow_id = uobj->object;
+	pd = flow_id->qp->pd;
 
 	ret = ib_destroy_flow(flow_id);
 	if (!ret)
 		uobj->live = 0;
 
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_FLOW, 1);
+
 	put_uobj_write(uobj);
 
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
@@ -3316,6 +3496,12 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
 	obj->uevent.events_reported = 0;
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 
+	ret = rdmacg_try_charge_resource(pd->device, file->ucontext->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_SRQ, 1);
+	if (ret)
+		goto err_put_cq;
+
 	srq = pd->device->create_srq(pd, &attr, udata);
 	if (IS_ERR(srq)) {
 		ret = PTR_ERR(srq);
@@ -3380,6 +3566,9 @@ err_destroy:
 	ib_destroy_srq(srq);
 
 err_put:
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_SRQ, 1);
 	put_pd_read(pd);
 
 err_put_cq:
@@ -3540,6 +3729,7 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	struct ib_uverbs_destroy_srq_resp resp;
 	struct ib_uobject		 *uobj;
 	struct ib_srq               	 *srq;
+	struct ib_pd                     *pd;
 	struct ib_uevent_object        	 *obj;
 	int                         	  ret = -EINVAL;
 	struct ib_usrq_object		 *us;
@@ -3554,6 +3744,7 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	srq = uobj->object;
 	obj = container_of(uobj, struct ib_uevent_object, uobject);
 	srq_type = srq->srq_type;
+	pd = srq->pd;
 
 	ret = ib_destroy_srq(srq);
 	if (!ret)
@@ -3564,6 +3755,10 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	rdmacg_uncharge_resource(pd->device, file->ucontext->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_SRQ, 1);
+
 	if (srq_type == IB_SRQT_XRC) {
 		us = container_of(obj, struct ib_usrq_object, uevent);
 		atomic_dec(&us->uxrcd->refcnt);
@@ -3597,6 +3792,8 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	struct ib_uverbs_ex_query_device_resp resp;
 	struct ib_uverbs_ex_query_device  cmd;
 	struct ib_device_attr attr;
+	struct pid *tgid;
+	int    limits[RDMA_VERB_RESOURCE_MAX];
 	int err;
 
 	if (ucore->inlen < sizeof(cmd))
@@ -3623,7 +3820,18 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	if (err)
 		return err;
 
-	copy_query_dev_fields(file, ib_dev, &resp.base, &attr);
+	rcu_read_lock();
+	tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
+	rcu_read_unlock();
+
+	err = rdmacg_query_resource_limit(ib_dev, tgid,
+					  RDMACG_RESOURCE_POOL_VERB,
+					  limits, RDMA_VERB_RESOURCE_MAX);
+	if (err)
+		goto end;
+
+	copy_query_dev_fields(file, ib_dev, &resp.base, &attr, limits);
+
 	resp.comp_mask = 0;
 
 	if (ucore->outlen < resp.response_length + sizeof(resp.odp_caps))
@@ -3656,6 +3864,8 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
 	resp.response_length += sizeof(resp.hca_core_clock);
 
 end:
+	put_pid(tgid);
+
 	err = ib_copy_to_udata(ucore, &resp, resp.response_length);
 	if (err)
 		return err;
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index e3ef288..4565386 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -45,6 +45,7 @@
 #include <linux/cdev.h>
 #include <linux/anon_inodes.h>
 #include <linux/slab.h>
+#include <linux/cgroup_rdma.h>
 
 #include <asm/uaccess.h>
 
@@ -214,6 +215,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->ah_list, list) {
 		struct ib_ah *ah = uobj->object;
 
+		rdmacg_uncharge_resource(ah->pd->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_AH, 1);
 		idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 		ib_destroy_ah(ah);
 		kfree(uobj);
@@ -223,6 +227,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->mw_list, list) {
 		struct ib_mw *mw = uobj->object;
 
+		rdmacg_uncharge_resource(mw->pd->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_MW, 1);
 		idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
 		ib_dealloc_mw(mw);
 		kfree(uobj);
@@ -231,6 +238,10 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->rule_list, list) {
 		struct ib_flow *flow_id = uobj->object;
 
+		rdmacg_uncharge_resource(flow_id->qp->pd->device,
+					 uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_FLOW, 1);
 		idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 		ib_destroy_flow(flow_id);
 		kfree(uobj);
@@ -245,6 +256,10 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		if (qp != qp->real_qp) {
 			ib_close_qp(qp);
 		} else {
+			rdmacg_uncharge_resource(qp->pd->device,
+						 uobj->context->tgid,
+						 RDMACG_RESOURCE_POOL_VERB,
+						 RDMA_VERB_RESOURCE_QP, 1);
 			ib_uverbs_detach_umcast(qp, uqp);
 			ib_destroy_qp(qp);
 		}
@@ -257,6 +272,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		struct ib_uevent_object *uevent =
 			container_of(uobj, struct ib_uevent_object, uobject);
 
+		rdmacg_uncharge_resource(srq->pd->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_SRQ, 1);
 		idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
 		ib_destroy_srq(srq);
 		ib_uverbs_release_uevent(file, uevent);
@@ -269,6 +287,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		struct ib_ucq_object *ucq =
 			container_of(uobj, struct ib_ucq_object, uobject);
 
+		rdmacg_uncharge_resource(cq->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_CQ, 1);
 		idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 		ib_destroy_cq(cq);
 		ib_uverbs_release_ucq(file, ev_file, ucq);
@@ -278,6 +299,9 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
 		struct ib_mr *mr = uobj->object;
 
+		rdmacg_uncharge_resource(mr->pd->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_MR, 1);
 		idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 		ib_dereg_mr(mr);
 		kfree(uobj);
@@ -298,11 +322,17 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
 		struct ib_pd *pd = uobj->object;
 
+		rdmacg_uncharge_resource(pd->device, uobj->context->tgid,
+					 RDMACG_RESOURCE_POOL_VERB,
+					 RDMA_VERB_RESOURCE_PD, 1);
 		idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 		ib_dealloc_pd(pd);
 		kfree(uobj);
 	}
 
+	rdmacg_uncharge_resource(context->device, context->tgid,
+				 RDMACG_RESOURCE_POOL_VERB,
+				 RDMA_VERB_RESOURCE_UCTX, 1);
 	put_pid(context->tgid);
 
 	return context->device->dealloc_ucontext(context);
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
  2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
                   ` (3 preceding siblings ...)
  2016-01-05 18:58 ` [PATCHv1 5/6] IB/core: use rdma cgroup for resource accounting Parav Pandit
@ 2016-01-05 18:58 ` Parav Pandit
       [not found]   ` <1452020286-9508-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
       [not found] ` <1452020286-9508-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  5 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-05 18:58 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module, pandit.parav
Added documentation for rdma controller to use in legacy mode and
using new unified hirerchy.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 Documentation/cgroup-legacy/rdma.txt | 129 +++++++++++++++++++++++++++++++++++
 Documentation/cgroup.txt             |  79 +++++++++++++++++++++
 2 files changed, 208 insertions(+)
 create mode 100644 Documentation/cgroup-legacy/rdma.txt
diff --git a/Documentation/cgroup-legacy/rdma.txt b/Documentation/cgroup-legacy/rdma.txt
new file mode 100644
index 0000000..70626c5
--- /dev/null
+++ b/Documentation/cgroup-legacy/rdma.txt
@@ -0,0 +1,129 @@
+				RDMA Resource Controller
+				------------------------
+
+Contents
+--------
+
+1. Overview
+  1-1. What is RDMA resource controller?
+  1-2. Why RDMA resource controller needed?
+  1-3. How is RDMA resource controller implemented?
+2. Usage Examples
+
+1. Overview
+
+1-1. What is RDMA resource controller?
+-------------------------------------
+
+RDMA resource controller allows user to limit RDMA/IB specific resources
+that a given set of processes can use. These processes are grouped using
+RDMA resource controller.
+
+RDMA resource controller currently allows two different type of resource
+pools.
+(a) RDMA IB specification level verb resources defined by IB stack
+(b) HCA vendor device specific resources
+
+RDMA resource controller controller allows maximum of upto 64 resources in
+a resource pool which is the internal construct of rdma cgroup explained
+at later part of this document.
+
+1-2. Why RDMA resource controller needed?
+----------------------------------------
+
+Currently user space applications can easily take away all the rdma device
+specific resources such as AH, CQ, QP, MR etc. Due to which other applications
+in other cgroup or kernel space ULPs may not even get chance to allocate any
+rdma resources. This leads to service unavailability.
+
+Therefore RDMA resource controller is needed through which resource consumption
+of processes can be limited. Through this controller various different rdma
+resources described by IB uverbs layer and any HCA vendor driver can be
+accounted.
+
+1-3. How is RDMA resource controller implemented?
+------------------------------------------------
+
+rdma cgroup allows limit configuration of resources. These resources are not
+defined by the rdma controller. Instead they are defined by the IB stack
+and HCA device drivers(optionally).
+This provides great flexibility to allow IB stack to define new resources,
+without any changes to rdma cgroup.
+Rdma cgroup maintains resource accounting per cgroup, per device, per resource
+type using resource pool structure. Each such resource pool is limited up to
+64 resources in given resource pool by rdma cgroup, which can be extended
+later if required.
+
+This resource pool object is linked to the cgroup css. Typically there
+are 0 to 4 resource pool instances per cgroup, per device in most use cases.
+But nothing limits to have it more. At present hundreds of RDMA devices per
+single cgroup may not be handled optimally, however there is no known use case
+for such configuration either.
+
+Since RDMA resources can be allocated from any process and can be freed by any
+of the child processes which shares the address space, rdma resources are
+always owned by the creator cgroup css. This allows process migration from one
+to other cgroup without major complexity of transferring resource ownership;
+because such ownership is not really present due to shared nature of
+rdma resources. Linking resources around css also ensures that cgroups can be
+deleted after processes migrated. This allow progress migration as well with
+active resources, even though that’s not the primary use case.
+
+Finally mapping of the resource owner pid to cgroup is maintained using
+simple hash table to perform quick look-up during resource charing/uncharging
+time.
+
+Resource pool object is created in following situations.
+(a) User sets the limit and no previous resource pool exist for the device
+of interest for the cgroup.
+(b) No resource limits were configured, but IB/RDMA stack tries to
+charge the resource. So that it correctly uncharge them when applications are
+running without limits and later on when limits are enforced during uncharging,
+otherwise usage count will drop to negative. This is done using default
+resource pool. Instead of implementing any sort of time markers, default pool
+simplifies the design.
+
+Resource pool is destroyed if it was of default type (not created
+by administrative operation) and it’s the last resource getting
+deallocated. Resource pool created as administrative operation is not
+deleted, as it’s expected to be used in near future.
+
+If user setting tries to delete all the resource limit
+with active resources per device, RDMA cgroup just marks the pool as
+default pool with maximum limits for each resource, otherwise it deletes the
+default resource pool.
+
+2. Usage Examples
+-----------------
+
+(a) List available RDMA verb level resources:
+
+#cat /sys/fs/cgroup/rdma/1/rdma.resource.verb.list
+Output:
+mlx4_0 uctx ah pd mr srq qp flow
+
+(b) Configure resource limit:
+echo mlx4_0 mr=100 qp=10 ah=2 > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit
+echo ocrdma1 mr=120 qp=20 cq=10 > /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit
+
+(c) Query resource limit:
+cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.limit
+#Output:
+mlx4_0 mr=100 qp=10 ah=2
+ocrdma1 mr=120 qp=20 cq=10
+
+(d) Query current usage:
+cat /sys/fs/cgroup/rdma/2/rdma.resource.verb.usage
+#Output:
+mlx4_0 mr=95 qp=8 ah=2
+ocrdma1 mr=0 qp=20 cq=10
+
+(e) Delete resource limit:
+echo mlx4_0 remove > /sys/fs/cgroup/rdma/1/rdma.resource.verb.limit
+
+(f) List available HCA HW specific resources: (optional)
+cat /sys/fs/cgroup/rdma/1/rdma.hw.verb.list
+vendor1 hw_qp hw_cq hw_timer
+
+(g) Configure hw specific resource limit:
+echo vendor1 hw_qp=56 > /sys/fs/cgroup/rdma/2/rdma.resource.hw.limit
diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 983ba63..57eb59c 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,8 @@ CONTENTS
   5-3. IO
     5-3-1. IO Interface Files
     5-3-2. Writeback
+  5-4. RDMA
+    5-4-1. RDMA Interface Files
 6. Namespace
   6-1. Basics
   6-2. The Root and Views
@@ -1017,6 +1019,83 @@ writeback as follows.
 	total available memory and applied the same way as
 	vm.dirty[_background]_ratio.
 
+5-4. RDMA
+
+The "rdma" controller regulates the distribution of RDMA resources.
+This controller implements both RDMA/IB verb level and RDMA HCA
+driver level resource distribution.
+
+5-4-1. RDMA Interface Files
+
+  rdma.resource.verb.list
+
+	A read-only file that exists for all the cgroups that describes
+	which all verb specific resources of a given device can be
+	distributed and accounted.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name that can be
+	distributed.
+
+	An example for mlx4_0 device follows.
+
+	  mlx4_0 ah cq pd mr qp flow srq
+
+  rdma.resource.verb.limit
+	A readwrite file that exists for all the cgroups that describes
+	current configured verbs resource limit for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 mr=1000 qp=104 ah=2
+	  ocrdma1 mr=900 qp=89 cq=10
+
+  rdma.resource.verb.usage
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup including root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 mr=1000 qp=102 ah=2
+	  ocrdma1 mr=900 qp=79 cq=10
+
+  rdma.resource.verb.failcnt
+	A read-only file that describes resource allocation failure
+	count for a given resource type of a particular device.
+	It exists for all the cgroup including root.
+
+	An example for mlx4 and ocrdma device follows.
+
+	  mlx4_0 mr=0 qp=1 ah=1
+	  ocrdma1 mr=2 qp=1 cq=1
+
+  rdma.resource.hw.list
+
+	A read-only file that exists for all the cgroups that describes
+	which all HCA hardware specific resources of a given device can be
+	distributed and accounted.
+
+  rdma.resource.hw.limit
+	A readwrite file that exists for all the cgroups that describes
+	current configured HCA hardware resource limit for a RDMA/IB device.
+
+	Lines are keyed by device name and are not ordered.
+	Each line contains space separated resource name and its configured
+	limit that can be distributed.
+
+  rdma.resource.hw.usage
+	A read-only file that describes current resource usage.
+	It exists for all the cgroup including root.
+
+  rdma.resource.hw.failcnt
+	A read-only file that describes HCA hardware resource
+	allocation failure count for a given resource type of
+	a particular device.
+	It exists for all the cgroup including root.
 
 6. Namespace
 
-- 
1.8.3.1
^ permalink raw reply related	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
       [not found]   ` <1452020286-9508-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-01-05 21:53     ` Tejun Heo
       [not found]       ` <20160105215309.GG5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-05 21:53 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, corbet-T1hC0tSOHrs,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hello,
On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote:
> +5-4-1. RDMA Interface Files
> +
> +  rdma.resource.verb.list
> +  rdma.resource.verb.limit
> +  rdma.resource.verb.usage
> +  rdma.resource.verb.failcnt
> +  rdma.resource.hw.list
> +  rdma.resource.hw.limit
> +  rdma.resource.hw.usage
> +  rdma.resource.hw.failcnt
Can you please read the rest of cgroup.txt and put the interface in
line with the common conventions followed by other controllers?
Thanks.
-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
       [not found] ` <1452020286-9508-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-01-05 18:58   ` [PATCHv1 3/6] rdmacg: implements rdma cgroup Parav Pandit
@ 2016-01-05 21:56   ` Tejun Heo
       [not found]     ` <20160105215623.GH5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-05 21:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, corbet-T1hC0tSOHrs,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hello,
On Wed, Jan 06, 2016 at 12:28:00AM +0530, Parav Pandit wrote:
> Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB stack & optionally by HCA vendor device drivers.
As I wrote before, I don't think this is a good idea.  Drivers will
inevitably add non-sensical "resources" which don't make any sense
without much scrutiny.  If different controllers can't agree upon the
same set of resources, which probably is a pretty good sign that this
isn't too well thought out to begin with, at least make all resource
types defined by the controller itself and let the controllers enable
them selectively.
Thanks.
-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
       [not found]   ` <1452020286-9508-3-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-01-05 21:56     ` Tejun Heo
  2016-01-06 23:16       ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-05 21:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	liranl-VPRAkNaXOzVWk0Htik3J/w, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	haggaie-VPRAkNaXOzVWk0Htik3J/w, corbet-T1hC0tSOHrs,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
> Added function pointer table to store resource pool specific
> operation for each resource type (verb and hw).
> Added list node to link device to rdma cgroup so that it can
> participate in resource accounting and limit configuration.
Is there any point in splitting patches 1 and 2 from 3?
-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
  2016-01-05 18:58   ` [PATCHv1 3/6] rdmacg: implements rdma cgroup Parav Pandit
@ 2016-01-05 22:01     ` Tejun Heo
       [not found]       ` <20160105220128.GJ5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-05 22:01 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, hannes,
	dledford, liranl, sean.hefty, jgunthorpe, haggaie, corbet,
	james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module
Hello,
On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
> +/* hash table to keep map of tgid to owner cgroup */
> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
> +DEFINE_SPINLOCK(pid_cg_map_lock);	/* lock to protect hash table access */
> +
> +/* Keeps mapping of pid to its owning cgroup at rdma level,
> + * This mapping doesn't change, even if process migrates from one to other
> + * rdma cgroup.
> + */
> +struct pid_cg_map {
> +	struct pid *pid;		/* hash key */
> +	struct rdma_cgroup *cg;
> +
> +	struct hlist_node hlist;	/* pid to cgroup hash table link */
> +	atomic_t refcnt;		/* count active user tasks to figure out
> +					 * when to free the memory
> +					 */
> +};
Ugh, there's something clearly wrong here.  Why does the rdma
controller need to keep track of pid cgroup membership?
> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
> +			      struct cg_resource_pool *rpool)
> +{
> +	spin_lock(&cg->cg_list_lock);
> +
> +	/* if its started getting used by other task,
> +	 * before we take the spin lock, then skip,
> +	 * freeing it.
> +	 */
Please follow CodingStyle.
> +	if (atomic_read(&rpool->refcnt) == 0) {
> +		list_del_init(&rpool->cg_list);
> +		spin_unlock(&cg->cg_list_lock);
> +
> +		_free_cg_rpool(rpool);
> +		return;
> +	}
> +	spin_unlock(&cg->cg_list_lock);
> +}
> +
> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
> +			     struct cg_resource_pool *rpool)
> +{
> +	/* Don't free the resource pool which is created by the
> +	 * user, otherwise we miss the configured limits. We don't
> +	 * gain much either by splitting storage of limit and usage.
> +	 * So keep it around until user deletes the limits.
> +	 */
> +	if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
> +		_dealloc_cg_rpool(cg, rpool);
I'm pretty sure you can get away with an fixed length array of
counters.  Please keep it simple.  It's a simple hard limit enforcer.
There's no need to create a massive dynamic infrastrucure.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
       [not found]       ` <20160105215309.GG5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-06 22:44         ` Parav Pandit
  2016-01-06 22:57           ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-06 22:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Wed, Jan 6, 2016 at 3:23 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:06AM +0530, Parav Pandit wrote:
>> +5-4-1. RDMA Interface Files
>> +
>> +  rdma.resource.verb.list
>> +  rdma.resource.verb.limit
>> +  rdma.resource.verb.usage
>> +  rdma.resource.verb.failcnt
>> +  rdma.resource.hw.list
>> +  rdma.resource.hw.limit
>> +  rdma.resource.hw.usage
>> +  rdma.resource.hw.failcnt
>
> Can you please read the rest of cgroup.txt and put the interface in
> line with the common conventions followed by other controllers?
>
Yes. I read through. I can see two changes to be made in V2 version of
this patch.
1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
respectively to,
2. rdma.resource.verb.stat and rdma.resource.verb.max.
3. rdma.resource.verb.failcnt indicate failure events, which I think
should go to events.
I roll out new patch for events post this patch as additional feature
and remove this feature in V2.
rdma.resource.verb.list file is unique to rdma cgroup, so I believe
this is fine.
We will conclude whether to have rdma.resource.hw.<files> or not in
other patches.
I am in opinion to keep "resource" and "verb" or "hw" tags around to
keep it verbose enough to know what are we trying to control.
Is that ok?
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
  2016-01-06 22:44         ` Parav Pandit
@ 2016-01-06 22:57           ` Tejun Heo
       [not found]             ` <20160106225714.GI3660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-06 22:57 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello,
On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote:
> Yes. I read through. I can see two changes to be made in V2 version of
> this patch.
> 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
> respectively to,
> 2. rdma.resource.verb.stat and rdma.resource.verb.max.
> 3. rdma.resource.verb.failcnt indicate failure events, which I think
> should go to events.
What's up with the ".resource" part?  Also can't the .max file list
the available resources?  Why does it need a separtae list file?
> I roll out new patch for events post this patch as additional feature
> and remove this feature in V2.
> 
> rdma.resource.verb.list file is unique to rdma cgroup, so I believe
> this is fine.
Please see above.
> We will conclude whether to have rdma.resource.hw.<files> or not in
> other patches.
> I am in opinion to keep "resource" and "verb" or "hw" tags around to
> keep it verbose enough to know what are we trying to control.
What does that achieve?  I feel that it's getting overengineered
constantly.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
       [not found]     ` <20160105215623.GH5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-06 23:13       ` Parav Pandit
       [not found]         ` <CAG53R5XQaUz=vJCEqLwGt93CC1YMnVQ=k4DuRU0nspcb_-+_6g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-06 23:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hi Tejun,
On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:00AM +0530, Parav Pandit wrote:
>> Resources are not defined by the RDMA cgroup. Resources are defined
>> by RDMA/IB stack & optionally by HCA vendor device drivers.
>
> As I wrote before, I don't think this is a good idea.  Drivers will
> inevitably add non-sensical "resources" which don't make any sense
> without much scrutiny.
In our last discussion on v0 patch,
http://lkml.iu.edu/hypermail/linux/kernel/1509.1/04331.html
The direction was, that vendor should be able to define their own resources.
> If different controllers can't agree upon the
> same set of resources, which probably is a pretty good sign that this
> isn't too well thought out to begin with,
When you said "different controller" you meant "different hw vendors", right?
Or you meant, rdma, mem, cpu as controller here?
> at least make all resource
> types defined by the controller itself and let the controllers enable
> them selectively.
>
In this V1 patch, resource is defined by the IB stack and rdma cgroup
is facilitator for same.
By doing so, IB stack modules can define new resource without really
making changes to cgroup.
This design also allows hw vendors to define their own resources which
will be reviewed in rdma mailing list anway.
The idea is different hw versions can have different resource support,
so the whole intention is not about defining different resource but
rather enabling it.
But yes, I equally agree that by doing so, different hw controller
vendors can define different hw resources.
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
  2016-01-05 21:56     ` Tejun Heo
@ 2016-01-06 23:16       ` Parav Pandit
  2016-01-07 15:07         ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-06 23:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <tj@kernel.org> wrote:
> On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
>> Added function pointer table to store resource pool specific
>> operation for each resource type (verb and hw).
>> Added list node to link device to rdma cgroup so that it can
>> participate in resource accounting and limit configuration.
>
> Is there any point in splitting patches 1 and 2 from 3?
>
Patch 2 is in IB stack, so I separated that patch out from 1. That
makes it 3 patches.
If you think single patch is easier to review, let me know, I can
respin to have one patch for these 3 smaller patches.
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
       [not found]       ` <20160105220128.GJ5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-06 23:33         ` Parav Pandit
  2016-01-07 15:29           ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-06 23:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Wed, Jan 6, 2016 at 3:31 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Jan 06, 2016 at 12:28:03AM +0530, Parav Pandit wrote:
>> +/* hash table to keep map of tgid to owner cgroup */
>> +DEFINE_HASHTABLE(pid_cg_map_tbl, 7);
>> +DEFINE_SPINLOCK(pid_cg_map_lock);    /* lock to protect hash table access */
>> +
>> +/* Keeps mapping of pid to its owning cgroup at rdma level,
>> + * This mapping doesn't change, even if process migrates from one to other
>> + * rdma cgroup.
>> + */
>> +struct pid_cg_map {
>> +     struct pid *pid;                /* hash key */
>> +     struct rdma_cgroup *cg;
>> +
>> +     struct hlist_node hlist;        /* pid to cgroup hash table link */
>> +     atomic_t refcnt;                /* count active user tasks to figure out
>> +                                      * when to free the memory
>> +                                      */
>> +};
>
> Ugh, there's something clearly wrong here.  Why does the rdma
> controller need to keep track of pid cgroup membership?
>
Rdma resource can be allocated by parent process, used and freed by
child process.
Child process could belong to different rdma cgroup.
Parent process might have been terminated after creation of rdma
cgroup. (Followed by cgroup might have been deleted too).
Its discussed in https://lkml.org/lkml/2015/11/2/307
In nutshell, there is process that clearly owns the rdma resource.
So to keep the design simple, rdma resource is owned by the creator
process and cgroup without modifying the task_struct.
>> +static void _dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +                           struct cg_resource_pool *rpool)
>> +{
>> +     spin_lock(&cg->cg_list_lock);
>> +
>> +     /* if its started getting used by other task,
>> +      * before we take the spin lock, then skip,
>> +      * freeing it.
>> +      */
>
> Please follow CodingStyle.
>
>> +     if (atomic_read(&rpool->refcnt) == 0) {
>> +             list_del_init(&rpool->cg_list);
>> +             spin_unlock(&cg->cg_list_lock);
>> +
>> +             _free_cg_rpool(rpool);
>> +             return;
>> +     }
>> +     spin_unlock(&cg->cg_list_lock);
>> +}
>> +
>> +static void dealloc_cg_rpool(struct rdma_cgroup *cg,
>> +                          struct cg_resource_pool *rpool)
>> +{
>> +     /* Don't free the resource pool which is created by the
>> +      * user, otherwise we miss the configured limits. We don't
>> +      * gain much either by splitting storage of limit and usage.
>> +      * So keep it around until user deletes the limits.
>> +      */
>> +     if (atomic_read(&rpool->creator) == RDMACG_RPOOL_CREATOR_DEFAULT)
>> +             _dealloc_cg_rpool(cg, rpool);
>
> I'm pretty sure you can get away with an fixed length array of
> counters.  Please keep it simple.  It's a simple hard limit enforcer.
> There's no need to create a massive dynamic infrastrucure.
>
Every resource pool for verbs resource is fixed length array. Length
of the array is defined by the IB stack modules.
This array is per cgroup, per device.
Its per device, because we agreed that we want to address requirement
of controlling/configuring them on per device basis.
Devices appear and disappear. Therefore they are allocated dynamically.
Otherwise this array could be static in cgroup structure.
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
       [not found]             ` <20160106225714.GI3660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2016-01-06 23:52               ` Parav Pandit
       [not found]                 ` <CAG53R5XKpP6YAe8buRrgG=m4_dQK5MjkWtcA7CN-74AJ1SUUyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-06 23:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Thu, Jan 7, 2016 at 4:27 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Thu, Jan 07, 2016 at 04:14:26AM +0530, Parav Pandit wrote:
>> Yes. I read through. I can see two changes to be made in V2 version of
>> this patch.
>> 1. rdma.resource.verb.usage and rdma.resource.verb.limit to change
>> respectively to,
>> 2. rdma.resource.verb.stat and rdma.resource.verb.max.
>> 3. rdma.resource.verb.failcnt indicate failure events, which I think
>> should go to events.
>
> What's up with the ".resource" part?
I can remove "resource" key word. If just that if something other than
resource comes up to limit to in future, it will be hard to define at
that time.
> Also can't the .max file list
> the available resources?  Why does it need a separtae list file?
>
max file does lists them only after limits are configured for that
device. Thats when rpool (array of max and usage counts) is allocated.
If user wants to know what all knobs are available, than list file
exposes them on per device basis without actually mentioning actual
limit or without allocating rpool arrays.
If you are hinting that I should allocate rpool array when rdma cgroup
is created, that can be done for already discovered devices.
If new devices are discovered after cgroup is created, for them we
anyway have to allocate/free when they appear/disappear.
In different implementation, where list of all the rdma cgroups can be
maintained, and rpool arrays can be allocated for all of them when new
device appear/disappear. This can move complexity of dynamic
allocation from try_charge/uncharge to device addition and removal
APIs. ib_register_ib_device() level.
However this comes with memory cost, where even if those device doesnt
participate in cgroup, for them rpool memory will be allocated for
each such rdma cgroup.
list file looks like below for two device entries.
mlx4_0 ah qp mr pd srq flow
ocrdma0 ah qp mr pd
max file looks like below.
mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10
>> I roll out new patch for events post this patch as additional feature
>> and remove this feature in V2.
>>
>> rdma.resource.verb.list file is unique to rdma cgroup, so I believe
>> this is fine.
>
> Please see above.
>
>> We will conclude whether to have rdma.resource.hw.<files> or not in
>> other patches.
>> I am in opinion to keep "resource" and "verb" or "hw" tags around to
>> keep it verbose enough to know what are we trying to control.
>
> What does that achieve?  I feel that it's getting overengineered
> constantly.
Please see above for "resource". I guess we are not loosing anything
by having "rdma.resource" vs just having "rdma".
But if that sounds too much, we can remove "resource".
>
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
       [not found]         ` <CAG53R5XQaUz=vJCEqLwGt93CC1YMnVQ=k4DuRU0nspcb_-+_6g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-01-07 15:07           ` Tejun Heo
       [not found]             ` <20160107150718.GC29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 15:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hello, Parav.
On Thu, Jan 07, 2016 at 04:43:20AM +0530, Parav Pandit wrote:
> > If different controllers can't agree upon the
> > same set of resources, which probably is a pretty good sign that this
> > isn't too well thought out to begin with,
> 
> When you said "different controller" you meant "different hw vendors", right?
> Or you meant, rdma, mem, cpu as controller here?
Different hw vendors.
> > at least make all resource
> > types defined by the controller itself and let the controllers enable
> > them selectively.
> >
> In this V1 patch, resource is defined by the IB stack and rdma cgroup
> is facilitator for same.
> By doing so, IB stack modules can define new resource without really
> making changes to cgroup.
> This design also allows hw vendors to define their own resources which
> will be reviewed in rdma mailing list anway.
> The idea is different hw versions can have different resource support,
> so the whole intention is not about defining different resource but
> rather enabling it.
> But yes, I equally agree that by doing so, different hw controller
> vendors can define different hw resources.
How many vendors and resources are we talking about?  What I was
trying to say was that unless the number is extremely high, it'd be
far simpler to hard code them in the rdma controller and let drivers
enable the ones which apply to them.  It would require updating the
rdma cgroup controller to add new resource types but I think that'd
actually be an upside, not down.  There needs to be some checks and
balances against adding new resource types; otherwise, it'll soon
become a mess.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
  2016-01-06 23:16       ` Parav Pandit
@ 2016-01-07 15:07         ` Tejun Heo
       [not found]           ` <20160107150756.GD29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 15:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Thu, Jan 07, 2016 at 04:46:19AM +0530, Parav Pandit wrote:
> On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <tj@kernel.org> wrote:
> > On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
> >> Added function pointer table to store resource pool specific
> >> operation for each resource type (verb and hw).
> >> Added list node to link device to rdma cgroup so that it can
> >> participate in resource accounting and limit configuration.
> >
> > Is there any point in splitting patches 1 and 2 from 3?
> >
> Patch 2 is in IB stack, so I separated that patch out from 1. That
> makes it 3 patches.
> If you think single patch is easier to review, let me know, I can
> respin to have one patch for these 3 smaller patches.
They don't do anything by themselves.  I think putting them into a
single patch would be easier to review.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
  2016-01-06 23:33         ` Parav Pandit
@ 2016-01-07 15:29           ` Tejun Heo
  2016-01-07 20:25             ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 15:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello,
On Thu, Jan 07, 2016 at 05:03:07AM +0530, Parav Pandit wrote:
> Rdma resource can be allocated by parent process, used and freed by
> child process.
> Child process could belong to different rdma cgroup.
> Parent process might have been terminated after creation of rdma
> cgroup. (Followed by cgroup might have been deleted too).
> Its discussed in https://lkml.org/lkml/2015/11/2/307
> 
> In nutshell, there is process that clearly owns the rdma resource.
> So to keep the design simple, rdma resource is owned by the creator
> process and cgroup without modifying the task_struct.
So, a resource created by a task in a cgroup staying in the cgroup
when the task gets migrated is fine; however, a resource being
allocated in a previous cgroup of the task isn't fine.  Once
allocated, the resource themselves should be associated with the
cgroup so that they can be freed from the ones they're allocated from.
If I'm understanding it correctly, the code is bending basic rules
around how resource and task cgroup membership is tracked, you really
can't do that.
> > I'm pretty sure you can get away with an fixed length array of
> > counters.  Please keep it simple.  It's a simple hard limit enforcer.
> > There's no need to create a massive dynamic infrastrucure.
>
> Every resource pool for verbs resource is fixed length array. Length
> of the array is defined by the IB stack modules.
> This array is per cgroup, per device.
> Its per device, because we agreed that we want to address requirement
> of controlling/configuring them on per device basis.
> Devices appear and disappear. Therefore they are allocated dynamically.
> Otherwise this array could be static in cgroup structure.
Please see the previous response.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
       [not found]                 ` <CAG53R5XKpP6YAe8buRrgG=m4_dQK5MjkWtcA7CN-74AJ1SUUyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-01-07 15:42                   ` Tejun Heo
       [not found]                     ` <20160107154218.GF29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 15:42 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hello,
On Thu, Jan 07, 2016 at 05:22:40AM +0530, Parav Pandit wrote:
> I can remove "resource" key word. If just that if something other than
> resource comes up to limit to in future, it will be hard to define at
> that time.
Please remove.  The word doesn't mean anything in this context.
> > Also can't the .max file list
> > the available resources?  Why does it need a separtae list file?
> >
> max file does lists them only after limits are configured for that
> device. Thats when rpool (array of max and usage counts) is allocated.
>
> If user wants to know what all knobs are available, than list file
> exposes them on per device basis without actually mentioning actual
> limit or without allocating rpool arrays.
...
> list file looks like below for two device entries.
> mlx4_0 ah qp mr pd srq flow
> ocrdma0 ah qp mr pd
> 
> max file looks like below.
> mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10
Just always show the settings for all devices in the max file like the
following?
  mlx4_0 ah=max qp=max mr=max pd=max srq=max flow=max
  ocrdma0 ah=max qp=max mr=max pd=max
Thanks.
-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 2/6] IB/core: Added members to support rdma cgroup
       [not found]           ` <20160107150756.GD29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-07 19:40             ` Parav Pandit
  0 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Thu, Jan 7, 2016 at 8:37 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Thu, Jan 07, 2016 at 04:46:19AM +0530, Parav Pandit wrote:
>> On Wed, Jan 6, 2016 at 3:26 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> > On Wed, Jan 06, 2016 at 12:28:02AM +0530, Parav Pandit wrote:
>> >> Added function pointer table to store resource pool specific
>> >> operation for each resource type (verb and hw).
>> >> Added list node to link device to rdma cgroup so that it can
>> >> participate in resource accounting and limit configuration.
>> >
>> > Is there any point in splitting patches 1 and 2 from 3?
>> >
>> Patch 2 is in IB stack, so I separated that patch out from 1. That
>> makes it 3 patches.
>> If you think single patch is easier to review, let me know, I can
>> respin to have one patch for these 3 smaller patches.
>
> They don't do anything by themselves.  I think putting them into a
> single patch would be easier to review.
o.k. I will put them in single patch.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 6/6] rdmacg: Added documentation for rdma controller.
       [not found]                     ` <20160107154218.GF29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-07 19:43                       ` Parav Pandit
  0 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 19:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Thu, Jan 7, 2016 at 9:12 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Thu, Jan 07, 2016 at 05:22:40AM +0530, Parav Pandit wrote:
>> I can remove "resource" key word. If just that if something other than
>> resource comes up to limit to in future, it will be hard to define at
>> that time.
>
> Please remove.  The word doesn't mean anything in this context.
>
ok. I will remove.
>> > Also can't the .max file list
>> > the available resources?  Why does it need a separtae list file?
>> >
>> max file does lists them only after limits are configured for that
>> device. Thats when rpool (array of max and usage counts) is allocated.
>>
>> If user wants to know what all knobs are available, than list file
>> exposes them on per device basis without actually mentioning actual
>> limit or without allocating rpool arrays.
> ...
>> list file looks like below for two device entries.
>> mlx4_0 ah qp mr pd srq flow
>> ocrdma0 ah qp mr pd
>>
>> max file looks like below.
>> mlx4_0 ah=100 qp=40 mr=10 pd=90 srq=10 flow=10
>
> Just always show the settings for all devices in the max file like the
> following?
>
>   mlx4_0 ah=max qp=max mr=max pd=max srq=max flow=max
>   ocrdma0 ah=max qp=max mr=max pd=max
Should be possible. I will make the change.
>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
       [not found]             ` <20160107150718.GC29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-01-07 20:01               ` Parav Pandit
  2016-01-07 20:06                 ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 20:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
On Thu, Jan 7, 2016 at 8:37 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Thu, Jan 07, 2016 at 04:43:20AM +0530, Parav Pandit wrote:
>> > If different controllers can't agree upon the
>> > same set of resources, which probably is a pretty good sign that this
>> > isn't too well thought out to begin with,
>>
>> When you said "different controller" you meant "different hw vendors", right?
>> Or you meant, rdma, mem, cpu as controller here?
>
> Different hw vendors.
>
>> > at least make all resource
>> > types defined by the controller itself and let the controllers enable
>> > them selectively.
>> >
>> In this V1 patch, resource is defined by the IB stack and rdma cgroup
>> is facilitator for same.
>> By doing so, IB stack modules can define new resource without really
>> making changes to cgroup.
>> This design also allows hw vendors to define their own resources which
>> will be reviewed in rdma mailing list anway.
>> The idea is different hw versions can have different resource support,
>> so the whole intention is not about defining different resource but
>> rather enabling it.
>> But yes, I equally agree that by doing so, different hw controller
>> vendors can define different hw resources.
>
> How many vendors and resources are we talking about?
To my knowledge Intel HFI driver is the only one.
>  What I was
> trying to say was that unless the number is extremely high, it'd be
> far simpler to hard code them in the rdma controller and let drivers
> enable the ones which apply to them.
Instead of in rdma controller, its hard coded in IB stack.
I see this as an advantage where resource definition ownership remains
with IB stack maintainers, rather than rdma cgroup maintainer.
rdma cgroup maintainer doesn't have to understand what SRQ vs QP or
ODP type MR or multicast group is.
IB stack maintainer is better placed to judge and define it.
I would like to hear from Jason, Doug, Liran and other RDMA experts
about their thoughts.
> It would require updating the
> rdma cgroup controller to add new resource types but I think that'd
> actually be an upside, not down.  There needs to be some checks and
> balances against adding new resource types; otherwise, it'll soon
> become a mess.
I think this checks for new resource type can be ensured by IB stack
maintainer to avoid the mess.
My preference is IB stack to define resource, but I am ok to let go
this feature if consensus is RDMA cgroup.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:01               ` Parav Pandit
@ 2016-01-07 20:06                 ` Tejun Heo
  2016-01-07 20:32                   ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello,
On Fri, Jan 08, 2016 at 01:31:06AM +0530, Parav Pandit wrote:
> >  What I was
> > trying to say was that unless the number is extremely high, it'd be
> > far simpler to hard code them in the rdma controller and let drivers
> > enable the ones which apply to them.
> 
> Instead of in rdma controller, its hard coded in IB stack.
> I see this as an advantage where resource definition ownership remains
> with IB stack maintainers, rather than rdma cgroup maintainer.
> rdma cgroup maintainer doesn't have to understand what SRQ vs QP or
> ODP type MR or multicast group is.
> IB stack maintainer is better placed to judge and define it.
> 
> I would like to hear from Jason, Doug, Liran and other RDMA experts
> about their thoughts.
That's fine.  Make it a header file in IB stack which is included from
the rdma cgroup controller.  The only things are not building a huge
dynamic framework for something which can easily be a simple static
thing and having some oversight in adding resource types.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
  2016-01-07 15:29           ` Tejun Heo
@ 2016-01-07 20:25             ` Parav Pandit
  2016-01-07 20:28               ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 20:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Thu, Jan 7, 2016 at 8:59 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Thu, Jan 07, 2016 at 05:03:07AM +0530, Parav Pandit wrote:
>> Rdma resource can be allocated by parent process, used and freed by
>> child process.
>> Child process could belong to different rdma cgroup.
>> Parent process might have been terminated after creation of rdma
>> cgroup. (Followed by cgroup might have been deleted too).
>> Its discussed in https://lkml.org/lkml/2015/11/2/307
>>
>> In nutshell, there is process that clearly owns the rdma resource.
>> So to keep the design simple, rdma resource is owned by the creator
>> process and cgroup without modifying the task_struct.
>
> So, a resource created by a task in a cgroup staying in the cgroup
> when the task gets migrated is fine; however, a resource being
> allocated in a previous cgroup of the task isn't fine.  Once
> allocated, the resource themselves should be associated with the
> cgroup so that they can be freed from the ones they're allocated from.
>
I probably didn't explain it well in previous email. Let me try again
and see if it make sense.
Whenever resource is allocated, it belongs to a given rdma cgroup.
Whenever its freed, its freed from same rdma cgroup.
(even if the original allocating process is terminated or the original
cgroup is offline now).
Every time a new resource is allocated, its current rdma cgroup is
being considered for allocation.
IB stack keeps the resource ownership with the pid (tgid) structure
and holds reference count to it, so that even if original process is
terminated, its pid structure keeps hovering around until last
resource is freed.
Above functionality is achieved, by maintaining the map this tgid and
associated original cgroup at try_charge(), uncharge() time.
In alternate method,
Its simple to store the pointer of rdma_cgroup structure in the IB
resource structure and hold on reference count to rdma_cgroup.
so that when its freed, uncharge_resource can accept rdma_cgroup
structure pointer.
This method will eliminate current pid map infrastructure all together
and achieve same functionality as described above.
try_charge() would return pointer to rdma_cg or NULL based on
successful charge, or fail respectively.
uncharge() will have rdma_cg as input argument.
Are you ok with that approach?
> If I'm understanding it correctly, the code is bending basic rules
> around how resource and task cgroup membership is tracked, you really
> can't do that.
>
>> > I'm pretty sure you can get away with an fixed length array of
>> > counters.  Please keep it simple.  It's a simple hard limit enforcer.
>> > There's no need to create a massive dynamic infrastrucure.
>>
>> Every resource pool for verbs resource is fixed length array. Length
>> of the array is defined by the IB stack modules.
>> This array is per cgroup, per device.
>> Its per device, because we agreed that we want to address requirement
>> of controlling/configuring them on per device basis.
>> Devices appear and disappear. Therefore they are allocated dynamically.
>> Otherwise this array could be static in cgroup structure.
>
> Please see the previous response.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
  2016-01-07 20:25             ` Parav Pandit
@ 2016-01-07 20:28               ` Tejun Heo
  2016-01-07 20:39                 ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello, Parav.
On Fri, Jan 08, 2016 at 01:55:09AM +0530, Parav Pandit wrote:
...
> Above functionality is achieved, by maintaining the map this tgid and
> associated original cgroup at try_charge(), uncharge() time.
Hmmm, what happens after the following?
1. A process allocates some rdma resources and get registered on the
   hash table.
2. The process gets migrated to a different cgroup.
3. The process allocates more rdma resources.
Which cgroup would the resources from #3 be attributed to?
> In alternate method,
> Its simple to store the pointer of rdma_cgroup structure in the IB
> resource structure and hold on reference count to rdma_cgroup.
> so that when its freed, uncharge_resource can accept rdma_cgroup
> structure pointer.
That'd be a lot more in line with how other controllers behave.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:06                 ` Tejun Heo
@ 2016-01-07 20:32                   ` Parav Pandit
  2016-01-07 20:34                     ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 20:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 1:36 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Jan 08, 2016 at 01:31:06AM +0530, Parav Pandit wrote:
>> >  What I was
>> > trying to say was that unless the number is extremely high, it'd be
>> > far simpler to hard code them in the rdma controller and let drivers
>> > enable the ones which apply to them.
>>
>> Instead of in rdma controller, its hard coded in IB stack.
>> I see this as an advantage where resource definition ownership remains
>> with IB stack maintainers, rather than rdma cgroup maintainer.
>> rdma cgroup maintainer doesn't have to understand what SRQ vs QP or
>> ODP type MR or multicast group is.
>> IB stack maintainer is better placed to judge and define it.
>>
>> I would like to hear from Jason, Doug, Liran and other RDMA experts
>> about their thoughts.
>
> That's fine.  Make it a header file in IB stack which is included from
> the rdma cgroup controller.  The only things are not building a huge
> dynamic framework for something which can easily be a simple static
> thing and having some oversight in adding resource types.
>
o.k. That doable. I want to make sure that we are on same page on below design.
rpool (which will contain static array based on header file ) would be
still there, because resource limits are on per device basis. Number
of devices are variable and dynamically appear. Therefore rdma_cg will
have the list of rpool attached to it. Do you agree?
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:32                   ` Parav Pandit
@ 2016-01-07 20:34                     ` Tejun Heo
  2016-01-07 20:46                       ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 08, 2016 at 02:02:20AM +0530, Parav Pandit wrote:
> o.k. That doable. I want to make sure that we are on same page on below design.
> rpool (which will contain static array based on header file ) would be
> still there, because resource limits are on per device basis. Number
> of devices are variable and dynamically appear. Therefore rdma_cg will
> have the list of rpool attached to it. Do you agree?
Yeap.  Would it make more sense to hang them off of whatever struct
which presents a rdma device tho?  And then just walk them from cgroup
controller?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
  2016-01-07 20:28               ` Tejun Heo
@ 2016-01-07 20:39                 ` Parav Pandit
       [not found]                   ` <CAG53R5UuCNwCKWAuub=n3FkJR6Z5oyF6cVPeW=5nk-JrURY=jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 20:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 1:58 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Jan 08, 2016 at 01:55:09AM +0530, Parav Pandit wrote:
> ...
>> Above functionality is achieved, by maintaining the map this tgid and
>> associated original cgroup at try_charge(), uncharge() time.
>
> Hmmm, what happens after the following?
>
> 1. A process allocates some rdma resources and get registered on the
>    hash table.
>
> 2. The process gets migrated to a different cgroup.
>
> 3. The process allocates more rdma resources.
>
> Which cgroup would the resources from #3 be attributed to?
Since the pid/tgid of the process doesn't change in step_3, it
allocates from original cgroup of step_1.
However in next patch V2, as described below, since IB resource will
store rdma_cg pointer,
in step_3, new resource will be allocated from new cgroup.
old resource will be freed from older cgroup.
This is what you are expecting, right?
>
>> In alternate method,
>> Its simple to store the pointer of rdma_cgroup structure in the IB
>> resource structure and hold on reference count to rdma_cgroup.
>> so that when its freed, uncharge_resource can accept rdma_cgroup
>> structure pointer.
>
> That'd be a lot more in line with how other controllers behave.
>
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 3/6] rdmacg: implements rdma cgroup
       [not found]                   ` <CAG53R5UuCNwCKWAuub=n3FkJR6Z5oyF6cVPeW=5nk-JrURY=jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-01-07 20:41                     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA
Hello, Parav.
On Fri, Jan 08, 2016 at 02:09:23AM +0530, Parav Pandit wrote:
> > Hmmm, what happens after the following?
> >
> > 1. A process allocates some rdma resources and get registered on the
> >    hash table.
> >
> > 2. The process gets migrated to a different cgroup.
> >
> > 3. The process allocates more rdma resources.
> >
> > Which cgroup would the resources from #3 be attributed to?
> 
> Since the pid/tgid of the process doesn't change in step_3, it
> allocates from original cgroup of step_1.
It shouldn't work that way.
> However in next patch V2, as described below, since IB resource will
> store rdma_cg pointer,
> in step_3, new resource will be allocated from new cgroup.
> old resource will be freed from older cgroup.
> This is what you are expecting, right?
Yeap.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:34                     ` Tejun Heo
@ 2016-01-07 20:46                       ` Parav Pandit
  2016-01-07 20:49                         ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 20:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 2:04 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Jan 08, 2016 at 02:02:20AM +0530, Parav Pandit wrote:
>> o.k. That doable. I want to make sure that we are on same page on below design.
>> rpool (which will contain static array based on header file ) would be
>> still there, because resource limits are on per device basis. Number
>> of devices are variable and dynamically appear. Therefore rdma_cg will
>> have the list of rpool attached to it. Do you agree?
>
> Yeap.  Would it make more sense to hang them off of whatever struct
> which presents a rdma device tho?  And then just walk them from cgroup
> controller?
>
Let me think through it. Its been late night for me currently. So dont
want to conclude in hurry.
At high level it looks doable by maintaining hash table head on per
device basis, that further reduces hash contention by one level.
I will get back on this tomorrow.
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:46                       ` Parav Pandit
@ 2016-01-07 20:49                         ` Tejun Heo
  2016-01-07 20:50                           ` Tejun Heo
  2016-01-07 21:04                           ` Parav Pandit
  0 siblings, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello, Parav.
On Fri, Jan 08, 2016 at 02:16:59AM +0530, Parav Pandit wrote:
> Let me think through it. Its been late night for me currently. So dont
> want to conclude in hurry.
Sure thing.
> At high level it looks doable by maintaining hash table head on per
> device basis, that further reduces hash contention by one level.
> I will get back on this tomorrow.
Hmmm... why would it need a hash table?  Let's say there's a struct
rdma_device for each rdma_device and then that stuct can simply have
rdma_device->res_table[] or whatever to track limits and consumptions
and rdma_device->res_enabled mask to tell which resources are enabled
on the device.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:49                         ` Tejun Heo
@ 2016-01-07 20:50                           ` Tejun Heo
  2016-01-07 21:01                             ` Parav Pandit
  2016-01-07 21:04                           ` Parav Pandit
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 20:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Ooh, btw, please don't bother to create separate interfaces for v1 and
v2 hierarchies.  Just creating one following v2 conventions and using
the same thing for v1 should do and is a lot easier for everybody.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:50                           ` Tejun Heo
@ 2016-01-07 21:01                             ` Parav Pandit
  2016-01-07 21:07                               ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 21:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 2:20 AM, Tejun Heo <tj@kernel.org> wrote:
> Ooh, btw, please don't bother to create separate interfaces for v1 and
> v2 hierarchies.  Just creating one following v2 conventions and using
> the same thing for v1 should do and is a lot easier for everybody.
>
Sure. I have already made that change to remove list.
and changed rdma.resource.verb.limit to rdma.verb.max etc.
I will keep rdma.hw.max around until we get some consensus.
Alternatively I was thinking to merge that as another attribute in
rdma.max file, like
mlx4_0 verb ah=max pd=10 qp=10
mlx4_0 hw ah=10 pd=100
ocrdma hw ah=10 pd=100
This remove hw specific extra file for rare feature.
Parav
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 20:49                         ` Tejun Heo
  2016-01-07 20:50                           ` Tejun Heo
@ 2016-01-07 21:04                           ` Parav Pandit
  2016-01-07 21:08                             ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 21:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 2:19 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Jan 08, 2016 at 02:16:59AM +0530, Parav Pandit wrote:
>> Let me think through it. Its been late night for me currently. So dont
>> want to conclude in hurry.
>
> Sure thing.
>
>> At high level it looks doable by maintaining hash table head on per
>> device basis, that further reduces hash contention by one level.
>> I will get back on this tomorrow.
>
> Hmmm... why would it need a hash table?  Let's say there's a struct
> rdma_device for each rdma_device and then that stuct can simply have
> rdma_device->res_table[] or whatever to track limits and consumptions
> and rdma_device->res_enabled mask to tell which resources are enabled
> on the device.
>
That table won't be sufficient, because rdma_device is shared among
multiple rdma_cgroups each such cgroup has different individual
resource limit and usage count. This is currently rpool structure.
For res_table[] needs to be per cgroup basis.
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 21:01                             ` Parav Pandit
@ 2016-01-07 21:07                               ` Tejun Heo
  2016-01-07 21:10                                 ` Parav Pandit
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 21:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 08, 2016 at 02:31:26AM +0530, Parav Pandit wrote:
> On Fri, Jan 8, 2016 at 2:20 AM, Tejun Heo <tj@kernel.org> wrote:
> > Ooh, btw, please don't bother to create separate interfaces for v1 and
> > v2 hierarchies.  Just creating one following v2 conventions and using
> > the same thing for v1 should do and is a lot easier for everybody.
> >
> Sure. I have already made that change to remove list.
> and changed rdma.resource.verb.limit to rdma.verb.max etc.
> 
> I will keep rdma.hw.max around until we get some consensus.
> 
> Alternatively I was thinking to merge that as another attribute in
> rdma.max file, like
> 
> mlx4_0 verb ah=max pd=10 qp=10
> mlx4_0 hw ah=10 pd=100
> ocrdma hw ah=10 pd=100
> 
> This remove hw specific extra file for rare feature.
Hmm... if there are duplicate keys, I think rdma.verb.max and
rdma.hw.max would be cleaner.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 21:04                           ` Parav Pandit
@ 2016-01-07 21:08                             ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2016-01-07 21:08 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
Hello,
On Fri, Jan 08, 2016 at 02:34:49AM +0530, Parav Pandit wrote:
> That table won't be sufficient, because rdma_device is shared among
> multiple rdma_cgroups each such cgroup has different individual
> resource limit and usage count. This is currently rpool structure.
> For res_table[] needs to be per cgroup basis.
Ah, you're right.  Please do whatever seems better to you.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
* Re: [PATCHv1 0/6] rdma controller support
  2016-01-07 21:07                               ` Tejun Heo
@ 2016-01-07 21:10                                 ` Parav Pandit
  0 siblings, 0 replies; 39+ messages in thread
From: Parav Pandit @ 2016-01-07 21:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Liran Liss, Hefty, Sean,
	Jason Gunthorpe, Haggai Eran, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module
On Fri, Jan 8, 2016 at 2:37 AM, Tejun Heo <tj@kernel.org> wrote:
> On Fri, Jan 08, 2016 at 02:31:26AM +0530, Parav Pandit wrote:
>> On Fri, Jan 8, 2016 at 2:20 AM, Tejun Heo <tj@kernel.org> wrote:
>> > Ooh, btw, please don't bother to create separate interfaces for v1 and
>> > v2 hierarchies.  Just creating one following v2 conventions and using
>> > the same thing for v1 should do and is a lot easier for everybody.
>> >
>> Sure. I have already made that change to remove list.
>> and changed rdma.resource.verb.limit to rdma.verb.max etc.
>>
>> I will keep rdma.hw.max around until we get some consensus.
>>
>> Alternatively I was thinking to merge that as another attribute in
>> rdma.max file, like
>>
>> mlx4_0 verb ah=max pd=10 qp=10
>> mlx4_0 hw ah=10 pd=100
>> ocrdma hw ah=10 pd=100
>>
>> This remove hw specific extra file for rare feature.
>
> Hmm... if there are duplicate keys, I think rdma.verb.max and
> rdma.hw.max would be cleaner.
>
ok. I agree. I will keep as you described.
> Thanks.
>
> --
> tejun
^ permalink raw reply	[flat|nested] 39+ messages in thread
end of thread, other threads:[~2016-01-07 21:10 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-05 18:58 [PATCHv1 0/6] rdma controller support Parav Pandit
2016-01-05 18:58 ` [PATCHv1 1/6] rdmacg: Added rdma cgroup header file Parav Pandit
2016-01-05 18:58 ` [PATCHv1 2/6] IB/core: Added members to support rdma cgroup Parav Pandit
     [not found]   ` <1452020286-9508-3-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-01-05 21:56     ` Tejun Heo
2016-01-06 23:16       ` Parav Pandit
2016-01-07 15:07         ` Tejun Heo
     [not found]           ` <20160107150756.GD29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-07 19:40             ` Parav Pandit
2016-01-05 18:58 ` [PATCHv1 4/6] IB/core: rdmacg support infrastructure APIs Parav Pandit
2016-01-05 18:58 ` [PATCHv1 5/6] IB/core: use rdma cgroup for resource accounting Parav Pandit
2016-01-05 18:58 ` [PATCHv1 6/6] rdmacg: Added documentation for rdma controller Parav Pandit
     [not found]   ` <1452020286-9508-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-01-05 21:53     ` Tejun Heo
     [not found]       ` <20160105215309.GG5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-06 22:44         ` Parav Pandit
2016-01-06 22:57           ` Tejun Heo
     [not found]             ` <20160106225714.GI3660-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2016-01-06 23:52               ` Parav Pandit
     [not found]                 ` <CAG53R5XKpP6YAe8buRrgG=m4_dQK5MjkWtcA7CN-74AJ1SUUyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-01-07 15:42                   ` Tejun Heo
     [not found]                     ` <20160107154218.GF29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-07 19:43                       ` Parav Pandit
     [not found] ` <1452020286-9508-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-01-05 18:58   ` [PATCHv1 3/6] rdmacg: implements rdma cgroup Parav Pandit
2016-01-05 22:01     ` Tejun Heo
     [not found]       ` <20160105220128.GJ5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-06 23:33         ` Parav Pandit
2016-01-07 15:29           ` Tejun Heo
2016-01-07 20:25             ` Parav Pandit
2016-01-07 20:28               ` Tejun Heo
2016-01-07 20:39                 ` Parav Pandit
     [not found]                   ` <CAG53R5UuCNwCKWAuub=n3FkJR6Z5oyF6cVPeW=5nk-JrURY=jw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-01-07 20:41                     ` Tejun Heo
2016-01-05 21:56   ` [PATCHv1 0/6] rdma controller support Tejun Heo
     [not found]     ` <20160105215623.GH5995-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-06 23:13       ` Parav Pandit
     [not found]         ` <CAG53R5XQaUz=vJCEqLwGt93CC1YMnVQ=k4DuRU0nspcb_-+_6g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-01-07 15:07           ` Tejun Heo
     [not found]             ` <20160107150718.GC29797-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-01-07 20:01               ` Parav Pandit
2016-01-07 20:06                 ` Tejun Heo
2016-01-07 20:32                   ` Parav Pandit
2016-01-07 20:34                     ` Tejun Heo
2016-01-07 20:46                       ` Parav Pandit
2016-01-07 20:49                         ` Tejun Heo
2016-01-07 20:50                           ` Tejun Heo
2016-01-07 21:01                             ` Parav Pandit
2016-01-07 21:07                               ` Tejun Heo
2016-01-07 21:10                                 ` Parav Pandit
2016-01-07 21:04                           ` Parav Pandit
2016-01-07 21:08                             ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).