* [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-07 20:38 Parav Pandit
[not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
` (4 more replies)
0 siblings, 5 replies; 60+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
dledford-H+wXaHxf7aLQT0dZR+AlfA
Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
raindel-VPRAkNaXOzVWk0Htik3J/w,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
linux-security-module-u79uwXL29TY76Z2rM5mHXA,
pandit.parav-Re5JQEeQqe8AvxtiuMwx3w
Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources.
This patch-set allows limiting rdma resources to set of processes.
It extend device cgroup controller for limiting rdma device limits.
With this patch, user verbs module queries rdma device cgroup controller
to query process's limit to consume such resource. It uncharge resource
counter after resource is being freed.
It extends the task structure to hold the statistic information about process's
rdma resource usage so that when process migrates from one to other controller,
right amount of resources can be migrated from one to other cgroup.
Future patches will support RDMA flows resource and will be enhanced further
to enforce limit of other resources and capabilities.
Parav Pandit (7):
devcg: Added user option to rdma resource tracking.
devcg: Added rdma resource tracking module.
devcg: Added infrastructure for rdma device cgroup.
devcg: Added rdma resource tracker object per task
devcg: device cgroup's extension for RDMA resource.
devcg: Added support to use RDMA device cgroup.
devcg: Added Documentation of RDMA device cgroup.
Documentation/cgroups/devices.txt | 32 ++-
drivers/infiniband/core/uverbs_cmd.c | 139 +++++++++--
drivers/infiniband/core/uverbs_main.c | 39 +++-
include/linux/device_cgroup.h | 53 +++++
include/linux/device_rdma_cgroup.h | 83 +++++++
include/linux/sched.h | 12 +-
init/Kconfig | 12 +
security/Makefile | 1 +
security/device_cgroup.c | 119 +++++++---
security/device_rdma_cgroup.c | 422 ++++++++++++++++++++++++++++++++++
10 files changed, 850 insertions(+), 62 deletions(-)
create mode 100644 include/linux/device_rdma_cgroup.h
create mode 100644 security/device_rdma_cgroup.c
--
1.8.3.1
^ permalink raw reply [flat|nested] 60+ messages in thread[parent not found: <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* [PATCH 1/7] devcg: Added user option to rdma resource tracking. [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-09-07 20:38 ` Parav Pandit 2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit ` (4 subsequent siblings) 5 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA, pandit.parav-Re5JQEeQqe8AvxtiuMwx3w Added user configuration option to enable/disable RDMA resource tracking feature of device cgroup as sub module. Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- init/Kconfig | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/init/Kconfig b/init/Kconfig index 2184b34..089db85 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -977,6 +977,18 @@ config CGROUP_DEVICE Provides a cgroup implementing whitelists for devices which a process in the cgroup can mknod or open. +config CGROUP_RDMA_RESOURCE + bool "RDMA Resource Controller for cgroups" + depends on CGROUP_DEVICE + default n + help + This option enables limiting rdma resources for a device cgroup. + Using this option, user space processes can be limited to use + limited number of RDMA resources such as MR, PD, QP, AH, FLOW, CQ + etc. + + Say N if unsure. + config CPUSETS bool "Cpuset support" help -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH 2/7] devcg: Added rdma resource tracking module. [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit @ 2015-09-07 20:38 ` Parav Pandit 2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit ` (3 subsequent siblings) 5 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA, pandit.parav-Re5JQEeQqe8AvxtiuMwx3w Added RDMA resource tracking object of device cgroup. Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- security/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/security/Makefile b/security/Makefile index c9bfbc8..c9ad56d 100644 --- a/security/Makefile +++ b/security/Makefile @@ -23,6 +23,7 @@ obj-$(CONFIG_SECURITY_TOMOYO) += tomoyo/ obj-$(CONFIG_SECURITY_APPARMOR) += apparmor/ obj-$(CONFIG_SECURITY_YAMA) += yama/ obj-$(CONFIG_CGROUP_DEVICE) += device_cgroup.o +obj-$(CONFIG_CGROUP_RDMA_RESOURCE) += device_rdma_cgroup.o # Object integrity file lists subdir-$(CONFIG_INTEGRITY) += integrity -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit 2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit @ 2015-09-07 20:38 ` Parav Pandit 2015-09-08 8:22 ` Haggai Eran 2015-09-08 8:36 ` Haggai Eran 2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup Parav Pandit ` (2 subsequent siblings) 5 siblings, 2 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA, pandit.parav-Re5JQEeQqe8AvxtiuMwx3w Extension of device cgroup for RDMA device resources. This implements RDMA resource tracker to limit RDMA resources such as AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup. It implements RDMA resource limit module to limit consuming RDMA resources for processes of the cgroup. RDMA resources are tracked on per task basis. RDMA resources across multiple such devices are limited among multiple processes of the owning device cgroup. RDMA device cgroup extension returns error when user space applications try to allocate resources more than its configured limit. Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- include/linux/device_rdma_cgroup.h | 83 ++++++++ security/device_rdma_cgroup.c | 422 +++++++++++++++++++++++++++++++++++++ 2 files changed, 505 insertions(+) create mode 100644 include/linux/device_rdma_cgroup.h create mode 100644 security/device_rdma_cgroup.c diff --git a/include/linux/device_rdma_cgroup.h b/include/linux/device_rdma_cgroup.h new file mode 100644 index 0000000..a2c261b --- /dev/null +++ b/include/linux/device_rdma_cgroup.h @@ -0,0 +1,83 @@ +#ifndef _DEVICE_RDMA_CGROUP_H +#define _DEVICE_RDMA_CGROUP_H + +#include <linux/cgroup.h> + +/* RDMA resources from device cgroup perspective */ +enum devcgroup_rdma_rt { + DEVCG_RDMA_RES_TYPE_UCTX, + DEVCG_RDMA_RES_TYPE_CQ, + DEVCG_RDMA_RES_TYPE_PD, + DEVCG_RDMA_RES_TYPE_AH, + DEVCG_RDMA_RES_TYPE_MR, + DEVCG_RDMA_RES_TYPE_MW, + DEVCG_RDMA_RES_TYPE_SRQ, + DEVCG_RDMA_RES_TYPE_QP, + DEVCG_RDMA_RES_TYPE_FLOW, + DEVCG_RDMA_RES_TYPE_MAX, +}; + +struct ib_ucontext; + +#define DEVCG_RDMA_MAX_RESOURCES S32_MAX + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + +#define DEVCG_RDMA_MAX_RESOURCE_STR "max" + +enum devcgroup_rdma_access_files { + DEVCG_RDMA_LIST_USAGE, +}; + +struct task_rdma_res_counter { + /* allows atomic increment of task and cgroup counters + * to avoid race with migration task. + */ + spinlock_t lock; + u32 usage[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct devcgroup_rdma_tracker { + int limit; + atomic_t usage; + int failcnt; +}; + +struct devcgroup_rdma { + struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX]; +}; + +struct dev_cgroup; + +void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg); +ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of, + char *buf, + size_t nbytes, loff_t off); +int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v); +int devcgroup_rdma_show_usage(struct seq_file *m, void *v); + +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num); +void devcgroup_rdma_fork(struct task_struct *task, void *priv); + +int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset); +int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type); +#else + +static inline int devcgroup_rdma_try_charge_resource( + enum devcgroup_rdma_rt type, int num) +{ return 0; } +static inline void devcgroup_rdma_uncharge_resource( + struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num) +{ } +static inline int devcgroup_rdma_query_resource_limit( + enum devcgroup_rdma_rt type) +{ return DEVCG_RDMA_MAX_RESOURCES; } +#endif + +#endif diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c new file mode 100644 index 0000000..fb4cc59 --- /dev/null +++ b/security/device_rdma_cgroup.c @@ -0,0 +1,422 @@ +/* + * RDMA device cgroup controller of device controller cgroup. + * + * Provides a cgroup hierarchy to limit various RDMA resource allocation to a + * configured limit of the cgroup. + * + * Its easy for user space applications to consume of RDMA device specific + * hardware resources. Such resource exhaustion should be prevented so that + * user space applications and other kernel consumers gets chance to allocate + * and effectively use the hardware resources. + * + * In order to use the device rdma controller, set the maximum resource count + * per cgroup, which ensures that total rdma resources for processes belonging + * to a cgroup doesn't exceed configured limit. + * + * RDMA resource limits are hierarchical, so the highest configured limit of + * the hierarchy is enforced. Allowing resource limit configuration to default + * cgroup allows fair share to kernel space ULPs as well. + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#include <linux/slab.h> +#include <linux/device_rdma_cgroup.h> +#include <linux/device_cgroup.h> +#include <rdma/ib_verbs.h> + +/** + * init_devcgroup_rdma_tracker - initialize resource limits. + * @dev_cg: device cgroup pointer for which limits should be + * initialized. + */ +void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg) +{ + int i; + + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) + dev_cg->rdma.tracker[i].limit = DEVCG_RDMA_MAX_RESOURCES; +} + +ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of, + char *buf, + size_t nbytes, loff_t off) +{ + struct cgroup_subsys_state *css = of_css(of); + struct dev_cgroup *dev_cg = css_to_devcgroup(css); + s64 new_limit; + int type = of_cft(of)->private; + int err; + + buf = strstrip(buf); + if (!strcmp(buf, DEVCG_RDMA_MAX_RESOURCE_STR)) { + new_limit = DEVCG_RDMA_MAX_RESOURCES; + goto max_limit; + } + + err = kstrtoll(buf, 0, &new_limit); + if (err) + return err; + + if (new_limit < 0 || new_limit >= DEVCG_RDMA_MAX_RESOURCES) + return -EINVAL; + +max_limit: + dev_cg->rdma.tracker[type].limit = new_limit; + return nbytes; +} + +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) +{ + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); + int type = seq_cft(sf)->private; + u32 usage; + + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); + } else { + usage = dev_cg->rdma.tracker[type].limit; + seq_printf(sf, "%u\n", usage); + } + return 0; +} + +static const char * const rdma_res_name[] = { + [DEVCG_RDMA_RES_TYPE_UCTX] = "uctx", + [DEVCG_RDMA_RES_TYPE_CQ] = "cq", + [DEVCG_RDMA_RES_TYPE_PD] = "pd", + [DEVCG_RDMA_RES_TYPE_AH] = "ah", + [DEVCG_RDMA_RES_TYPE_MR] = "mr", + [DEVCG_RDMA_RES_TYPE_MW] = "mw", + [DEVCG_RDMA_RES_TYPE_SRQ] = "srq", + [DEVCG_RDMA_RES_TYPE_QP] = "qp", + [DEVCG_RDMA_RES_TYPE_FLOW] = "flow", +}; + +int devcgroup_rdma_show_usage(struct seq_file *m, void *v) +{ + struct dev_cgroup *devcg = css_to_devcgroup(seq_css(m)); + const char *res_name = NULL; + u32 usage; + int i; + + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + res_name = rdma_res_name[i]; + usage = atomic_read(&devcg->rdma.tracker[i].usage); + if (usage == DEVCG_RDMA_MAX_RESOURCES) + seq_printf(m, "%s %s\n", res_name, + DEVCG_RDMA_MAX_RESOURCE_STR); + else + seq_printf(m, "%s %u\n", res_name, usage); + }; + return 0; +} + +static void rdma_free_res_counter(struct task_struct *task) +{ + struct task_rdma_res_counter *res_cnt = NULL; + bool free_res = false; + + task_lock(task); + res_cnt = task->rdma_res_counter; + if (res_cnt && + res_cnt->usage[DEVCG_RDMA_RES_TYPE_UCTX] == 0) { + /* free resource counters if this is the last + * ucontext, which is getting deallocated. + */ + task->rdma_res_counter = NULL; + free_res = true; + } + task_unlock(task); + + /* synchronize with task migration activity from one to other cgroup + * which might be reading this task's resource counters. + */ + synchronize_rcu(); + if (free_res) + kfree(res_cnt); +} + +static void uncharge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, s64 num) +{ + /* + * A negative count (or overflow for that matter) is invalid, + * and indicates a bug in the device rdma controller. + */ + WARN_ON_ONCE(atomic_add_negative(-num, + &dev_cg->rdma.tracker[type].usage)); +} + +static void uncharge_task_resource(struct task_struct *task, + struct dev_cgroup *cg, + enum devcgroup_rdma_rt type, + int num) +{ + struct dev_cgroup *p; + + if (!num) + return; + + /* protect against actual task which might be + * freeing resource counter memory due to no resource + * consumption. + */ + task_lock(task); + if (!task->rdma_res_counter) { + task_unlock(task); + return; + } + for (p = cg; p; p = parent_devcgroup(p)) + uncharge_resource(p, type, num); + + task_unlock(task); +} + +/** + * devcgroup_rdma_uncharge_resource - hierarchically uncharge + * rdma resource count + * @ucontext: the ucontext from which to uncharge the resource + * pass null when caller knows that there was past allocation + * and its calling from same process context to which this resource + * belongs. + * @type: the type of resource to uncharge + * @num: the number of resource to uncharge + */ +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *dev_cg, *p; + struct task_struct *ctx_task; + + if (!num) + return; + + /* get cgroup of ib_ucontext it belong to, to uncharge + * so that when its called from any worker tasks or any + * other tasks to which this resource doesn't belong to, + * it can be uncharged correctly. + */ + if (ucontext) + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); + else + ctx_task = current; + dev_cg = task_devcgroup(ctx_task); + + spin_lock(&ctx_task->rdma_res_counter->lock); + ctx_task->rdma_res_counter->usage[type] -= num; + + for (p = dev_cg; p; p = parent_devcgroup(p)) + uncharge_resource(p, type, num); + + spin_unlock(&ctx_task->rdma_res_counter->lock); + + if (type == DEVCG_RDMA_RES_TYPE_UCTX) + rdma_free_res_counter(ctx_task); +} +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); + +/** + * This function does not follow configured rdma resource limit. + * It cannot fail and the new rdma resource count may exceed the limit. + * This is only used during task migration where there is no other + * way out than violating the limit. + */ +static void charge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *p; + + for (p = dev_cg; p; p = parent_devcgroup(p)) { + struct devcgroup_rdma *rdma = &p->rdma; + + atomic_add(num, &rdma->tracker[type].usage); + } +} + +/** + * try_charge_resource - hierarchically try to charge + * the rdma resource count + * @type: the type of resource to uncharge + * @num: the number of rdma resource to charge + * + * This function follows the set limit. It will fail if the charge would cause + * the new value to exceed the hierarchical limit. Returns 0 if the charge + * succeded, otherwise -EAGAIN. + */ +static int try_charge_resource(struct dev_cgroup *dev_cg, + enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *p, *q; + + for (p = dev_cg; p; p = parent_devcgroup(p)) { + struct devcgroup_rdma *rdma = &p->rdma; + s64 new = atomic_add_return(num, + &rdma->tracker[type].usage); + + if (new > rdma->tracker[type].limit) + goto revert; + } + return 0; + +revert: + for (q = dev_cg; q != p; q = parent_devcgroup(q)) + uncharge_resource(q, type, num); + uncharge_resource(q, type, num); + return -EAGAIN; +} + +/** + * devcgroup_rdma_try_charge_resource - hierarchically try to charge + * the rdma resource count + * @type: the type of resource to uncharge + * @num: the number of rdma resource to charge + * + * This function follows the set limit in hierarchical way. + * It will fail if the charge would cause the new value to exceed the + * hierarchical limit. + * Returns 0 if the charge succeded, otherwise -EAGAIN. + */ +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) +{ + struct dev_cgroup *dev_cg = task_devcgroup(current); + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; + int status; + + if (!res_cnt) { + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); + if (!res_cnt) + return -ENOMEM; + + spin_lock_init(&res_cnt->lock); + rcu_assign_pointer(current->rdma_res_counter, res_cnt); + } + + /* synchronize with migration task by taking lock, to avoid + * race condition of performing cgroup resource migration + * in non atomic way with this task, which can leads to leaked + * resources in older cgroup. + */ + spin_lock(&res_cnt->lock); + status = try_charge_resource(dev_cg, type, num); + if (status) + goto busy; + + /* single task updating its rdma resource usage, so atomic is + * not required. + */ + current->rdma_res_counter->usage[type] += num; + +busy: + spin_unlock(&res_cnt->lock); + return status; +} +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); + +/** + * devcgroup_rdma_query_resource_limit - query the resource limit + * for a given resource type of the calling user process. It returns the + * hierarchically smallest limit of the cgroup hierarchy. + * @type: the type of resource to query the limit + * Returns resource limit across all the RDMA devices accessible + * to this process. + */ +int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type) +{ + struct dev_cgroup *dev_cg, *p; + int cur_limit, limit; + + dev_cg = task_devcgroup(current); + limit = dev_cg->rdma.tracker[type].limit; + + /* find the controller in the given hirerchy with lowest limit, + * and report its limit to avoid confusion to user and applications, + * who rely on the query functionality. + */ + for (p = dev_cg; p; p = parent_devcgroup(p)) { + cur_limit = p->rdma.tracker[type].limit; + limit = min_t(int, cur_limit, limit); + } + return limit; +} +EXPORT_SYMBOL(devcgroup_rdma_query_resource_limit); + +int devcgroup_rdma_can_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css); + struct dev_cgroup *old_cg; + struct task_struct *task; + struct task_rdma_res_counter *task_res_cnt; + int val, i; + + cgroup_taskset_for_each(task, tset) { + old_cg = task_devcgroup(task); + + /* protect against a task which might be deallocating + * rdma_res_counter structure because last resource + * of the task might undergoing deallocation. + */ + rcu_read_lock(); + task_res_cnt = rcu_dereference(task->rdma_res_counter); + if (!task_res_cnt) + goto empty_task; + + spin_lock(&task_res_cnt->lock); + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + val = task_res_cnt->usage[i]; + + charge_resource(dst_cg, i, val); + uncharge_task_resource(task, old_cg, i, val); + } + spin_unlock(&task_res_cnt->lock); + +empty_task: + rcu_read_unlock(); + } + return 0; +} + +void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css); + struct dev_cgroup *old_cg; + struct task_struct *task; + struct task_rdma_res_counter *task_res_cnt; + u32 val; int i; + + cgroup_taskset_for_each(task, tset) { + old_cg = task_devcgroup(task); + + /* protect against task deallocating rdma_res_counter structure + * because last ucontext resource of the task might be + * getting deallocated. + */ + rcu_read_lock(); + task_res_cnt = rcu_dereference(task->rdma_res_counter); + if (!task_res_cnt) + goto empty_task; + + spin_lock(&task_res_cnt->lock); + for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) { + val = task_res_cnt->usage[i]; + + charge_resource(old_cg, i, val); + uncharge_task_resource(task, dst_cg, i, val); + } + spin_unlock(&task_res_cnt->lock); +empty_task: + rcu_read_unlock(); + } +} + +void devcgroup_rdma_fork(struct task_struct *task, void *priv) +{ + /* There is per task resource counters, + * so whatever clone as copied over, ignore it. + */ + task->rdma_res_counter = NULL; +} -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit @ 2015-09-08 8:22 ` Haggai Eran 2015-09-08 10:18 ` Parav Pandit 2015-09-08 8:36 ` Haggai Eran 1 sibling, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 8:22 UTC (permalink / raw) To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, dledford Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm, linux-security-module On 07/09/2015 23:38, Parav Pandit wrote: > +/* RDMA resources from device cgroup perspective */ > +enum devcgroup_rdma_rt { > + DEVCG_RDMA_RES_TYPE_UCTX, > + DEVCG_RDMA_RES_TYPE_CQ, > + DEVCG_RDMA_RES_TYPE_PD, > + DEVCG_RDMA_RES_TYPE_AH, > + DEVCG_RDMA_RES_TYPE_MR, > + DEVCG_RDMA_RES_TYPE_MW, I didn't see memory windows in dev_cgroup_files in patch 3. Is it used? > + DEVCG_RDMA_RES_TYPE_SRQ, > + DEVCG_RDMA_RES_TYPE_QP, > + DEVCG_RDMA_RES_TYPE_FLOW, > + DEVCG_RDMA_RES_TYPE_MAX, > +}; > +struct devcgroup_rdma_tracker { > + int limit; > + atomic_t usage; > + int failcnt; > +}; Have you considered using struct res_counter? > + * RDMA resource limits are hierarchical, so the highest configured limit of > + * the hierarchy is enforced. Allowing resource limit configuration to default > + * cgroup allows fair share to kernel space ULPs as well. In what way is the highest configured limit of the hierarchy enforced? I would expect all the limits along the hierarchy to be enforced. > +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) > +{ > + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); > + int type = seq_cft(sf)->private; > + u32 usage; > + > + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { > + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); > + } else { > + usage = dev_cg->rdma.tracker[type].limit; If this is the resource limit, don't name it 'usage'. > + seq_printf(sf, "%u\n", usage); > + } > + return 0; > +} > +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) > +{ > + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); > + int type = seq_cft(sf)->private; > + u32 usage; > + > + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { > + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); I'm not sure hiding the actual number is good, especially in the show_usage case. > + } else { > + usage = dev_cg->rdma.tracker[type].limit; > + seq_printf(sf, "%u\n", usage); > + } > + return 0; > +} > +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, > + enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg, *p; > + struct task_struct *ctx_task; > + > + if (!num) > + return; > + > + /* get cgroup of ib_ucontext it belong to, to uncharge > + * so that when its called from any worker tasks or any > + * other tasks to which this resource doesn't belong to, > + * it can be uncharged correctly. > + */ > + if (ucontext) > + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); > + else > + ctx_task = current; > + dev_cg = task_devcgroup(ctx_task); > + > + spin_lock(&ctx_task->rdma_res_counter->lock); Don't you need an rcu read lock and rcu_dereference to access rdma_res_counter? > + ctx_task->rdma_res_counter->usage[type] -= num; > + > + for (p = dev_cg; p; p = parent_devcgroup(p)) > + uncharge_resource(p, type, num); > + > + spin_unlock(&ctx_task->rdma_res_counter->lock); > + > + if (type == DEVCG_RDMA_RES_TYPE_UCTX) > + rdma_free_res_counter(ctx_task); > +} > +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg = task_devcgroup(current); > + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; > + int status; > + > + if (!res_cnt) { > + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); > + if (!res_cnt) > + return -ENOMEM; > + > + spin_lock_init(&res_cnt->lock); > + rcu_assign_pointer(current->rdma_res_counter, res_cnt); Don't you need the task lock to update rdma_res_counter here? > + } > + > + /* synchronize with migration task by taking lock, to avoid > + * race condition of performing cgroup resource migration > + * in non atomic way with this task, which can leads to leaked > + * resources in older cgroup. > + */ > + spin_lock(&res_cnt->lock); > + status = try_charge_resource(dev_cg, type, num); > + if (status) > + goto busy; > + > + /* single task updating its rdma resource usage, so atomic is > + * not required. > + */ > + current->rdma_res_counter->usage[type] += num; > + > +busy: > + spin_unlock(&res_cnt->lock); > + return status; > +} > +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); Regards, Haggai ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-08 8:22 ` Haggai Eran @ 2015-09-08 10:18 ` Parav Pandit 2015-09-08 13:50 ` Haggai Eran 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-08 10:18 UTC (permalink / raw) To: Haggai Eran Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 1:52 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> +/* RDMA resources from device cgroup perspective */ >> +enum devcgroup_rdma_rt { >> + DEVCG_RDMA_RES_TYPE_UCTX, >> + DEVCG_RDMA_RES_TYPE_CQ, >> + DEVCG_RDMA_RES_TYPE_PD, >> + DEVCG_RDMA_RES_TYPE_AH, >> + DEVCG_RDMA_RES_TYPE_MR, >> + DEVCG_RDMA_RES_TYPE_MW, > I didn't see memory windows in dev_cgroup_files in patch 3. Is it used? ib_uverbs_dereg_mr() needs a fix in my patch for MW and alloc_mw() also needs to use it. I will fix it. >> + DEVCG_RDMA_RES_TYPE_SRQ, >> + DEVCG_RDMA_RES_TYPE_QP, >> + DEVCG_RDMA_RES_TYPE_FLOW, >> + DEVCG_RDMA_RES_TYPE_MAX, >> +}; > >> +struct devcgroup_rdma_tracker { >> + int limit; >> + atomic_t usage; >> + int failcnt; >> +}; > Have you considered using struct res_counter? No. I will look into the structure and see if it fits or not. > >> + * RDMA resource limits are hierarchical, so the highest configured limit of >> + * the hierarchy is enforced. Allowing resource limit configuration to default >> + * cgroup allows fair share to kernel space ULPs as well. > In what way is the highest configured limit of the hierarchy enforced? I > would expect all the limits along the hierarchy to be enforced. > In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. Lets take example to clarify. Say cg_A, cg_B, cg_C Role name limit Parent cg_A 100 Child_level1 cg_B (child of cg_A) 20 Child_level2: cg_C (child of cg_B) 50 If the process allocating rdma resource belongs to cg_C, limit lowest limit in the hierarchy is applied during charge() stage. If cg_A limit happens to be 10, since 10 is lowest, its limit would be applicable as you expected. this is similar to newly added PID subsystem in functionality. >> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >> +{ >> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >> + int type = seq_cft(sf)->private; >> + u32 usage; >> + >> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >> + } else { >> + usage = dev_cg->rdma.tracker[type].limit; > If this is the resource limit, don't name it 'usage'. > o.k. This is typo mistake from usage show function I made. I will change it. >> + seq_printf(sf, "%u\n", usage); >> + } >> + return 0; >> +} > >> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >> +{ >> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >> + int type = seq_cft(sf)->private; >> + u32 usage; >> + >> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); > I'm not sure hiding the actual number is good, especially in the > show_usage case. This is similar to following other controller same as newly added PID subsystem in showing max limit. > >> + } else { >> + usage = dev_cg->rdma.tracker[type].limit; >> + seq_printf(sf, "%u\n", usage); >> + } >> + return 0; >> +} > >> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >> + enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg, *p; >> + struct task_struct *ctx_task; >> + >> + if (!num) >> + return; >> + >> + /* get cgroup of ib_ucontext it belong to, to uncharge >> + * so that when its called from any worker tasks or any >> + * other tasks to which this resource doesn't belong to, >> + * it can be uncharged correctly. >> + */ >> + if (ucontext) >> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >> + else >> + ctx_task = current; >> + dev_cg = task_devcgroup(ctx_task); >> + >> + spin_lock(&ctx_task->rdma_res_counter->lock); > Don't you need an rcu read lock and rcu_dereference to access > rdma_res_counter? I believe, its not required because when uncharge() is happening, it can happen only from 3 contexts. (a) from the caller task context, who has made allocation call, so no synchronizing needed. (b) from the dealloc resource context, again this is from the same task context which allocated, it so this is single threaded, no need to syncronize. (c) from the fput() context when process is terminated abruptly or as part of differed cleanup, when this is happening there cannot be allocator task anyway. > >> + ctx_task->rdma_res_counter->usage[type] -= num; >> + >> + for (p = dev_cg; p; p = parent_devcgroup(p)) >> + uncharge_resource(p, type, num); >> + >> + spin_unlock(&ctx_task->rdma_res_counter->lock); >> + >> + if (type == DEVCG_RDMA_RES_TYPE_UCTX) >> + rdma_free_res_counter(ctx_task); >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > >> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg = task_devcgroup(current); >> + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter; >> + int status; >> + >> + if (!res_cnt) { >> + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL); >> + if (!res_cnt) >> + return -ENOMEM; >> + >> + spin_lock_init(&res_cnt->lock); >> + rcu_assign_pointer(current->rdma_res_counter, res_cnt); > Don't you need the task lock to update rdma_res_counter here? > No. this is the caller task allocating it, so its single threaded. It needs to syncronize with migration thread which is reading counters of all the processes, while they are getting allocated and freed. Therefore rcu() is sufficient. >> + } >> + >> + /* synchronize with migration task by taking lock, to avoid >> + * race condition of performing cgroup resource migration >> + * in non atomic way with this task, which can leads to leaked >> + * resources in older cgroup. >> + */ >> + spin_lock(&res_cnt->lock); >> + status = try_charge_resource(dev_cg, type, num); >> + if (status) >> + goto busy; >> + >> + /* single task updating its rdma resource usage, so atomic is >> + * not required. >> + */ >> + current->rdma_res_counter->usage[type] += num; >> + >> +busy: >> + spin_unlock(&res_cnt->lock); >> + return status; >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource); > > Regards, > Haggai ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-08 10:18 ` Parav Pandit @ 2015-09-08 13:50 ` Haggai Eran 2015-09-08 14:13 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 13:50 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On 08/09/2015 13:18, Parav Pandit wrote: >> > >>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of >>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default >>> >> + * cgroup allows fair share to kernel space ULPs as well. >> > In what way is the highest configured limit of the hierarchy enforced? I >> > would expect all the limits along the hierarchy to be enforced. >> > > In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. > > Lets take example to clarify. > Say cg_A, cg_B, cg_C > Role name limit > Parent cg_A 100 > Child_level1 cg_B (child of cg_A) 20 > Child_level2: cg_C (child of cg_B) 50 > > If the process allocating rdma resource belongs to cg_C, limit lowest > limit in the hierarchy is applied during charge() stage. > If cg_A limit happens to be 10, since 10 is lowest, its limit would be > applicable as you expected. Looking at the code, the usage in every level is charged. This is what I would expect. I just think the comment is a bit misleading. >>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >>> +{ >>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >>> + int type = seq_cft(sf)->private; >>> + u32 usage; >>> + >>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >> I'm not sure hiding the actual number is good, especially in the >> show_usage case. > > This is similar to following other controller same as newly added PID > subsystem in showing max limit. Okay. >>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>> + enum devcgroup_rdma_rt type, int num) >>> +{ >>> + struct dev_cgroup *dev_cg, *p; >>> + struct task_struct *ctx_task; >>> + >>> + if (!num) >>> + return; >>> + >>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>> + * so that when its called from any worker tasks or any >>> + * other tasks to which this resource doesn't belong to, >>> + * it can be uncharged correctly. >>> + */ >>> + if (ucontext) >>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>> + else >>> + ctx_task = current; >>> + dev_cg = task_devcgroup(ctx_task); >>> + >>> + spin_lock(&ctx_task->rdma_res_counter->lock); >> Don't you need an rcu read lock and rcu_dereference to access >> rdma_res_counter? > > I believe, its not required because when uncharge() is happening, it > can happen only from 3 contexts. > (a) from the caller task context, who has made allocation call, so no > synchronizing needed. > (b) from the dealloc resource context, again this is from the same > task context which allocated, it so this is single threaded, no need > to syncronize. I don't think it is true. You can access uverbs from multiple threads. What may help your case here I think is the fact that only when the last ucontext is released you can change the rdma_res_counter field, and ucontext release takes the ib_uverbs_file->mutex. Still, I think it would be best to use rcu_dereference(), if only for documentation and sparse. > (c) from the fput() context when process is terminated abruptly or as > part of differed cleanup, when this is happening there cannot be > allocator task anyway. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-08 13:50 ` Haggai Eran @ 2015-09-08 14:13 ` Parav Pandit 0 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-08 14:13 UTC (permalink / raw) To: Haggai Eran Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 7:20 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 08/09/2015 13:18, Parav Pandit wrote: >>> > >>>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of >>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default >>>> >> + * cgroup allows fair share to kernel space ULPs as well. >>> > In what way is the highest configured limit of the hierarchy enforced? I >>> > would expect all the limits along the hierarchy to be enforced. >>> > >> In hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied. >> >> Lets take example to clarify. >> Say cg_A, cg_B, cg_C >> Role name limit >> Parent cg_A 100 >> Child_level1 cg_B (child of cg_A) 20 >> Child_level2: cg_C (child of cg_B) 50 >> >> If the process allocating rdma resource belongs to cg_C, limit lowest >> limit in the hierarchy is applied during charge() stage. >> If cg_A limit happens to be 10, since 10 is lowest, its limit would be >> applicable as you expected. > > Looking at the code, the usage in every level is charged. This is what I > would expect. I just think the comment is a bit misleading. > >>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v) >>>> +{ >>>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf)); >>>> + int type = seq_cft(sf)->private; >>>> + u32 usage; >>>> + >>>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) { >>>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR); >>> I'm not sure hiding the actual number is good, especially in the >>> show_usage case. >> >> This is similar to following other controller same as newly added PID >> subsystem in showing max limit. > > Okay. > >>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>>> + enum devcgroup_rdma_rt type, int num) >>>> +{ >>>> + struct dev_cgroup *dev_cg, *p; >>>> + struct task_struct *ctx_task; >>>> + >>>> + if (!num) >>>> + return; >>>> + >>>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>>> + * so that when its called from any worker tasks or any >>>> + * other tasks to which this resource doesn't belong to, >>>> + * it can be uncharged correctly. >>>> + */ >>>> + if (ucontext) >>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>>> + else >>>> + ctx_task = current; >>>> + dev_cg = task_devcgroup(ctx_task); >>>> + >>>> + spin_lock(&ctx_task->rdma_res_counter->lock); >>> Don't you need an rcu read lock and rcu_dereference to access >>> rdma_res_counter? >> >> I believe, its not required because when uncharge() is happening, it >> can happen only from 3 contexts. >> (a) from the caller task context, who has made allocation call, so no >> synchronizing needed. >> (b) from the dealloc resource context, again this is from the same >> task context which allocated, it so this is single threaded, no need >> to syncronize. > I don't think it is true. You can access uverbs from multiple threads. Yes, thats right. Though I design counter structure allocation on per task basis for individual thread access, I totally missed out ucontext sharing among threads. I replied in other thread to make counters during charge, uncharge to atomic to cover that case. Therefore I need rcu lock and deference as well. > What may help your case here I think is the fact that only when the last > ucontext is released you can change the rdma_res_counter field, and > ucontext release takes the ib_uverbs_file->mutex. > > Still, I think it would be best to use rcu_dereference(), if only for > documentation and sparse. yes. > >> (c) from the fput() context when process is terminated abruptly or as >> part of differed cleanup, when this is happening there cannot be >> allocator task anyway. > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit 2015-09-08 8:22 ` Haggai Eran @ 2015-09-08 8:36 ` Haggai Eran [not found] ` <55EE9DF5.7030401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 8:36 UTC (permalink / raw) To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, dledford Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm, linux-security-module On 07/09/2015 23:38, Parav Pandit wrote: > +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, > + enum devcgroup_rdma_rt type, int num) > +{ > + struct dev_cgroup *dev_cg, *p; > + struct task_struct *ctx_task; > + > + if (!num) > + return; > + > + /* get cgroup of ib_ucontext it belong to, to uncharge > + * so that when its called from any worker tasks or any > + * other tasks to which this resource doesn't belong to, > + * it can be uncharged correctly. > + */ > + if (ucontext) > + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); > + else > + ctx_task = current; So what happens if a process creates a ucontext, forks, and then the child creates and destroys a CQ? If I understand correctly, created resources are always charged to the current process (the child), but when it is destroyed the owner of the ucontext (the parent) will be uncharged. Since ucontexts are not meant to be used by multiple processes, I think it would be okay to always charge the owner process (the one that created the ucontext). > + dev_cg = task_devcgroup(ctx_task); > + > + spin_lock(&ctx_task->rdma_res_counter->lock); > + ctx_task->rdma_res_counter->usage[type] -= num; > + > + for (p = dev_cg; p; p = parent_devcgroup(p)) > + uncharge_resource(p, type, num); > + > + spin_unlock(&ctx_task->rdma_res_counter->lock); > + > + if (type == DEVCG_RDMA_RES_TYPE_UCTX) > + rdma_free_res_counter(ctx_task); > +} > +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <55EE9DF5.7030401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. [not found] ` <55EE9DF5.7030401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-09-08 10:50 ` Parav Pandit 2015-09-08 14:10 ` Haggai Eran 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-08 10:50 UTC (permalink / raw) To: Haggai Eran Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >> + enum devcgroup_rdma_rt type, int num) >> +{ >> + struct dev_cgroup *dev_cg, *p; >> + struct task_struct *ctx_task; >> + >> + if (!num) >> + return; >> + >> + /* get cgroup of ib_ucontext it belong to, to uncharge >> + * so that when its called from any worker tasks or any >> + * other tasks to which this resource doesn't belong to, >> + * it can be uncharged correctly. >> + */ >> + if (ucontext) >> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >> + else >> + ctx_task = current; > So what happens if a process creates a ucontext, forks, and then the > child creates and destroys a CQ? If I understand correctly, created > resources are always charged to the current process (the child), but > when it is destroyed the owner of the ucontext (the parent) will be > uncharged. > > Since ucontexts are not meant to be used by multiple processes, I think > it would be okay to always charge the owner process (the one that > created the ucontext). I need to think about it. I would like to avoid keep per task resource counters for two reasons. For a while I thought that native fork() doesn't take care to share the RDMA resources and all CQ, QP dmaable memory from PID namespace perspective. 1. Because, it could well happen that process and its child process is created in PID namespace_A, after which child is migrated to new PID namespace_B. after which parent from the namespace_A is terminated. I am not sure how the ucontext ownership changes from parent to child process at that point today. I prefer to keep this complexity out if at all it exists as process migration across namespaces is not a frequent event for which to optimize the code for. 2. by having per task counter (as cost of memory some memory) allows to avoid using atomic during charge(), uncharge(). The intent is to have per task (process and thread) to have their resource counter instance, but I can see that its broken where its charging parent process as of now without atomics. As you said its ok to always charge the owner process, I have to relax 2nd requirement and fallback to use atomics for charge(), uncharge() or I have to get rid of ucontext from the uncharge() API which is difficult due to fput() being in worker thread context. > >> + dev_cg = task_devcgroup(ctx_task); >> + >> + spin_lock(&ctx_task->rdma_res_counter->lock); >> + ctx_task->rdma_res_counter->usage[type] -= num; >> + >> + for (p = dev_cg; p; p = parent_devcgroup(p)) >> + uncharge_resource(p, type, num); >> + >> + spin_unlock(&ctx_task->rdma_res_counter->lock); >> + >> + if (type == DEVCG_RDMA_RES_TYPE_UCTX) >> + rdma_free_res_counter(ctx_task); >> +} >> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource); > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource. 2015-09-08 10:50 ` Parav Pandit @ 2015-09-08 14:10 ` Haggai Eran 0 siblings, 0 replies; 60+ messages in thread From: Haggai Eran @ 2015-09-08 14:10 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On 08/09/2015 13:50, Parav Pandit wrote: > On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote: >> On 07/09/2015 23:38, Parav Pandit wrote: >>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext, >>> + enum devcgroup_rdma_rt type, int num) >>> +{ >>> + struct dev_cgroup *dev_cg, *p; >>> + struct task_struct *ctx_task; >>> + >>> + if (!num) >>> + return; >>> + >>> + /* get cgroup of ib_ucontext it belong to, to uncharge >>> + * so that when its called from any worker tasks or any >>> + * other tasks to which this resource doesn't belong to, >>> + * it can be uncharged correctly. >>> + */ >>> + if (ucontext) >>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID); >>> + else >>> + ctx_task = current; >> So what happens if a process creates a ucontext, forks, and then the >> child creates and destroys a CQ? If I understand correctly, created >> resources are always charged to the current process (the child), but >> when it is destroyed the owner of the ucontext (the parent) will be >> uncharged. >> >> Since ucontexts are not meant to be used by multiple processes, I think >> it would be okay to always charge the owner process (the one that >> created the ucontext). > > I need to think about it. I would like to avoid keep per task resource > counters for two reasons. > For a while I thought that native fork() doesn't take care to share > the RDMA resources and all CQ, QP dmaable memory from PID namespace > perspective. > > 1. Because, it could well happen that process and its child process is > created in PID namespace_A, after which child is migrated to new PID > namespace_B. > after which parent from the namespace_A is terminated. I am not sure > how the ucontext ownership changes from parent to child process at > that point today. > I prefer to keep this complexity out if at all it exists as process > migration across namespaces is not a frequent event for which to > optimize the code for. > > 2. by having per task counter (as cost of memory some memory) allows > to avoid using atomic during charge(), uncharge(). > > The intent is to have per task (process and thread) to have their > resource counter instance, but I can see that its broken where its > charging parent process as of now without atomics. > As you said its ok to always charge the owner process, I have to relax > 2nd requirement and fallback to use atomics for charge(), uncharge() > or I have to get rid of ucontext from the uncharge() API which is > difficult due to fput() being in worker thread context. > I think the cost of atomic operations here would normally be negligible compared to the cost of accessing the hardware to allocate or deallocate these resources. ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup. [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ` (2 preceding siblings ...) 2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit @ 2015-09-07 20:38 ` Parav Pandit 2015-09-08 12:45 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Haggai Eran 2015-09-08 15:23 ` Tejun Heo 5 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA, pandit.parav-Re5JQEeQqe8AvxtiuMwx3w Modified device cgroup documentation to reflect its dual purpose without creating new cgroup subsystem for rdma. Added documentation to describe functionality and usage of device cgroup extension for RDMA. Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> --- Documentation/cgroups/devices.txt | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 3c1095c..eca5b70 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt @@ -1,9 +1,12 @@ -Device Whitelist Controller +Device Controller 1. Description: -Implement a cgroup to track and enforce open and mknod restrictions -on device files. A device cgroup associates a device access +Device controller implements a cgroup for two purposes. + +1.1 Device white list controller +It implement a cgroup to track and enforce open and mknod +restrictions on device files. A device cgroup associates a device access whitelist with each cgroup. A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it applies to all types and all major and minor numbers. Major and minor are @@ -15,8 +18,15 @@ cgroup gets a copy of the parent. Administrators can then remove devices from the whitelist or add new entries. A child cgroup can never receive a device access which is denied by its parent. +1.2 RDMA device resource controller +It implements a cgroup to limit various RDMA device resources for +a controller. Such resource includes RDMA PD, CQ, AH, MR, SRQ, QP, FLOW. +It limits RDMA resources access to tasks of the cgroup across multiple +RDMA devices. + 2. User Interface +2.1 Device white list controller An entry is added using devices.allow, and removed using devices.deny. For instance @@ -33,6 +43,22 @@ will remove the default 'a *:* rwm' entry. Doing will add the 'a *:* rwm' entry to the whitelist. +2.2 RDMA device controller + +RDMA resources are limited using devices.rdma.resource.max.<resource_name>. +Doing + echo 200 > /sys/fs/cgroup/1/rdma.resource.max_qp +will limit maximum number of QP across all the process of cgroup to 200. + +More examples: + echo 200 > /sys/fs/cgroup/1/rdma.resource.max_flow + echo 10 > /sys/fs/cgroup/1/rdma.resource.max_pd + echo 15 > /sys/fs/cgroup/1/rdma.resource.max_srq + echo 1 > /sys/fs/cgroup/1/rdma.resource.max_uctx + +RDMA resource current usage can be tracked using devices.rdma.resource.usage + cat /sys/fs/cgroup/1/devices.rdma.resource.usage + 3. Security Any task can move itself between cgroups. This clearly won't -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ` (3 preceding siblings ...) 2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup Parav Pandit @ 2015-09-08 12:45 ` Haggai Eran 2015-09-08 15:23 ` Tejun Heo 5 siblings, 0 replies; 60+ messages in thread From: Haggai Eran @ 2015-09-08 12:45 UTC (permalink / raw) To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On 07/09/2015 23:38, Parav Pandit wrote: > Currently user space applications can easily take away all the rdma > device specific resources such as AH, CQ, QP, MR etc. Due to which other > applications in other cgroup or kernel space ULPs may not even get chance > to allocate any rdma resources. > > This patch-set allows limiting rdma resources to set of processes. > It extend device cgroup controller for limiting rdma device limits. I don't think extending the device cgroup is the right place for these limits. It is currently a very generic controller and adding various RDMA resources to it looks out of place. Why not create a new controller for rdma? Another thing I noticed is that all limits in this cgroup are global, while the resources they control are hardware device specific. I think it would be better if the cgroup controlled the limits of each device separately. > With this patch, user verbs module queries rdma device cgroup controller > to query process's limit to consume such resource. It uncharge resource > counter after resource is being freed. This is another reason why per-device limits would be better. Since limits are reflected to user-space when querying a specific device, it will show the same maximum limit on every device opened. If the user opens 3 devices they might expect to be able to open 3 times the number of the resources they actually can. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> ` (4 preceding siblings ...) 2015-09-08 12:45 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Haggai Eran @ 2015-09-08 15:23 ` Tejun Heo 2015-09-09 3:57 ` Parav Pandit 5 siblings, 1 reply; 60+ messages in thread From: Tejun Heo @ 2015-09-08 15:23 UTC (permalink / raw) To: Parav Pandit Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA, corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA Hello, Parav. On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote: > Currently user space applications can easily take away all the rdma > device specific resources such as AH, CQ, QP, MR etc. Due to which other > applications in other cgroup or kernel space ULPs may not even get chance > to allocate any rdma resources. Is there something simple I can read up on what each resource is? What's the usual access control mechanism? > This patch-set allows limiting rdma resources to set of processes. > It extend device cgroup controller for limiting rdma device limits. I don't think this belongs to devcg. If these make sense as a set of resources to be controlled via cgroup, the right way prolly would be a separate controller. Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-08 15:23 ` Tejun Heo @ 2015-09-09 3:57 ` Parav Pandit 2015-09-10 16:49 ` Tejun Heo 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-09 3:57 UTC (permalink / raw) To: Tejun Heo Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 8:53 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Parav. > > On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote: >> Currently user space applications can easily take away all the rdma >> device specific resources such as AH, CQ, QP, MR etc. Due to which other >> applications in other cgroup or kernel space ULPs may not even get chance >> to allocate any rdma resources. > > Is there something simple I can read up on what each resource is? > What's the usual access control mechanism? > Hi Tejun, This is one old white paper, but most of the reasoning still holds true on RDMA. http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf More notes on RDMA resources and summary: RDMA allows data transport from one system to other system where RDMA device implements OSI layers 4 to 1 typically in hardware, drivers. RDMA device provides data path semantics to perform data transfer in zero copy manner from one to other host, very similar to local dma controller. It also allows data transfer operation from user space application of one to other system. In order to do so, all the resources are created using trusted kernel space which also provides isolation among applications. These resources include are- QP (queue pair) to transfer data, CQ (Completion queue) to indicate completion of data transfer operation, MR (memory region) to represent user application memory as source or destination for data transfer. Common resources are QP, SRQ (shared received queue), CQ, MR, AH (Address handle), FLOW, PD (protection domain), user context etc. >> This patch-set allows limiting rdma resources to set of processes. >> It extend device cgroup controller for limiting rdma device limits. > > I don't think this belongs to devcg. If these make sense as a set of > resources to be controlled via cgroup, the right way prolly would be a > separate controller. > In past there has been similar comment to have dedicated cgroup controller for RDMA instead of merging with device cgroup. I am ok with both the approach, however I prefer to utilize device controller instead of spinning of new controller for new devices category. I anticipate more such need would arise and for new device category, it might not be worth to have new cgroup controller. RapidIO though very less popular and upcoming PCIe are on horizon to offer similar benefits as that of RDMA and in future having one controller for each of them again would not be right approach. I certainly seek your and others inputs in this email thread here whether (a) to continue to extend device cgroup (which support character, block devices white list) and now RDMA devices or (b) to spin of new controller, if so what are the compelling reasons that it can provide compare to extension. Current scope of the patch is limited to RDMA resources as first patch, but for fact I am sure that there are more functionality in pipe to support via this cgroup by me and others. So keeping atleast these two aspects in mind, I need input on direction of dedicated controller or new one. In future, I anticipate that we might have sub directory to device cgroup for individual device class to control. such as, <sys/fs/cgroup/devices/ /char /block /rdma /pcie /child_cgroup..1..N Each controllers cgroup access files would remain within their own scope. We are not there yet from base infrastructure but something to be done as it matures and users start using it. > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-09 3:57 ` Parav Pandit @ 2015-09-10 16:49 ` Tejun Heo [not found] ` <20150910164946.GH8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-09-10 17:48 ` Hefty, Sean 0 siblings, 2 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-10 16:49 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module Hello, Parav. On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote: > This is one old white paper, but most of the reasoning still holds true on RDMA. > http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf Just read it. Much appreciated. ... > These resources include are- QP (queue pair) to transfer data, CQ > (Completion queue) to indicate completion of data transfer operation, > MR (memory region) to represent user application memory as source or > destination for data transfer. > Common resources are QP, SRQ (shared received queue), CQ, MR, AH > (Address handle), FLOW, PD (protection domain), user context etc. It's kinda bothering that all these are disparate resources. I suppose that each restriction comes from the underlying hardware and there's no accepted higher level abstraction for these things? > >> This patch-set allows limiting rdma resources to set of processes. > >> It extend device cgroup controller for limiting rdma device limits. > > > > I don't think this belongs to devcg. If these make sense as a set of > > resources to be controlled via cgroup, the right way prolly would be a > > separate controller. > > > > In past there has been similar comment to have dedicated cgroup > controller for RDMA instead of merging with device cgroup. > I am ok with both the approach, however I prefer to utilize device > controller instead of spinning of new controller for new devices > category. > I anticipate more such need would arise and for new device category, > it might not be worth to have new cgroup controller. > RapidIO though very less popular and upcoming PCIe are on horizon to > offer similar benefits as that of RDMA and in future having one > controller for each of them again would not be right approach. > > I certainly seek your and others inputs in this email thread here whether > (a) to continue to extend device cgroup (which support character, > block devices white list) and now RDMA devices > or > (b) to spin of new controller, if so what are the compelling reasons > that it can provide compare to extension. I'm doubtful that these things are gonna be mainstream w/o building up higher level abstractions on top and if we ever get there we won't be talking about MR or CQ or whatever. Also, whatever next-gen is unlikely to have enough commonalities when the proposed resource knobs are this low level, so let's please keep it separate, so that if/when this goes out of fashion for one reason or another, the controller can silently wither away too. > Current scope of the patch is limited to RDMA resources as first > patch, but for fact I am sure that there are more functionality in > pipe to support via this cgroup by me and others. > So keeping atleast these two aspects in mind, I need input on > direction of dedicated controller or new one. > > In future, I anticipate that we might have sub directory to device > cgroup for individual device class to control. > such as, > <sys/fs/cgroup/devices/ > /char > /block > /rdma > /pcie > /child_cgroup..1..N > Each controllers cgroup access files would remain within their own > scope. We are not there yet from base infrastructure but something to > be done as it matures and users start using it. I don't think that jives with the rest of cgroup and what generic block or pcie attributes are directly exposed to applications and need to be hierarchically controlled via cgroup? Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150910164946.GH8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150910164946.GH8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-10 17:46 ` Parav Pandit 2015-09-10 20:22 ` Tejun Heo 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-10 17:46 UTC (permalink / raw) To: Tejun Heo Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On Thu, Sep 10, 2015 at 10:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, Parav. > > On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote: >> This is one old white paper, but most of the reasoning still holds true on RDMA. >> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf > > Just read it. Much appreciated. > > ... >> These resources include are- QP (queue pair) to transfer data, CQ >> (Completion queue) to indicate completion of data transfer operation, >> MR (memory region) to represent user application memory as source or >> destination for data transfer. >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH >> (Address handle), FLOW, PD (protection domain), user context etc. > > It's kinda bothering that all these are disparate resources. Actually not. They are linked resources. Every QP needs associated one or two CQ, one PD. Every QP will use few MRs for data transfer. Here is the good programming guide of the RDMA APIs exposed to the user space application. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf So first version of the cgroups patch will address the control operation for section 3.4. > I suppose that each restriction comes from the underlying hardware and > there's no accepted higher level abstraction for these things? > There is higher level abstraction which is through the verbs layer currently which does actually expose the hardware resource but in vendor agnostic way. There are many vendors who support these verbs layer, some of them which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers which support these verbs are in <drivers/infiniband/hw/> kernel tree. There is higher level APIs above the verb layer, such as MPI, libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer. They all rely on the hardware resource. All of these higher level abstraction is accepted and well used by certain application class. It would be long discussion to go over them here. >> >> This patch-set allows limiting rdma resources to set of processes. >> >> It extend device cgroup controller for limiting rdma device limits. >> > >> > I don't think this belongs to devcg. If these make sense as a set of >> > resources to be controlled via cgroup, the right way prolly would be a >> > separate controller. >> > >> >> In past there has been similar comment to have dedicated cgroup >> controller for RDMA instead of merging with device cgroup. >> I am ok with both the approach, however I prefer to utilize device >> controller instead of spinning of new controller for new devices >> category. >> I anticipate more such need would arise and for new device category, >> it might not be worth to have new cgroup controller. >> RapidIO though very less popular and upcoming PCIe are on horizon to >> offer similar benefits as that of RDMA and in future having one >> controller for each of them again would not be right approach. >> >> I certainly seek your and others inputs in this email thread here whether >> (a) to continue to extend device cgroup (which support character, >> block devices white list) and now RDMA devices >> or >> (b) to spin of new controller, if so what are the compelling reasons >> that it can provide compare to extension. > > I'm doubtful that these things are gonna be mainstream w/o building up > higher level abstractions on top and if we ever get there we won't be > talking about MR or CQ or whatever. Some of the higher level examples I gave above will adapt to resource allocation failure. Some are actually adaptive to few resource allocation failure, they do query resources. But its not completely there yet. Once we have this notion of limited resource in place, abstraction layer would adapt to relatively smaller value of such resource. These higher level abstraction is mainstream. Its shipped at least in Redhat Enterprise Linux. > Also, whatever next-gen is > unlikely to have enough commonalities when the proposed resource knobs > are this low level, I agree that resource won't be common in next-gen other transport whenever they arrive. But with my existing background working on some of those transport, they appear similar in nature and it might seek similar knobs. > so let's please keep it separate, so that if/when > this goes out of fashion for one reason or another, the controller can > silently wither away too. > >> Current scope of the patch is limited to RDMA resources as first >> patch, but for fact I am sure that there are more functionality in >> pipe to support via this cgroup by me and others. >> So keeping atleast these two aspects in mind, I need input on >> direction of dedicated controller or new one. >> >> In future, I anticipate that we might have sub directory to device >> cgroup for individual device class to control. >> such as, >> <sys/fs/cgroup/devices/ >> /char >> /block >> /rdma >> /pcie >> /child_cgroup..1..N >> Each controllers cgroup access files would remain within their own >> scope. We are not there yet from base infrastructure but something to >> be done as it matures and users start using it. > > I don't think that jives with the rest of cgroup and what generic > block or pcie attributes are directly exposed to applications and need > to be hierarchically controlled via cgroup? > I do agree that currently cgroup doesn't have notion of sub cgroup or above hierarchy today. so until than I was considering to implement it under devices cgroup as generic place without the hierarchy shown above. Therefore current interface is at device cgroup level. If you are suggesting to have rdma cgroup as separate entity for near future, its fine with me. Later on when next-gen arrives we might have scope to make rdma cgroup as more generic one. But than it might look like what I described above. In past I have discussions with Liran Liss from Mellanox as well on this topic and we also agreed to have such cgroup controller. He has recent presentation at Linux foundation event indicating to have cgroup for RDMA. Below is the link to it. http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf Slides 1 to 7 and slide 13 will give you more insight to it. Liran and I had similar presentation to RDMA audience with less slides in RDMA openfabrics summit in March 2015. I am ok to create separate cgroup for rdma, if community thinks that way. My preference would be still use device cgroup for above extensions unless there are fundamental issues that I am missing. I would let you make the call. Rdma and other is just another type of device with different characteristics than character or block, so one device cgroup with sub functionalities can allow setting knobs. Every device category will have their own set of knobs for resources, ACL, limits, policy. And I think cgroup is certainly better control point than sysfs or spinning of new control infrastructure for this. That said, I would like to hear your and communities view on how they would like to see this shaping up. > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-10 17:46 ` Parav Pandit @ 2015-09-10 20:22 ` Tejun Heo 2015-09-11 3:39 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Tejun Heo @ 2015-09-10 20:22 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module Hello, Parav. On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote: > >> These resources include are- QP (queue pair) to transfer data, CQ > >> (Completion queue) to indicate completion of data transfer operation, > >> MR (memory region) to represent user application memory as source or > >> destination for data transfer. > >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH > >> (Address handle), FLOW, PD (protection domain), user context etc. > > > > It's kinda bothering that all these are disparate resources. > > Actually not. They are linked resources. Every QP needs associated one > or two CQ, one PD. > Every QP will use few MRs for data transfer. So, if that's the case, let's please implement something higher level. The goal is providing reasonable isolation or protection. If that can be achieved at a higher level of abstraction, please do that. > Here is the good programming guide of the RDMA APIs exposed to the > user space application. > > http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf > So first version of the cgroups patch will address the control > operation for section 3.4. > > > I suppose that each restriction comes from the underlying hardware and > > there's no accepted higher level abstraction for these things? > > There is higher level abstraction which is through the verbs layer > currently which does actually expose the hardware resource but in > vendor agnostic way. > There are many vendors who support these verbs layer, some of them > which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers > which support these verbs are in <drivers/infiniband/hw/> kernel tree. > > There is higher level APIs above the verb layer, such as MPI, > libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer. > They all rely on the hardware resource. All of these higher level > abstraction is accepted and well used by certain application class. It > would be long discussion to go over them here. Well, the programming interface that userland builds on top doesn't matter too much here but if there is a common resource abstraction which can be made in terms of constructs that consumers of the facility would care about, that likely is a better choice than exposing whatever hardware exposes. > > I'm doubtful that these things are gonna be mainstream w/o building up > > higher level abstractions on top and if we ever get there we won't be > > talking about MR or CQ or whatever. > > Some of the higher level examples I gave above will adapt to resource > allocation failure. Some are actually adaptive to few resource > allocation failure, they do query resources. But its not completely > there yet. Once we have this notion of limited resource in place, > abstraction layer would adapt to relatively smaller value of such > resource. > > These higher level abstraction is mainstream. Its shipped at least in > Redhat Enterprise Linux. Again, I was talking more about resource abstraction - e.g. something along the line of "I want N command buffers". > > Also, whatever next-gen is > > unlikely to have enough commonalities when the proposed resource knobs > > are this low level, > > I agree that resource won't be common in next-gen other transport > whenever they arrive. > But with my existing background working on some of those transport, > they appear similar in nature and it might seek similar knobs. I don't know. What's proposed in this thread seems way too low level to be useful anywhere else. Also, what if there are multiple devices? Is that a problem to worry about? > In past I have discussions with Liran Liss from Mellanox as well on > this topic and we also agreed to have such cgroup controller. > He has recent presentation at Linux foundation event indicating to > have cgroup for RDMA. > Below is the link to it. > http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf > Slides 1 to 7 and slide 13 will give you more insight to it. > Liran and I had similar presentation to RDMA audience with less slides > in RDMA openfabrics summit in March 2015. > > I am ok to create separate cgroup for rdma, if community thinks that way. > My preference would be still use device cgroup for above extensions > unless there are fundamental issues that I am missing. The thing is that they aren't related at all in any way. There's no reason to tie them together. In fact, the way we did devcg is backward. The ideal solution would have been extending the usual ACL to understand cgroups so that it's a natural growth of the permission system. You're talking about actual hardware resources. That has nothing to do with access permissions on device nodes. > I would let you make the call. > Rdma and other is just another type of device with different > characteristics than character or block, so one device cgroup with sub > functionalities can allow setting knobs. > Every device category will have their own set of knobs for resources, > ACL, limits, policy. I'm kinda doubtful we're gonna have too many of these. Hardware details being exposed to userland this directly isn't common. > And I think cgroup is certainly better control point than sysfs or > spinning of new control infrastructure for this. > That said, I would like to hear your and communities view on how they > would like to see this shaping up. I'd say keep it simple and do the minimum. :) Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-10 20:22 ` Tejun Heo @ 2015-09-11 3:39 ` Parav Pandit [not found] ` <CAG53R5WtuPA=J_GYPzNTAKbjB1r0K90qhXEDxLNf7vxYyxgrKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-11 3:39 UTC (permalink / raw) To: Tejun Heo Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Fri, Sep 11, 2015 at 1:52 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Parav. > > On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote: >> >> These resources include are- QP (queue pair) to transfer data, CQ >> >> (Completion queue) to indicate completion of data transfer operation, >> >> MR (memory region) to represent user application memory as source or >> >> destination for data transfer. >> >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH >> >> (Address handle), FLOW, PD (protection domain), user context etc. >> > >> > It's kinda bothering that all these are disparate resources. >> >> Actually not. They are linked resources. Every QP needs associated one >> or two CQ, one PD. >> Every QP will use few MRs for data transfer. > > So, if that's the case, let's please implement something higher level. > The goal is providing reasonable isolation or protection. If that can > be achieved at a higher level of abstraction, please do that. > >> Here is the good programming guide of the RDMA APIs exposed to the >> user space application. >> >> http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf >> So first version of the cgroups patch will address the control >> operation for section 3.4. >> >> > I suppose that each restriction comes from the underlying hardware and >> > there's no accepted higher level abstraction for these things? >> >> There is higher level abstraction which is through the verbs layer >> currently which does actually expose the hardware resource but in >> vendor agnostic way. >> There are many vendors who support these verbs layer, some of them >> which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers >> which support these verbs are in <drivers/infiniband/hw/> kernel tree. >> >> There is higher level APIs above the verb layer, such as MPI, >> libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer. >> They all rely on the hardware resource. All of these higher level >> abstraction is accepted and well used by certain application class. It >> would be long discussion to go over them here. > > Well, the programming interface that userland builds on top doesn't > matter too much here but if there is a common resource abstraction > which can be made in terms of constructs that consumers of the > facility would care about, that likely is a better choice than > exposing whatever hardware exposes. > Tejun, The fact is that user level application uses hardware resources. Verbs layer is software abstraction for it. Drivers are hiding how they implement this QP or CQ or whatever hardware resource they project via API layer. For all of the userland on top of verb layer I mentioned above, the common resource abstraction is these resources AH, QP, CQ, MR etc. Hardware (and driver) might have different view of this resource in their real implementation. For example, verb layer can say that it has 100 QPs, but hardware might actually have 20 QPs that driver decide how to efficiently use it. >> > I'm doubtful that these things are gonna be mainstream w/o building up >> > higher level abstractions on top and if we ever get there we won't be >> > talking about MR or CQ or whatever. >> >> Some of the higher level examples I gave above will adapt to resource >> allocation failure. Some are actually adaptive to few resource >> allocation failure, they do query resources. But its not completely >> there yet. Once we have this notion of limited resource in place, >> abstraction layer would adapt to relatively smaller value of such >> resource. >> >> These higher level abstraction is mainstream. Its shipped at least in >> Redhat Enterprise Linux. > > Again, I was talking more about resource abstraction - e.g. something > along the line of "I want N command buffers". > Yes. We are still talking of resource abstraction here. RDMA and IBTA defines these resources. On top of these resources various frameworks are build. so for example, User land is tuning environment deploying for MPI application, it would configure: 10 processes from the PID controller, 10 CPUs in cpuset controller, 1 PD, 20 CQ, 10 QP, 100 MRs in rdma controller, say user land is tuning environment for deploying rsocket application for 100 connections, it would configure, 100 PD, 100 QP, 200 MR. When verb layer see failure with it, they will adapt to live with what they have at lower performance. Since every higher level which I mentioned in different in the way, it uses RDMA resources, we cannot generalize it as "N command buffers". That generalization in my mind is the - rdma resources - central common entity. >> > Also, whatever next-gen is >> > unlikely to have enough commonalities when the proposed resource knobs >> > are this low level, >> >> I agree that resource won't be common in next-gen other transport >> whenever they arrive. >> But with my existing background working on some of those transport, >> they appear similar in nature and it might seek similar knobs. > > I don't know. What's proposed in this thread seems way too low level > to be useful anywhere else. Also, what if there are multiple devices? > Is that a problem to worry about? > o.k. It doesn't have to be useful anywhere else. If it suffice the need of RDMA applications, its fine for near future. This patch allows limiting resources across multiple devices. As we go along the path, and if requirement come up to have knob on per device basis, thats something we can extend in future. > >> I would let you make the call. >> Rdma and other is just another type of device with different >> characteristics than character or block, so one device cgroup with sub >> functionalities can allow setting knobs. >> Every device category will have their own set of knobs for resources, >> ACL, limits, policy. > > I'm kinda doubtful we're gonna have too many of these. Hardware > details being exposed to userland this directly isn't common. > Its common in RDMA applications. Again they may not be real hardware resource, its just API layer which defines those RDMA constructs. >> And I think cgroup is certainly better control point than sysfs or >> spinning of new control infrastructure for this. >> That said, I would like to hear your and communities view on how they >> would like to see this shaping up. > > I'd say keep it simple and do the minimum. :) > o.k. In that case new rdma cgroup controller which does rdma resource accounting is possibly the most simplest form? Make sense? > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5WtuPA=J_GYPzNTAKbjB1r0K90qhXEDxLNf7vxYyxgrKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5WtuPA=J_GYPzNTAKbjB1r0K90qhXEDxLNf7vxYyxgrKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-11 4:04 ` Tejun Heo [not found] ` <20150911040413.GA18850-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 2015-09-11 4:43 ` Parav Pandit 0 siblings, 2 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-11 4:04 UTC (permalink / raw) To: Parav Pandit Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA Hello, Parav. On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote: > The fact is that user level application uses hardware resources. > Verbs layer is software abstraction for it. Drivers are hiding how > they implement this QP or CQ or whatever hardware resource they > project via API layer. > For all of the userland on top of verb layer I mentioned above, the > common resource abstraction is these resources AH, QP, CQ, MR etc. > Hardware (and driver) might have different view of this resource in > their real implementation. > For example, verb layer can say that it has 100 QPs, but hardware > might actually have 20 QPs that driver decide how to efficiently use > it. My uneducated suspicion is that the abstraction is just not developed enough. It should be possible to virtualize these resources through, most likely, time-sharing to the level where userland simply says "I want this chunk transferred there" and OS schedules the transfer prioritizing competing requests. It could be that given the use cases rdma might not need such level of abstraction - e.g. most users want to be and are pretty close to bare metal, but, if that's true, it also kinda is weird to build hierarchical resource distribution scheme on top of such bare abstraction. ... > > I don't know. What's proposed in this thread seems way too low level > > to be useful anywhere else. Also, what if there are multiple devices? > > Is that a problem to worry about? > > o.k. It doesn't have to be useful anywhere else. If it suffice the > need of RDMA applications, its fine for near future. > This patch allows limiting resources across multiple devices. > As we go along the path, and if requirement come up to have knob on > per device basis, thats something we can extend in future. You kinda have to decide that upfront cuz it gets baked into the interface. > > I'm kinda doubtful we're gonna have too many of these. Hardware > > details being exposed to userland this directly isn't common. > > Its common in RDMA applications. Again they may not be real hardware > resource, its just API layer which defines those RDMA constructs. It's still a very low level of abstraction which pretty much gets decided by what the hardware and driver decide to do. > > I'd say keep it simple and do the minimum. :) > > o.k. In that case new rdma cgroup controller which does rdma resource > accounting is possibly the most simplest form? > Make sense? So, this fits cgroup's purpose to certain level but it feels like we're trying to build too much on top of something which hasn't developed sufficiently. I suppose it could be that this is the level of development that rdma is gonna reach and dumb cgroup controller can be useful for some use cases. I don't know, so, yeah, let's keep it simple and avoid doing crazy stuff. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150911040413.GA18850-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150911040413.GA18850-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2015-09-11 4:24 ` Doug Ledford [not found] ` <55F25781.20308-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Doug Ledford @ 2015-09-11 4:24 UTC (permalink / raw) To: Tejun Heo, Parav Pandit Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA [-- Attachment #1: Type: text/plain, Size: 6403 bytes --] On 09/11/2015 12:04 AM, Tejun Heo wrote: > Hello, Parav. > > On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote: >> The fact is that user level application uses hardware resources. >> Verbs layer is software abstraction for it. Drivers are hiding how >> they implement this QP or CQ or whatever hardware resource they >> project via API layer. >> For all of the userland on top of verb layer I mentioned above, the >> common resource abstraction is these resources AH, QP, CQ, MR etc. >> Hardware (and driver) might have different view of this resource in >> their real implementation. >> For example, verb layer can say that it has 100 QPs, but hardware >> might actually have 20 QPs that driver decide how to efficiently use >> it. > > My uneducated suspicion is that the abstraction is just not developed > enough. The abstraction is 10+ years old. It has had plenty of time to ferment and something better for the specific use case has not emerged. > It should be possible to virtualize these resources through, > most likely, time-sharing to the level where userland simply says "I > want this chunk transferred there" and OS schedules the transfer > prioritizing competing requests. No. And if you think this, then you miss the *entire* point of RDMA technologies. An analogy that I have used many times in presentations is that, in the networking world, the kernel is both a postman and a copy machine. It receives all incoming packets and must sort them to the right recipient (the postman job) and when the user space application is ready to use the information it must copy it into the user's VM space because it couldn't just put the user's data buffer on the RX buffer list since each buffer might belong to anyone (the copy machine). In the RDMA world, you create a new queue pair, it is often a long lived connection (like a socket), but it belongs now to the app and the app can directly queue both send and receive buffers to the card and on incoming packets the card will be able to know that the packet belongs to a specific queue pair and will immediately go to that apps buffer. You can *not* do this with TCP without moving to complete TCP offload on the card, registration of specific sockets on the card, and then allowing the application to pre-register receive buffers for a specific socket to the card so that incoming data on the wire can go straight to the right place. If you ever get to the point of "OS schedules the transfer" then you might as well throw RDMA out the window because you have totally trashed the benefit it provides. > It could be that given the use cases rdma might not need such level of > abstraction - e.g. most users want to be and are pretty close to bare > metal, but, if that's true, it also kinda is weird to build > hierarchical resource distribution scheme on top of such bare > abstraction. Not really. If you are going to have a bare abstraction, this one isn't really a bad one. You have devices. On a device, you allocate protection domains (PDs). If you don't care about cross connection issues, you ignore this and only use one. If you do care, this acts like a process's unique VM space only for RDMA buffers, it is a domain to protect the data of one connection from another. Then you have queue pairs (QPs) which are roughly the equivalent of a socket. Each QP has at least one Completion Queue where you get the events that tell you things have completed (although they often use two, one for send completions and one for receive completions). And then you use some number of memory registrations (MRs) and address handles (AHs) depending on your usage. Since RDMA stands for Remote Direct Memory Access, as you can imagine, giving a remote machine free reign to access all of the physical memory in your machine is a security issue. The MRs help to control what memory the remote host on a specific QP has access to. The AHs control how we actually route packets from ourselves to the remote host. Here's the deal. You might be able to create an abstraction above this that hides *some* of this. But it can't hide even nearly all of it without loosing significant functionality. The problem here is that you are thinking about RDMA connections like sockets. They aren't. Not even close. They are "how do I allow a remote machine to directly read and write into my machines physical memory in an even remotely close to secure manner?" These resources aren't hardware resources, they are the abstraction resources needed to answer that question. > ... >>> I don't know. What's proposed in this thread seems way too low level >>> to be useful anywhere else. Also, what if there are multiple devices? >>> Is that a problem to worry about? >> >> o.k. It doesn't have to be useful anywhere else. If it suffice the >> need of RDMA applications, its fine for near future. >> This patch allows limiting resources across multiple devices. >> As we go along the path, and if requirement come up to have knob on >> per device basis, thats something we can extend in future. > > You kinda have to decide that upfront cuz it gets baked into the > interface. > >>> I'm kinda doubtful we're gonna have too many of these. Hardware >>> details being exposed to userland this directly isn't common. >> >> Its common in RDMA applications. Again they may not be real hardware >> resource, its just API layer which defines those RDMA constructs. > > It's still a very low level of abstraction which pretty much gets > decided by what the hardware and driver decide to do. > >>> I'd say keep it simple and do the minimum. :) >> >> o.k. In that case new rdma cgroup controller which does rdma resource >> accounting is possibly the most simplest form? >> Make sense? > > So, this fits cgroup's purpose to certain level but it feels like > we're trying to build too much on top of something which hasn't > developed sufficiently. I suppose it could be that this is the level > of development that rdma is gonna reach and dumb cgroup controller can > be useful for some use cases. I don't know, so, yeah, let's keep it > simple and avoid doing crazy stuff. > > Thanks. > -- Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> GPG KeyID: 0E572FDD [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 884 bytes --] ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <55F25781.20308-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <55F25781.20308-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-09-11 14:52 ` Tejun Heo 2015-09-11 16:26 ` Parav Pandit ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-11 14:52 UTC (permalink / raw) To: Doug Ledford Cc: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA Hello, Doug. On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. > > The abstraction is 10+ years old. It has had plenty of time to ferment > and something better for the specific use case has not emerged. I think that is likely more reflective of the use cases rather than anything inherent in the concept. > > It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > No. And if you think this, then you miss the *entire* point of RDMA > technologies. An analogy that I have used many times in presentations > is that, in the networking world, the kernel is both a postman and a > copy machine. It receives all incoming packets and must sort them to > the right recipient (the postman job) and when the user space > application is ready to use the information it must copy it into the > user's VM space because it couldn't just put the user's data buffer on > the RX buffer list since each buffer might belong to anyone (the copy > machine). In the RDMA world, you create a new queue pair, it is often a > long lived connection (like a socket), but it belongs now to the app and > the app can directly queue both send and receive buffers to the card and > on incoming packets the card will be able to know that the packet > belongs to a specific queue pair and will immediately go to that apps > buffer. You can *not* do this with TCP without moving to complete TCP > offload on the card, registration of specific sockets on the card, and > then allowing the application to pre-register receive buffers for a > specific socket to the card so that incoming data on the wire can go > straight to the right place. If you ever get to the point of "OS > schedules the transfer" then you might as well throw RDMA out the window > because you have totally trashed the benefit it provides. I don't know. This sounds like classic "this is painful so it must be good" bare metal fantasy. I get that rdma succeeds at bypassing a lot of overhead. That's great but that really isn't exclusive with having more accessible mechanisms built on top. The crux of cost saving is the hardware knowing where the incoming data belongs and putting it there directly. Everything else is there to facilitate that and if you're declaring that it's impossible to build accessible abstractions for that, I can't agree with you. Note that this is not to say that rdma should do that in the operating system. As you said, people have been happy with the bare abstraction for a long time and, given relatively specialized use cases, that can be completely fine but please do note that the lack of proper abstraction isn't an inherent feature. It's just easier that way and putting in more effort hasn't been necessary. > > It could be that given the use cases rdma might not need such level of > > abstraction - e.g. most users want to be and are pretty close to bare > > metal, but, if that's true, it also kinda is weird to build > > hierarchical resource distribution scheme on top of such bare > > abstraction. > > Not really. If you are going to have a bare abstraction, this one isn't > really a bad one. You have devices. On a device, you allocate > protection domains (PDs). If you don't care about cross connection > issues, you ignore this and only use one. If you do care, this acts > like a process's unique VM space only for RDMA buffers, it is a domain > to protect the data of one connection from another. Then you have queue > pairs (QPs) which are roughly the equivalent of a socket. Each QP has > at least one Completion Queue where you get the events that tell you > things have completed (although they often use two, one for send > completions and one for receive completions). And then you use some > number of memory registrations (MRs) and address handles (AHs) depending > on your usage. Since RDMA stands for Remote Direct Memory Access, as > you can imagine, giving a remote machine free reign to access all of the > physical memory in your machine is a security issue. The MRs help to > control what memory the remote host on a specific QP has access to. The > AHs control how we actually route packets from ourselves to the remote host. > > Here's the deal. You might be able to create an abstraction above this > that hides *some* of this. But it can't hide even nearly all of it > without loosing significant functionality. The problem here is that you > are thinking about RDMA connections like sockets. They aren't. Not > even close. They are "how do I allow a remote machine to directly read > and write into my machines physical memory in an even remotely close to > secure manner?" These resources aren't hardware resources, they are the > abstraction resources needed to answer that question. So, the existence of resource limitations is fine. That's what we deal with all the time. The problem usually with this sort of interfaces which expose implementation details to users directly is that it severely limits engineering manuevering space. You usually want your users to express their intentions and a mechanism to arbitrate resources to satisfy those intentions (and in a way more graceful than "we can't, maybe try later?"); otherwise, implementing any sort of high level resource distribution scheme becomes painful and usually the only thing possible is preventing runaway disasters - you don't wanna pin unused resource permanently if there actually is contention around it, so usually all you can do with hard limits is overcommiting limits so that it at least prevents disasters. cpuset is a special case but think of cpu, memory or io controllers. Their resource distribution schemes are a lot more developed than what's proposed in this patchset and that's a necessity because nobody wants to cripple their machines for resource control. This is a lot more like the pids controller and that controller's almost sole purpose is preventing runaway workload wrecking the whole machine. It's getting rambly but the point is that if the resource being controlled by this controller is actually contended for performance reasons, this sort of hard limiting is inherently unlikely to be very useful. If the resource isn't and the main goal is preventing runaway hogs, it'll be able to do that but is that the goal here? For this to be actually useful for performance contended cases, it'd need higher level abstractions. Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 14:52 ` Tejun Heo @ 2015-09-11 16:26 ` Parav Pandit [not found] ` <CAG53R5X5z-H15f1FzCFFqao=taYeHyJnXAZT2mPzAHYOkyq-_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <20150911145213.GQ8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-09-11 19:22 ` Hefty, Sean 2 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-11 16:26 UTC (permalink / raw) To: Tejun Heo Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module > If the resource isn't and the main goal is preventing runaway > hogs, it'll be able to do that but is that the goal here? For this to > be actually useful for performance contended cases, it'd need higher > level abstractions. > Resource run away by application can lead to (a) kernel and (b) other applications left out with no resources situation. Both the problems are the target of this patch set by accounting via cgroup. Performance contention can be resolved with higher level user space, which will tune it. Threshold and fail counters are on the way in follow on patch. > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5X5z-H15f1FzCFFqao=taYeHyJnXAZT2mPzAHYOkyq-_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5X5z-H15f1FzCFFqao=taYeHyJnXAZT2mPzAHYOkyq-_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-11 16:34 ` Tejun Heo [not found] ` <20150911163449.GS8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Tejun Heo @ 2015-09-11 16:34 UTC (permalink / raw) To: Parav Pandit Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA Hello, Parav. On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote: > Resource run away by application can lead to (a) kernel and (b) other > applications left out with no resources situation. Yeap, that this controller would be able to prevent to a reasonable extent. > Both the problems are the target of this patch set by accounting via cgroup. > > Performance contention can be resolved with higher level user space, > which will tune it. If individual applications are gonna be allowed to do that, what's to prevent them from jacking up their limits? So, I assume you're thinking of a central authority overseeing distribution and enforcing the policy through cgroups? > Threshold and fail counters are on the way in follow on patch. If you're planning on following what the existing memcg did in this area, it's unlikely to go well. Would you mind sharing what you have on mind in the long term? Where do you see this going? Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150911163449.GS8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150911163449.GS8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-11 16:39 ` Parav Pandit 2015-09-11 19:25 ` Tejun Heo 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-11 16:39 UTC (permalink / raw) To: Tejun Heo Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, Parav. > > On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote: >> Resource run away by application can lead to (a) kernel and (b) other >> applications left out with no resources situation. > > Yeap, that this controller would be able to prevent to a reasonable > extent. > >> Both the problems are the target of this patch set by accounting via cgroup. >> >> Performance contention can be resolved with higher level user space, >> which will tune it. > > If individual applications are gonna be allowed to do that, what's to > prevent them from jacking up their limits? I should have been more explicit. I didnt mean the application to control which is allocating it. > So, I assume you're > thinking of a central authority overseeing distribution and enforcing > the policy through cgroups? > Exactly. >> Threshold and fail counters are on the way in follow on patch. > > If you're planning on following what the existing memcg did in this > area, it's unlikely to go well. Would you mind sharing what you have > on mind in the long term? Where do you see this going? > At least current thoughts are: central entity authority monitors fail count and new threashold count. Fail count - as similar to other indicates how many time resource failure occured threshold count - indicates upto what this resource has gone upto in usage. (application might not be able to poll on thousands of such resources entries). So based on fail count and threshold count, it can tune it further. > Thanks. > > -- > tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 16:39 ` Parav Pandit @ 2015-09-11 19:25 ` Tejun Heo [not found] ` <20150911192517.GU8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Tejun Heo @ 2015-09-11 19:25 UTC (permalink / raw) To: Parav Pandit Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module Hello, Parav. On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote: > > If you're planning on following what the existing memcg did in this > > area, it's unlikely to go well. Would you mind sharing what you have > > on mind in the long term? Where do you see this going? > > At least current thoughts are: central entity authority monitors fail > count and new threashold count. > Fail count - as similar to other indicates how many time resource > failure occured > threshold count - indicates upto what this resource has gone upto in > usage. (application might not be able to poll on thousands of such > resources entries). > So based on fail count and threshold count, it can tune it further. So, regardless of the specific resource in question, implementing adaptive resource distribution requires more than simple thresholds and failcnts. The very minimum would be a way to exert reclaim pressure and then a way to measure how much lack of a given resource is affecting the workload. Maybe it can adaptively lower the limits and then watch how often allocation fails but that's highly unlikely to be an effective measure as it can't do anything to hoarders and the frequency of allocation failure doesn't necessarily correlate with the amount of impact the workload is getting (it's not a measure of usage). This is what I'm awry about. The kernel-userland interface here is cut pretty low in the stack leaving most of arbitration and management logic in the userland, which seems to be what people wanted and that's fine, but then you're trying to implement an intelligent resource control layer which straddles across kernel and userland with those low level primitives which inevitably would increase the required interface surface as nobody has enough information. Just to illustrate the point, please think of the alsa interface. We expose hardware capabilities pretty much as-is leaving management and multiplexing to userland and there's nothing wrong with it. It fits better that way; however, we don't then go try to implement cgroup controller for PCM channels. To do any high-level resource management, you gotta do it where the said resource is actually managed and arbitrated. What's the allocation frequency you're expecting? It might be better to just let allocations themselves go through the agent that you're planning. You sure can use cgroup membership to identify who's asking tho. Given how the whole thing is architectured, I'd suggest thinking more about how the whole thing should turn out eventually. Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150911192517.GU8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150911192517.GU8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-14 10:18 ` Parav Pandit 0 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-14 10:18 UTC (permalink / raw) To: Tejun Heo Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, Parav. > > On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote: >> > If you're planning on following what the existing memcg did in this >> > area, it's unlikely to go well. Would you mind sharing what you have >> > on mind in the long term? Where do you see this going? >> >> At least current thoughts are: central entity authority monitors fail >> count and new threashold count. >> Fail count - as similar to other indicates how many time resource >> failure occured >> threshold count - indicates upto what this resource has gone upto in >> usage. (application might not be able to poll on thousands of such >> resources entries). >> So based on fail count and threshold count, it can tune it further. > > So, regardless of the specific resource in question, implementing > adaptive resource distribution requires more than simple thresholds > and failcnts. May be yes. Buts in difficult to go through the whole design to shape up right now. This is the infrastructure getting build with few capabilities. I see this as starting point instead of end point. > The very minimum would be a way to exert reclaim > pressure and then a way to measure how much lack of a given resource > is affecting the workload. Maybe it can adaptively lower the limits > and then watch how often allocation fails but that's highly unlikely > to be an effective measure as it can't do anything to hoarders and the > frequency of allocation failure doesn't necessarily correlate with the > amount of impact the workload is getting (it's not a measure of > usage). It can always kill the hoarding process(es), which is holding up the resources without using it. Such processes will eventually will get restarted but will not be able to hoard so much because its been on the radar for hoarding and its limits have been reduced. > > This is what I'm awry about. The kernel-userland interface here is > cut pretty low in the stack leaving most of arbitration and management > logic in the userland, which seems to be what people wanted and that's > fine, but then you're trying to implement an intelligent resource > control layer which straddles across kernel and userland with those > low level primitives which inevitably would increase the required > interface surface as nobody has enough information. > We might be able to get the information as we go along. Such arbitration and management layer outside (instead of inside) has more visibility into multiple systems which are part of single cluster and processes are spreaded across cgroup in each such system. While a logic inside can manage just a manage a process of single node which are using multiple cgroups. > Just to illustrate the point, please think of the alsa interface. We > expose hardware capabilities pretty much as-is leaving management and > multiplexing to userland and there's nothing wrong with it. It fits > better that way; however, we don't then go try to implement cgroup > controller for PCM channels. To do any high-level resource > management, you gotta do it where the said resource is actually > managed and arbitrated. > > What's the allocation frequency you're expecting? It might be better > to just let allocations themselves go through the agent that you're > planning. In that case we might need to build FUSE style infrastructure. Frequency for RDMA resource allocation is certainly less than read/write calls. > You sure can use cgroup membership to identify who's asking > tho. Given how the whole thing is architectured, I'd suggest thinking > more about how the whole thing should turn out eventually. > Yes, I agree. At this point, its software solution to provide resource isolation in simple manner which has scope to become adaptive in future. > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150911145213.GQ8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150911145213.GQ8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-11 16:47 ` Parav Pandit [not found] ` <CAG53R5X5o8hJX1VJ00j5Bxuaps3FGCPNss4ey-07Dq+XP8xoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-11 16:47 UTC (permalink / raw) To: Tejun Heo Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA > cpuset is a special case but think of cpu, memory or io controllers. > Their resource distribution schemes are a lot more developed than > what's proposed in this patchset and that's a necessity because nobody > wants to cripple their machines for resource control. IO controller and applications are mature in nature. When IO controller throttles the IO, applications are pretty mature where if IO takes longer to complete, there is possibly almost no way to cancel the system call or rather application might not want to cancel the IO at least the non asynchronous one. So application just notice lower performance than throttled way. Its really not possible at RDMA level with RDMA resource to hold up resource creation call for longer time, because reusing existing resource with failed status can likely to give better performance. As Doug explained in his example, many RDMA resources as its been used by applications are relatively long lived. So holding ups resource creation while its taken by other process will certainly will look bad on application performance front compare to returning failure and reusing existing one once its available or once new one is available. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5X5o8hJX1VJ00j5Bxuaps3FGCPNss4ey-07Dq+XP8xoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5X5o8hJX1VJ00j5Bxuaps3FGCPNss4ey-07Dq+XP8xoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-11 19:05 ` Tejun Heo 0 siblings, 0 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-11 19:05 UTC (permalink / raw) To: Parav Pandit Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA Hello, Parav. On Fri, Sep 11, 2015 at 10:17:42PM +0530, Parav Pandit wrote: > IO controller and applications are mature in nature. > When IO controller throttles the IO, applications are pretty mature > where if IO takes longer to complete, there is possibly almost no way > to cancel the system call or rather application might not want to > cancel the IO at least the non asynchronous one. I was more talking about the fact that they allow resources to be consumed when they aren't contended. > So application just notice lower performance than throttled way. > Its really not possible at RDMA level with RDMA resource to hold up > resource creation call for longer time, because reusing existing > resource with failed status can likely to give better performance. > As Doug explained in his example, many RDMA resources as its been used > by applications are relatively long lived. So holding ups resource > creation while its taken by other process will certainly will look bad > on application performance front compare to returning failure and > reusing existing one once its available or once new one is available. I'm not really sold on the idea that this can be used to implement performance based resource distribution. I'll write more about that on the other subthread. Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 14:52 ` Tejun Heo 2015-09-11 16:26 ` Parav Pandit [not found] ` <20150911145213.GQ8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-11 19:22 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A82373A903A586-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> 2015-09-14 10:15 ` Parav Pandit 2 siblings, 2 replies; 60+ messages in thread From: Hefty, Sean @ 2015-09-11 19:22 UTC (permalink / raw) To: Tejun Heo, Doug Ledford Cc: Parav Pandit, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org > So, the existence of resource limitations is fine. That's what we > deal with all the time. The problem usually with this sort of > interfaces which expose implementation details to users directly is > that it severely limits engineering manuevering space. You usually > want your users to express their intentions and a mechanism to > arbitrate resources to satisfy those intentions (and in a way more > graceful than "we can't, maybe try later?"); otherwise, implementing > any sort of high level resource distribution scheme becomes painful > and usually the only thing possible is preventing runaway disasters - > you don't wanna pin unused resource permanently if there actually is > contention around it, so usually all you can do with hard limits is > overcommiting limits so that it at least prevents disasters. I agree with Tejun that this proposal is at the wrong level of abstraction. If you look at just trying to limit QPs, it's not clear what that attempts to accomplish. Conceptually, a QP is little more than an addressable endpoint. It may or may not map to HW resources (for Intel NICs it does not). Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC. Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use. There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation. - Sean ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A82373A903A586-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <1828884A29C6694DAF28B7E6B8A82373A903A586-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2015-09-11 19:43 ` Jason Gunthorpe 2015-09-11 20:06 ` Hefty, Sean 0 siblings, 1 reply; 60+ messages in thread From: Jason Gunthorpe @ 2015-09-11 19:43 UTC (permalink / raw) To: Hefty, Sean Cc: Tejun Heo, Doug Ledford, Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Sep 11, 2015 at 07:22:56PM +0000, Hefty, Sean wrote: > Trying to limit the number of QPs that an app can allocate, > therefore, just limits how much of the address space an app can use. > There's no clear link between QP limits and HW resource limits, > unless you assume a very specific underlying implementation. Isn't that the point though? We have several vendors with hardware that does impose hard limits on specific resources. There is no way to avoid that, and ultimately, those exact HW resources need to be limited. If we want to talk about abstraction, then I'd suggest something very general and simple - two limits: '% of the RDMA hardware resource pool' (per device or per ep?) 'bytes of kernel memory for RDMA structures' (all devices) That comfortably covers all the various kinds of hardware we support in a reasonable fashion. Unless there really is a reason why we need to constrain exactly and precisely PD/QP/MR/AH (I can't think of one off hand) The 'RDMA hardware resource pool' is a vendor-driver-device specific thing, with no generic definition beyond something that doesn't fit in the other limit. Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 19:43 ` Jason Gunthorpe @ 2015-09-11 20:06 ` Hefty, Sean 2015-09-14 11:09 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Hefty, Sean @ 2015-09-11 20:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: Tejun Heo, Doug Ledford, Parav Pandit, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org > > Trying to limit the number of QPs that an app can allocate, > > therefore, just limits how much of the address space an app can use. > > There's no clear link between QP limits and HW resource limits, > > unless you assume a very specific underlying implementation. > > Isn't that the point though? We have several vendors with hardware > that does impose hard limits on specific resources. There is no way to > avoid that, and ultimately, those exact HW resources need to be > limited. My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? > If we want to talk about abstraction, then I'd suggest something very > general and simple - two limits: > '% of the RDMA hardware resource pool' (per device or per ep?) > 'bytes of kernel memory for RDMA structures' (all devices) Yes - this makes more sense to me. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 20:06 ` Hefty, Sean @ 2015-09-14 11:09 ` Parav Pandit 2015-09-14 14:04 ` Parav Pandit [not found] ` <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 2 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-14 11:09 UTC (permalink / raw) To: Hefty, Sean Cc: Jason Gunthorpe, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@intel.com> wrote: >> > Trying to limit the number of QPs that an app can allocate, >> > therefore, just limits how much of the address space an app can use. >> > There's no clear link between QP limits and HW resource limits, >> > unless you assume a very specific underlying implementation. >> >> Isn't that the point though? We have several vendors with hardware >> that does impose hard limits on specific resources. There is no way to >> avoid that, and ultimately, those exact HW resources need to be >> limited. > > My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? I think it means if its RDMA RC QP, than whether you can talk to 1000 nodes or 1 node in network. When we deploy MPI application, it know the rank of the application, we know the cluster size of the deployment and based on that resource allocation can be done. If you meant to say from performance point of view, than resource count is possibly not the right measure. Just because we have not defined those interface for performance today in this patch set, doesn't mean that we won't do it. I could easily see a number_of_messages/sec as one interface to be added in future. But that won't stop process hoarders to stop taking away all the QPs, just the way we needed PID controller. Now when it comes to Intel implementation, if it driver layer knows (in future we new APIs) that whether 10 or 100 user QPs should map to few hw-QPs or more hw-QPs (uSNIC). so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to other cgroup. If hw- implementation doesn't require isolation, it could just continue from single pool, its left to the vendor implementation on how to use this information (this API is not present in the patch). So cgroup can also provides a control point for vendor layer to tune internal resource allocation based on provided matrix, which cannot be done by just providing "memory usage by RDMA structures". If I have to compare it with other cgroup knobs, low level individual knobs by itself, doesn't serve any meaningful purpose either. Just by defined how much CPU to use or how much memory to use, it cannot define the application performance either. I am not sure, whether iocontroller can achieve 10 million IOPs by defining single CPU and 64KB of memory. all the knobs needs to be set in right way to reach desired number. In similar line RDMA resource knobs as individual knobs are not definition of performance, its just another knob. > >> If we want to talk about abstraction, then I'd suggest something very >> general and simple - two limits: >> '% of the RDMA hardware resource pool' (per device or per ep?) >> 'bytes of kernel memory for RDMA structures' (all devices) > > Yes - this makes more sense to me. > Sean, Jason, Help me to understand this scheme. 1. How does the % of resource, is different than absolute number? With rest of the cgroups systems we define absolute number at most places to my knowledge. Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc. 20% of QP = 20 QPs when 100 QPs are with hw. I prefer to keep the resource scheme consistent with other resource control points - i.e. absolute number. 2. bytes of kernel memory for RDMA structures One QP of one vendor might consume X bytes and other Y bytes. How does the application knows how much memory to give. application can allocate 100 QP of each 1 entry deep or 1 QP of 100 entries deep as in Sean's example. Both might consume almost same memory. Application doing 100 QP allocation, still within limit of memory of cgroup leaves other applications without any QP. I don't see a point of memory footprint based scheme, as memory limits are well addressed by more smarter memory controller anyway. I do agree with Tejun, Sean on the point that abstraction level has to be different for using RDMA and thats why libfabrics and other interfaces are emerging which will take its own time to get stabilize, integrated. Until pure IB style RDMA programming model exist - based on RDMA resource based scheme, I think control point also has to be on resources. Once a stable abstraction level is on table (possibly across fabric not just RDMA), than a right resource controller can be implemented. Even when RDMA abstraction layer arrives, as Jason mentioned, at the end it would consume some hw resource anyway, that needs to be controlled too. Jason, If the hardware vendor defines the resource pool without saying its resource QP or MR, how would actually management/control point can decide what should be controlled to what limit? We will need additional user space library component to decode than, after that it needs to be abstracted out as QP or MR so that it can be deal in vendor agnostic way as application layer. and than it would look similar to what is being proposed here? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-14 11:09 ` Parav Pandit @ 2015-09-14 14:04 ` Parav Pandit [not found] ` <CAG53R5U7sYnR2w+Wrhh58Ud1HOrKLDCYxZZgK58FyAkJ8exshw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-14 14:04 UTC (permalink / raw) To: Hefty, Sean Cc: Jason Gunthorpe, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org Hi Tejun, I missed to acknowledge your point that we need both - hard limit and soft limit/weight. Current patchset is only based on hard limit. I see that weight would be another helfpul layer in chain that we can implement after this as incremental that makes review, debugging manageable? Parav On Mon, Sep 14, 2015 at 4:39 PM, Parav Pandit <pandit.parav@gmail.com> wrote: > On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@intel.com> wrote: >>> > Trying to limit the number of QPs that an app can allocate, >>> > therefore, just limits how much of the address space an app can use. >>> > There's no clear link between QP limits and HW resource limits, >>> > unless you assume a very specific underlying implementation. >>> >>> Isn't that the point though? We have several vendors with hardware >>> that does impose hard limits on specific resources. There is no way to >>> avoid that, and ultimately, those exact HW resources need to be >>> limited. >> >> My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows? > > I think it means if its RDMA RC QP, than whether you can talk to 1000 > nodes or 1 node in network. > When we deploy MPI application, it know the rank of the application, > we know the cluster size of the deployment and based on that resource > allocation can be done. > If you meant to say from performance point of view, than resource > count is possibly not the right measure. > > Just because we have not defined those interface for performance today > in this patch set, doesn't mean that we won't do it. > I could easily see a number_of_messages/sec as one interface to be > added in future. > But that won't stop process hoarders to stop taking away all the QPs, > just the way we needed PID controller. > > Now when it comes to Intel implementation, if it driver layer knows > (in future we new APIs) that whether 10 or 100 user QPs should map to > few hw-QPs or more hw-QPs (uSNIC). > so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to > other cgroup. > If hw- implementation doesn't require isolation, it could just > continue from single pool, its left to the vendor implementation on > how to use this information (this API is not present in the patch). > > So cgroup can also provides a control point for vendor layer to tune > internal resource allocation based on provided matrix, which cannot be > done by just providing "memory usage by RDMA structures". > > If I have to compare it with other cgroup knobs, low level individual > knobs by itself, doesn't serve any meaningful purpose either. > Just by defined how much CPU to use or how much memory to use, it > cannot define the application performance either. > I am not sure, whether iocontroller can achieve 10 million IOPs by > defining single CPU and 64KB of memory. > all the knobs needs to be set in right way to reach desired number. > > In similar line RDMA resource knobs as individual knobs are not > definition of performance, its just another knob. > >> >>> If we want to talk about abstraction, then I'd suggest something very >>> general and simple - two limits: >>> '% of the RDMA hardware resource pool' (per device or per ep?) >>> 'bytes of kernel memory for RDMA structures' (all devices) >> >> Yes - this makes more sense to me. >> > > Sean, Jason, > Help me to understand this scheme. > > 1. How does the % of resource, is different than absolute number? With > rest of the cgroups systems we define absolute number at most places > to my knowledge. > Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc. > 20% of QP = 20 QPs when 100 QPs are with hw. > I prefer to keep the resource scheme consistent with other resource > control points - i.e. absolute number. > > 2. bytes of kernel memory for RDMA structures > One QP of one vendor might consume X bytes and other Y bytes. How does > the application knows how much memory to give. > application can allocate 100 QP of each 1 entry deep or 1 QP of 100 > entries deep as in Sean's example. > Both might consume almost same memory. > Application doing 100 QP allocation, still within limit of memory of > cgroup leaves other applications without any QP. > I don't see a point of memory footprint based scheme, as memory limits > are well addressed by more smarter memory controller anyway. > > I do agree with Tejun, Sean on the point that abstraction level has to > be different for using RDMA and thats why libfabrics and other > interfaces are emerging which will take its own time to get stabilize, > integrated. > > Until pure IB style RDMA programming model exist - based on RDMA > resource based scheme, I think control point also has to be on > resources. > Once a stable abstraction level is on table (possibly across fabric > not just RDMA), than a right resource controller can be implemented. > Even when RDMA abstraction layer arrives, as Jason mentioned, at the > end it would consume some hw resource anyway, that needs to be > controlled too. > > Jason, > If the hardware vendor defines the resource pool without saying its > resource QP or MR, how would actually management/control point can > decide what should be controlled to what limit? > We will need additional user space library component to decode than, > after that it needs to be abstracted out as QP or MR so that it can be > deal in vendor agnostic way as application layer. > and than it would look similar to what is being proposed here? ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5U7sYnR2w+Wrhh58Ud1HOrKLDCYxZZgK58FyAkJ8exshw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5U7sYnR2w+Wrhh58Ud1HOrKLDCYxZZgK58FyAkJ8exshw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-14 15:21 ` Tejun Heo 0 siblings, 0 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-14 15:21 UTC (permalink / raw) To: Parav Pandit Cc: Hefty, Sean, Jason Gunthorpe, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Hello, Parav. On Mon, Sep 14, 2015 at 07:34:09PM +0530, Parav Pandit wrote: > I missed to acknowledge your point that we need both - hard limit and > soft limit/weight. Current patchset is only based on hard limit. > I see that weight would be another helfpul layer in chain that we can > implement after this as incremental that makes review, debugging > manageable? At this point, I'm very unsure that doing this as a cgroup controller is a good direction. From userland interface standpoint, publishing a cgroup controller is a big commitment. It is true that we haven't been doing a good job of gatekeeping or polishing controller interfaces but we're trying hard to change that and what's being proposed in this thread doesn't really seem to be mature enough. It's not even clear what's being identified as resources here are things that the users would actually care about or if it's even possible to implement sensible resource control in the kernel via the proposed resource restrictions. So, I'd suggest going back to the board and figuring out what the actual resources are, their distribution strategies should be and at which layer such strategies can be implemented best. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-14 17:28 ` Jason Gunthorpe [not found] ` <20150914172832.GA21652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Jason Gunthorpe @ 2015-09-14 17:28 UTC (permalink / raw) To: Parav Pandit Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote: > 1. How does the % of resource, is different than absolute number? With > rest of the cgroups systems we define absolute number at most places > to my knowledge. There isn't really much choice if the abstraction is a bundle of all resources. You can't use an absolute number unless every possible hardware limited resource is defined, which doesn't seem smart to me either. It is not abstract enough, and doesn't match our universe of hardware very well. > 2. bytes of kernel memory for RDMA structures > One QP of one vendor might consume X bytes and other Y bytes. How does > the application knows how much memory to give. I don't see this distinction being useful at such a fine granularity where the control side needs to distinguish between 1 and 2 QPs. The majority use for control groups has been along with containers to prevent a container for exhausting resources in a way that impacts another. In that use model limiting each container to N MB of kernel memory makes it straightforward to reason about resource exhaustion in a multi-tennant environment. We have other controllers that do this, just more indirectly (ie limiting the number of inotifies, or the number of fds indirectly cap kernel memory consumption) ie Presumably some fairly small limitation like 10MB is enough for most non-MPI jobs. > Application doing 100 QP allocation, still within limit of memory of > cgroup leaves other applications without any QP. No, if the HW has a fixed QP pool then it would hit #1 above. Both are active at once. For example you'd say a container cannot use more than 10% of the device's hardware resources, or more than 10MB of kernel memory. If on an mlx card, you probably hit the 10% of QP resources first. If on an qib card there is no HW QP pool (well, almost, QPNs are always limited), so you'd hit the memory limit instead. In either case, we don't want to see a container able to exhaust either all of kernel memory or all of the HW resources to deny other containers. If you have a non-container use case in mind I'd be curious to hear it.. > I don't see a point of memory footprint based scheme, as memory limits > are well addressed by more smarter memory controller anyway. I don't thing #1 is controlled but another controller. This is long lived kernel-side memory allocations to support RDMA resource allocation - we certainly have nothing in the rdma layer that is tracking this stuff. > If the hardware vendor defines the resource pool without saying its > resource QP or MR, how would actually management/control point can > decide what should be controlled to what limit? In the kernel each HW driver has to be involved to declare what it's hardware resource limits are. In user space, it is just a simple limiter knob to prevent resource exhaustion. UAPI wise, nobdy has to care if the limit is actually # of QPs or something else. Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150914172832.GA21652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150914172832.GA21652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-09-14 18:54 ` Parav Pandit 2015-09-14 20:18 ` Jason Gunthorpe 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-14 18:54 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote: > >> 1. How does the % of resource, is different than absolute number? With >> rest of the cgroups systems we define absolute number at most places >> to my knowledge. > > There isn't really much choice if the abstraction is a bundle of all > resources. You can't use an absolute number unless every possible > hardware limited resource is defined, which doesn't seem smart to me > either. Absolute number of percentage is representation for a given property. That property needs definition. Isn't it? How do we say that "Some undefined" resource you give certain amount, which user doesn't know about what to administer, or configure. It has to be quantifiable entity. It is not abstract enough, and doesn't match our universe of > hardware very well. > Why does the user need to know the actual hardware resource limits or define hardware based resource. RDMA verbs is the abstraction point. We could well define (a) how many number of RDMA connections are allowed instead of QP, or CQ or AH. (b) how many data transfer buffers to use. The fact is we have so many mid layers, which uses these resources differently, above abstraction does not fit the bill. But we know the mid layers how they operate, and how they use the RDMA resource keeping. So if we deploy MPI application for given cluster of container, we can accurately configure the RDMA resource, isn't it? Another example would be, if we don't want only 50% resources to be given to all containers and rest 50% to kernel consumers such as NFS, all containers can reside in single rdma cgroup limited to given limits. >> 2. bytes of kernel memory for RDMA structures >> One QP of one vendor might consume X bytes and other Y bytes. How does >> the application knows how much memory to give. > > I don't see this distinction being useful at such a fine granularity > where the control side needs to distinguish between 1 and 2 QPs. > > The majority use for control groups has been along with containers to > prevent a container for exhausting resources in a way that impacts > another. > Right. Thats the intention. > In that use model limiting each container to N MB of kernel memory > makes it straightforward to reason about resource exhaustion in a > multi-tennant environment. We have other controllers that do this, > just more indirectly (ie limiting the number of inotifies, or the > number of fds indirectly cap kernel memory consumption) > > ie Presumably some fairly small limitation like 10MB is enough for > most non-MPI jobs. Container application always write a simple for loop code to take away majority of QP with 10MB limit. > >> Application doing 100 QP allocation, still within limit of memory of >> cgroup leaves other applications without any QP. > > No, if the HW has a fixed QP pool then it would hit #1 above. Both are > active at once. For example you'd say a container cannot use more than > 10% of the device's hardware resources, or more than 10MB of kernel > memory. > Right. we need to define this resource pool, right? Why it cannot be verbs abstraction? How many resources are really used to implement verb layer in reality is left to hardware vendor Abstract pool just added confusion instead of clarity. Imagine instead of tcp_bytes or kmem bytes, its "some memory resource", how would someone debug/tune a system with abstract knobs? > If on an mlx card, you probably hit the 10% of QP resources first. If > on an qib card there is no HW QP pool (well, almost, QPNs are always > limited), so you'd hit the memory limit instead. > > In either case, we don't want to see a container able to exhaust > either all of kernel memory or all of the HW resources to deny other > containers. > > If you have a non-container use case in mind I'd be curious to hear > it.. Container is the prime case. Additionally equally prime case of non container use case. Today, application can take up all the resource being first class citizan, and NFS mount will fail. So without container also we should be able to restrict resources to user mode app. > >> I don't see a point of memory footprint based scheme, as memory limits >> are well addressed by more smarter memory controller anyway. > > I don't thing #1 is controlled but another controller. This is long > lived kernel-side memory allocations to support RDMA resource > allocation - we certainly have nothing in the rdma layer that is > tracking this stuff. > Some drivers performs mmap() of kernel memory to user space, some drivers does user space page allocation and maps to device. Putting or tracking all those is just so intrusive changes spreading down the vendor drivers or ib layer which may not be right way to track. Memory allocation tracking I believe should be left to memcg. >> If the hardware vendor defines the resource pool without saying its >> resource QP or MR, how would actually management/control point can >> decide what should be controlled to what limit? > > In the kernel each HW driver has to be involved to declare what it's > hardware resource limits are. > > In user space, it is just a simple limiter knob to prevent resource > exhaustion. > > UAPI wise, nobdy has to care if the limit is actually # of QPs or > something else. > If we dont care about resource, we cannot tune or limit it. number of MRs used by MPI vs rsocket vs accelio is way different. > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-14 18:54 ` Parav Pandit @ 2015-09-14 20:18 ` Jason Gunthorpe 2015-09-15 3:08 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Jason Gunthorpe @ 2015-09-14 20:18 UTC (permalink / raw) To: Parav Pandit Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org On Tue, Sep 15, 2015 at 12:24:41AM +0530, Parav Pandit wrote: > On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: > > On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote: > > > >> 1. How does the % of resource, is different than absolute number? With > >> rest of the cgroups systems we define absolute number at most places > >> to my knowledge. > > > > There isn't really much choice if the abstraction is a bundle of all > > resources. You can't use an absolute number unless every possible > > hardware limited resource is defined, which doesn't seem smart to me > > either. > > Absolute number of percentage is representation for a given property. > That property needs definition. Isn't it? > How do we say that "Some undefined" resource you give certain amount, > which user doesn't know about what to administer, or configure. > It has to be quantifiable entity. Each vendor can quantify exactly what HW resources their implementation has and how the above limit impacts their card. There will be many variations, and IIRC, some vendors have resource pools not directly related to the standard PD/QP/MR/CQ/AH verbs resources. > > It is not abstract enough, and doesn't match our universe of > > hardware very well. > Why does the user need to know the actual hardware resource limits or > define hardware based resource. Because actual hardware resources *ARE* the limit. We cannot abstract it away. The hardware/driver has real, fixed, immutable limits. No API abstraction can possibly change that. The limits are such there *IS NO* API boundary that can bundle them into something simpler. There will always be apps that require wildly different ratios of the basic verbs resources (PD/QP/CQ/AH/MR) Either we control each and every vendor's limited resource directly (which is where you started), or we just roll them up into a 'all resource' bundle and control them indirectly. There just isn't a mythical third 'better API' choice with the hardware we have today. > (a) how many number of RDMA connections are allowed instead of QP, or CQ or AH. > (b) how many data transfer buffers to use. None of that accurately reflects what the real HW limits actually are. > > ie Presumably some fairly small limitation like 10MB is enough for > > most non-MPI jobs. > > Container application always write a simple for loop code to take away > majority of QP with 10MB limit. No, the HW and kmem limits must work together, the HW limit would prevent exhaustion outside the container. > Imagine instead of tcp_bytes or kmem bytes, its "some memory > resource", how would someone debug/tune a system with abstract knobs? Well, we have the memcg controller that does track kmem. The subsystem specific kmem limit is to force fair sharing of the limited kmem resource within the overall memcg limit. They are complementary. A fictional rdma_kmem and tcp_kmem would serve very similar purposes. > > UAPI wise, nobdy has to care if the limit is actually # of QPs or > > something else. > If we dont care about resource, we cannot tune or limit it. number of > MRs used by MPI vs rsocket vs accelio is way different. So? I don't think it is really important to have an exact, precise, limit. The HW pools are pretty big, unless you plan to run tens of thousands of containers eacg with tiny RDMA limits, it is fine to talk in broader terms (ie 10% of all HW limited resource) which is totally adaquate to hard-prevent run away or exhaustion scenarios. Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-14 20:18 ` Jason Gunthorpe @ 2015-09-15 3:08 ` Parav Pandit [not found] ` <CAG53R5XY1q+AqJvgtK_Qd4Sai2kZX9vhDKD_2dNXpw4Gf=nz0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-15 3:08 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org > Because actual hardware resources *ARE* the limit. We cannot abstract > it away. The hardware/driver has real, fixed, immutable limits. No API > abstraction can possibly change that. > > The limits are such there *IS NO* API boundary that can bundle them > into something simpler. There will always be apps that require wildly > different ratios of the basic verbs resources (PD/QP/CQ/AH/MR) > > Either we control each and every vendor's limited resource directly > (which is where you started), or we just roll them up into a 'all > resource' bundle and control them indirectly. There just isn't a > mythical third 'better API' choice with the hardware we have today. > As you precisely described, about wild ratio, we are asking vendor driver (bottom most layer) to statically define what the resource pool is, without telling him which application are we going to run to use those pool. Therefore vendor layer cannot ever define "right" resource pool. If we try to fix defining "right" resource pool, we will have to come up with API to modify/tune individual element of the pool. Once we bring that complexity, it becomes what is proposed in this pachset. Instead of bringing such complex solution, that affecting all the layers which solves the same problem as this patch, its better to keep definition of "bundle" in the user library/application deployment engine. where bundle is set of those resources. May be instead of having invidividual files for each resource, at user interface level, we can have rdma.bundle file. this bundle cgroup file defines these resources such as "ah 100 mr 100 qp 10" > So? I don't think it is really important to have an exact, precise, > limit. The HW pools are pretty big, unless you plan to run tens of > thousands of containers eacg with tiny RDMA limits, it is fine to talk > in broader terms (ie 10% of all HW limited resource) which is totally > adaquate to hard-prevent run away or exhaustion scenarios. > rdma cgroup will allow us to run post 512 or 1024 containers without using PCIe SR-IOV, without creating any vendor specific resource pools. > Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5XY1q+AqJvgtK_Qd4Sai2kZX9vhDKD_2dNXpw4Gf=nz0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <CAG53R5XY1q+AqJvgtK_Qd4Sai2kZX9vhDKD_2dNXpw4Gf=nz0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-15 3:45 ` Jason Gunthorpe [not found] ` <20150915034549.GA27847-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Jason Gunthorpe @ 2015-09-15 3:45 UTC (permalink / raw) To: Parav Pandit Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote: > As you precisely described, about wild ratio, > we are asking vendor driver (bottom most layer) to statically define > what the resource pool is, without telling him which application are > we going to run to use those pool. > Therefore vendor layer cannot ever define "right" resource pool. No, I'm saying the resource pool is *well defined* and *fixed* by each hardware. The only question is how do we expose the N resource limits, the list of which is totally vendor specific. Yes, using a % scheme fixes the ratios, 1% is going to be a certain number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver configuration. That is the trade off for API simplicity. Yes, this results in some resources being over provisioned. I have no idea if that is usable for the workloads people want to run.. But *there is no middle option*. Either each and every single hardware limited resources has a dedicated per-container limit, or they are *somehow* bundled and the ratios become fixed. If Tejun says we can't have something so emphemeral as a vendor specific list of hardware resource pools - then what choice is left? > Instead of bringing such complex solution, that affecting all the > layers which solves the same problem as this patch, > its better to keep definition of "bundle" in the user > library/application deployment engine. > where bundle is set of those resources. The kernel has to do the restriction, so at some point you are telling the kernel to limit each and every unique resource the HW has, which is back to the original patch set, munging how the data is passed makes no difference to the basic objection, IMHO. > rdma cgroup will allow us to run post 512 or 1024 containers without > using PCIe SR-IOV, without creating any vendor specific resource > pools. If you ignore any vendor specific resource limits then you've just left open a hole, a wayward container can exhaust all others - so what was the point of doing all this work? Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <20150915034549.GA27847-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150915034549.GA27847-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2015-09-16 4:41 ` Parav Pandit 2015-09-20 10:35 ` Haggai Eran 1 sibling, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-16 4:41 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Hi Jason, Sean, Tejun, I am in process of defining new approach, design based on the feedback given here for new RDMA cgroup from all of you. I have also collected feedback from Liran yesterday and ORNL folks too. Soon I will post the new approach, high level APIs and functionality for review before submitting actual implementation. Regards, Parav Pandit On Tue, Sep 15, 2015 at 9:15 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote: > >> As you precisely described, about wild ratio, >> we are asking vendor driver (bottom most layer) to statically define >> what the resource pool is, without telling him which application are >> we going to run to use those pool. >> Therefore vendor layer cannot ever define "right" resource pool. > > No, I'm saying the resource pool is *well defined* and *fixed* by each > hardware. > > The only question is how do we expose the N resource limits, the list > of which is totally vendor specific. > >> rdma cgroup will allow us to run post 512 or 1024 containers without >> using PCIe SR-IOV, without creating any vendor specific resource >> pools. > > If you ignore any vendor specific resource limits then you've just > left open a hole, a wayward container can exhaust all others - so what > was the point of doing all this work? > > Jason ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <20150915034549.GA27847-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 2015-09-16 4:41 ` Parav Pandit @ 2015-09-20 10:35 ` Haggai Eran [not found] ` <55FE8C06.8010504-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-20 10:35 UTC (permalink / raw) To: Jason Gunthorpe, Parav Pandit Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On 15/09/2015 06:45, Jason Gunthorpe wrote: > No, I'm saying the resource pool is *well defined* and *fixed* by each > hardware. > > The only question is how do we expose the N resource limits, the list > of which is totally vendor specific. I don't see why you say the limits are vendor specific. It is true that different RDMA devices have different implementations and capabilities, but they all use the expose the same set of RDMA objects with their limitations. Whether those limitations come from hardware limitations, from the driver, or just because the address space is limited, they can still be exhausted. > Yes, using a % scheme fixes the ratios, 1% is going to be a certain > number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver > configuration. That is the trade off for API simplicity. > > > Yes, this results in some resources being over provisioned. I agree that such a scheme will be easy to configure, but I don't think it can work well in all situations. Imagine you want to let one container use almost all RC QPs as you want it to connect to the entire cluster through RC. Other containers can still use a single datagram QP to connect to the entire cluster, but they would require many address handles. If you force a fixed ratio of resources given to each container it would be hard to describe such a partitioning. I think it would be better to expose different controls for the different RDMA resources. Regards, Haggai ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <55FE8C06.8010504-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource [not found] ` <55FE8C06.8010504-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-10-28 8:14 ` Parav Pandit 0 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-10-28 8:14 UTC (permalink / raw) To: Haggai Eran Cc: Jason Gunthorpe, Hefty, Sean, Tejun Heo, Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Hi, I finally got some chance and progress on redesigning rdma cgroup controller for the most use cases that we discussed in this email chain. I am posting RFC and soon code in new email. Parav On Sun, Sep 20, 2015 at 4:05 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On 15/09/2015 06:45, Jason Gunthorpe wrote: >> No, I'm saying the resource pool is *well defined* and *fixed* by each >> hardware. >> >> The only question is how do we expose the N resource limits, the list >> of which is totally vendor specific. > > I don't see why you say the limits are vendor specific. It is true that > different RDMA devices have different implementations and capabilities, > but they all use the expose the same set of RDMA objects with their > limitations. Whether those limitations come from hardware limitations, > from the driver, or just because the address space is limited, they can > still be exhausted. > >> Yes, using a % scheme fixes the ratios, 1% is going to be a certain >> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver >> configuration. That is the trade off for API simplicity. >> >> >> Yes, this results in some resources being over provisioned. > > I agree that such a scheme will be easy to configure, but I don't think > it can work well in all situations. Imagine you want to let one > container use almost all RC QPs as you want it to connect to the entire > cluster through RC. Other containers can still use a single datagram QP > to connect to the entire cluster, but they would require many address > handles. If you force a fixed ratio of resources given to each container > it would be hard to describe such a partitioning. > > I think it would be better to expose different controls for the > different RDMA resources. > > Regards, > Haggai ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 19:22 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A82373A903A586-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2015-09-14 10:15 ` Parav Pandit 1 sibling, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-14 10:15 UTC (permalink / raw) To: Hefty, Sean Cc: Tejun Heo, Doug Ledford, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org On Sat, Sep 12, 2015 at 12:52 AM, Hefty, Sean <sean.hefty@intel.com> wrote: >> So, the existence of resource limitations is fine. That's what we >> deal with all the time. The problem usually with this sort of >> interfaces which expose implementation details to users directly is >> that it severely limits engineering manuevering space. You usually >> want your users to express their intentions and a mechanism to >> arbitrate resources to satisfy those intentions (and in a way more >> graceful than "we can't, maybe try later?"); otherwise, implementing >> any sort of high level resource distribution scheme becomes painful >> and usually the only thing possible is preventing runaway disasters - >> you don't wanna pin unused resource permanently if there actually is >> contention around it, so usually all you can do with hard limits is >> overcommiting limits so that it at least prevents disasters. > > I agree with Tejun that this proposal is at the wrong level of abstraction. > > If you look at just trying to limit QPs, it's not clear what that attempts to accomplish. Conceptually, a QP is little more than an addressable endpoint. It may or may not map to HW resources (for Intel NICs it does not). Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC. > cgroups as it stands today provides resource controls in effective manner of existing defined resource, such as cpu cycles, memory in user and kernel space, tcp bytes, IOPS etc. Similarly RDMA programming model defines its own set of resources which is used by applications which accesses those resources directly. What we are debating here is that, RDMA exposing hardware resources is not correct, and therefore whether a cgroup controller is needed or not. There are two points here. 1. Whether RDMA programming model is correct or not which works on defined resources of IB spec. 2. Assuming that programming model is fine, (because we have actively maintained IB stack in kernel and adoption of user space components in OS), whether we need to control those resources or not via cgroup. Tejun trying to say that because point_1 is doesn't seem to be right way to solve problem, point_2 should not be done or done at different level of abstraction. More questions/comments in Jason and Sean thread. Sean, Even though there is no one to one map of verb-QP to hw-QP, in order for driver or lower layer to effectively map the right verb-QP to hw-QP, such vendor specific layer needs to know how is it going to be used. Otherwise two contending applications for a QP may not get the right number of hw-QPs to use. > Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use. There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation. > > - Sean ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 4:04 ` Tejun Heo [not found] ` <20150911040413.GA18850-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2015-09-11 4:43 ` Parav Pandit 2015-09-11 15:03 ` Tejun Heo 1 sibling, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-11 4:43 UTC (permalink / raw) To: Tejun Heo Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Fri, Sep 11, 2015 at 9:34 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Parav. > > On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote: >> The fact is that user level application uses hardware resources. >> Verbs layer is software abstraction for it. Drivers are hiding how >> they implement this QP or CQ or whatever hardware resource they >> project via API layer. >> For all of the userland on top of verb layer I mentioned above, the >> common resource abstraction is these resources AH, QP, CQ, MR etc. >> Hardware (and driver) might have different view of this resource in >> their real implementation. >> For example, verb layer can say that it has 100 QPs, but hardware >> might actually have 20 QPs that driver decide how to efficiently use >> it. > > My uneducated suspicion is that the abstraction is just not developed > enough. It should be possible to virtualize these resources through, > most likely, time-sharing to the level where userland simply says "I > want this chunk transferred there" and OS schedules the transfer > prioritizing competing requests. Tejun, That is such a perfect abstraction to have at OS level, but not sure how much close it can be to bare metal RDMA it can be. I have started discussion on that front as well as part of other thread, but its certainly long way to go. Most want to enjoy the performance benefit of the bare metal interfaces it provides. Such abstraction that you mentioned, exists, the only difference is instead of its OS as central entity, its the higher level libraries, drivers and hw together does it today for the applications. > > It could be that given the use cases rdma might not need such level of > abstraction - e.g. most users want to be and are pretty close to bare > metal, but, if that's true, it also kinda is weird to build > hierarchical resource distribution scheme on top of such bare > abstraction. > > ... >> > I don't know. What's proposed in this thread seems way too low level >> > to be useful anywhere else. Also, what if there are multiple devices? >> > Is that a problem to worry about? >> >> o.k. It doesn't have to be useful anywhere else. If it suffice the >> need of RDMA applications, its fine for near future. >> This patch allows limiting resources across multiple devices. >> As we go along the path, and if requirement come up to have knob on >> per device basis, thats something we can extend in future. > > You kinda have to decide that upfront cuz it gets baked into the > interface. Well, all the interfaces are not yet defined. Except the test and benchmark utilities, real world applications wouldn't really bother much about which device are they are going through. so I expect that per device level control would nice for very specific applications, but I don't anticipate that in first place. If others have different view, I would be happy to hear that. Even if we extend per device control, I would expect per cgroup control at top level without which its uncontrolled access. > >> > I'm kinda doubtful we're gonna have too many of these. Hardware >> > details being exposed to userland this directly isn't common. >> >> Its common in RDMA applications. Again they may not be real hardware >> resource, its just API layer which defines those RDMA constructs. > > It's still a very low level of abstraction which pretty much gets > decided by what the hardware and driver decide to do. > >> > I'd say keep it simple and do the minimum. :) >> >> o.k. In that case new rdma cgroup controller which does rdma resource >> accounting is possibly the most simplest form? >> Make sense? > > So, this fits cgroup's purpose to certain level but it feels like > we're trying to build too much on top of something which hasn't > developed sufficiently. I suppose it could be that this is the level > of development that rdma is gonna reach and dumb cgroup controller can > be useful for some use cases. I don't know, so, yeah, let's keep it > simple and avoid doing crazy stuff. > o.k. thanks. I would wait for some more time to collect more feedback. In absence of that, I will send updated patch V1 which will include, (a) functionality of this patch in new rdma cgroup as you recommended, (b) fixes for comments from Haggai for this patch (c) more fixes which I have done in mean time > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-11 4:43 ` Parav Pandit @ 2015-09-11 15:03 ` Tejun Heo 0 siblings, 0 replies; 60+ messages in thread From: Tejun Heo @ 2015-09-11 15:03 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module Hello, Parav. On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote: > > My uneducated suspicion is that the abstraction is just not developed > > enough. It should be possible to virtualize these resources through, > > most likely, time-sharing to the level where userland simply says "I > > want this chunk transferred there" and OS schedules the transfer > > prioritizing competing requests. > > Tejun, > That is such a perfect abstraction to have at OS level, but not sure > how much close it can be to bare metal RDMA it can be. > I have started discussion on that front as well as part of other > thread, but its certainly long way to go. > Most want to enjoy the performance benefit of the bare metal > interfaces it provides. Yeah, sure, I'm not trying to say that rdma needs or should do that. > Such abstraction that you mentioned, exists, the only difference is > instead of its OS as central entity, its the higher level libraries, > drivers and hw together does it today for the applications. But more that having resource control in the OS and actual arbitration higher up in the stack isn't likely to lead to an effective resource distribution scheme. > > You kinda have to decide that upfront cuz it gets baked into the > > interface. > > Well, all the interfaces are not yet defined. Except the test and I meant the cgroup interface. > benchmark utilities, real world applications wouldn't really bother > much about which device are they are going through. Weights can work fine across multiple devices. Hard limits don't. It just doesn't make any sense. Unless you can exclude multiple device scenarios, you'll have to implement per-device limits. Thanks. -- tejun ^ permalink raw reply [flat|nested] 60+ messages in thread
* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-10 16:49 ` Tejun Heo [not found] ` <20150910164946.GH8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-09-10 17:48 ` Hefty, Sean 1 sibling, 0 replies; 60+ messages in thread From: Hefty, Sean @ 2015-09-10 17:48 UTC (permalink / raw) To: Tejun Heo, Parav Pandit Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org, linux-security-module@vger.kernel.org > > In past there has been similar comment to have dedicated cgroup > > controller for RDMA instead of merging with device cgroup. > > I am ok with both the approach, however I prefer to utilize device > > controller instead of spinning of new controller for new devices > > category. > > I anticipate more such need would arise and for new device category, > > it might not be worth to have new cgroup controller. > > RapidIO though very less popular and upcoming PCIe are on horizon to > > offer similar benefits as that of RDMA and in future having one > > controller for each of them again would not be right approach. > > > > I certainly seek your and others inputs in this email thread here > whether > > (a) to continue to extend device cgroup (which support character, > > block devices white list) and now RDMA devices > > or > > (b) to spin of new controller, if so what are the compelling reasons > > that it can provide compare to extension. > > I'm doubtful that these things are gonna be mainstream w/o building up > higher level abstractions on top and if we ever get there we won't be > talking about MR or CQ or whatever. Also, whatever next-gen is > unlikely to have enough commonalities when the proposed resource knobs > are this low level, so let's please keep it separate, so that if/when > this goes out of fashion for one reason or another, the controller can > silently wither away too. As an attempt to abstract the hardware resources only, what these devices are exposing to apps can be viewed as command queues (RDMA QPs and SRQs), notification queues (RDMA CQs and EQs), and space in the device cache and allocated memory (RDMA MRs and AHs, maybe PDs). If one wanted a higher level of abstraction, associations exist between these resources. For example, command queues feed into notification queues. Address handles are required resources to use an unconnected queue pair. - Sean ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup. 2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-09-07 20:38 ` Parav Pandit [not found] ` <1441658303-18081-4-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit ` (2 subsequent siblings) 4 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, dledford Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel, akpm, linux-security-module, pandit.parav 1. Moved necessary functions and data structures to header file to reuse them at device cgroup white list functionality and for rdma functionality. 2. Added infrastructure to invoke RDMA specific routines for resource configuration, query and during fork handling. 3. Added sysfs interface files for configuring max limit of each rdma resource and one file for querying controllers current resource usage. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> --- include/linux/device_cgroup.h | 53 +++++++++++++++++++ security/device_cgroup.c | 119 +++++++++++++++++++++++++++++------------- 2 files changed, 136 insertions(+), 36 deletions(-) diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h index 8b64221..cdbdd60 100644 --- a/include/linux/device_cgroup.h +++ b/include/linux/device_cgroup.h @@ -1,6 +1,57 @@ +#ifndef _DEVICE_CGROUP +#define _DEVICE_CGROUP + #include <linux/fs.h> +#include <linux/cgroup.h> +#include <linux/device_rdma_cgroup.h> #ifdef CONFIG_CGROUP_DEVICE + +enum devcg_behavior { + DEVCG_DEFAULT_NONE, + DEVCG_DEFAULT_ALLOW, + DEVCG_DEFAULT_DENY, +}; + +/* + * exception list locking rules: + * hold devcgroup_mutex for update/read. + * hold rcu_read_lock() for read. + */ + +struct dev_exception_item { + u32 major, minor; + short type; + short access; + struct list_head list; + struct rcu_head rcu; +}; + +struct dev_cgroup { + struct cgroup_subsys_state css; + struct list_head exceptions; + enum devcg_behavior behavior; + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + struct devcgroup_rdma rdma; +#endif +}; + +static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) +{ + return s ? container_of(s, struct dev_cgroup, css) : NULL; +} + +static inline struct dev_cgroup *parent_devcgroup(struct dev_cgroup *dev_cg) +{ + return css_to_devcgroup(dev_cg->css.parent); +} + +static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) +{ + return css_to_devcgroup(task_css(task, devices_cgrp_id)); +} + extern int __devcgroup_inode_permission(struct inode *inode, int mask); extern int devcgroup_inode_mknod(int mode, dev_t dev); static inline int devcgroup_inode_permission(struct inode *inode, int mask) @@ -17,3 +68,5 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask) static inline int devcgroup_inode_mknod(int mode, dev_t dev) { return 0; } #endif + +#endif diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 188c1d2..a0b3239 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -25,42 +25,6 @@ static DEFINE_MUTEX(devcgroup_mutex); -enum devcg_behavior { - DEVCG_DEFAULT_NONE, - DEVCG_DEFAULT_ALLOW, - DEVCG_DEFAULT_DENY, -}; - -/* - * exception list locking rules: - * hold devcgroup_mutex for update/read. - * hold rcu_read_lock() for read. - */ - -struct dev_exception_item { - u32 major, minor; - short type; - short access; - struct list_head list; - struct rcu_head rcu; -}; - -struct dev_cgroup { - struct cgroup_subsys_state css; - struct list_head exceptions; - enum devcg_behavior behavior; -}; - -static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s) -{ - return s ? container_of(s, struct dev_cgroup, css) : NULL; -} - -static inline struct dev_cgroup *task_devcgroup(struct task_struct *task) -{ - return css_to_devcgroup(task_css(task, devices_cgrp_id)); -} - /* * called under devcgroup_mutex */ @@ -223,6 +187,9 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css) INIT_LIST_HEAD(&dev_cgroup->exceptions); dev_cgroup->behavior = DEVCG_DEFAULT_NONE; +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + init_devcgroup_rdma_tracker(dev_cgroup); +#endif return &dev_cgroup->css; } @@ -234,6 +201,25 @@ static void devcgroup_css_free(struct cgroup_subsys_state *css) kfree(dev_cgroup); } +#ifdef CONFIG_CGROUP_RDMA_RESOURCE +static int devcgroup_can_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + return devcgroup_rdma_can_attach(dst_css, tset); +} + +static void devcgroup_cancel_attach(struct cgroup_subsys_state *dst_css, + struct cgroup_taskset *tset) +{ + devcgroup_cancel_attach(dst_css, tset); +} + +static void devcgroup_fork(struct task_struct *task, void *priv) +{ + devcgroup_rdma_fork(task, priv); +} +#endif + #define DEVCG_ALLOW 1 #define DEVCG_DENY 2 #define DEVCG_LIST 3 @@ -788,6 +774,62 @@ static struct cftype dev_cgroup_files[] = { .seq_show = devcgroup_seq_show, .private = DEVCG_LIST, }, + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + { + .name = "rdma.resource.uctx.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_UCTX, + }, + { + .name = "rdma.resource.cq.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_CQ, + }, + { + .name = "rdma.resource.ah.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_AH, + }, + { + .name = "rdma.resource.pd.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_PD, + }, + { + .name = "rdma.resource.flow.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_FLOW, + }, + { + .name = "rdma.resource.srq.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_SRQ, + }, + { + .name = "rdma.resource.qp.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_QP, + }, + { + .name = "rdma.resource.mr.max", + .write = devcgroup_rdma_set_max_resource, + .seq_show = devcgroup_rdma_get_max_resource, + .private = DEVCG_RDMA_RES_TYPE_MR, + }, + { + .name = "rdma.resource.usage", + .seq_show = devcgroup_rdma_show_usage, + .private = DEVCG_RDMA_LIST_USAGE, + }, +#endif { } /* terminate */ }; @@ -796,6 +838,11 @@ struct cgroup_subsys devices_cgrp_subsys = { .css_free = devcgroup_css_free, .css_online = devcgroup_online, .css_offline = devcgroup_offline, +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + .fork = devcgroup_fork, + .can_attach = devcgroup_can_attach, + .cancel_attach = devcgroup_cancel_attach, +#endif .legacy_cftypes = dev_cgroup_files, }; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 60+ messages in thread
[parent not found: <1441658303-18081-4-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup. [not found] ` <1441658303-18081-4-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-09-08 5:31 ` Haggai Eran [not found] ` <55EE72B7.1060304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 5:31 UTC (permalink / raw) To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On 07/09/2015 23:38, Parav Pandit wrote: > diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h > index 8b64221..cdbdd60 100644 > --- a/include/linux/device_cgroup.h > +++ b/include/linux/device_cgroup.h > @@ -1,6 +1,57 @@ > +#ifndef _DEVICE_CGROUP > +#define _DEVICE_CGROUP > + > #include <linux/fs.h> > +#include <linux/cgroup.h> > +#include <linux/device_rdma_cgroup.h> You cannot add this include line before adding the device_rdma_cgroup.h (added in patch 5). You should reorder the patches so that after each patch the kernel builds correctly. I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile before it was added to the kernel. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <55EE72B7.1060304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup. [not found] ` <55EE72B7.1060304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2015-09-08 7:02 ` Parav Pandit 0 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-08 7:02 UTC (permalink / raw) To: Haggai Eran Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On Tue, Sep 8, 2015 at 11:01 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h >> index 8b64221..cdbdd60 100644 >> --- a/include/linux/device_cgroup.h >> +++ b/include/linux/device_cgroup.h >> @@ -1,6 +1,57 @@ >> +#ifndef _DEVICE_CGROUP >> +#define _DEVICE_CGROUP >> + >> #include <linux/fs.h> >> +#include <linux/cgroup.h> >> +#include <linux/device_rdma_cgroup.h> > > You cannot add this include line before adding the device_rdma_cgroup.h > (added in patch 5). You should reorder the patches so that after each > patch the kernel builds correctly. > o.k. got it. I will send V1 with this suggested changes. > I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile > before it was added to the kernel. > o.k. > Regards, > Haggai ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 4/7] devcg: Added rdma resource tracker object per task 2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit [not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:38 ` [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup Parav Pandit @ 2015-09-07 20:38 ` Parav Pandit [not found] ` <1441658303-18081-5-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit 2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit 4 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, dledford Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel, akpm, linux-security-module, pandit.parav Added RDMA device resource tracking object per task. Added comments to capture usage of task lock by device cgroup for rdma. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> --- include/linux/sched.h | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ae21f15..a5f79b6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1334,6 +1334,8 @@ union rcu_special { }; struct rcu_node; +struct task_rdma_res_counter; + enum perf_event_task_context { perf_invalid_context = -1, perf_hw_context = 0, @@ -1637,6 +1639,14 @@ struct task_struct { struct css_set __rcu *cgroups; /* cg_list protected by css_set_lock and tsk->alloc_lock */ struct list_head cg_list; + +#ifdef CONFIG_CGROUP_RDMA_RESOURCE + /* RDMA resource accounting counters, allocated only + * when RDMA resources are created by a task. + */ + struct task_rdma_res_counter *rdma_res_counter; +#endif + #endif #ifdef CONFIG_FUTEX struct robust_list_head __user *robust_list; @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring * subscriptions and synchronises with wait4(). Also used in procfs. Also * pins the final release of task.io_context. Also protects ->cpuset and - * ->cgroup.subsys[]. And ->vfork_done. + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. * * Nests both inside and outside of read_lock(&tasklist_lock). * It must not be nested with write_lock_irq(&tasklist_lock), -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 60+ messages in thread
[parent not found: <1441658303-18081-5-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task [not found] ` <1441658303-18081-5-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-09-08 5:48 ` Haggai Eran 2015-09-08 7:04 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 5:48 UTC (permalink / raw) To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On 07/09/2015 23:38, Parav Pandit wrote: > @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) > * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring > * subscriptions and synchronises with wait4(). Also used in procfs. Also > * pins the final release of task.io_context. Also protects ->cpuset and > - * ->cgroup.subsys[]. And ->vfork_done. > + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. s/projtects/protects/ > * > * Nests both inside and outside of read_lock(&tasklist_lock). > * It must not be nested with write_lock_irq(&tasklist_lock), -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task 2015-09-08 5:48 ` Haggai Eran @ 2015-09-08 7:04 ` Parav Pandit [not found] ` <CAG53R5VwLnDUjpOwaD_gZMkRBjyT1Wg_sSPw2gAg9oJkqdn3dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-08 7:04 UTC (permalink / raw) To: Haggai Eran Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie@mellanox.com> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) >> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring >> * subscriptions and synchronises with wait4(). Also used in procfs. Also >> * pins the final release of task.io_context. Also protects ->cpuset and >> - * ->cgroup.subsys[]. And ->vfork_done. >> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. > s/projtects/protects/ >> * >> * Nests both inside and outside of read_lock(&tasklist_lock). >> * It must not be nested with write_lock_irq(&tasklist_lock), > Hi Haggai Eran, Did you miss to put comments or I missed something? Parav ^ permalink raw reply [flat|nested] 60+ messages in thread
[parent not found: <CAG53R5VwLnDUjpOwaD_gZMkRBjyT1Wg_sSPw2gAg9oJkqdn3dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task [not found] ` <CAG53R5VwLnDUjpOwaD_gZMkRBjyT1Wg_sSPw2gAg9oJkqdn3dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-09-08 8:24 ` Haggai Eran 2015-09-08 8:26 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 8:24 UTC (permalink / raw) To: Parav Pandit Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On 08/09/2015 10:04, Parav Pandit wrote: > On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote: >> On 07/09/2015 23:38, Parav Pandit wrote: >>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) >>> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring >>> * subscriptions and synchronises with wait4(). Also used in procfs. Also >>> * pins the final release of task.io_context. Also protects ->cpuset and >>> - * ->cgroup.subsys[]. And ->vfork_done. >>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. >> s/projtects/protects/ >>> * >>> * Nests both inside and outside of read_lock(&tasklist_lock). >>> * It must not be nested with write_lock_irq(&tasklist_lock), >> > > Hi Haggai Eran, > Did you miss to put comments or I missed something? Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in your comment. You should change the word "projtects" to "protects". Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task 2015-09-08 8:24 ` Haggai Eran @ 2015-09-08 8:26 ` Parav Pandit 0 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-08 8:26 UTC (permalink / raw) To: Haggai Eran Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 1:54 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 08/09/2015 10:04, Parav Pandit wrote: >> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie@mellanox.com> wrote: >>> On 07/09/2015 23:38, Parav Pandit wrote: >>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p) >>>> * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring >>>> * subscriptions and synchronises with wait4(). Also used in procfs. Also >>>> * pins the final release of task.io_context. Also protects ->cpuset and >>>> - * ->cgroup.subsys[]. And ->vfork_done. >>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter. >>> s/projtects/protects/ >>>> * >>>> * Nests both inside and outside of read_lock(&tasklist_lock). >>>> * It must not be nested with write_lock_irq(&tasklist_lock), >>> >> >> Hi Haggai Eran, >> Did you miss to put comments or I missed something? > > Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in > your comment. You should change the word "projtects" to "protects". > > Haggai > ah. ok. Right. Will correct it. ^ permalink raw reply [flat|nested] 60+ messages in thread
* [PATCH 6/7] devcg: Added support to use RDMA device cgroup. 2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit ` (2 preceding siblings ...) 2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit @ 2015-09-07 20:38 ` Parav Pandit [not found] ` <1441658303-18081-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit 4 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw) To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, dledford Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel, akpm, linux-security-module, pandit.parav RDMA uverbs modules now queries associated device cgroup rdma controller before allocating device resources and uncharge them while freeing rdma device resources. Since fput() sequence can free the resources from the workqueue context (instead of task context which allocated the resource), it passes associated ucontext pointer during uncharge, so that rdma cgroup controller can correctly free the resource of right task and right cgroup. Signed-off-by: Parav Pandit <pandit.parav@gmail.com> --- drivers/infiniband/core/uverbs_cmd.c | 139 +++++++++++++++++++++++++++++----- drivers/infiniband/core/uverbs_main.c | 39 +++++++++- 2 files changed, 156 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index bbb02ff..c080374 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -37,6 +37,7 @@ #include <linux/fs.h> #include <linux/slab.h> #include <linux/sched.h> +#include <linux/device_rdma_cgroup.h> #include <asm/uaccess.h> @@ -281,6 +282,19 @@ static void put_xrcd_read(struct ib_uobject *uobj) put_uobj_read(uobj); } +static void init_ucontext_lists(struct ib_ucontext *ucontext) +{ + INIT_LIST_HEAD(&ucontext->pd_list); + INIT_LIST_HEAD(&ucontext->mr_list); + INIT_LIST_HEAD(&ucontext->mw_list); + INIT_LIST_HEAD(&ucontext->cq_list); + INIT_LIST_HEAD(&ucontext->qp_list); + INIT_LIST_HEAD(&ucontext->srq_list); + INIT_LIST_HEAD(&ucontext->ah_list); + INIT_LIST_HEAD(&ucontext->xrcd_list); + INIT_LIST_HEAD(&ucontext->rule_list); +} + ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -313,22 +327,18 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, (unsigned long) cmd.response + sizeof resp, in_len - sizeof cmd, out_len - sizeof resp); + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_UCTX, 1); + if (ret) + goto err; + ucontext = ibdev->alloc_ucontext(ibdev, &udata); if (IS_ERR(ucontext)) { ret = PTR_ERR(ucontext); - goto err; + goto err_alloc; } ucontext->device = ibdev; - INIT_LIST_HEAD(&ucontext->pd_list); - INIT_LIST_HEAD(&ucontext->mr_list); - INIT_LIST_HEAD(&ucontext->mw_list); - INIT_LIST_HEAD(&ucontext->cq_list); - INIT_LIST_HEAD(&ucontext->qp_list); - INIT_LIST_HEAD(&ucontext->srq_list); - INIT_LIST_HEAD(&ucontext->ah_list); - INIT_LIST_HEAD(&ucontext->xrcd_list); - INIT_LIST_HEAD(&ucontext->rule_list); + init_ucontext_lists(ucontext); rcu_read_lock(); ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID); rcu_read_unlock(); @@ -395,6 +405,8 @@ err_free: put_pid(ucontext->tgid); ibdev->dealloc_ucontext(ucontext); +err_alloc: + devcgroup_rdma_uncharge_resource(NULL, DEVCG_RDMA_RES_TYPE_UCTX, 1); err: mutex_unlock(&file->mutex); return ret; @@ -412,15 +424,23 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file, resp->vendor_id = attr->vendor_id; resp->vendor_part_id = attr->vendor_part_id; resp->hw_ver = attr->hw_ver; - resp->max_qp = attr->max_qp; + resp->max_qp = min_t(int, attr->max_qp, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_QP)); resp->max_qp_wr = attr->max_qp_wr; resp->device_cap_flags = attr->device_cap_flags; resp->max_sge = attr->max_sge; resp->max_sge_rd = attr->max_sge_rd; - resp->max_cq = attr->max_cq; + resp->max_cq = min_t(int, attr->max_cq, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_CQ)); resp->max_cqe = attr->max_cqe; - resp->max_mr = attr->max_mr; - resp->max_pd = attr->max_pd; + resp->max_mr = min_t(int, attr->max_mr, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_MR)); + resp->max_pd = min_t(int, attr->max_pd, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_PD)); resp->max_qp_rd_atom = attr->max_qp_rd_atom; resp->max_ee_rd_atom = attr->max_ee_rd_atom; resp->max_res_rd_atom = attr->max_res_rd_atom; @@ -429,16 +449,22 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file, resp->atomic_cap = attr->atomic_cap; resp->max_ee = attr->max_ee; resp->max_rdd = attr->max_rdd; - resp->max_mw = attr->max_mw; + resp->max_mw = min_t(int, attr->max_mw, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_MW)); resp->max_raw_ipv6_qp = attr->max_raw_ipv6_qp; resp->max_raw_ethy_qp = attr->max_raw_ethy_qp; resp->max_mcast_grp = attr->max_mcast_grp; resp->max_mcast_qp_attach = attr->max_mcast_qp_attach; resp->max_total_mcast_qp_attach = attr->max_total_mcast_qp_attach; - resp->max_ah = attr->max_ah; + resp->max_ah = min_t(int, attr->max_ah, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_AH)); resp->max_fmr = attr->max_fmr; resp->max_map_per_fmr = attr->max_map_per_fmr; - resp->max_srq = attr->max_srq; + resp->max_srq = min_t(int, attr->max_srq, + devcgroup_rdma_query_resource_limit( + DEVCG_RDMA_RES_TYPE_SRQ)); resp->max_srq_wr = attr->max_srq_wr; resp->max_srq_sge = attr->max_srq_sge; resp->max_pkeys = attr->max_pkeys; @@ -550,6 +576,12 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file, if (!uobj) return -ENOMEM; + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_PD, 1); + if (ret) { + kfree(uobj); + return -EPERM; + } + init_uobj(uobj, 0, file->ucontext, &pd_lock_class); down_write(&uobj->mutex); @@ -595,6 +627,9 @@ err_idr: ib_dealloc_pd(pd); err: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_PD, 1); + put_uobj_write(uobj); return ret; } @@ -623,6 +658,9 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_PD, 1); + idr_remove_uobj(&ib_uverbs_pd_idr, uobj); mutex_lock(&file->mutex); @@ -987,6 +1025,10 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, } } + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_MR, 1); + if (ret) + goto err_charge; + mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, cmd.access_flags, &udata); if (IS_ERR(mr)) { @@ -1033,8 +1075,10 @@ err_copy: err_unreg: ib_dereg_mr(mr); - err_put: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_MR, 1); +err_charge: put_pd_read(pd); err_free: @@ -1162,6 +1206,9 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_MR, 1); + idr_remove_uobj(&ib_uverbs_mr_idr, uobj); mutex_lock(&file->mutex); @@ -1379,6 +1426,10 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file, if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags)) attr.flags = cmd->flags; + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_CQ, 1); + if (ret) + goto err_charge; + cq = file->device->ib_dev->create_cq(file->device->ib_dev, &attr, file->ucontext, uhw); if (IS_ERR(cq)) { @@ -1426,6 +1477,9 @@ err_free: ib_destroy_cq(cq); err_file: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_CQ, 1); +err_charge: if (ev_file) ib_uverbs_release_ucq(file, ev_file, obj); @@ -1700,6 +1754,9 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_CQ, 1); + idr_remove_uobj(&ib_uverbs_cq_idr, uobj); mutex_lock(&file->mutex); @@ -1818,6 +1875,10 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, INIT_LIST_HEAD(&obj->uevent.event_list); INIT_LIST_HEAD(&obj->mcast_list); + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_QP, 1); + if (ret) + goto err_put; + if (cmd.qp_type == IB_QPT_XRC_TGT) qp = ib_create_qp(pd, &attr); else @@ -1825,7 +1886,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file, if (IS_ERR(qp)) { ret = PTR_ERR(qp); - goto err_put; + goto err_create; } if (cmd.qp_type != IB_QPT_XRC_TGT) { @@ -1900,6 +1961,9 @@ err_copy: err_destroy: ib_destroy_qp(qp); +err_create: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_QP, 1); err_put: if (xrcd) put_xrcd_read(xrcd_uobj); @@ -2256,6 +2320,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_QP, 1); + if (obj->uxrcd) atomic_dec(&obj->uxrcd->refcnt); @@ -2665,10 +2732,14 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file, memset(&attr.dmac, 0, sizeof(attr.dmac)); memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16); + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_AH, 1); + if (ret) + goto err_put; + ah = ib_create_ah(pd, &attr); if (IS_ERR(ah)) { ret = PTR_ERR(ah); - goto err_put; + goto err_create; } ah->uobject = uobj; @@ -2704,6 +2775,9 @@ err_copy: err_destroy: ib_destroy_ah(ah); +err_create: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_AH, 1); err_put: put_pd_read(pd); @@ -2737,6 +2811,9 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_AH, 1); + idr_remove_uobj(&ib_uverbs_ah_idr, uobj); mutex_lock(&file->mutex); @@ -2986,10 +3063,15 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file, err = -EINVAL; goto err_free; } + + err = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_FLOW, 1); + if (err) + goto err_free; + flow_id = ib_create_flow(qp, flow_attr, IB_FLOW_DOMAIN_USER); if (IS_ERR(flow_id)) { err = PTR_ERR(flow_id); - goto err_free; + goto err_create; } flow_id->qp = qp; flow_id->uobject = uobj; @@ -3023,6 +3105,9 @@ err_copy: idr_remove_uobj(&ib_uverbs_rule_idr, uobj); destroy_flow: ib_destroy_flow(flow_id); +err_create: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_FLOW, 1); err_free: kfree(flow_attr); err_put: @@ -3064,6 +3149,9 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file, if (!ret) uobj->live = 0; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_FLOW, 1); + put_uobj_write(uobj); idr_remove_uobj(&ib_uverbs_rule_idr, uobj); @@ -3129,6 +3217,10 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file, obj->uevent.events_reported = 0; INIT_LIST_HEAD(&obj->uevent.event_list); + ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_SRQ, 1); + if (ret) + goto err_put_cq; + srq = pd->device->create_srq(pd, &attr, udata); if (IS_ERR(srq)) { ret = PTR_ERR(srq); @@ -3193,6 +3285,8 @@ err_destroy: ib_destroy_srq(srq); err_put: + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_SRQ, 1); put_pd_read(pd); err_put_cq: @@ -3372,6 +3466,9 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file, if (ret) return ret; + devcgroup_rdma_uncharge_resource(file->ucontext, + DEVCG_RDMA_RES_TYPE_SRQ, 1); + if (srq_type == IB_SRQT_XRC) { us = container_of(obj, struct ib_usrq_object, uevent); atomic_dec(&us->uxrcd->refcnt); diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index f6eef2d..31544d4 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -45,6 +45,7 @@ #include <linux/cdev.h> #include <linux/anon_inodes.h> #include <linux/slab.h> +#include <linux/device_rdma_cgroup.h> #include <asm/uaccess.h> @@ -200,6 +201,7 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, struct ib_ucontext *context) { struct ib_uobject *uobj, *tmp; + int uobj_cnt = 0, ret; if (!context) return 0; @@ -212,8 +214,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, idr_remove_uobj(&ib_uverbs_ah_idr, uobj); ib_destroy_ah(ah); kfree(uobj); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_AH, uobj_cnt); + uobj_cnt = 0; /* Remove MWs before QPs, in order to support type 2A MWs. */ list_for_each_entry_safe(uobj, tmp, &context->mw_list, list) { struct ib_mw *mw = uobj->object; @@ -221,16 +227,24 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, idr_remove_uobj(&ib_uverbs_mw_idr, uobj); ib_dealloc_mw(mw); kfree(uobj); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_MW, uobj_cnt); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->rule_list, list) { struct ib_flow *flow_id = uobj->object; idr_remove_uobj(&ib_uverbs_rule_idr, uobj); ib_destroy_flow(flow_id); kfree(uobj); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_FLOW, uobj_cnt); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = uobj->object; struct ib_uqp_object *uqp = @@ -245,8 +259,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, } ib_uverbs_release_uevent(file, &uqp->uevent); kfree(uqp); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_QP, uobj_cnt); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) { struct ib_srq *srq = uobj->object; struct ib_uevent_object *uevent = @@ -256,8 +274,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, ib_destroy_srq(srq); ib_uverbs_release_uevent(file, uevent); kfree(uevent); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_SRQ, uobj_cnt); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) { struct ib_cq *cq = uobj->object; struct ib_uverbs_event_file *ev_file = cq->cq_context; @@ -268,15 +290,22 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, ib_destroy_cq(cq); ib_uverbs_release_ucq(file, ev_file, ucq); kfree(ucq); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_CQ, uobj_cnt); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) { struct ib_mr *mr = uobj->object; idr_remove_uobj(&ib_uverbs_mr_idr, uobj); ib_dereg_mr(mr); kfree(uobj); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_MR, uobj_cnt); mutex_lock(&file->device->xrcd_tree_mutex); list_for_each_entry_safe(uobj, tmp, &context->xrcd_list, list) { @@ -290,17 +319,25 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file, } mutex_unlock(&file->device->xrcd_tree_mutex); + uobj_cnt = 0; list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) { struct ib_pd *pd = uobj->object; idr_remove_uobj(&ib_uverbs_pd_idr, uobj); ib_dealloc_pd(pd); kfree(uobj); + uobj_cnt++; } + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_PD, uobj_cnt); put_pid(context->tgid); - return context->device->dealloc_ucontext(context); + ret = context->device->dealloc_ucontext(context); + + devcgroup_rdma_uncharge_resource(context, + DEVCG_RDMA_RES_TYPE_UCTX, 1); + return ret; } static void ib_uverbs_release_file(struct kref *ref) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 60+ messages in thread
[parent not found: <1441658303-18081-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup. [not found] ` <1441658303-18081-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2015-09-08 8:40 ` Haggai Eran 2015-09-08 10:22 ` Parav Pandit 0 siblings, 1 reply; 60+ messages in thread From: Haggai Eran @ 2015-09-08 8:40 UTC (permalink / raw) To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA, serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, linux-security-module-u79uwXL29TY76Z2rM5mHXA On 07/09/2015 23:38, Parav Pandit wrote: > +static void init_ucontext_lists(struct ib_ucontext *ucontext) > +{ > + INIT_LIST_HEAD(&ucontext->pd_list); > + INIT_LIST_HEAD(&ucontext->mr_list); > + INIT_LIST_HEAD(&ucontext->mw_list); > + INIT_LIST_HEAD(&ucontext->cq_list); > + INIT_LIST_HEAD(&ucontext->qp_list); > + INIT_LIST_HEAD(&ucontext->srq_list); > + INIT_LIST_HEAD(&ucontext->ah_list); > + INIT_LIST_HEAD(&ucontext->xrcd_list); > + INIT_LIST_HEAD(&ucontext->rule_list); > +} I don't see how this change is related to the patch. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup. 2015-09-08 8:40 ` Haggai Eran @ 2015-09-08 10:22 ` Parav Pandit 2015-09-08 13:40 ` Haggai Eran 0 siblings, 1 reply; 60+ messages in thread From: Parav Pandit @ 2015-09-08 10:22 UTC (permalink / raw) To: Haggai Eran Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <haggaie@mellanox.com> wrote: > On 07/09/2015 23:38, Parav Pandit wrote: >> +static void init_ucontext_lists(struct ib_ucontext *ucontext) >> +{ >> + INIT_LIST_HEAD(&ucontext->pd_list); >> + INIT_LIST_HEAD(&ucontext->mr_list); >> + INIT_LIST_HEAD(&ucontext->mw_list); >> + INIT_LIST_HEAD(&ucontext->cq_list); >> + INIT_LIST_HEAD(&ucontext->qp_list); >> + INIT_LIST_HEAD(&ucontext->srq_list); >> + INIT_LIST_HEAD(&ucontext->ah_list); >> + INIT_LIST_HEAD(&ucontext->xrcd_list); >> + INIT_LIST_HEAD(&ucontext->rule_list); >> +} > > I don't see how this change is related to the patch. Its not but code which I added makes this function to grow longer, so to keep it to same readability level, I did the cleanup. May be I can send separate patch for cleanup? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup. 2015-09-08 10:22 ` Parav Pandit @ 2015-09-08 13:40 ` Haggai Eran 0 siblings, 0 replies; 60+ messages in thread From: Haggai Eran @ 2015-09-08 13:40 UTC (permalink / raw) To: Parav Pandit Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris, serge, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module On 08/09/2015 13:22, Parav Pandit wrote: > On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <haggaie@mellanox.com> wrote: >> On 07/09/2015 23:38, Parav Pandit wrote: >>> +static void init_ucontext_lists(struct ib_ucontext *ucontext) >>> +{ >>> + INIT_LIST_HEAD(&ucontext->pd_list); >>> + INIT_LIST_HEAD(&ucontext->mr_list); >>> + INIT_LIST_HEAD(&ucontext->mw_list); >>> + INIT_LIST_HEAD(&ucontext->cq_list); >>> + INIT_LIST_HEAD(&ucontext->qp_list); >>> + INIT_LIST_HEAD(&ucontext->srq_list); >>> + INIT_LIST_HEAD(&ucontext->ah_list); >>> + INIT_LIST_HEAD(&ucontext->xrcd_list); >>> + INIT_LIST_HEAD(&ucontext->rule_list); >>> +} >> >> I don't see how this change is related to the patch. > > Its not but code which I added makes this function to grow longer, so > to keep it to same readability level, I did the cleanup. > May be I can send separate patch for cleanup? Sounds good to me. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource 2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit ` (3 preceding siblings ...) 2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit @ 2015-09-07 20:55 ` Parav Pandit 4 siblings, 0 replies; 60+ messages in thread From: Parav Pandit @ 2015-09-07 20:55 UTC (permalink / raw) To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes, Doug Ledford Cc: corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm, linux-security-module, Parav Pandit Hi Doug, Tejun, This is from cgroups for-4.3 branch. linux-rdma trunk will face compilation error as its behind Tejun's for-4.3 branch. Patch has dependency on the some of the cgroup subsystem functionality for fork(). Therefore its required to merge those changes first to linux-rdma trunk. Parav On Tue, Sep 8, 2015 at 2:08 AM, Parav Pandit <pandit.parav@gmail.com> wrote: > Currently user space applications can easily take away all the rdma > device specific resources such as AH, CQ, QP, MR etc. Due to which other > applications in other cgroup or kernel space ULPs may not even get chance > to allocate any rdma resources. > > This patch-set allows limiting rdma resources to set of processes. > It extend device cgroup controller for limiting rdma device limits. > > With this patch, user verbs module queries rdma device cgroup controller > to query process's limit to consume such resource. It uncharge resource > counter after resource is being freed. > > It extends the task structure to hold the statistic information about process's > rdma resource usage so that when process migrates from one to other controller, > right amount of resources can be migrated from one to other cgroup. > > Future patches will support RDMA flows resource and will be enhanced further > to enforce limit of other resources and capabilities. > > Parav Pandit (7): > devcg: Added user option to rdma resource tracking. > devcg: Added rdma resource tracking module. > devcg: Added infrastructure for rdma device cgroup. > devcg: Added rdma resource tracker object per task > devcg: device cgroup's extension for RDMA resource. > devcg: Added support to use RDMA device cgroup. > devcg: Added Documentation of RDMA device cgroup. > > Documentation/cgroups/devices.txt | 32 ++- > drivers/infiniband/core/uverbs_cmd.c | 139 +++++++++-- > drivers/infiniband/core/uverbs_main.c | 39 +++- > include/linux/device_cgroup.h | 53 +++++ > include/linux/device_rdma_cgroup.h | 83 +++++++ > include/linux/sched.h | 12 +- > init/Kconfig | 12 + > security/Makefile | 1 + > security/device_cgroup.c | 119 +++++++--- > security/device_rdma_cgroup.c | 422 ++++++++++++++++++++++++++++++++++ > 10 files changed, 850 insertions(+), 62 deletions(-) > create mode 100644 include/linux/device_rdma_cgroup.h > create mode 100644 security/device_rdma_cgroup.c > > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2015-10-28 8:14 UTC | newest]
Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
[not found] ` <1441658303-18081-1-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit
2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit
2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit
2015-09-08 8:22 ` Haggai Eran
2015-09-08 10:18 ` Parav Pandit
2015-09-08 13:50 ` Haggai Eran
2015-09-08 14:13 ` Parav Pandit
2015-09-08 8:36 ` Haggai Eran
[not found] ` <55EE9DF5.7030401-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-09-08 10:50 ` Parav Pandit
2015-09-08 14:10 ` Haggai Eran
2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup Parav Pandit
2015-09-08 12:45 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Haggai Eran
2015-09-08 15:23 ` Tejun Heo
2015-09-09 3:57 ` Parav Pandit
2015-09-10 16:49 ` Tejun Heo
[not found] ` <20150910164946.GH8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-10 17:46 ` Parav Pandit
2015-09-10 20:22 ` Tejun Heo
2015-09-11 3:39 ` Parav Pandit
[not found] ` <CAG53R5WtuPA=J_GYPzNTAKbjB1r0K90qhXEDxLNf7vxYyxgrKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-11 4:04 ` Tejun Heo
[not found] ` <20150911040413.GA18850-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-09-11 4:24 ` Doug Ledford
[not found] ` <55F25781.20308-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-09-11 14:52 ` Tejun Heo
2015-09-11 16:26 ` Parav Pandit
[not found] ` <CAG53R5X5z-H15f1FzCFFqao=taYeHyJnXAZT2mPzAHYOkyq-_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-11 16:34 ` Tejun Heo
[not found] ` <20150911163449.GS8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-11 16:39 ` Parav Pandit
2015-09-11 19:25 ` Tejun Heo
[not found] ` <20150911192517.GU8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-14 10:18 ` Parav Pandit
[not found] ` <20150911145213.GQ8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-09-11 16:47 ` Parav Pandit
[not found] ` <CAG53R5X5o8hJX1VJ00j5Bxuaps3FGCPNss4ey-07Dq+XP8xoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-11 19:05 ` Tejun Heo
2015-09-11 19:22 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373A903A586-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-09-11 19:43 ` Jason Gunthorpe
2015-09-11 20:06 ` Hefty, Sean
2015-09-14 11:09 ` Parav Pandit
2015-09-14 14:04 ` Parav Pandit
[not found] ` <CAG53R5U7sYnR2w+Wrhh58Ud1HOrKLDCYxZZgK58FyAkJ8exshw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-14 15:21 ` Tejun Heo
[not found] ` <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-14 17:28 ` Jason Gunthorpe
[not found] ` <20150914172832.GA21652-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-09-14 18:54 ` Parav Pandit
2015-09-14 20:18 ` Jason Gunthorpe
2015-09-15 3:08 ` Parav Pandit
[not found] ` <CAG53R5XY1q+AqJvgtK_Qd4Sai2kZX9vhDKD_2dNXpw4Gf=nz0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-15 3:45 ` Jason Gunthorpe
[not found] ` <20150915034549.GA27847-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-09-16 4:41 ` Parav Pandit
2015-09-20 10:35 ` Haggai Eran
[not found] ` <55FE8C06.8010504-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-28 8:14 ` Parav Pandit
2015-09-14 10:15 ` Parav Pandit
2015-09-11 4:43 ` Parav Pandit
2015-09-11 15:03 ` Tejun Heo
2015-09-10 17:48 ` Hefty, Sean
2015-09-07 20:38 ` [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup Parav Pandit
[not found] ` <1441658303-18081-4-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-09-08 5:31 ` Haggai Eran
[not found] ` <55EE72B7.1060304-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-09-08 7:02 ` Parav Pandit
2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit
[not found] ` <1441658303-18081-5-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-09-08 5:48 ` Haggai Eran
2015-09-08 7:04 ` Parav Pandit
[not found] ` <CAG53R5VwLnDUjpOwaD_gZMkRBjyT1Wg_sSPw2gAg9oJkqdn3dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-08 8:24 ` Haggai Eran
2015-09-08 8:26 ` Parav Pandit
2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit
[not found] ` <1441658303-18081-7-git-send-email-pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2015-09-08 8:40 ` Haggai Eran
2015-09-08 10:22 ` Parav Pandit
2015-09-08 13:40 ` Haggai Eran
2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).