public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]     ` <90150ffd-ba2d-6528-21b7-7ea493cd2b9a-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
@ 2023-06-29 15:21       ` Ofir Gal
       [not found]         ` <79894bd9-03c7-8d27-eb6a-5e1336550449-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Ofir Gal @ 2023-06-29 15:21 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Sagi Grimberg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	tj-DgEjT+Ai2ygdnm+yROfE0A

Hey Sagi and Chaitanya,

On 28/06/2023 5:33, Chaitanya Kulkarni wrote:
> On 6/27/23 14:13, Sagi Grimberg wrote:
>> Hey Ofir,
>>
>>> Currently there is no way to throttle nvme targets with cgroup v2.
>>
>> How do you do it with v1?
With v1 I would add a blkio rule at the cgroup root level. The bio's
that the nvme target submits aren't associated to a specific cgroup,
which makes them follow the rules of the cgroup root level.

V2 doesn't allow to set rules at the root level by design.

>>> The IOs that the nvme target submits lack associating to a cgroup,
>>> which makes them act as root cgroup. The root cgroup can't be throttled
>>> with the cgroup v2 mechanism.
>>
>> What happens to file or passthru backends? You paid attention just to
>> bdev. I don't see how this is sanely supported with files. It's possible
>> if you convert nvmet to use its own dedicated kthreads and infer the
>> cg from the kthread. That presents a whole other set of issues.
>>
> 
> if we are doing it for one back-end we cannot leave other back-ends out ...
> 
>> Maybe the cleanest way to implement something like this is to implement
>> a full blown nvmet cgroup controller that you can apply a whole set of
>> resources to, in addition to I/O.

Thorttiling files and passthru isn't possible with cgroup v1 as well,
cgroup v2 broke the abillity to throttle bdevs. The purpose of the patch
is to re-enable the broken functionality.

There was an attempt to re-enable the functionality by allowing io
throttle on the root cgroup but it's against the cgroup v2 design.
Reference:
https://lore.kernel.org/r/20220114093000.3323470-1-yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org/

A full blown nvmet cgroup controller may be a complete solution, but it
may take some time to achieve, while the feature is still broken.

>>
>>> Signed-off-by: Ofir Gal <ofir.gal-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
>>> ---
>>>   drivers/nvme/target/configfs.c    | 77 +++++++++++++++++++++++++++++++
>>>   drivers/nvme/target/core.c        |  3 ++
>>>   drivers/nvme/target/io-cmd-bdev.c | 13 ++++++
>>>   drivers/nvme/target/nvmet.h       |  3 ++
>>>   include/linux/cgroup.h            |  5 ++
>>>   kernel/cgroup/cgroup-internal.h   |  5 --
>>
>> Don't mix cgroup and nvmet changes in the same patch.

Thanks for claryfing I wansn's sure if it's nessecary I would split the
patch for v2.

>>
>>>   6 files changed, 101 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/nvme/target/configfs.c 
>>> b/drivers/nvme/target/configfs.c
>>> index 907143870da5..2e8f93a07498 100644
>>> --- a/drivers/nvme/target/configfs.c
>>> +++ b/drivers/nvme/target/configfs.c
>>> @@ -12,6 +12,7 @@
>>>   #include <linux/ctype.h>
>>>   #include <linux/pci.h>
>>>   #include <linux/pci-p2pdma.h>
>>> +#include <linux/cgroup.h>
>>>   #ifdef CONFIG_NVME_TARGET_AUTH
>>>   #include <linux/nvme-auth.h>
>>>   #endif
>>> @@ -281,6 +282,81 @@ static ssize_t 
>>> nvmet_param_pi_enable_store(struct config_item *item,
>>>   CONFIGFS_ATTR(nvmet_, param_pi_enable);
>>>   #endif
>>>   +static ssize_t nvmet_param_associated_cgroup_show(struct 
>>> config_item *item,
>>> +        char *page)
>>> +{
>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>> +    ssize_t len = 0;
>>> +    ssize_t retval;
>>> +    char *suffix;
>>> +
>>> +    /* No cgroup has been set means the IOs are assoicated to the 
>>> root cgroup */
>>> +    if (!port->cgrp)
>>> +        goto root_cgroup;
>>> +
>>> +    retval = cgroup_path_ns(port->cgrp, page, PAGE_SIZE,
>>> +                         current->nsproxy->cgroup_ns);
>>> +    if (retval >= PATH_MAX || retval >= PAGE_SIZE)
>>> +        return -ENAMETOOLONG;
>>> +
>>> +    /* No cgroup found means the IOs are assoicated to the root 
>>> cgroup */
>>> +    if (retval < 0)
>>> +        goto root_cgroup;
>>> +
>>> +    len += retval;
>>> +
>>> +    suffix = cgroup_is_dead(port->cgrp) ? " (deleted)\n" : "\n";
>>> +    len += snprintf(page + len, PAGE_SIZE - len, suffix);
>>> +
>>> +    return len;
>>> +
>>> +root_cgroup:
>>> +    return snprintf(page, PAGE_SIZE, "/\n");
>>> +}
>>> +
>>> +static ssize_t nvmet_param_associated_cgroup_store(struct 
>>> config_item *item,
>>> +        const char *page, size_t count)
>>> +{
>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>> +    struct cgroup_subsys_state *blkcg;
>>> +    ssize_t retval = -EINVAL;
>>> +    struct cgroup *cgrp;
>>> +    char *path;
>>> +    int len;
>>> +
>>> +    len = strcspn(page, "\n");
>>> +    if (!len)
>>> +        return -EINVAL;
>>> +
>>> +    path = kmemdup_nul(page, len, GFP_KERNEL);
>>> +    if (!path)
>>> +        return -ENOMEM;
>>> +
>>> +    cgrp = cgroup_get_from_path(path);
>>> +    kfree(path);
>>> +    if (IS_ERR(cgrp))
>>> +        return -ENOENT;
>>> +
>>> +    blkcg = cgroup_get_e_css(cgrp, &io_cgrp_subsys);
>>> +    if (!blkcg)
>>> +        goto out_put_cgroup;
>>> +
>>> +    /* Put old cgroup */
>>> +    if (port->cgrp)
>>> +        cgroup_put(port->cgrp);
>>> +
>>> +    port->cgrp = cgrp;
>>> +    port->blkcg = blkcg;
>>> +
>>> +    return count;
>>> +
>>> +out_put_cgroup:
>>> +    cgroup_put(cgrp);
>>> +    return retval;
>>> +}
>>
>> I'm not at all convinced that nvmet ratelimiting does not
>> require a dedicated cgroup controller... Rgardles, this doesn't
>> look like a port attribute, its a subsystem attribute.
> 
> +1 here, can you please explain the choice of port ?

In cgroup threads/processes are associated to a specific control group.
Each control group may have different rules to throttle various devices.
For example we may have 2 applications both using the same bdev.
By associating the apps to different cgroups, we can create a different
throttling rule for each app.
Throttling is done by echoing "MAJOR:MINOR rbps=X wiops=Y" to "io.max"
of the cgroup.

Associating a subsystem to a cgroup will only allow us to create a
single rule for each namespace (bdev) in this subsystem.
When associating the nvme port to the cgroup it acts as the "thread"
that handles the IO for the target, which aligns with the cgroup design.

Regardless if the attribute is part of the port or the subsystem, the
user needs to specify constraints per namespace. I see no clear value in
setting the cgroup attribute on the subsystem.

On the other hand, by associating a port to a cgroup we could have
multiple constraints per namespace. It will allow the user to have more
control of the behavior of his system.
For example a system with a RDMA port and a TCP port that are connected
to the same subsystem's can apply different limits for each port.

This could be complimentary to NVMe ANA for example, where the target
could apply different constraints for optimized and non-optimized paths

To elaborate, such approach fits nicely into NVMe controller model.
Controllers report namespace's ANA group and different controllers may
report different ANA groups for the same namespace.
So, namespace's ANA group is basically a property of controller and not
a property of subsystem.
Further, the controller is associated with port and thus might inherit
some of its properties.

> also for cgroup related changes I think you need to CC Tejun who's
> the author of the cgroup_is_dead() and whatever the right mailing list ..
> 
> -ck

Thanks for clarifying I wasn't sure if it's necessary or not.

>> +
>> +CONFIGFS_ATTR(nvmet_, param_associated_cgroup);
>> +
>>   static ssize_t nvmet_addr_trtype_show(struct config_item *item,
>>   		char *page)
>>   {
>> @@ -1742,6 +1818,7 @@ static struct configfs_attribute *nvmet_port_attrs[] = {
>>   	&nvmet_attr_addr_trsvcid,
>>   	&nvmet_attr_addr_trtype,
>>   	&nvmet_attr_param_inline_data_size,
>> +	&nvmet_attr_param_associated_cgroup,
>>   #ifdef CONFIG_BLK_DEV_INTEGRITY
>>   	&nvmet_attr_param_pi_enable,
>>   #endif
>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>> index 3935165048e7..b63996b61c6d 100644
>> --- a/drivers/nvme/target/core.c
>> +++ b/drivers/nvme/target/core.c
>> @@ -376,6 +376,9 @@ void nvmet_disable_port(struct nvmet_port *port)
>>   	port->enabled = false;
>>   	port->tr_ops = NULL;
>>   
>> +	if (port->cgrp)
>> +		cgroup_put(port->cgrp);
>> +
>>   	ops = nvmet_transports[port->disc_addr.trtype];
>>   	ops->remove_port(port);
>>   	module_put(ops->owner);
>> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
>> index c2d6cea0236b..eb63a071131d 100644
>> --- a/drivers/nvme/target/io-cmd-bdev.c
>> +++ b/drivers/nvme/target/io-cmd-bdev.c
>> @@ -8,6 +8,8 @@
>>   #include <linux/blk-integrity.h>
>>   #include <linux/memremap.h>
>>   #include <linux/module.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/blk-cgroup.h>
>>   #include "nvmet.h"
>>   
>>   void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id)
>> @@ -285,6 +287,8 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>   	bio->bi_iter.bi_sector = sector;
>>   	bio->bi_private = req;
>>   	bio->bi_end_io = nvmet_bio_done;
>> +	if (req->port->blkcg)
>> +		bio_associate_blkg_from_css(bio, req->port->blkcg);
>>   
>>   	blk_start_plug(&plug);
>>   	if (req->metadata_len)
>> @@ -308,6 +312,7 @@ static void nvmet_bdev_execute_rw(struct nvmet_req *req)
>>   			bio = bio_alloc(req->ns->bdev, bio_max_segs(sg_cnt),
>>   					opf, GFP_KERNEL);
>>   			bio->bi_iter.bi_sector = sector;
>> +			bio_clone_blkg_association(bio, prev);
>>   
>>   			bio_chain(bio, prev);
>>   			submit_bio(prev);
>> @@ -345,6 +350,8 @@ static void nvmet_bdev_execute_flush(struct nvmet_req *req)
>>   		 ARRAY_SIZE(req->inline_bvec), REQ_OP_WRITE | REQ_PREFLUSH);
>>   	bio->bi_private = req;
>>   	bio->bi_end_io = nvmet_bio_done;
>> +	if (req->port->blkcg)
>> +		bio_associate_blkg_from_css(bio, req->port->blkcg);
>>   
>>   	submit_bio(bio);
>>   }
>> @@ -397,6 +404,9 @@ static void nvmet_bdev_execute_discard(struct nvmet_req *req)
>>   	if (bio) {
>>   		bio->bi_private = req;
>>   		bio->bi_end_io = nvmet_bio_done;
>> +		if (req->port->blkcg)
>> +			bio_associate_blkg_from_css(bio, req->port->blkcg);
>> +
>>   		if (status)
>>   			bio_io_error(bio);
>>   		else
>> @@ -444,6 +454,9 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req)
>>   	if (bio) {
>>   		bio->bi_private = req;
>>   		bio->bi_end_io = nvmet_bio_done;
>> +		if (req->port->blkcg)
>> +			bio_associate_blkg_from_css(bio, req->port->blkcg);
>> +
>>   		submit_bio(bio);
>>   	} else {
>>   		nvmet_req_complete(req, errno_to_nvme_status(req, ret));
>> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
>> index dc60a22646f7..3e5c9737d07e 100644
>> --- a/drivers/nvme/target/nvmet.h
>> +++ b/drivers/nvme/target/nvmet.h
>> @@ -20,6 +20,7 @@
>>   #include <linux/blkdev.h>
>>   #include <linux/radix-tree.h>
>>   #include <linux/t10-pi.h>
>> +#include <linux/cgroup.h>
>>   
>>   #define NVMET_DEFAULT_VS		NVME_VS(1, 3, 0)
>>   
>> @@ -163,6 +164,8 @@ struct nvmet_port {
>>   	int				inline_data_size;
>>   	const struct nvmet_fabrics_ops	*tr_ops;
>>   	bool				pi_enable;
>> +	struct cgroup				*cgrp;
>> +	struct cgroup_subsys_state	*blkcg;
>>   };
>>   
>>   static inline struct nvmet_port *to_nvmet_port(struct config_item *item)
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 885f5395fcd0..47e2a7cdc31e 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -562,6 +562,11 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
>>   		cgrp->nr_populated_threaded_children;
>>   }
>>   
>> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
>> +{
>> +	return !(cgrp->self.flags & CSS_ONLINE);
>> +}
>> +
>>   /* returns ino associated with a cgroup */
>>   static inline ino_t cgroup_ino(struct cgroup *cgrp)
>>   {
>> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
>> index 367b0a42ada9..8c5c83e9edd7 100644
>> --- a/kernel/cgroup/cgroup-internal.h
>> +++ b/kernel/cgroup/cgroup-internal.h
>> @@ -181,11 +181,6 @@ extern struct list_head cgroup_roots;
>>   	for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT &&		\
>>   	     (((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
>>   
>> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
>> -{
>> -	return !(cgrp->self.flags & CSS_ONLINE);
>> -}
>> -
>>   static inline bool notify_on_release(const struct cgroup *cgrp)
>>   {
>>   	return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]         ` <79894bd9-03c7-8d27-eb6a-5e1336550449-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
@ 2023-06-30 18:26           ` Michal Koutný
  2023-07-03 13:21           ` Sagi Grimberg
  1 sibling, 0 replies; 6+ messages in thread
From: Michal Koutný @ 2023-06-30 18:26 UTC (permalink / raw)
  To: Ofir Gal
  Cc: Chaitanya Kulkarni, hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Sagi Grimberg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	tj-DgEjT+Ai2ygdnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 1077 bytes --]

Hello.

On Thu, Jun 29, 2023 at 06:21:34PM +0300, Ofir Gal <ofir.gal-83CFe9096HFBDgjK7y7TUQ@public.gmane.org> wrote:
> >>> Currently there is no way to throttle nvme targets with cgroup v2.
> >>
> >> How do you do it with v1?
> With v1 I would add a blkio rule at the cgroup root level. The bio's
> that the nvme target submits aren't associated to a specific cgroup,
> which makes them follow the rules of the cgroup root level.
> 
> V2 doesn't allow to set rules at the root level by design.

I think you want to add a per-nvme port (device?) throttled-bandwidth
attribute (instead of param_associated_cgroup) to control this on the
client (whole host) side -- even without cgroups.
Associating ports with cgroups and then bios with cgroups looks
superfluous.
(But it is understandably tempting because then IO controller's
throttling implementaion may be reused.)


How are the nvmet_bdev_execute_rw() et al functions called? (Only via
kthreads without (uspace) process context? Or can the reqs be traced to
a user space process?)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]         ` <79894bd9-03c7-8d27-eb6a-5e1336550449-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
  2023-06-30 18:26           ` Michal Koutný
@ 2023-07-03 13:21           ` Sagi Grimberg
       [not found]             ` <ab204a2d-9a30-7c90-8afa-fc84c935730f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2023-07-03 13:21 UTC (permalink / raw)
  To: Ofir Gal, Chaitanya Kulkarni
  Cc: hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A


> Hey Sagi and Chaitanya,
> 
> On 28/06/2023 5:33, Chaitanya Kulkarni wrote:
>> On 6/27/23 14:13, Sagi Grimberg wrote:
>>> Hey Ofir,
>>>
>>>> Currently there is no way to throttle nvme targets with cgroup v2.
>>>
>>> How do you do it with v1?
> With v1 I would add a blkio rule at the cgroup root level. The bio's
> that the nvme target submits aren't associated to a specific cgroup,
> which makes them follow the rules of the cgroup root level.
> 
> V2 doesn't allow to set rules at the root level by design.
> 
>>>> The IOs that the nvme target submits lack associating to a cgroup,
>>>> which makes them act as root cgroup. The root cgroup can't be throttled
>>>> with the cgroup v2 mechanism.
>>>
>>> What happens to file or passthru backends? You paid attention just to
>>> bdev. I don't see how this is sanely supported with files. It's possible
>>> if you convert nvmet to use its own dedicated kthreads and infer the
>>> cg from the kthread. That presents a whole other set of issues.
>>>
>>
>> if we are doing it for one back-end we cannot leave other back-ends out ...
>>
>>> Maybe the cleanest way to implement something like this is to implement
>>> a full blown nvmet cgroup controller that you can apply a whole set of
>>> resources to, in addition to I/O.
> 
> Thorttiling files and passthru isn't possible with cgroup v1 as well,
> cgroup v2 broke the abillity to throttle bdevs. The purpose of the patch
> is to re-enable the broken functionality.

cgroupv2 didn't break anything, this was never an intended feature of
the linux nvme target, so it couldn't have been broken. Did anyone
know that people are doing this with nvmet?

I'm pretty sure others on the list are treating this as a suggested
new feature for nvmet. and designing this feature as something that
is only supported for blkdevs is undersirable.


> There was an attempt to re-enable the functionality by allowing io
> throttle on the root cgroup but it's against the cgroup v2 design.
> Reference:
> https://lore.kernel.org/r/20220114093000.3323470-1-yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org/
> 
> A full blown nvmet cgroup controller may be a complete solution, but it
> may take some time to achieve,

I don't see any other sane solution here.

Maybe Tejun/others think differently here?

> while the feature is still broken.

Again, this is not a breakage.

> 
>>>
>>>> Signed-off-by: Ofir Gal <ofir.gal-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
>>>> ---
>>>>    drivers/nvme/target/configfs.c    | 77 +++++++++++++++++++++++++++++++
>>>>    drivers/nvme/target/core.c        |  3 ++
>>>>    drivers/nvme/target/io-cmd-bdev.c | 13 ++++++
>>>>    drivers/nvme/target/nvmet.h       |  3 ++
>>>>    include/linux/cgroup.h            |  5 ++
>>>>    kernel/cgroup/cgroup-internal.h   |  5 --
>>>
>>> Don't mix cgroup and nvmet changes in the same patch.
> 
> Thanks for claryfing I wansn's sure if it's nessecary I would split the
> patch for v2.
> 
>>>
>>>>    6 files changed, 101 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/target/configfs.c
>>>> b/drivers/nvme/target/configfs.c
>>>> index 907143870da5..2e8f93a07498 100644
>>>> --- a/drivers/nvme/target/configfs.c
>>>> +++ b/drivers/nvme/target/configfs.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <linux/ctype.h>
>>>>    #include <linux/pci.h>
>>>>    #include <linux/pci-p2pdma.h>
>>>> +#include <linux/cgroup.h>
>>>>    #ifdef CONFIG_NVME_TARGET_AUTH
>>>>    #include <linux/nvme-auth.h>
>>>>    #endif
>>>> @@ -281,6 +282,81 @@ static ssize_t
>>>> nvmet_param_pi_enable_store(struct config_item *item,
>>>>    CONFIGFS_ATTR(nvmet_, param_pi_enable);
>>>>    #endif
>>>>    +static ssize_t nvmet_param_associated_cgroup_show(struct
>>>> config_item *item,
>>>> +        char *page)
>>>> +{
>>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>>> +    ssize_t len = 0;
>>>> +    ssize_t retval;
>>>> +    char *suffix;
>>>> +
>>>> +    /* No cgroup has been set means the IOs are assoicated to the
>>>> root cgroup */
>>>> +    if (!port->cgrp)
>>>> +        goto root_cgroup;
>>>> +
>>>> +    retval = cgroup_path_ns(port->cgrp, page, PAGE_SIZE,
>>>> +                         current->nsproxy->cgroup_ns);
>>>> +    if (retval >= PATH_MAX || retval >= PAGE_SIZE)
>>>> +        return -ENAMETOOLONG;
>>>> +
>>>> +    /* No cgroup found means the IOs are assoicated to the root
>>>> cgroup */
>>>> +    if (retval < 0)
>>>> +        goto root_cgroup;
>>>> +
>>>> +    len += retval;
>>>> +
>>>> +    suffix = cgroup_is_dead(port->cgrp) ? " (deleted)\n" : "\n";
>>>> +    len += snprintf(page + len, PAGE_SIZE - len, suffix);
>>>> +
>>>> +    return len;
>>>> +
>>>> +root_cgroup:
>>>> +    return snprintf(page, PAGE_SIZE, "/\n");
>>>> +}
>>>> +
>>>> +static ssize_t nvmet_param_associated_cgroup_store(struct
>>>> config_item *item,
>>>> +        const char *page, size_t count)
>>>> +{
>>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>>> +    struct cgroup_subsys_state *blkcg;
>>>> +    ssize_t retval = -EINVAL;
>>>> +    struct cgroup *cgrp;
>>>> +    char *path;
>>>> +    int len;
>>>> +
>>>> +    len = strcspn(page, "\n");
>>>> +    if (!len)
>>>> +        return -EINVAL;
>>>> +
>>>> +    path = kmemdup_nul(page, len, GFP_KERNEL);
>>>> +    if (!path)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    cgrp = cgroup_get_from_path(path);
>>>> +    kfree(path);
>>>> +    if (IS_ERR(cgrp))
>>>> +        return -ENOENT;
>>>> +
>>>> +    blkcg = cgroup_get_e_css(cgrp, &io_cgrp_subsys);
>>>> +    if (!blkcg)
>>>> +        goto out_put_cgroup;
>>>> +
>>>> +    /* Put old cgroup */
>>>> +    if (port->cgrp)
>>>> +        cgroup_put(port->cgrp);
>>>> +
>>>> +    port->cgrp = cgrp;
>>>> +    port->blkcg = blkcg;
>>>> +
>>>> +    return count;
>>>> +
>>>> +out_put_cgroup:
>>>> +    cgroup_put(cgrp);
>>>> +    return retval;
>>>> +}
>>>
>>> I'm not at all convinced that nvmet ratelimiting does not
>>> require a dedicated cgroup controller... Rgardles, this doesn't
>>> look like a port attribute, its a subsystem attribute.
>>
>> +1 here, can you please explain the choice of port ?
> 
> In cgroup threads/processes are associated to a specific control group.
> Each control group may have different rules to throttle various devices.
> For example we may have 2 applications both using the same bdev.
> By associating the apps to different cgroups, we can create a different
> throttling rule for each app.
> Throttling is done by echoing "MAJOR:MINOR rbps=X wiops=Y" to "io.max"
> of the cgroup.
> 
> Associating a subsystem to a cgroup will only allow us to create a
> single rule for each namespace (bdev) in this subsystem.
> When associating the nvme port to the cgroup it acts as the "thread"
> that handles the IO for the target, which aligns with the cgroup design.

The port does not own the thread though, rdma/loop inherit the
"thread" from something entirely different. tcp uses the same worker
threads for all ports (global workqueue).

IIUC, a cgroup would be associated with a specific host, and the entity
that is host aware is a subsystem, not a port.

> Regardless if the attribute is part of the port or the subsystem, the
> user needs to specify constraints per namespace. I see no clear value in
> setting the cgroup attribute on the subsystem.

Maybe it would be worth to explain what you are trying to achieve here,
because my understanding is that you want to allow different hosts to
get different I/O service levels. The right place to do this is the
subsystem AFAICT.

> On the other hand, by associating a port to a cgroup we could have
> multiple constraints per namespace. It will allow the user to have more
> control of the behavior of his system.

Can you please clarify what you mean by "multiple constraints per
namespace"?

> For example a system with a RDMA port and a TCP port that are connected
> to the same subsystem's can apply different limits for each port.
> 
> This could be complimentary to NVMe ANA for example, where the target
> could apply different constraints for optimized and non-optimized paths

I don't understand what you are suggesting here. nvme already has
asymmetry semantics between controllers.

I think you are mixing two different things. There is asymmetric access
in nvmet (which could translate to tcp vs. rdma or to something else)
and there is providing different I/O service levels to different hosts.
I don't see how these two are connected.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]             ` <ab204a2d-9a30-7c90-8afa-fc84c935730f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2023-07-03 19:16               ` Tejun Heo
       [not found]                 ` <ZKMehaAF0v-nV1qt-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2023-07-03 19:16 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Ofir Gal, Chaitanya Kulkarni, hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jul 03, 2023 at 04:21:40PM +0300, Sagi Grimberg wrote:
> > Thorttiling files and passthru isn't possible with cgroup v1 as well,
> > cgroup v2 broke the abillity to throttle bdevs. The purpose of the patch
> > is to re-enable the broken functionality.
> 
> cgroupv2 didn't break anything, this was never an intended feature of
> the linux nvme target, so it couldn't have been broken. Did anyone
> know that people are doing this with nvmet?

Maybe he's referring to the fact that cgroup1 allowed throttling root
cgroups? Maybe they were throttling from the root cgroup on the client side?

> I'm pretty sure others on the list are treating this as a suggested
> new feature for nvmet. and designing this feature as something that
> is only supported for blkdevs is undersirable.
> 
> > There was an attempt to re-enable the functionality by allowing io
> > throttle on the root cgroup but it's against the cgroup v2 design.
> > Reference:
> > https://lore.kernel.org/r/20220114093000.3323470-1-yukuai3-hv44wF8Li93QT0dZR+AlfA@public.gmane.org/
> > 
> > A full blown nvmet cgroup controller may be a complete solution, but it
> > may take some time to achieve,
> 
> I don't see any other sane solution here.
> 
> Maybe Tejun/others think differently here?

I'm not necessarily against the idea of enabling subsystems to assign cgroup
membership to entities which aren't processes or threads. It does make sense
for cases where a kernel subsystem is serving multiple classes of users
which aren't processes as here and it's likely that we'd need something
similar for certain memory regions in a limited way (e.g. tmpfs chunk shared
across multiple cgroups).

That said, because we haven't done this before, we haven't figured out how
the API should be like and we definitely want something which can be used in
a similar fashion across the board. Also, cgroup does assume that resources
are always associated with processes or threads, and making this work with
non-task entity would require some generalization there. Maybe the solution
is to always have a tying kthread which serves as a proxy for the resource
but that seems a bit nasty at least on the first thought.

In principle, at least from cgroup POV, I think the idea of being able to
assign cgroup membership to subsystem-specific entities is okay. In
practice, there are quite a few challenges to address.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]                 ` <ZKMehaAF0v-nV1qt-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
@ 2023-07-04  7:54                   ` Sagi Grimberg
       [not found]                     ` <b181b848-b2c7-4a7e-7173-ff6c771d6731-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2023-07-04  7:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ofir Gal, Chaitanya Kulkarni, hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA


>>> A full blown nvmet cgroup controller may be a complete solution, but it
>>> may take some time to achieve,
>>
>> I don't see any other sane solution here.
>>
>> Maybe Tejun/others think differently here?
> 
> I'm not necessarily against the idea of enabling subsystems to assign cgroup
> membership to entities which aren't processes or threads. It does make sense
> for cases where a kernel subsystem is serving multiple classes of users
> which aren't processes as here and it's likely that we'd need something
> similar for certain memory regions in a limited way (e.g. tmpfs chunk shared
> across multiple cgroups).

That makes sense.

 From the nvme target side, the prime use-case is I/O, which can be on
against bdev backends, file backends or passthru nvme devices.

What we'd want is for something that is agnostic to the backend type
hence my comment that the only sane solution would be to introduce a
nvmet cgroup controller.

I also asked the question of what is the use-case here? because the
"users" are remote nvme hosts accessing nvmet, there is no direct
mapping between a nvme namespace (backed by say a bdev) to a host, only
indirect mapping via a subsystem over a port (which is kinda-sorta
similar to a SCSI I_T Nexus). Implementing I/O service-levels
enforcement with blkcg seems like the wrong place to me.

> That said, because we haven't done this before, we haven't figured out how
> the API should be like and we definitely want something which can be used in
> a similar fashion across the board. Also, cgroup does assume that resources
> are always associated with processes or threads, and making this work with
> non-task entity would require some generalization there. Maybe the solution
> is to always have a tying kthread which serves as a proxy for the resource
> but that seems a bit nasty at least on the first thought.

That was also a thought earlier in the thread as that is pretty much
what the loop driver does, however that requires quite a bit of
infrastructure because nvmet threads are primarily workqueues/kworkers,
There is no notion of kthreads per entity.

> In principle, at least from cgroup POV, I think the idea of being able to
> assign cgroup membership to subsystem-specific entities is okay. In
> practice, there are quite a few challenges to address.

Understood.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] nvmet: allow associating port to a cgroup via configfs
       [not found]                     ` <b181b848-b2c7-4a7e-7173-ff6c771d6731-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2023-07-06 13:28                       ` Ofir Gal
  0 siblings, 0 replies; 6+ messages in thread
From: Ofir Gal @ 2023-07-06 13:28 UTC (permalink / raw)
  To: Sagi Grimberg, Tejun Heo
  Cc: Chaitanya Kulkarni, hch-jcswGhMUV9g@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 04/07/2023 10:54, Sagi Grimberg wrote:
>>>> A full blown nvmet cgroup controller may be a complete solution, but it
>>>> may take some time to achieve,
>>>
>>> I don't see any other sane solution here.
>>>
>>> Maybe Tejun/others think differently here?
>>
>> I'm not necessarily against the idea of enabling subsystems to assign
>> cgroup
>> membership to entities which aren't processes or threads. It does make
>> sense
>> for cases where a kernel subsystem is serving multiple classes of users
>> which aren't processes as here and it's likely that we'd need something
>> similar for certain memory regions in a limited way (e.g. tmpfs chunk
>> shared
>> across multiple cgroups).
> 
> That makes sense.
> 
> From the nvme target side, the prime use-case is I/O, which can be on
> against bdev backends, file backends or passthru nvme devices.
> 
> What we'd want is for something that is agnostic to the backend type
> hence my comment that the only sane solution would be to introduce a
> nvmet cgroup controller.
> 
> I also asked the question of what is the use-case here? because the
> "users" are remote nvme hosts accessing nvmet, there is no direct
> mapping between a nvme namespace (backed by say a bdev) to a host, only
> indirect mapping via a subsystem over a port (which is kinda-sorta
> similar to a SCSI I_T Nexus). Implementing I/O service-levels
> enforcement with blkcg seems like the wrong place to me.

We have a server with a bdev that reach 15K IOPs, but when more than 10K
IOPs are served the latency of the bdev is being doubled. We want to
limit this bdev to 10K IOPs to serve the clients with low latency.

For sake of simplicity The server expose the bdev with nvme target under
a single subsystem and with a single namespace through a single port.

There are 2 clients connected to the server, the clients aren't aware of
each other and are submitting unknown amount of IOPs.
Without the limitation on the server side we have to limit each client
to half of the bdev capability, in our example it would be 5K IOPs for
each client.

This would be non-optimal limitation, sometimes client #1 is idle while
client #2 need to submit 10K IOPs.
With a server side limitation, both clients can submit up to 10K
combined but are exposed to "noisy-neighbor".

>> That said, because we haven't done this before, we haven't figured out
>> how
>> the API should be like and we definitely want something which can be
>> used in
>> a similar fashion across the board. Also, cgroup does assume that
>> resources
>> are always associated with processes or threads, and making this work
>> with
>> non-task entity would require some generalization there. Maybe the
>> solution
>> is to always have a tying kthread which serves as a proxy for the
>> resource
>> but that seems a bit nasty at least on the first thought.

A general API to associate non-task entities sounds great!


On 03/07/2023 13:21, Sagi Grimberg wrote:
> cgroupv2 didn't break anything, this was never an intended feature of
> the linux nvme target, so it couldn't have been broken. Did anyone
> know that people are doing this with nvmet?
> 
> I'm pretty sure others on the list are treating this as a suggested
> new feature for nvmet. and designing this feature as something that
> is only supported for blkdevs is undersirable.

I understand that throttling of an nvme target was not an intended feature.
However it was possible to create a global limit for any bdev, which
allowed throttling nvme target with bdev backends.
Today with the new cgroup architecture it is impossible.

I don't see how this does not count as a user-space breakage.
I think this patch can be a temporary solution until a general API will
be designed and implemented.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-06 13:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20230627100215.1206008-1-ofir.gal@volumez.com>
     [not found] ` <30d03bba-c2e1-7847-f17e-403d4e648228@grimberg.me>
     [not found]   ` <90150ffd-ba2d-6528-21b7-7ea493cd2b9a@nvidia.com>
     [not found]     ` <90150ffd-ba2d-6528-21b7-7ea493cd2b9a-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2023-06-29 15:21       ` [PATCH] nvmet: allow associating port to a cgroup via configfs Ofir Gal
     [not found]         ` <79894bd9-03c7-8d27-eb6a-5e1336550449-83CFe9096HFBDgjK7y7TUQ@public.gmane.org>
2023-06-30 18:26           ` Michal Koutný
2023-07-03 13:21           ` Sagi Grimberg
     [not found]             ` <ab204a2d-9a30-7c90-8afa-fc84c935730f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2023-07-03 19:16               ` Tejun Heo
     [not found]                 ` <ZKMehaAF0v-nV1qt-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-07-04  7:54                   ` Sagi Grimberg
     [not found]                     ` <b181b848-b2c7-4a7e-7173-ff6c771d6731-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2023-07-06 13:28                       ` Ofir Gal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox