Re: [PATCH v3] memcg: Add memory.pressure_level events

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anton Vorontsov <anton@enomsg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Pekka Enberg <penberg@kernel.org>, Mel Gorman <mgorman@suse.de>,
	Glauber Costa <glommer@parallels.com>,
	Michal Hocko <mhocko@suse.cz>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Leonid Moiseichuk <leonid.moiseichuk@nokia.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>,
	John Stultz <john.stultz@linaro.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linaro-kernel@lists.linaro.org, patches@linaro.org,
	kernel-team@android.com
Subject: Re: [PATCH v3] memcg: Add memory.pressure_level events
Date: Tue, 2 Apr 2013 21:02:46 -0700	[thread overview]
Message-ID: <20130403040246.GA32229@lizard.gateway.2wire.net> (raw)
In-Reply-To: <20130326134656.4e0e0aefcf881bffae769b1e@linux-foundation.org>

On Tue, Mar 26, 2013 at 01:46:56PM -0700, Andrew Morton wrote:
[...]
> > +The file memory.pressure_level is only used to setup an eventfd,
> > +read/write operations are no implemented.
[...]
> Did we tell people how to use the eventfd interface anywhere?

Good point. In v4 I added a detailed instructions on how to setup the file
descriptors.

> >  1. Add support for accounting huge pages (as a separate controller)
> >  2. Make per-cgroup scanner reclaim not-shared pages first
> > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > --- /dev/null
> > +++ b/include/linux/vmpressure.h
> > @@ -0,0 +1,47 @@
> > +#ifndef __LINUX_VMPRESSURE_H
> > +#define __LINUX_VMPRESSURE_H
> > +
> > +#include <linux/mutex.h>
> > +#include <linux/list.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/gfp.h>
> > +#include <linux/types.h>
> > +#include <linux/cgroup.h>
> > +
> > +struct vmpressure {
> > +	unsigned int scanned;
> > +	unsigned int reclaimed;
> > +	/* The lock is used to keep the scanned/reclaimed above in sync. */
> > +	struct mutex sr_lock;
> > +
> > +	struct list_head events;
> 
> A comment describing what goes at `events' would be nice.  Reference
> "struct vmpressure_event".

Done.

> > +	/* Have to grab the lock on events traversal or modifications. */
> > +	struct mutex events_lock;
> > +
> > +	struct work_struct work;
> > +};
> >
> > ...
> >
> > +/*
> > + * The window size is the number of scanned pages before we try to analyze
> > + * the scanned/reclaimed ratio (or difference).
> > + *
> > + * It is used as a rate-limit tunable for the "low" level notification,
> > + * and for averaging medium/critical levels. Using small window sizes can
> > + * cause lot of false positives, but too big window size will delay the
> > + * notifications.
> > + *
> > + * TODO: Make the window size depend on machine size, as we do for vmstat
> > + * thresholds.
> 
> Here "the window size" refers to vmpressure_win, yes?

Yup.

(To make it clear, in the new version I added a direct reference to the
vmpressure_win.)

> > + */
> > +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> > +static const unsigned int vmpressure_level_med = 60;
> > +static const unsigned int vmpressure_level_critical = 95;
> > +static const unsigned int vmpressure_level_critical_prio = 3;
> 
> vmpressure_level_critical_prio is a bit mysterious and undocumented. 
> Please document it here and/or at vmpressure_prio().

I added documentation in v4.

> > +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> > +		unsigned long scanned, unsigned long reclaimed)
> 
> Exported function and a primary inteface.  Needs nice documentation, please ;)

Sure thing, all exported function now come with kernel-doc comments.

[...]
> > +	 */
> > +	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> 
> I'm surprised at __GFP_HIGHMEM's inclusion.  On some machines the great
> majority of user memory is in highmem.  What's up?

In the new revision I included this comment:

        /*
         * Here we only want to account pressure that userland is able to
         * help us with. For example, suppose that DMA zone is under
         * pressure; if we notify userland about that kind of pressure,
         * then it will be mostly a waste as it will trigger unnecessary
         * freeing of memory by userland (since userland is more likely to
         * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
         * is why we include only movable, highmem and FS/IO pages.
         * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
         * we account it too.
         */
        if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
                return;

> > +	if (!scanned)
> > +		return;
> > +
> > +	mutex_lock(&vmpr->sr_lock);
> > +	vmpr->scanned += scanned;
> > +	vmpr->reclaimed += reclaimed;
> 
> See, here we're accumulating into a 32-bit variable quantities which used
> to be held in 64-bit variables.    The overflow risk gets higher...

I see. I fixed this.

> > +	mutex_unlock(&vmpr->sr_lock);
> > +
> > +	if (scanned < vmpressure_win || work_pending(&vmpr->work))
> > +		return;
> > +	schedule_work(&vmpr->work);
> > +}
> > +
> > +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> 
> Documentation please.

Yup, done.

> > +{
> > +	if (prio > vmpressure_level_critical_prio)
> > +		return;
> > +
> > +	/*
> > +	 * OK, the prio is below the threshold, updating vmpressure
> 
> But you never told me what that threshold is for!  And I have no means
> of working out why you chose "3", nor the effects of altering it, etc.

True. This is explained it in a comment now.

[...]
> > +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> > +			      struct eventfd_ctx *eventfd, const char *args)
> 
> Document the interface, please.

Done.

> > +{
> > +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> > +	struct vmpressure_event *ev;
> > +	int lvl;
> 
> These abbreviations are rather unlinuxy.  wk->work, vmpr->vmpressure,
> lvl->level, etc.

Yeah, I agree. Although, 'vmpressure' as a function-scope variable is
kinda too long, the code becomes really hard to read. But in memcg struct
and global namespace I now use the full 'vmpressure' name.

> > +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> > +		if (!strcmp(vmpressure_str_levels[lvl], args))
> > +			break;
> > +	}
> > +
> > +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> > +		return -EINVAL;
> > +
> > +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> > +	if (!ev)
> > +		return -ENOMEM;
> > +
> > +	ev->efd = eventfd;
> > +	ev->level = lvl;
> > +
> > +	mutex_lock(&vmpr->events_lock);
> > +	list_add(&ev->node, &vmpr->events);
> 
> What's the upper bound on the length of this list?

As of now, it is controlled by the cgroup core, so I would say the number
of opened FDs, and if that is a problem, it should be fixed for everyone.
The good thing is that the list is per-cgroup, it is not global.


Thanks for the review, Andrew!

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Anton Vorontsov <anton@enomsg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Pekka Enberg <penberg@kernel.org>, Mel Gorman <mgorman@suse.de>,
	Glauber Costa <glommer@parallels.com>,
	Michal Hocko <mhocko@suse.cz>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	Greg Thelen <gthelen@google.com>,
	Leonid Moiseichuk <leonid.moiseichuk@nokia.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>,
	John Stultz <john.stultz@linaro.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linaro-kernel@lists.linaro.org, patches@linaro.org,
	kernel-team@android.com
Subject: Re: [PATCH v3] memcg: Add memory.pressure_level events
Date: Tue, 2 Apr 2013 21:02:46 -0700	[thread overview]
Message-ID: <20130403040246.GA32229@lizard.gateway.2wire.net> (raw)
In-Reply-To: <20130326134656.4e0e0aefcf881bffae769b1e@linux-foundation.org>

On Tue, Mar 26, 2013 at 01:46:56PM -0700, Andrew Morton wrote:
[...]
> > +The file memory.pressure_level is only used to setup an eventfd,
> > +read/write operations are no implemented.
[...]
> Did we tell people how to use the eventfd interface anywhere?

Good point. In v4 I added a detailed instructions on how to setup the file
descriptors.

> >  1. Add support for accounting huge pages (as a separate controller)
> >  2. Make per-cgroup scanner reclaim not-shared pages first
> > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > --- /dev/null
> > +++ b/include/linux/vmpressure.h
> > @@ -0,0 +1,47 @@
> > +#ifndef __LINUX_VMPRESSURE_H
> > +#define __LINUX_VMPRESSURE_H
> > +
> > +#include <linux/mutex.h>
> > +#include <linux/list.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/gfp.h>
> > +#include <linux/types.h>
> > +#include <linux/cgroup.h>
> > +
> > +struct vmpressure {
> > +	unsigned int scanned;
> > +	unsigned int reclaimed;
> > +	/* The lock is used to keep the scanned/reclaimed above in sync. */
> > +	struct mutex sr_lock;
> > +
> > +	struct list_head events;
> 
> A comment describing what goes at `events' would be nice.  Reference
> "struct vmpressure_event".

Done.

> > +	/* Have to grab the lock on events traversal or modifications. */
> > +	struct mutex events_lock;
> > +
> > +	struct work_struct work;
> > +};
> >
> > ...
> >
> > +/*
> > + * The window size is the number of scanned pages before we try to analyze
> > + * the scanned/reclaimed ratio (or difference).
> > + *
> > + * It is used as a rate-limit tunable for the "low" level notification,
> > + * and for averaging medium/critical levels. Using small window sizes can
> > + * cause lot of false positives, but too big window size will delay the
> > + * notifications.
> > + *
> > + * TODO: Make the window size depend on machine size, as we do for vmstat
> > + * thresholds.
> 
> Here "the window size" refers to vmpressure_win, yes?

Yup.

(To make it clear, in the new version I added a direct reference to the
vmpressure_win.)

> > + */
> > +static const unsigned int vmpressure_win = SWAP_CLUSTER_MAX * 16;
> > +static const unsigned int vmpressure_level_med = 60;
> > +static const unsigned int vmpressure_level_critical = 95;
> > +static const unsigned int vmpressure_level_critical_prio = 3;
> 
> vmpressure_level_critical_prio is a bit mysterious and undocumented. 
> Please document it here and/or at vmpressure_prio().

I added documentation in v4.

> > +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> > +		unsigned long scanned, unsigned long reclaimed)
> 
> Exported function and a primary inteface.  Needs nice documentation, please ;)

Sure thing, all exported function now come with kernel-doc comments.

[...]
> > +	 */
> > +	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> 
> I'm surprised at __GFP_HIGHMEM's inclusion.  On some machines the great
> majority of user memory is in highmem.  What's up?

In the new revision I included this comment:

        /*
         * Here we only want to account pressure that userland is able to
         * help us with. For example, suppose that DMA zone is under
         * pressure; if we notify userland about that kind of pressure,
         * then it will be mostly a waste as it will trigger unnecessary
         * freeing of memory by userland (since userland is more likely to
         * have HIGHMEM/MOVABLE pages instead of the DMA fallback). That
         * is why we include only movable, highmem and FS/IO pages.
         * Indirect reclaim (kswapd) sets sc->gfp_mask to GFP_KERNEL, so
         * we account it too.
         */
        if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
                return;

> > +	if (!scanned)
> > +		return;
> > +
> > +	mutex_lock(&vmpr->sr_lock);
> > +	vmpr->scanned += scanned;
> > +	vmpr->reclaimed += reclaimed;
> 
> See, here we're accumulating into a 32-bit variable quantities which used
> to be held in 64-bit variables.    The overflow risk gets higher...

I see. I fixed this.

> > +	mutex_unlock(&vmpr->sr_lock);
> > +
> > +	if (scanned < vmpressure_win || work_pending(&vmpr->work))
> > +		return;
> > +	schedule_work(&vmpr->work);
> > +}
> > +
> > +void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> 
> Documentation please.

Yup, done.

> > +{
> > +	if (prio > vmpressure_level_critical_prio)
> > +		return;
> > +
> > +	/*
> > +	 * OK, the prio is below the threshold, updating vmpressure
> 
> But you never told me what that threshold is for!  And I have no means
> of working out why you chose "3", nor the effects of altering it, etc.

True. This is explained it in a comment now.

[...]
> > +int vmpressure_register_event(struct cgroup *cg, struct cftype *cft,
> > +			      struct eventfd_ctx *eventfd, const char *args)
> 
> Document the interface, please.

Done.

> > +{
> > +	struct vmpressure *vmpr = cg_to_vmpr(cg);
> > +	struct vmpressure_event *ev;
> > +	int lvl;
> 
> These abbreviations are rather unlinuxy.  wk->work, vmpr->vmpressure,
> lvl->level, etc.

Yeah, I agree. Although, 'vmpressure' as a function-scope variable is
kinda too long, the code becomes really hard to read. But in memcg struct
and global namespace I now use the full 'vmpressure' name.

> > +	for (lvl = 0; lvl < VMPRESSURE_NUM_LEVELS; lvl++) {
> > +		if (!strcmp(vmpressure_str_levels[lvl], args))
> > +			break;
> > +	}
> > +
> > +	if (lvl >= VMPRESSURE_NUM_LEVELS)
> > +		return -EINVAL;
> > +
> > +	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> > +	if (!ev)
> > +		return -ENOMEM;
> > +
> > +	ev->efd = eventfd;
> > +	ev->level = lvl;
> > +
> > +	mutex_lock(&vmpr->events_lock);
> > +	list_add(&ev->node, &vmpr->events);
> 
> What's the upper bound on the length of this list?

As of now, it is controlled by the cgroup core, so I would say the number
of opened FDs, and if that is a problem, it should be fixed for everyone.
The good thing is that the list is per-cgroup, it is not global.


Thanks for the review, Andrew!

Anton

next prev parent reply	other threads:[~2013-04-03  4:02 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-22  7:13 [PATCH v3] memcg: Add memory.pressure_level events Anton Vorontsov
2013-03-22  7:13 ` Anton Vorontsov
2013-03-22  7:13 ` Anton Vorontsov
     [not found] ` <20130322071351.GA3971-1CZZkhFgUWNRmOO2HpYVOJIbA5emDH3N@public.gmane.org>
2013-03-26 20:46   ` Andrew Morton
2013-03-26 20:46     ` Andrew Morton
2013-03-26 20:46     ` Andrew Morton
2013-04-03  4:02     ` Anton Vorontsov [this message]
2013-04-03  4:02       ` Anton Vorontsov
2013-03-27  1:53 ` Kamezawa Hiroyuki
2013-03-27  1:53   ` Kamezawa Hiroyuki
     [not found]   ` <5152511A.1010707-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2013-04-03  4:00     ` Anton Vorontsov
2013-04-03  4:00       ` Anton Vorontsov
2013-04-03  4:00       ` Anton Vorontsov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130403040246.GA32229@lizard.gateway.2wire.net \
    --to=anton@enomsg.org \
    --cc=akpm@linux-foundation.org \
    --cc=b.zolnierkie@samsung.com \
    --cc=cgroups@vger.kernel.org \
    --cc=glommer@parallels.com \
    --cc=gthelen@google.com \
    --cc=john.stultz@linaro.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kernel-team@android.com \
    --cc=kirill@shutemov.name \
    --cc=kosaki.motohiro@gmail.com \
    --cc=lcapitulino@redhat.com \
    --cc=leonid.moiseichuk@nokia.com \
    --cc=linaro-kernel@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=patches@linaro.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.