Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [patch] acpi, pm: fix build breakage
From: Rafael J. Wysocki @ 2012-11-08 21:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Aaron Lu, Huang Ying, Len Brown, Lv Zheng, Adrian Hunter,
	linux-kernel, linux-pm, linux-acpi
In-Reply-To: <alpine.DEB.2.00.1211081119350.8749@chino.kir.corp.google.com>

On Thursday, November 08, 2012 11:20:01 AM David Rientjes wrote:
> Commit b87b49cd0efd ("ACPI / PM: Move device PM functions related to sleep 
> states") declared acpi_target_system_state() for CONFIG_PM_SLEEP whereas 
> it is only defined for CONFIG_ACPI_SLEEP, resulting in the following link 
> error:
> 
> drivers/built-in.o: In function `acpi_pm_device_sleep_wake':
> drivers/acpi/device_pm.c:342: undefined reference to `acpi_target_system_state'
> drivers/built-in.o: In function `acpi_dev_suspend_late':
> drivers/acpi/device_pm.c:501: undefined reference to `acpi_target_system_state'
> drivers/built-in.o: In function `acpi_pm_device_sleep_state':
> drivers/acpi/device_pm.c:221: undefined reference to `acpi_target_system_state'
> 
> Define it only for CONFIG_ACPI_SLEEP and fallback to a dummy definition 
> for other configs.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

I've received the patch and will apply it when I get back home from Spain.

Thanks,
Rafael


> ---
>  include/acpi/acpi_bus.h |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -469,11 +469,9 @@ static inline int acpi_pm_device_run_wake(struct device *dev, bool enable)
>  #endif
>  
>  #ifdef CONFIG_PM_SLEEP
> -u32 acpi_target_system_state(void);
>  int __acpi_device_sleep_wake(struct acpi_device *, u32, bool);
>  int acpi_pm_device_sleep_wake(struct device *, bool);
>  #else
> -static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
>  static inline int __acpi_device_sleep_wake(struct acpi_device *adev,
>  					   u32 target_state, bool enable)
>  {
> @@ -485,6 +483,12 @@ static inline int acpi_pm_device_sleep_wake(struct device *dev, bool enable)
>  }
>  #endif
>  
> +#ifdef CONFIG_ACPI_SLEEP
> +u32 acpi_target_system_state(void);
> +#else
> +static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
> +#endif
> +
>  static inline bool acpi_device_power_manageable(struct acpi_device *adev)
>  {
>  	return adev->flags.power_manageable;
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-08 19:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan,
	kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap,
	svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm,
	linux-kernel
In-Reply-To: <20121108180257.GC8218@suse.de>

On 11/08/2012 11:32 PM, Mel Gorman wrote:
> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>> ------------------------------------------------------------
>>
>> Today memory subsystems are offer a wide range of capabilities for managing
>> memory power consumption. As a quick example, if a block of memory is not
>> referenced for a threshold amount of time, the memory controller can decide to
>> put that chunk into a low-power content-preserving state. And the next
>> reference to that memory chunk would bring it back to full power for read/write.
>> With this capability in place, it becomes important for the OS to understand
>> the boundaries of such power-manageable chunks of memory and to ensure that
>> references are consolidated to a minimum number of such memory power management
>> domains.
>>
> 
> How much power is saved?

Last year, Amit had evaluated the "Hierarchy" patchset on a Samsung Exynos (ARM)
board and reported that it could save up to 6.3% relative to total system power.
(This was when he allowed only 1 GB out of the total 2 GB RAM to enter low
power states).

Below is the link to his post, as mentioned in the references section in the
cover letter.
http://article.gmane.org/gmane.linux.kernel.mm/65935

Of course, the power savings depends on the characteristics of the particular
hardware memory subsystem used, and the amount of memory present in the system.

> 
>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>> the firmware can expose information regarding the boundaries of such memory
>> power management domains to the OS in a standard way.
>>
> 
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events?

Sorry I should have been clearer when I mentioned ACPI 5.0. I mentioned ACPI 5.0
just to make a point that support for getting the memory power management
boundaries from the firmware is not far away. I didn't mean to say that that's
the only target for memory power management. Like I mentioned above, last year
the power-savings benefit was measured on ARM boards. The aim of this patchset
is to propose and evaluate some of the core VM algorithms that we will need
to efficiently exploit the power management features offered by the memory
subsystems.

IOW, info regarding memory power domain boundaries made available by ACPI 5.0
or even just with some help from the bootloader on some platforms is only the
input to the VM subsystem to understand at what granularity it should manage
things. *How* it manages is the choice of the algorithm/design at the VM level,
which is what this patchset is trying to propose, by exploring several different
designs of doing it and its costs/benefits.

That's the reason I just hard-coded mem region size to 512 MB in this patchset
and focussed on the VM algorithm to explore what we can do, once we have that
size/boundary info.

> For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
> 

We are not really looking at ACPI event notifications here. All we expect from
the firmware (at a first level) is info regarding the boundaries, so that the
VM can be intelligent about how it consolidates references. Many of the memory
subsystems can do power-management automatically - like for example, if a
particular chunk of memory is not referenced for a given threshold time, it can
put it into low-power (content preserving) state without the OS telling it to
do it.

> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.
> 

Right, we are focussing on the core VM algorithms for now. The input (ACPI or
other methods) can come later and then we can measure the numbers.

>> How can Linux VM help memory power savings?
>>
>> o Consolidate memory allocations and/or references such that they are
>> not spread across the entire memory address space.  Basically area of memory
>> that is not being referenced, can reside in low power state.
>>
> 
> Which the series does not appear to do.
> 

Well, it influences page-allocation to be memory-region aware. So it does an
attempt to consolidate allocations (and thereby references). As I mentioned,
hardware transition to low-power state can be automatic. The VM must be
intelligent enough to help with that (or atleast smart enough not to disrupt
that!), by avoiding spreading across allocations everywhere.

>> o Support targeted memory reclaim, where certain areas of memory that can be
>> easily freed can be offlined, allowing those areas of memory to be put into
>> lower power states.
>>
> 
> Which the series does not appear to do judging from this;
> 

Yes, that is one of the items in the TODO list.

>   include/linux/mm.h     |   38 +++++++
>   include/linux/mmzone.h |   52 +++++++++
>   mm/compaction.c        |    8 +
>   mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
>   mm/vmstat.c            |   59 ++++++++++-
> 
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.
> 
> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.
>

Right, we might need a better way to handle the various possibilities of
the layout of memory regions in the hardware. But this initial RFC tried to
focus on what do we do with that info, inside the VM to aid with power
management.

> It seems to me that superficially the VM implementation for the support
> would have
> 
> a) Involved a tree that managed the overlapping regions (even if it's
>    not hierarchal it feels more sensible) and picked the highest-power-state
>    common denominator in the tree. This would only be allocated if support
>    for MPST is available.
> b) Leave memory allocations and reclaim as they are in the active state.
> c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
>    power but still usable with a latency penalty. This might be a single
>    migrate type but could also be a parallel set of free_area called
>    free_area_lowpower that is only used when free_area is depleted and in
>    the very slow path of the allocator.
> d) Use memory hot-remove for power states where the refresh rates were
>    not constant
> 
> and only did anything expensive in response to an ACPI event -- none of
> the fast paths should be touched.
>
> When transitioning to the low power state, memory should be migrated in
> a vaguely similar fashion to what CMA does. For low-power, migration
> failure is acceptable. If contents are not preserved, ACPI needs to know
> if the migration failed because it cannot enter that power state.
>

As I mentioned, we are not really talking about reacting to ACPI events here.
The idea behind this patchset is to have efficient VM algorithms that can
shape memory references depending on power-management boundaries exposed by
the firmware. With that as the goal, I feel we should not even consider
migration as a first step - we should rather consider how to shape allocations
such that we can remain power-efficient right from the beginning and
throughout the runtime, without needing to migrate if possible. And this
patchset implements one of the designs that achieves that.
 
> For any of this to be worthwhile, low power states would need to be achieved
> for long periods of time because that migration is not free.
> 

Best to avoid migration as far as possible in the first place :-)

>> Memory Regions:
>> ---------------
>>
>> "Memory Regions" is a way of capturing the boundaries of power-managable
>> chunks of memory, within the MM subsystem.
>>
>> Short description of the "Sorted-buddy" design:
>> -----------------------------------------------
>>
>> In this design, the memory region boundaries are captured in a parallel
>> data-structure instead of fitting regions between nodes and zones in the
>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>> zones' freelists in region-sorted-order and thus do page allocation in the
>> order of increasing memory regions.
> 
> Implying that this sorting has to happen in the either the alloc or free
> fast path.
> 

Yep, I have moved it to the free path. The alloc path remains fast.

>> (The freelists need not be fully
>> address-sorted, they just need to be region-sorted. Patch 6 explains this
>> in more detail).
>>
>> The idea is to do page allocation in increasing order of memory regions
>> (within a zone) and perform page reclaim in the reverse order, as illustrated
>> below.
>>
>> ---------------------------- Increasing region number---------------------->
>>
>> Direction of allocation--->                         <---Direction of reclaim
>>
> 
> Compaction will work against this because it uses a PFN walker to isolate
> free pages and will ignore memory regions. If pageblocks were used, it
> could take that into account at least.
> 
>> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
>> lies in the page-free path and not the page-allocation path and hence the
>> critical page allocation paths remain fast.
> 
> Page free can be a critical path for application performance as well.
> Think network buffer heavy alloc and freeing of buffers.
> 
> However, migratetype information is already looked up for THP so ideally
> power awareness would piggyback on it.
> 
>> Moreover, the heart of the page
>> allocation algorithm itself remains largely unchanged, and the region-related
>> data-structures are optimized to avoid unnecessary updates during the
>> page-allocator's runtime.
>>
>> Advantages of this design:
>> --------------------------
>> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>>    hence we avoid its associated problems (like too many zones, extra page
>>    reclaim threads, question of choosing watermarks etc).
>>    [This is an advantage over the "Hierarchy" design]
>>
>> 2. Performance overhead is expected to be low: Since we retain the simplicity
>>    of the algorithm in the page allocation path, page allocation can
>>    potentially remain as fast as it would be without memory regions. The
>>    overhead is pushed to the page-freeing paths which are not that critical.
>>
>>
>> Results:
>> =======
>>
>> Test setup:
>> -----------
>> This patchset applies cleanly on top of 3.7-rc3.
>>
>> x86 dual-socket quad core HT-enabled machine booted with mem=8G
>> Memory region size = 512 MB
>>
>> Functional testing:
>> -------------------
>>
>> Ran pagetest, a simple C program that allocates and touches a required number
>> of pages.
>>
>> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
>> of allocations from pagetest.
>>
>> 	     Present pages   |	Free pages at various allocations        |
>> 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
>>   Region 0      16	     |   0      |    0     |     0    |    0     |
>>   Region 1      131072       |  87219   |  8066    |   7892   |  7387    |
>>   Region 2      131072       | 131072   |  79036   |     0    |    0     |
>>   Region 3      131072       | 131072   | 131072   |   79061  |    0     |
>>   Region 4      131072       | 131072   | 131072   |  131072  |    0     |
>>   Region 5      131072       | 131072   | 131072   |  131072  |  79051   |
>>   Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
>>   Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
>>   Region 8      131056       | 105475   | 105472   |  105472  |  105472  |
>>
>> This shows that page allocation occurs in the order of increasing region
>> numbers, as intended in this design.
>>
>> Performance impact:
>> -------------------
>>
>> Kernbench results didn't show much of a difference between the performance
>> of vanilla 3.7-rc3 and this patchset.
>>
>>
>> Todos:
>> =====
>>
>> 1. Memory-region aware page-reclamation:
>> ----------------------------------------
>>
>> We would like to do page reclaim in the reverse order of page allocation
>> within a zone, ie., in the order of decreasing region numbers.
>> To achieve that, while scanning lru pages to reclaim, we could potentially
>> look for pages belonging to higher regions (considering region boundaries)
>> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
>> reclaim candidates.
>>
> 
> This would disrupting LRU ordering and if those pages were recently
> allocated and you force a situation where swap has to be used then any
> saving in low memory will be lost by having to access the disk instead.
>

Right, we need to do it in a way that doesn't hurt performance or power-savings.
I definitely need to think more on this.. Any suggestions? 
 
>> 2. Compile-time exclusion of Memory Power Management, and extending the
>> support to also work with other features such as Mem cgroups, kexec etc.
>>  
> 
> Compile-time exclusion is pointless because it'll be always activated by
> distribution configs. Support for MPST should be detected at runtime and
> 
> 3. ACPI support to actually use this thing and validate the design is
>    compatible with the spec and actually works in hardware
> 

ACPI is not the only way to exploit this; other platforms (like ARM for example)
can expose info today with some help with the bootloader, and as mentioned
Amit already did a quick evaluation last year. So its not like we are totally
blocked on ACPI support in order to design the VM algorithms to manage memory
power-efficiently.

Thanks a lot for taking a look and for your invaluable feedback!

Regards,
Srivatsa S. Bhat


^ permalink raw reply

* [patch] acpi, pm: fix build breakage
From: David Rientjes @ 2012-11-08 19:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Aaron Lu, Huang Ying, Len Brown, Lv Zheng, Adrian Hunter,
	linux-kernel, linux-pm, linux-acpi
In-Reply-To: <alpine.DEB.2.00.1211061150190.6467@chino.kir.corp.google.com>

Commit b87b49cd0efd ("ACPI / PM: Move device PM functions related to sleep 
states") declared acpi_target_system_state() for CONFIG_PM_SLEEP whereas 
it is only defined for CONFIG_ACPI_SLEEP, resulting in the following link 
error:

drivers/built-in.o: In function `acpi_pm_device_sleep_wake':
drivers/acpi/device_pm.c:342: undefined reference to `acpi_target_system_state'
drivers/built-in.o: In function `acpi_dev_suspend_late':
drivers/acpi/device_pm.c:501: undefined reference to `acpi_target_system_state'
drivers/built-in.o: In function `acpi_pm_device_sleep_state':
drivers/acpi/device_pm.c:221: undefined reference to `acpi_target_system_state'

Define it only for CONFIG_ACPI_SLEEP and fallback to a dummy definition 
for other configs.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/acpi/acpi_bus.h |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
--- a/include/acpi/acpi_bus.h
+++ b/include/acpi/acpi_bus.h
@@ -469,11 +469,9 @@ static inline int acpi_pm_device_run_wake(struct device *dev, bool enable)
 #endif
 
 #ifdef CONFIG_PM_SLEEP
-u32 acpi_target_system_state(void);
 int __acpi_device_sleep_wake(struct acpi_device *, u32, bool);
 int acpi_pm_device_sleep_wake(struct device *, bool);
 #else
-static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
 static inline int __acpi_device_sleep_wake(struct acpi_device *adev,
 					   u32 target_state, bool enable)
 {
@@ -485,6 +483,12 @@ static inline int acpi_pm_device_sleep_wake(struct device *dev, bool enable)
 }
 #endif
 
+#ifdef CONFIG_ACPI_SLEEP
+u32 acpi_target_system_state(void);
+#else
+static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
+#endif
+
 static inline bool acpi_device_power_manageable(struct acpi_device *adev)
 {
 	return adev->flags.power_manageable;

^ permalink raw reply

* [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Tejun Heo @ 2012-11-08 19:07 UTC (permalink / raw)
  To: lizefan, mhocko, rjw
  Cc: containers, cgroups, linux-kernel, linux-pm, fweisbec,
	Glauber Costa
In-Reply-To: <1351931915-1701-2-git-send-email-tj@kernel.org>

Subject: cgroup: add cgroup_subsys->post_create()

Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.

This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.

This patch adds ->post_create().  It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().

When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance.  It will be explained with the descendant iterators.

v2: Added a paragraph about its future use w/ descendant iterators per
    Michal.

v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
    Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
---
Oops, forgot updating cgroup_load_subsys().  Hate that it's a
different path from cgroup_init_subsys(). :(

Thanks.

 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |   15 +++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -438,6 +438,7 @@ int cgroup_taskset_size(struct cgroup_ta
 
 struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
+	void (*post_create)(struct cgroup *cgrp);
 	void (*pre_destroy)(struct cgroup *cgrp);
 	void (*destroy)(struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4059,10 +4059,15 @@ static long cgroup_create(struct cgroup
 	if (err < 0)
 		goto err_remove;
 
-	/* each css holds a ref to the cgroup's dentry */
-	for_each_subsys(root, ss)
+	for_each_subsys(root, ss) {
+		/* each css holds a ref to the cgroup's dentry */
 		dget(dentry);
 
+		/* creation succeeded, notify subsystems */
+		if (ss->post_create)
+			ss->post_create(cgrp);
+	}
+
 	/* The cgroup directory was pre-locked for us */
 	BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
 
@@ -4280,6 +4285,9 @@ static void __init cgroup_init_subsys(st
 
 	ss->active = 1;
 
+	if (ss->post_create)
+		ss->post_create(&ss->root->top_cgroup);
+
 	/* this function shouldn't be used with modular subsystems, since they
 	 * need to register a subsys_id, among other things */
 	BUG_ON(ss->module);
@@ -4389,6 +4397,9 @@ int __init_or_module cgroup_load_subsys(
 
 	ss->active = 1;
 
+	if (ss->post_create)
+		ss->post_create(&ss->root->top_cgroup);
+
 	/* success! */
 	mutex_unlock(&cgroup_mutex);
 	return 0;

^ permalink raw reply

* Re: [PATCH 9/9 v3] cgroup_freezer: implement proper hierarchy support
From: Michal Hocko @ 2012-11-08 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121108180417.GC9672-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On Thu 08-11-12 10:04:17, Tejun Heo wrote:
> On Thu, Nov 08, 2012 at 07:02:46PM +0100, Michal Hocko wrote:
> > On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Reviewed-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > 
> > You probably meant Reviewed-by: Michal Hocko .... ;)
> 
> Hehehheheh... man, I'm too self-absolved.  Thanks for noticing
> that. :)  Updated.

Heh, btw. the doc update looks good as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 9/9 v3] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-08 18:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121108180246.GA17415-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

On Thu, Nov 08, 2012 at 07:02:46PM +0100, Michal Hocko wrote:
> On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Reviewed-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> You probably meant Reviewed-by: Michal Hocko .... ;)

Hehehheheh... man, I'm too self-absolved.  Thanks for noticing
that. :)  Updated.

-- 
tejun

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Mel Gorman @ 2012-11-08 18:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mjg59, paulmck, dave, maxime.coquelin, loic.pallardy, arjan,
	kmpark, kamezawa.hiroyu, lenb, rjw, gargankita, amit.kachhap,
	svaidy, thomas.abraham, santosh.shilimkar, linux-pm, linux-mm,
	linux-kernel
In-Reply-To: <20121106195026.6941.24662.stgit@srivatsabhat.in.ibm.com>

On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> ------------------------------------------------------------
> 
> Today memory subsystems are offer a wide range of capabilities for managing
> memory power consumption. As a quick example, if a block of memory is not
> referenced for a threshold amount of time, the memory controller can decide to
> put that chunk into a low-power content-preserving state. And the next
> reference to that memory chunk would bring it back to full power for read/write.
> With this capability in place, it becomes important for the OS to understand
> the boundaries of such power-manageable chunks of memory and to ensure that
> references are consolidated to a minimum number of such memory power management
> domains.
> 

How much power is saved?

> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> the firmware can expose information regarding the boundaries of such memory
> power management domains to the OS in a standard way.
> 

I'm not familiar with the ACPI spec but is there support for parsing of
MPST and interpreting the associated ACPI events? For example, if ACPI
fires an event indicating that a memory power node is to enter a low
state then presumably the OS should actively migrate pages away -- even
if it's going into a state where the contents are still refreshed
as exiting that state could take a long time.

I did not look closely at the patchset at all because it looked like the
actual support to use it and measure the benefit is missing.

> How can Linux VM help memory power savings?
> 
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space.  Basically area of memory
> that is not being referenced, can reside in low power state.
> 

Which the series does not appear to do.

> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
> 

Which the series does not appear to do judging from this;

  include/linux/mm.h     |   38 +++++++
  include/linux/mmzone.h |   52 +++++++++
  mm/compaction.c        |    8 +
  mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
  mm/vmstat.c            |   59 ++++++++++-

This does not appear to be doing anything with reclaim and not enough with
compaction to indicate that the series actively manages memory placement
in response to ACPI events.

Further in section 5.2.21.4 the spec says that power node regions can
overlap (but are not hierarchal for some reason) but have no gaps yet the
structure you use to represent is assumes there can be gaps and there are
no overlaps. Again, this is just glancing at the spec and a quick skim of
the patches so maybe I missed something that explains why this structure
is suitable.

It seems to me that superficially the VM implementation for the support
would have

a) Involved a tree that managed the overlapping regions (even if it's
   not hierarchal it feels more sensible) and picked the highest-power-state
   common denominator in the tree. This would only be allocated if support
   for MPST is available.
b) Leave memory allocations and reclaim as they are in the active state.
c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
   power but still usable with a latency penalty. This might be a single
   migrate type but could also be a parallel set of free_area called
   free_area_lowpower that is only used when free_area is depleted and in
   the very slow path of the allocator.
d) Use memory hot-remove for power states where the refresh rates were
   not constant

and only did anything expensive in response to an ACPI event -- none of
the fast paths should be touched.

When transitioning to the low power state, memory should be migrated in
a vaguely similar fashion to what CMA does. For low-power, migration
failure is acceptable. If contents are not preserved, ACPI needs to know
if the migration failed because it cannot enter that power state.

For any of this to be worthwhile, low power states would need to be achieved
for long periods of time because that migration is not free.

> Memory Regions:
> ---------------
> 
> "Memory Regions" is a way of capturing the boundaries of power-managable
> chunks of memory, within the MM subsystem.
> 
> Short description of the "Sorted-buddy" design:
> -----------------------------------------------
> 
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions.

Implying that this sorting has to happen in the either the alloc or free
fast path.

> (The freelists need not be fully
> address-sorted, they just need to be region-sorted. Patch 6 explains this
> in more detail).
> 
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform page reclaim in the reverse order, as illustrated
> below.
> 
> ---------------------------- Increasing region number---------------------->
> 
> Direction of allocation--->                         <---Direction of reclaim
> 

Compaction will work against this because it uses a PFN walker to isolate
free pages and will ignore memory regions. If pageblocks were used, it
could take that into account at least.

> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and not the page-allocation path and hence the
> critical page allocation paths remain fast.

Page free can be a critical path for application performance as well.
Think network buffer heavy alloc and freeing of buffers.

However, migratetype information is already looked up for THP so ideally
power awareness would piggyback on it.

> Moreover, the heart of the page
> allocation algorithm itself remains largely unchanged, and the region-related
> data-structures are optimized to avoid unnecessary updates during the
> page-allocator's runtime.
> 
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>    hence we avoid its associated problems (like too many zones, extra page
>    reclaim threads, question of choosing watermarks etc).
>    [This is an advantage over the "Hierarchy" design]
> 
> 2. Performance overhead is expected to be low: Since we retain the simplicity
>    of the algorithm in the page allocation path, page allocation can
>    potentially remain as fast as it would be without memory regions. The
>    overhead is pushed to the page-freeing paths which are not that critical.
> 
> 
> Results:
> =======
> 
> Test setup:
> -----------
> This patchset applies cleanly on top of 3.7-rc3.
> 
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
> 
> Functional testing:
> -------------------
> 
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
> 
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
> 
> 	     Present pages   |	Free pages at various allocations        |
> 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
>   Region 0      16	     |   0      |    0     |     0    |    0     |
>   Region 1      131072       |  87219   |  8066    |   7892   |  7387    |
>   Region 2      131072       | 131072   |  79036   |     0    |    0     |
>   Region 3      131072       | 131072   | 131072   |   79061  |    0     |
>   Region 4      131072       | 131072   | 131072   |  131072  |    0     |
>   Region 5      131072       | 131072   | 131072   |  131072  |  79051   |
>   Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
>   Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
>   Region 8      131056       | 105475   | 105472   |  105472  |  105472  |
> 
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
> 
> Performance impact:
> -------------------
> 
> Kernbench results didn't show much of a difference between the performance
> of vanilla 3.7-rc3 and this patchset.
> 
> 
> Todos:
> =====
> 
> 1. Memory-region aware page-reclamation:
> ----------------------------------------
> 
> We would like to do page reclaim in the reverse order of page allocation
> within a zone, ie., in the order of decreasing region numbers.
> To achieve that, while scanning lru pages to reclaim, we could potentially
> look for pages belonging to higher regions (considering region boundaries)
> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> reclaim candidates.
> 

This would disrupting LRU ordering and if those pages were recently
allocated and you force a situation where swap has to be used then any
saving in low memory will be lost by having to access the disk instead.

> 2. Compile-time exclusion of Memory Power Management, and extending the
> support to also work with other features such as Mem cgroups, kexec etc.
>  

Compile-time exclusion is pointless because it'll be always activated by
distribution configs. Support for MPST should be detected at runtime and

3. ACPI support to actually use this thing and validate the design is
   compatible with the spec and actually works in hardware

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* Re: [PATCH 9/9 v3] cgroup_freezer: implement proper hierarchy support
From: Michal Hocko @ 2012-11-08 18:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121108175750.GK12973-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> Up until now, cgroup_freezer didn't implement hierarchy properly.
> cgroups could be arranged in hierarchy but it didn't make any
> difference in how each cgroup_freezer behaved.  They all operated
> separately.
> 
> This patch implements proper hierarchy support.  If a cgroup is
> frozen, all its descendants are frozen.  A cgroup is thawed iff it and
> all its ancestors are THAWED.  freezer.self_freezing shows the current
> freezing state for the cgroup itself.  freezer.parent_freezing shows
> whether the cgroup is freezing because any of its ancestors is
> freezing.
> 
> freezer_post_create() locks the parent and new cgroup and inherits the
> parent's state and freezer_change_state() applies new state top-down
> using cgroup_for_each_descendant_pre() which guarantees that no child
> can escape its parent's state.  update_if_frozen() uses
> cgroup_for_each_descendant_post() to propagate frozen states
> bottom-up.
> 
> Synchronization could be coarser and easier by using a single mutex to
> protect all hierarchy operations.  Finer grained approach was used
> because it wasn't too difficult for cgroup_freezer and I think it's
> beneficial to have an example implementation and cgroup_freezer is
> rather simple and can serve a good one.
> 
> As this makes cgroup_freezer properly hierarchical,
> freezer_subsys.broken_hierarchy marking is removed.
> 
> Note that this patch changes userland visible behavior - freezing a
> cgroup now freezes all its descendants too.  This behavior change is
> intended and has been warned via .broken_hierarchy.
> 
> v2: Michal spotted a bug in freezer_change_state() - descendants were
>     inheriting from the wrong ancestor.  Fixed.
> 
> v3: Documentation/cgroups/freezer-subsystem.txt updated.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

You probably meant Reviewed-by: Michal Hocko .... ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCHSET cgroup/for-3.8] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-08 18:01 UTC (permalink / raw)
  To: lizefan, mhocko, rjw
  Cc: containers, cgroups, linux-kernel, linux-pm, fweisbec
In-Reply-To: <1351931915-1701-1-git-send-email-tj@kernel.org>

On Sat, Nov 03, 2012 at 01:38:26AM -0700, Tejun Heo wrote:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup_freezer-hierarchy

Updated patches posted as replies to the original patches and the
above git branch updated with the updated patches.  As all the updates
are minor, I won't re-post the whole series.

Thanks.

-- 
tejun

^ permalink raw reply

* [PATCH 3/9 v2] cgroup: implement generic child / descendant walk macros
From: Tejun Heo @ 2012-11-08 17:59 UTC (permalink / raw)
  To: lizefan, mhocko, rjw
  Cc: containers, cgroups, linux-kernel, linux-pm, fweisbec
In-Reply-To: <1351931915-1701-4-git-send-email-tj@kernel.org>

Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants.  This patch adds the following
three macros.

* cgroup_for_each_child() - walk immediate children of a cgroup.

* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
  in pre-order tree traversal.

* cgroup_for_each_descendant_post() - visit all descendants of a
  cgroup in post-order tree traversal.

All three only require the user to hold RCU read lock during
traversal.  Verifying that each iterated cgroup is online is the
responsibility of the user.  When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way.  See comments for details.

v2: s/config/state/ in commit message and comments per Michal.  More
    documentation on synchronization rules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/cgroup.h |   94 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup.c        |   86 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+)

--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -534,6 +534,100 @@ static inline struct cgroup* task_cgroup
 	return task_subsys_state(task, subsys_id)->cgroup;
 }
 
+/**
+ * cgroup_for_each_child - iterate through children of a cgroup
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose children to walk
+ *
+ * Walk @cgroup's children.  Must be called under rcu_read_lock().  A child
+ * cgroup which hasn't finished ->post_create() or already has finished
+ * ->pre_destroy() may show up during traversal and it's each subsystem's
+ * responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, a cgroup which finished ->post_create()
+ * is guaranteed to be visible in the future iterations.
+ */
+#define cgroup_for_each_child(pos, cgroup)				\
+	list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
+
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+					  struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_pre - pre-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Walk @cgroup's descendants.  Must be called under rcu_read_lock().  A
+ * descendant cgroup which hasn't finished ->post_create() or already has
+ * finished ->pre_destroy() may show up during traversal and it's each
+ * subsystem's responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, and synchronizes against @pos on each
+ * iteration, any descendant cgroup which finished ->post_create() is
+ * guaranteed to be visible in the future iterations.
+ *
+ * In other words, the following guarantees that a descendant can't escape
+ * state updates of its ancestors.
+ *
+ * my_post_create(@cgrp)
+ * {
+ *	Lock @cgrp->parent and @cgrp;
+ *	Inherit state from @cgrp->parent;
+ *	Unlock both.
+ * }
+ *
+ * my_update_state(@cgrp)
+ * {
+ *	Lock @cgrp;
+ *	Update @cgrp's state;
+ *	Unlock @cgrp;
+ *
+ *	cgroup_for_each_descendant_pre(@pos, @cgrp) {
+ *		Lock @pos;
+ *		Verify @pos is alive and inherit state from @pos->parent;
+ *		Unlock @pos;
+ *	}
+ * }
+ *
+ * As long as the inheriting step, including checking the parent state, is
+ * enclosed inside @pos locking, double-locking the parent isn't necessary
+ * while inheriting.  The state update to the parent is guaranteed to be
+ * visible by walking order and, as long as inheriting operations to the
+ * same @pos are atomic to each other, multiple updates racing each other
+ * still result in the correct state.  It's guaranateed that at least one
+ * inheritance happens for any cgroup after the latest update to its
+ * parent.
+ *
+ * If checking parent's state requires locking the parent, each inheriting
+ * iteration should lock and unlock both @pos->parent and @pos.
+ *
+ * Alternatively, a subsystem may choose to use a single global lock to
+ * synchronize ->post_create() and ->pre_destroy() against tree-walking
+ * operations.
+ */
+#define cgroup_for_each_descendant_pre(pos, cgroup)			\
+	for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos);	\
+	     pos = cgroup_next_descendant_pre((pos), (cgroup)))
+
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+					   struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_post - post-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Similar to cgroup_for_each_descendant_pre() but performs post-order
+ * traversal instead.  Note that the walk visibility guarantee described in
+ * pre-order walk doesn't apply the same to post-order walks.
+ */
+#define cgroup_for_each_descendant_post(pos, cgroup)			\
+	for (pos = cgroup_next_descendant_post(NULL, (cgroup)); (pos);	\
+	     pos = cgroup_next_descendant_post((pos), (cgroup)))
+
 /* A cgroup_iter should be treated as an opaque object */
 struct cgroup_iter {
 	struct list_head *cg_link;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2984,6 +2984,92 @@ static void cgroup_enable_task_cg_lists(
 	write_unlock(&css_set_lock);
 }
 
+/**
+ * cgroup_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_pre().  Find the next
+ * descendant to visit for pre-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+					  struct cgroup *cgroup)
+{
+	struct cgroup *next;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/* if first iteration, pretend we just visited @cgroup */
+	if (!pos) {
+		if (list_empty(&cgroup->children))
+			return NULL;
+		pos = cgroup;
+	}
+
+	/* visit the first child if exists */
+	next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
+	if (next)
+		return next;
+
+	/* no child, visit my or the closest ancestor's next sibling */
+	do {
+		next = list_entry_rcu(pos->sibling.next, struct cgroup,
+				      sibling);
+		if (&next->sibling != &pos->parent->children)
+			return next;
+
+		pos = pos->parent;
+	} while (pos != cgroup);
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
+
+static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
+{
+	struct cgroup *last;
+
+	do {
+		last = pos;
+		pos = list_first_or_null_rcu(&pos->children, struct cgroup,
+					     sibling);
+	} while (pos);
+
+	return last;
+}
+
+/**
+ * cgroup_next_descendant_post - find the next descendant for post-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_post().  Find the next
+ * descendant to visit for post-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+					   struct cgroup *cgroup)
+{
+	struct cgroup *next;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/* if first iteration, visit the leftmost descendant */
+	if (!pos) {
+		next = cgroup_leftmost_descendant(cgroup);
+		return next != cgroup ? next : NULL;
+	}
+
+	/* if there's an unvisited sibling, visit its leftmost descendant */
+	next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+	if (&next->sibling != &pos->parent->children)
+		return cgroup_leftmost_descendant(next);
+
+	/* no sibling left, visit parent */
+	next = pos->parent;
+	return next != cgroup ? next : NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_post);
+
 void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
 	__acquires(css_set_lock)
 {

^ permalink raw reply

* [PATCH 9/9 v3] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-08 17:57 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <1351931915-1701-10-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved.  They all operated
separately.

This patch implements proper hierarchy support.  If a cgroup is
frozen, all its descendants are frozen.  A cgroup is thawed iff it and
all its ancestors are THAWED.  freezer.self_freezing shows the current
freezing state for the cgroup itself.  freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.

freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state.  update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.

Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations.  Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.

As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.

Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too.  This behavior change is
intended and has been warned via .broken_hierarchy.

v2: Michal spotted a bug in freezer_change_state() - descendants were
    inheriting from the wrong ancestor.  Fixed.

v3: Documentation/cgroups/freezer-subsystem.txt updated.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 Documentation/cgroups/freezer-subsystem.txt |   63 +++++++---
 kernel/cgroup_freezer.c                     |  161 +++++++++++++++++++++-------
 2 files changed, 165 insertions(+), 59 deletions(-)

--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -49,13 +49,49 @@ prevent the freeze/unfreeze cycle from b
 being frozen. This allows the bash example above and gdb to run as
 expected.
 
-The freezer subsystem in the container filesystem defines a file named
-freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the
-cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup.
-Reading will return the current state.
+The cgroup freezer is hierarchical. Freezing a cgroup freezes all
+tasks beloning to the cgroup and all its descendant cgroups. Each
+cgroup has its own state (self-state) and the state inherited from the
+parent (parent-state). Iff both states are THAWED, the cgroup is
+THAWED.
 
-Note freezer.state doesn't exist in root cgroup, which means root cgroup
-is non-freezable.
+The following cgroupfs files are created by cgroup freezer.
+
+* freezer.state: Read-write.
+
+  When read, returns the effective state of the cgroup - "THAWED",
+  "FREEZING" or "FROZEN". This is the combined self and parent-states.
+  If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
+
+  FREEZING cgroup transitions into FROZEN state when all tasks
+  belonging to the cgroup and its descendants become frozen. Note that
+  a cgroup reverts to FREEZING from FROZEN after a new task is added
+  to the cgroup or one of its descendant cgroups until the new task is
+  frozen.
+
+  When written, sets the self-state of the cgroup. Two values are
+  allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
+  if not already freezing, enters FREEZING state along with all its
+  descendant cgroups.
+
+  If THAWED is written, the self-state of the cgroup is changed to
+  THAWED.  Note that the effective state may not change to THAWED if
+  the parent-state is still freezing. If a cgroup's effective state
+  becomes THAWED, all its descendants which are freezing because of
+  the cgroup also leave the freezing state.
+
+* freezer.self_freezing: Read only.
+
+  Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
+  This value is 1 iff the last write to freezer.state was "FROZEN".
+
+* freezer.parent_freezing: Read only.
+
+  Shows the parent-state.  0 if none of the cgroup's ancestors is
+  frozen; otherwise, 1.
+
+The root cgroup is non-freezable and the above interface files don't
+exist.
 
 * Examples of usage :
 
@@ -85,18 +121,3 @@ to unfreeze all tasks in the container :
 
 This is the basic mechanism which should do the right thing for user space task
 in a simple scenario.
-
-It's important to note that freezing can be incomplete. In that case we return
-EBUSY. This means that some tasks in the cgroup are busy doing something that
-prevents us from completely freezing the cgroup at this time. After EBUSY,
-the cgroup will remain partially frozen -- reflected by freezer.state reporting
-"FREEZING" when read. The state will remain "FREEZING" until one of these
-things happens:
-
-	1) Userspace cancels the freezing operation by writing "THAWED" to
-		the freezer.state file
-	2) Userspace retries the freezing operation by writing "FROZEN" to
-		the freezer.state file (writing "FREEZING" is not legal
-		and returns EINVAL)
-	3) The tasks that blocked the cgroup from entering the "FROZEN"
-		state disappear from the cgroup's set of tasks.
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -22,6 +22,13 @@
 #include <linux/freezer.h>
 #include <linux/seq_file.h>
 
+/*
+ * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
+ * set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
+ * for "THAWED".  FREEZING_PARENT is set if the parent freezer is FREEZING
+ * for whatever reason.  IOW, a cgroup has FREEZING_PARENT set if one of
+ * its ancestors has FREEZING_SELF set.
+ */
 enum freezer_state_flags {
 	CGROUP_FREEZER_ONLINE	= (1 << 0), /* freezer is fully online */
 	CGROUP_FREEZING_SELF	= (1 << 1), /* this freezer is freezing */
@@ -50,6 +57,15 @@ static inline struct freezer *task_freez
 			    struct freezer, css);
 }
 
+static struct freezer *parent_freezer(struct freezer *freezer)
+{
+	struct cgroup *pcg = freezer->css.cgroup->parent;
+
+	if (pcg)
+		return cgroup_freezer(pcg);
+	return NULL;
+}
+
 bool cgroup_freezing(struct task_struct *task)
 {
 	bool ret;
@@ -74,17 +90,6 @@ static const char *freezer_state_strs(un
 	return "THAWED";
 };
 
-/*
- * State diagram
- * Transitions are caused by userspace writes to the freezer.state file.
- * The values in parenthesis are state labels. The rest are edge labels.
- *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- *    ^ ^                    |                     |
- *    | \_______THAWED_______/                     |
- *    \__________________________THAWED____________/
- */
-
 struct cgroup_subsys freezer_subsys;
 
 static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
@@ -103,15 +108,34 @@ static struct cgroup_subsys_state *freez
  * freezer_post_create - commit creation of a freezer cgroup
  * @cgroup: cgroup being created
  *
- * We're committing to creation of @cgroup.  Mark it online.
+ * We're committing to creation of @cgroup.  Mark it online and inherit
+ * parent's freezing state while holding both parent's and our
+ * freezer->lock.
  */
 static void freezer_post_create(struct cgroup *cgroup)
 {
 	struct freezer *freezer = cgroup_freezer(cgroup);
+	struct freezer *parent = parent_freezer(freezer);
+
+	/*
+	 * The following double locking and freezing state inheritance
+	 * guarantee that @cgroup can never escape ancestors' freezing
+	 * states.  See cgroup_for_each_descendant_pre() for details.
+	 */
+	if (parent)
+		spin_lock_irq(&parent->lock);
+	spin_lock_nested(&freezer->lock, SINGLE_DEPTH_NESTING);
 
-	spin_lock_irq(&freezer->lock);
 	freezer->state |= CGROUP_FREEZER_ONLINE;
-	spin_unlock_irq(&freezer->lock);
+
+	if (parent && (parent->state & CGROUP_FREEZING)) {
+		freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
+		atomic_inc(&system_freezing_cnt);
+	}
+
+	spin_unlock(&freezer->lock);
+	if (parent)
+		spin_unlock_irq(&parent->lock);
 }
 
 /**
@@ -153,6 +177,7 @@ static void freezer_attach(struct cgroup
 {
 	struct freezer *freezer = cgroup_freezer(new_cgrp);
 	struct task_struct *task;
+	bool clear_frozen = false;
 
 	spin_lock_irq(&freezer->lock);
 
@@ -172,10 +197,25 @@ static void freezer_attach(struct cgroup
 		} else {
 			freeze_task(task);
 			freezer->state &= ~CGROUP_FROZEN;
+			clear_frozen = true;
 		}
 	}
 
 	spin_unlock_irq(&freezer->lock);
+
+	/*
+	 * Propagate FROZEN clearing upwards.  We may race with
+	 * update_if_frozen(), but as long as both work bottom-up, either
+	 * update_if_frozen() sees child's FROZEN cleared or we clear the
+	 * parent's FROZEN later.  No parent w/ !FROZEN children can be
+	 * left FROZEN.
+	 */
+	while (clear_frozen && (freezer = parent_freezer(freezer))) {
+		spin_lock_irq(&freezer->lock);
+		freezer->state &= ~CGROUP_FROZEN;
+		clear_frozen = freezer->state & CGROUP_FREEZING;
+		spin_unlock_irq(&freezer->lock);
+	}
 }
 
 static void freezer_fork(struct task_struct *task)
@@ -200,24 +240,47 @@ out:
 	rcu_read_unlock();
 }
 
-/*
- * We change from FREEZING to FROZEN lazily if the cgroup was only
- * partially frozen when we exitted write.  Caller must hold freezer->lock.
+/**
+ * update_if_frozen - update whether a cgroup finished freezing
+ * @cgroup: cgroup of interest
+ *
+ * Once FREEZING is initiated, transition to FROZEN is lazily updated by
+ * calling this function.  If the current state is FREEZING but not FROZEN,
+ * this function checks whether all tasks of this cgroup and the descendant
+ * cgroups finished freezing and, if so, sets FROZEN.
+ *
+ * The caller is responsible for grabbing RCU read lock and calling
+ * update_if_frozen() on all descendants prior to invoking this function.
  *
  * Task states and freezer state might disagree while tasks are being
  * migrated into or out of @cgroup, so we can't verify task states against
  * @freezer state here.  See freezer_attach() for details.
  */
-static void update_if_frozen(struct freezer *freezer)
+static void update_if_frozen(struct cgroup *cgroup)
 {
-	struct cgroup *cgroup = freezer->css.cgroup;
+	struct freezer *freezer = cgroup_freezer(cgroup);
+	struct cgroup *pos;
 	struct cgroup_iter it;
 	struct task_struct *task;
 
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	spin_lock_irq(&freezer->lock);
+
 	if (!(freezer->state & CGROUP_FREEZING) ||
 	    (freezer->state & CGROUP_FROZEN))
-		return;
+		goto out_unlock;
 
+	/* are all (live) children frozen? */
+	cgroup_for_each_child(pos, cgroup) {
+		struct freezer *child = cgroup_freezer(pos);
+
+		if ((child->state & CGROUP_FREEZER_ONLINE) &&
+		    !(child->state & CGROUP_FROZEN))
+			goto out_unlock;
+	}
+
+	/* are all tasks frozen? */
 	cgroup_iter_start(cgroup, &it);
 
 	while ((task = cgroup_iter_next(cgroup, &it))) {
@@ -229,27 +292,32 @@ static void update_if_frozen(struct free
 			 * the usual frozen condition.
 			 */
 			if (!frozen(task) && !freezer_should_skip(task))
-				goto notyet;
+				goto out_iter_end;
 		}
 	}
 
 	freezer->state |= CGROUP_FROZEN;
-notyet:
+out_iter_end:
 	cgroup_iter_end(cgroup, &it);
+out_unlock:
+	spin_unlock_irq(&freezer->lock);
 }
 
 static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
 			struct seq_file *m)
 {
-	struct freezer *freezer = cgroup_freezer(cgroup);
-	unsigned int state;
+	struct cgroup *pos;
 
-	spin_lock_irq(&freezer->lock);
-	update_if_frozen(freezer);
-	state = freezer->state;
-	spin_unlock_irq(&freezer->lock);
+	rcu_read_lock();
 
-	seq_puts(m, freezer_state_strs(state));
+	/* update states bottom-up */
+	cgroup_for_each_descendant_post(pos, cgroup)
+		update_if_frozen(pos);
+	update_if_frozen(cgroup);
+
+	rcu_read_unlock();
+
+	seq_puts(m, freezer_state_strs(cgroup_freezer(cgroup)->state));
 	seq_putc(m, '\n');
 	return 0;
 }
@@ -320,14 +388,39 @@ static void freezer_apply_state(struct f
  * @freezer: freezer of interest
  * @freeze: whether to freeze or thaw
  *
- * Freeze or thaw @cgroup according to @freeze.
+ * Freeze or thaw @freezer according to @freeze.  The operations are
+ * recursive - all descendants of @freezer will be affected.
  */
 static void freezer_change_state(struct freezer *freezer, bool freeze)
 {
+	struct cgroup *pos;
+
 	/* update @freezer */
 	spin_lock_irq(&freezer->lock);
 	freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
 	spin_unlock_irq(&freezer->lock);
+
+	/*
+	 * Update all its descendants in pre-order traversal.  Each
+	 * descendant will try to inherit its parent's FREEZING state as
+	 * CGROUP_FREEZING_PARENT.
+	 */
+	rcu_read_lock();
+	cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
+		struct freezer *pos_f = cgroup_freezer(pos);
+		struct freezer *parent = parent_freezer(pos_f);
+
+		/*
+		 * Our update to @parent->state is already visible which is
+		 * all we need.  No need to lock @parent.  For more info on
+		 * synchronization, see freezer_post_create().
+		 */
+		spin_lock_irq(&pos_f->lock);
+		freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
+				    CGROUP_FREEZING_PARENT);
+		spin_unlock_irq(&pos_f->lock);
+	}
+	rcu_read_unlock();
 }
 
 static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
@@ -390,12 +483,4 @@ struct cgroup_subsys freezer_subsys = {
 	.attach		= freezer_attach,
 	.fork		= freezer_fork,
 	.base_cftypes	= files,
-
-	/*
-	 * freezer subsys doesn't handle hierarchy at all.  Frozen state
-	 * should be inherited through the hierarchy - if a parent is
-	 * frozen, all its children should be frozen.  Fix it and remove
-	 * the following.
-	 */
-	.broken_hierarchy = true,
 };

^ permalink raw reply

* Re: [PATCH 8/9] cgroup_freezer: add ->post_create() and ->pre_destroy() and track online state
From: Tejun Heo @ 2012-11-08 17:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121108132306.GH31821-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hello,

On Thu, Nov 08, 2012 at 02:23:06PM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:34, Tejun Heo wrote:
> > A cgroup is online and visible to iteration between ->post_create()
> > and ->pre_destroy().  This patch introduces CGROUP_FREEZER_ONLINE and
> > toggles it from the newly added freezer_post_create() and
> > freezer_pre_destroy() while holding freezer->lock such that a
> > cgroup_freezer can be reilably distinguished to be online.  This will
> > be used by full hierarchy support.
> 
> I am thinking whether freezer_pre_destroy is really needed. Once we
> reach pre_destroy then there are no tasks nor any children in the group
> so there is nobody to wake up if the group was frozen and the destroy
> callback is called after synchronize_rcu so the traversing should be
> safe.

Yeah, it might be true, but I'd still like to keep the offlining in
->pre_destroy() so that it's symmetrical w/ ->post_create().  I'll
rename and document the ops so that the roles of each are clearer.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 3/9] cgroup: implement generic child / descendant walk macros
From: Tejun Heo @ 2012-11-08 17:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec
In-Reply-To: <20121108095013.GA31821@dhcp22.suse.cz>

Hey, Michal.

On Thu, Nov 08, 2012 at 10:50:13AM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:29, Tejun Heo wrote:
> > Currently, cgroup doesn't provide any generic helper for walking a
> > given cgroup's children or descendants.  This patch adds the following
> > three macros.
> > 
> > * cgroup_for_each_child() - walk immediate children of a cgroup.
> > 
> > * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> >   in pre-order tree traversal.
> > 
> > * cgroup_for_each_descendant_post() - visit all descendants of a
> >   cgroup in post-order tree traversal.
> > 
> > All three only require the user to hold RCU read lock during
> > traversal.  Verifying that each iterated cgroup is online is the
> > responsibility of the user.  When used with proper synchronization,
> > cgroup_for_each_descendant_pre() can be used to propagate config
> > updates to descendants in reliable way.  See comments for details.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> 
> I will convert mem_cgroup_iter to use this rather than css_get_next
> after this gets into the next tree so that it can fly via Andrew.
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> Just a minor knit. You are talking about a config propagation while I
> would consider state propagation more clear and less confusing. Config
> is usually stable enough so that post_create is not necessary for
> syncing (e.g. memcg.swappiness). It is a state which must be consistent
> throughout the hierarchy which matters here.

Did s/config/state/g

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-08 17:07 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <1434035.IipKXWk2Vr@vostro.rjw.lan>

On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:

> > > > is it a good idea to allow to set device state to SUSPENDED if the device
> > > > is disabled?
> > > 
> > > No, it is not.  The status should always be ACTIVE as long as usage_count > 0.

That isn't strictly true, because pm_runtime_get_noresume violates this
rule.  What the PM core actually does is prevent a transition from the
ACTIVE state to the SUSPENDING/SUSPENDED state if usage_count > 0,
_provided_ runtime PM is enabled.  There's no such restriction when it
is disabled.

BTW, do we need to think about what happens in the case where the
device _does_ have a driver and for some reason the driver has disabled
the device for runtime PM?  I would just as soon ignore the issue.

> > > However, in some cases we actually would like to change the status to
> > > SUSPENDED when usage_count becomes equal to 0, because that means we can
> > > suspend (I mean really suspend) the parents of the devices in question
> > > (and we want to notify the parents in those cases).
> > 
> > So do you think Alan Stern's suggestion about forbidden and disabled is
> > the right way to go?
> 
> I'm not really sure about that.
> 
> My original idea was that the runtime PM status and usage counter would
> only matter when runtime PM of a device was enabled.  That leads to
> problems, though, when we enable runtime PM of a device whose usage
> counter is greater from zero and status is SUSPENDED.

That doesn't seem to be a problem.  It can arise without disabling
runtime PM at all -- just call pm_runtime_get_noresume.

>  Also when the
> device's status is ACTIVE, but its parent's child count is 0.

__pm_runtime_set_status prevents this situation from arising.  When the 
device's status is set to ACTIVE, the parent's child count is 
incremented.  So this isn't a problem either.

> It's not very easy to fix this at the core level, though, because we
> depend on the current behavior in some places.  I'm thinking that
> perhaps pm_runtime_enable() should just WARN() if things are obviously
> inconsistent (although there still may be problems, for example, if the
> parent's child count is 2 when we enable runtime PM for its child, but that
> child is the only one it actually has).

I think we should continue the original strategy of ignoring the status
and usage counter when runtime PM is disabled.  This is definitely the
easiest and most straightforward approach.  Fixing the problem at hand
(VGA controllers) by changing the PCI subsystem seems like the simplest
solution.

Your revised patch does do the job, except for a few problems.  
Namely, while local_pci_probe() and pci_device_remove() are running,
the device _does_ have a driver.  This means that local_pci_probe()
should not call pm_runtime_get_sync(), for example.  Doing so would
invoke the driver's runtime_resume routine before calling the driver's
probe routine!

The USB subsystem solves this problem by carefully keeping track of the 
state of the device-driver binding:

	Originally the device is UNBOUND.

	At the start of the subsystem's probe routine, the state
	changes to BINDING.

	If the probe succeeds then it changes to BOUND; otherwise
	it goes back to UNBOUND.

	At the start of the subsystem's remove routine, the state
	changes to UNBINDING.  At the end it goes to UNBOUND.

When the state is anything other than BOUND, the subsystem's runtime PM 
routines act as though there is no driver.  This works because the 
subsystem makes sure that the device is ACTIVE with a nonzero usage 
count before calling the driver's probe or remove routine, so no 
runtime PM callbacks can occur at these awkward times.

If PCI adopted this strategy then your new patch would work okay.  I 
think -- I haven't checked it thoroughly.

Alan Stern

^ permalink raw reply

* Re: [PATCH 9/9 v2] cgroup_freezer: implement proper hierarchy support
From: Michal Hocko @ 2012-11-08 15:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec
In-Reply-To: <20121108152923.GG12973@htj.dyndns.org>

On Thu 08-11-12 07:29:23, Tejun Heo wrote:
> Hey, Michal.
> 
> On Thu, Nov 08, 2012 at 04:20:39PM +0100, Michal Hocko wrote:
> > > So, in the above example in CPU2, (B->state & FREEZING) test and
> > > freezer_apply_state(C, false) can't be interleaved with the same
> > > inheritance operation from CPU1.  They either happen before or after.
> > 
> > I am not sure I understand what you mean by inheritance operation but
> 
> The operation of looking at one's parent and inherting the state.
> Here, checking parent->state and calling apply_state accordingly.
> 
> > you are right that the race is not possible because spin_lock makes
> > sure that Foo->state is done after the lock(Foo->child) and spin_unlock
> > then serves as a leave barrier so the other CPUs will see the correctly
> > updated value. The rest is handled by the fixed ordered tree walk. So
> > there is really no interleaving going on here.
> 
> I'm worried that this is a bit too subtle.

Dunno, it looks obvious now, I just missed the entry&leave implicit
barriers by spinlocks and again sorry about the confusion.

> This will work fine with a single hierarchy mutex protecting hierarchy
> updates and state propagations through it and that should work for
> most controllers too.

But single mutex is just ugly.

> I want to have at least one example of finer grained locking for
> future reference and cgroup_freezer happened to be the one I started
> working on.

I think this is a good example because it shows how to share the state
without too many implementation details.

> So, it is more complicated (not necessarily in written code but the
> sync rules) than necessary.  I'm still not sure whether to keep it or
> not.

I think the locking is fine and I would keep it this way rather than a
big lock.

> I'll add more documentation about synchronization in
> cgroup_for_each_descendant_pre() either way.

more doc cannot hurt ;)

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 8/9] cgroup_freezer: add ->post_create() and ->pre_destroy() and track online state
From: Tejun Heo @ 2012-11-08 15:41 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <509B3999.6060505-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Hello, Kamezawa.

On Thu, Nov 08, 2012 at 01:48:25PM +0900, Kamezawa Hiroyuki wrote:
> Could you explain what 'online' means here again, rather than changelog ?
> BTW, 'online' is a shared concept, between post_create() and pre_destroy(), among
> developpers ? Is it new ?

I'm prepping a patch to rename create, post_create, pre_destroy,
destroy operations to allocate, online, offline, free, so yeah the
terms and concepts will be used for all of cgroup.  I'll update the
docs in that patch.

Thanks!

-- 
tejun

^ permalink raw reply

* Re: [PATCH 9/9 v2] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-08 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121108152039.GL31821-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hey, Michal.

On Thu, Nov 08, 2012 at 04:20:39PM +0100, Michal Hocko wrote:
> > So, in the above example in CPU2, (B->state & FREEZING) test and
> > freezer_apply_state(C, false) can't be interleaved with the same
> > inheritance operation from CPU1.  They either happen before or after.
> 
> I am not sure I understand what you mean by inheritance operation but

The operation of looking at one's parent and inherting the state.
Here, checking parent->state and calling apply_state accordingly.

> you are right that the race is not possible because spin_lock makes
> sure that Foo->state is done after the lock(Foo->child) and spin_unlock
> then serves as a leave barrier so the other CPUs will see the correctly
> updated value. The rest is handled by the fixed ordered tree walk. So
> there is really no interleaving going on here.

I'm worried that this is a bit too subtle.  This will work fine with a
single hierarchy mutex protecting hierarchy updates and state
propagations through it and that should work for most controllers too.
I want to have at least one example of finer grained locking for
future reference and cgroup_freezer happened to be the one I started
working on.  So, it is more complicated (not necessarily in written
code but the sync rules) than necessary.  I'm still not sure whether
to keep it or not.  I'll add more documentation about synchronization
in cgroup_for_each_descendant_pre() either way.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 9/9 v2] cgroup_freezer: implement proper hierarchy support
From: Michal Hocko @ 2012-11-08 15:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec
In-Reply-To: <20121108141848.GA12973@htj.dyndns.org>

On Thu 08-11-12 06:18:48, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Nov 08, 2012 at 03:08:52PM +0100, Michal Hocko wrote:
> > This seems to be racy because parent->state access is not linearized.
> > Say we have parallel freeze and thawing on a tree like the following:
> > 	A
> > 	|
> > 	B
> >         |
> >         C 
> > 
> > pre_order will visit them in B, C order.
> > 	CPU1						CPU2
> > freezer_apply_state(A, true)
> > A->state & FREEZING == true		freezer_apply_state(A, false)
> > 					A->state & FREEZING == false
> > 					freezer_apply_state(B, false)
> > 					B->state & FREEZING == false
> > freezer_apply_state(B, true)
> > 
> > B->state & FREEZING == true
> > freezer_apply_state(C, true)
> > 					freezer_apply_state(C, false)
> > 
> > So A, C are thawed while B is frozen. Or am I missing something which
> > would prevent from this kind of race?
> 
> The rule is that there will be at least one inheritance operation
> after a parent is updated.  The exact order of propagation doesn't
> matter as long as there's at least one inheritance event after the
> latest update to a parent.  This works because inheritance operations
> are atomic to each other.  If one inheritance operation "saw" an
> update to its parent, the next inheritance operation is guaranteed to
> see at least upto that update.
> 
> So, in the above example in CPU2, (B->state & FREEZING) test and
> freezer_apply_state(C, false) can't be interleaved with the same
> inheritance operation from CPU1.  They either happen before or after.

I am not sure I understand what you mean by inheritance operation but
you are right that the race is not possible because spin_lock makes
sure that Foo->state is done after the lock(Foo->child) and spin_unlock
then serves as a leave barrier so the other CPUs will see the correctly
updated value. The rest is handled by the fixed ordered tree walk. So
there is really no interleaving going on here.

Sorry about the confusion!

Feel free to add my Reviewed-by.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 6/9] cgroup_freezer: make freezer->state mask of flags
From: Michal Hocko @ 2012-11-08 14:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121108143952.GD12973-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On Thu 08-11-12 06:39:52, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Nov 08, 2012 at 11:39:28AM +0100, Michal Hocko wrote:
> > On Sat 03-11-12 01:38:32, Tejun Heo wrote:
> > > freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> > > As the scheduled full hierarchy support requires more than one
> > > freezing condition, switch it to mask of flags.  If FREEZING is not
> > > set, it's thawed.  FREEZING is set if freezing or frozen.  If frozen,
> > > both FREEZING and FROZEN are set.  Now that tasks can be attached to
> > > an already frozen cgroup, this also makes freezing condition checks
> > > more natural.
> > > 
> > > This patch doesn't introduce any behavior change.
> > > 
> > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > 
> > I think it would be nicer to use freezer_state_flags enum rather than
> > unsigned int for the state. I would even expect gcc to complain about
> > that but it looks like -fstrict-enums is c++ specific (so long enum
> > safety).
> 
> But if you use it as flag bits, the resulting value no longer is
> inside the defined enum values.  Isn't that weird?

Ahh, there is always CGROUP_FREEZER_ONLINE which would overcomplicate
the possible and valid masks.
Please ignore this...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 7/9] cgroup_freezer: introduce CGROUP_FREEZING_[SELF|PARENT]
From: Tejun Heo @ 2012-11-08 14:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec
In-Reply-To: <20121108124755.GG31821@dhcp22.suse.cz>

Hello,

On Thu, Nov 08, 2012 at 01:47:55PM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:33, Tejun Heo wrote:
> > Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
> > the two flags.  This is to prepare for full hierarchy support.
> > 
> > freezer_apply_date() is updated such that it can handle setting and
> > clearing of both flags.  The two flags are also exposed to userland
> > via read-only files self_freezing and parent_freezing.
> > 
> > Other than the added cgroupfs files, this patch doesn't introduce any
> > behavior change.
> 
> OK, I can imagine that this might be useful to find the first parent
> group that needs to be thawed to unfreeze the group I am interested
> in. Could you also update Documentation/cgroups/freezer-subsystem.txt to
> clarify the intended usage?

Ooh, right.  Will update the doc in the last patch.

> Minor nit. Same as mentioned in the previous patch freezer_apply_state
> should get enum freezer_state_flags state parameter.

But it isn't an enum value.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 7/9] cgroup_freezer: introduce CGROUP_FREEZING_[SELF|PARENT]
From: Tejun Heo @ 2012-11-08 14:41 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: rjw-KKrjLPT3xs0, linux-pm-u79uwXL29TY76Z2rM5mHXA,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <509B3B94.1070407-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Hello, Kamezawa.

On Thu, Nov 08, 2012 at 01:56:52PM +0900, Kamezawa Hiroyuki wrote:
> Got it, thank you. I'm glad if you add some more explanation in comments
> and considered use-case in changelog.

I explained it in the last patch because it was a bit weird to explain
stuff which isn't implemented yet.  If it doesn't feel sufficient
after the last patch, please let me know.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 6/9] cgroup_freezer: make freezer->state mask of flags
From: Tejun Heo @ 2012-11-08 14:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121108103928.GD31821-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hello, Michal.

On Thu, Nov 08, 2012 at 11:39:28AM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:32, Tejun Heo wrote:
> > freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> > As the scheduled full hierarchy support requires more than one
> > freezing condition, switch it to mask of flags.  If FREEZING is not
> > set, it's thawed.  FREEZING is set if freezing or frozen.  If frozen,
> > both FREEZING and FROZEN are set.  Now that tasks can be attached to
> > an already frozen cgroup, this also makes freezing condition checks
> > more natural.
> > 
> > This patch doesn't introduce any behavior change.
> > 
> > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> I think it would be nicer to use freezer_state_flags enum rather than
> unsigned int for the state. I would even expect gcc to complain about
> that but it looks like -fstrict-enums is c++ specific (so long enum
> safety).

But if you use it as flag bits, the resulting value no longer is
inside the defined enum values.  Isn't that weird?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 6/9] cgroup_freezer: make freezer->state mask of flags
From: Tejun Heo @ 2012-11-08 14:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: lizefan, mhocko, rjw, linux-pm, fweisbec, containers,
	linux-kernel, cgroups
In-Reply-To: <509B3C72.3050904@jp.fujitsu.com>

Hello, Kamezawa.

On Thu, Nov 08, 2012 at 02:00:34PM +0900, Kamezawa Hiroyuki wrote:
> (2012/11/08 13:42), Tejun Heo wrote:
> >Hello, Kame.
> >
> >On Thu, Nov 08, 2012 at 01:37:50PM +0900, Kamezawa Hiroyuki wrote:
> >>How about
> >>enum {
> >>    __CGROUP_FREEZING,
> >>    __CGROUP_FROZEN,
> >>};
> >>
> >>#define CGROUP_FREEZER_STATE_MASK  0x3
> >>#define CGROUP_FREEZER_STATE(state)	((state) & CGROUP_FREEZER_STATE_MASK)
> >>#define CGROUP_THAW(state)	(CGROUP_FREEZER_STATE(state) == 0)
> >>#define CGROUP_FREEZING(state)	(CGROUP_FREEZER_STATE(state) == __CGROUP_FREEZING)
> >>#define CGROUP_FROZEN(state)\
> >>	(CGROUP_FREEZER_STATE(state) == (__CGROUP_FREEZING | __CGROUP_FROZEN))
> >
> >I think it's a bit overdone and we have cases where we test for
> >FREEZING regardless of FROZEN and cases where test for FREEZING &&
> >!FROZEN.  We can have, say, CGROUP_FREZING() and then
> >CGROUP_FREEZING_BUT_NOT_FROZEN(), but it feels more like obfuscation
> >than anything else.
> >
> 
> Hm, then, I'm glad if I can see what combinations of flags are valid and
> meanings of them in source code comments.

Ooh, the following comment is added by a later patch when the whole
thing is complete.

/*
 * A cgroup is freezing if any FREEZING flags are set.  FREEZING_SELF is
 * set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
 * for "THAWED".  FREEZING_PARENT is set if the parent freezer is FREEZING
 * for whatever reason.  IOW, a cgroup has FREEZING_PARENT set if one of
 * its ancestors has FREEZING_SELF set.
 */

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 9/9 v2] cgroup_freezer: implement proper hierarchy support
From: Tejun Heo @ 2012-11-08 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20121108140852.GI31821-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>

Hello, Michal.

On Thu, Nov 08, 2012 at 03:08:52PM +0100, Michal Hocko wrote:
> This seems to be racy because parent->state access is not linearized.
> Say we have parallel freeze and thawing on a tree like the following:
> 	A
> 	|
> 	B
>         |
>         C 
> 
> pre_order will visit them in B, C order.
> 	CPU1						CPU2
> freezer_apply_state(A, true)
> A->state & FREEZING == true		freezer_apply_state(A, false)
> 					A->state & FREEZING == false
> 					freezer_apply_state(B, false)
> 					B->state & FREEZING == false
> freezer_apply_state(B, true)
> 
> B->state & FREEZING == true
> freezer_apply_state(C, true)
> 					freezer_apply_state(C, false)
> 
> So A, C are thawed while B is frozen. Or am I missing something which
> would prevent from this kind of race?

The rule is that there will be at least one inheritance operation
after a parent is updated.  The exact order of propagation doesn't
matter as long as there's at least one inheritance event after the
latest update to a parent.  This works because inheritance operations
are atomic to each other.  If one inheritance operation "saw" an
update to its parent, the next inheritance operation is guaranteed to
see at least upto that update.

So, in the above example in CPU2, (B->state & FREEZING) test and
freezer_apply_state(C, false) can't be interleaved with the same
inheritance operation from CPU1.  They either happen before or after.

Maybe it's too subtle.  The only reason I didn't use a giant
freezer_mutex was that I wanted to demonstrate how to do state
propagation without depending on single giant lock.  Maybe nobody
wants that and this should use one mutex to protect the hierarchy.  I
don't know.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 9/9 v2] cgroup_freezer: implement proper hierarchy support
From: Michal Hocko @ 2012-11-08 14:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, rjw-KKrjLPT3xs0,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20121107163919.GC2660-9pTldWuhBndy/B6EtB590w@public.gmane.org>

On Wed 07-11-12 08:39:19, Tejun Heo wrote:
[...]
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
[...]
> @@ -320,14 +388,39 @@ static void freezer_apply_state(struct f
>   * @freezer: freezer of interest
>   * @freeze: whether to freeze or thaw
>   *
> - * Freeze or thaw @cgroup according to @freeze.
> + * Freeze or thaw @freezer according to @freeze.  The operations are
> + * recursive - all descendants of @freezer will be affected.
>   */
>  static void freezer_change_state(struct freezer *freezer, bool freeze)
>  {
> +	struct cgroup *pos;
> +
>  	/* update @freezer */
>  	spin_lock_irq(&freezer->lock);
>  	freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
>  	spin_unlock_irq(&freezer->lock);
> +
> +	/*
> +	 * Update all its descendants in pre-order traversal.  Each
> +	 * descendant will try to inherit its parent's FREEZING state as
> +	 * CGROUP_FREEZING_PARENT.
> +	 */
> +	rcu_read_lock();
> +	cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
> +		struct freezer *pos_f = cgroup_freezer(pos);
> +		struct freezer *parent = parent_freezer(pos_f);
> +
> +		/*
> +		 * Our update to @parent->state is already visible which is
> +		 * all we need.  No need to lock @parent.  For more info on
> +		 * synchronization, see freezer_post_create().
> +		 */
> +		spin_lock_irq(&pos_f->lock);
> +		freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
> +				    CGROUP_FREEZING_PARENT);
> +		spin_unlock_irq(&pos_f->lock);
> +	}
> +	rcu_read_unlock();
>  }

This seems to be racy because parent->state access is not linearized.
Say we have parallel freeze and thawing on a tree like the following:
	A
	|
	B
        |
        C 

pre_order will visit them in B, C order.
	CPU1						CPU2
freezer_apply_state(A, true)
A->state & FREEZING == true		freezer_apply_state(A, false)
					A->state & FREEZING == false
					freezer_apply_state(B, false)
					B->state & FREEZING == false
freezer_apply_state(B, true)

B->state & FREEZING == true
freezer_apply_state(C, true)
					freezer_apply_state(C, false)

So A, C are thawed while B is frozen. Or am I missing something which
would prevent from this kind of race?

The easy fix is to move spin_unlock_irq(&freezer->lock); after
rcu_read_unlock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox