* Re: [PATCHSET] cgroup: introduce cgroup_taskset and consolidate subsys methods
From: Tejun Heo @ 2011-08-24 7:49 UTC (permalink / raw)
To: Frederic Weisbecker; +Cc: containers, lizf, linux-kernel, linux-pm, paul
In-Reply-To: <20110824011428.GC23979@somewhere>
Hello, Frederic.
On Wed, Aug 24, 2011 at 03:14:30AM +0200, Frederic Weisbecker wrote:
> > 0001-cgroup-subsys-attach_task-should-be-called-after-mig.patch
> > 0002-cgroup-improve-old-cgroup-handling-in-cgroup_attach_.patch
> > 0003-cgroup-introduce-cgroup_taskset-and-use-it-in-subsys.patch
> > 0004-cgroup-don-t-use-subsys-can_attach_task-or-attach_ta.patch
> > 0005-cgroup-cpuset-don-t-use-ss-pre_attach.patch
> > 0006-cgroup-kill-subsys-can_attach_task-pre_attach-and-at.patch
>
> I don't understand the point on patches 3,4,5,6
>
> Why pushing the task iterations down to the subsystems?
I'll try again.
It seems like methods were added to serve the immediate need of the
particular user at the time and that in turn led to addition of
callbacks which were both superflous and incomplete (the bullet points
in the original message list them). This seems to have happened
because extra interface was added without trying to make the existing
interface complete.
The interface is complicated and cumbersome to use - are
[can_]attach() called first or [can_]attach_task()? What about
cancelation? What if a subsys wants to perform operations across
multiple tasks atomically?
In general, iteration-by-callback is painful to use. Establishing
common context (be it synchronization domain or shared variables)
becomes very cumbersome and implementation becomes fragmented and
difficult to follow. For example, imagine how it would be like to use
list if we had call_for_each_list_entry(func, list_head) instead of
the control-loop style iterators we have know.
So, using iterators enables making all relevant information to each
stage of attach so that only one callback is required for each step -
the way it should be. In addition, it makes it far easier for
subsystems to implement more involved logic in their methods.
I tried to make cgroup_freezer behave better which requires better
synchronization against the freezer and, with the current interface,
it's extremely ugly and painful. The new interface is complete, easy
to understand and use with far less subtleties.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH v7 3/4] PM / DEVFREQ: add basic governors
From: MyungJoo Ham @ 2011-08-24 7:46 UTC (permalink / raw)
To: Turquette, Mike
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <CAJOA=zM__=_TcuSCF7BOfvHQXJ0u=m_5Y2XHi1uEBBy_z6BJYQ@mail.gmail.com>
On Wed, Aug 24, 2011 at 2:29 AM, Turquette, Mike <mturquette@ti.com> wrote:
> On Tue, Aug 23, 2011 at 12:54 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>
> I think that user_set_freq == 0 is a valid request from a user. A
> user might not know what the valid frequencies are for a device, but
> knows that he/she wants the lowest one. Writing 0 to the sysfs value
> should give the floor of the available frequencies.
>
Users can know what is the minimum valid frequency with
devfreq_min_freq. However, writing 0 to the sysfs is a comfortable
method and I'll allow that input in the next revision along with the
patch of allowing governors to have their own sysfs entries with
example of userspace.
>> + else
>> + *freq = df->user_set_freq;
>> +
>> + return 0;
>> +}
>> +
>> +struct devfreq_governor devfreq_userspace = {
>> + .name = "userspace",
>> + .get_target_freq = devfreq_userspace_func,
>> +};
>
> The set_user_frequency attribute should be moved out of devfreq.c
> (patch 4) and live here. There is no purpose to that entry except for
> the userspace governor and it should not exist in sysfs unless this
> governor is in use. To be clear, there should still be a read-only
> show_frequency or something in devfreq.c.
>
> This again touches on my long-standing complaints with this series'
> design. set_user_frequency is an attribute that belongs to the
> governor, not to the core. Such misplacement of attribute/entity
> ownership is common in this series, but I won't go down that rabbit
> hole again. In the mean time at least this one instance of the
> problem for the userspace gov should be fixed up.
>
> Regards,
> Mike
>
I'll include a patch that shows an example of having a governor
specific sysfs entry at the next revision. It is enabled by
introducing a function to governors "struct devfreq
*get_devfreq(struct device *dev);" at /device/devfreq/governor.h
(local to governors) and adding .init and .exit callbacks at struct
devfreq_governor.
Cheers!
MyungJoo
--
MyungJoo Ham (함명주), Ph.D.
Mobile Software Platform Lab,
Digital Media and Communications (DMC) Business
Samsung Electronics
cell: 82-10-6714-2858
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm
^ permalink raw reply
* Re: [PATCH v7 4/4] PM / DEVFREQ: add sysfs interface
From: MyungJoo Ham @ 2011-08-24 7:40 UTC (permalink / raw)
To: Turquette, Mike
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <CAJOA=zNZqz=tHah+z5RHq7SqeG1eGPdO8=PEuGUztmA45cRCxg@mail.gmail.com>
On Wed, Aug 24, 2011 at 2:34 AM, Turquette, Mike <mturquette@ti.com> wrote:
> On Tue, Aug 23, 2011 at 12:54 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>> + The /sys/devices/.../power/devfreq_cur_freq shows the
>
> Copy/paste error? Description should reference devfreq_max_freq not
> devfreq_cur_freq.
>
>> + minimum operable frequency of the corresponding device.
>
> Similar to the above.
>
Thank you. I'll resubmit soon with corrections (and w/ some further
changes to allow governors to have their own sysfs entries and access
more on the DEVFREQ core).
>> +
>> +What: /sys/devices/.../power/devfreq_set_freq
>> +Date: August 2011
>> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
>> +Description:
>> + The /sys/devices/.../power/devfreq_set_freq sets and shows
>> + the user specified desired frequency of the device. The
>> + governor may and may not use the value. With the basic
>> + governors given with devfreq.c, userspace governor is
>> + using the value.
>
> As I stated in patch 3, this should conditionally exist only if the
> userspace governor is being used for this device.
Ok, in the next revision, this entry will be removed and replaced with
an entry created by userspace governor.
>
> The existing devfreq_cur_freq covers the read-only case well.
It could be a bit different especially with another device driver
enabling and disabling OPPs and/or PM QoS.
For example, when user states "100MHz" on a device with
"50/100/200MHz", it will use 100. However, if someone has disabled
100MHz with opp_disable, devfreq_cur_freq will automatically changed
to 200 while the user value should remain at 100 in case where 100 is
later re-enabled.
>
> Regards,
> Mike
>
Thank you.
Cheers!
MyungJoo
--
MyungJoo Ham (함명주), Ph.D.
Mobile Software Platform Lab,
Digital Media and Communications (DMC) Business
Samsung Electronics
cell: 82-10-6714-2858
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm
^ permalink raw reply
* Re: [POWER DOMAIN suspend callbacks] Observation.
From: Santosh @ 2011-08-24 6:17 UTC (permalink / raw)
To: Kevin Hilman; +Cc: Govindraj R, Linux PM mailing list, linux-omap
In-Reply-To: <871uwcjbwi.fsf@ti.com>
On Tuesday 23 August 2011 10:36 PM, Kevin Hilman wrote:
> Hi Santosh,
>
> Santosh<santosh.shilimkar@ti.com> writes:
>
>> Rafael, Kevin,
>>
>> On latest kernel( V3.1-rc1+), the subsystem(driver) suspend
>> callbacks are not getting called because power domain callbcaks
>> are populated.
>>
>> And as per commit 4d27e9dc{PM: Make power domain callbacks take
>> precedence over subsystem ones}, it's expected bahavior.
>
> Correct.
>
>> Who is suppose to call the driver suspend callback?
>
> If populated, the PM domain callbacks should call the driver callbacks.
> If there are no PM domain callbacks, then the subsystem (in this case,
> the platform_bus) should be calling the driver callbacks.
>
>> Some drivers/subsystem would have state machine which needs to
>> be suspended.
>>
>> Is the power domain suspend callback, suppose to take care of
>> it ? If yes, then that seems to be missing for OMAP.
>
> Yup, there's a bug. They're not missing, just misplaced. ;)
>
> When adding the noirq callbacks to ensure devices are idled late in
> suspend by omap_device, I the patch commited mistakenly uses
> SET_SYSTEM_SLEEP_PM_OPS(), which sets the "normal" suspend/resume
> handlers and not the noirq handlers.
>
> Can you try the patch below? I only briefly tested it on omap3/n900 so
> far.
>
The patch works like charm.
> This populates most of the PM domain methods with the same ones used by
> the subystem (platform_bus) and only overrides the noirq methods with
> custom versions. This patch should make all the driver's suspend/resume
> methods be called as expected.
>
> After a bit more sanitiy testing, I'll post a real patch for the -rc
> series.
>
Great.
Regards
Santosh
^ permalink raw reply
* Re: [PATCH 2/2 v2] sh-sci / PM: Use power.irq_safe
From: Paul Mundt @ 2011-08-24 5:33 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, LKML, linux-sh
In-Reply-To: <201108212111.45012.rjw@sisk.pl>
On Sun, Aug 21, 2011 at 09:11:44PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
>
> Since sci_port_enable() and sci_port_disable() may be run with
> interrupts off and they execute pm_runtime_get_sync() and
> pm_runtime_put_sync(), respectively, the SCI device's
> power.irq_safe flags has to be used to indicate that it is safe
> to execute runtime PM callbacks for this device with interrupts off.
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Not sure how you want this one handled. Did you simply want to roll this
in with your other patch with my Acked-by, or should I be taking this
through my tree already regardless of the 1/2 patch?
^ permalink raw reply
* Re: [PATCH 4/6] cgroup: don't use subsys->can_attach_task() or ->attach_task()
From: Matt Helsley @ 2011-08-24 1:57 UTC (permalink / raw)
To: Tejun Heo
Cc: lizf, paul, Daisuke Nishimura, linux-kernel, linux-pm,
Ingo Molnar, containers
In-Reply-To: <1314138000-2049-5-git-send-email-tj@kernel.org>
On Wed, Aug 24, 2011 at 12:19:58AM +0200, Tejun Heo wrote:
<snip>
> * In cgroup_freezer, remove unnecessary NULL assignments to unused
> methods. It's useless and very prone to get out of sync, which
> already happened.
You could post this part independently -- that might be best since
I guess the taskset bits will need more discussion.
Cheers,
-Matt Helsley
^ permalink raw reply
* Re: [PATCH 1/6] cgroup: subsys->attach_task() should be called after migration
From: Li Zefan @ 2011-08-24 1:31 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: paul, linux-kernel, Tejun Heo, containers, Andrew Morton,
linux-pm
In-Reply-To: <20110824003234.GB23979@somewhere>
> There have been a patch posted recently:
>
> "[PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc"
>
> that not only fixes that ordering but also only attach the task if the
> migration happened correctly (task not exited).
>
> Can somebody queue it for 3.2 ?
>
cgroup patches normally go through -mm tree.
^ permalink raw reply
* Re: [PATCHSET] cgroup: introduce cgroup_taskset and consolidate subsys methods
From: Frederic Weisbecker @ 2011-08-24 1:14 UTC (permalink / raw)
To: Tejun Heo; +Cc: containers, lizf, linux-kernel, linux-pm, paul
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
On Wed, Aug 24, 2011 at 12:19:54AM +0200, Tejun Heo wrote:
> Hello,
>
> cgroup has grown quite some number of subsys methods. Some of them
> are overlapping, inconsistent with each other and called under
> different conditions depending on whether they're called for a single
> task or whole process. Unfortunately, these callbacks are complicated
> and incomplete at the same time.
>
> * ->attach_task() is called after migration for task attach but before
> for process.
>
> * Ditto for ->pre_attach().
>
> * ->can_attach_task() is called for every task in the thread group but
> ->attach_task() skips the ones which don't actually change cgroups.
>
> * Task attach becomes noop if the task isn't actually moving. Process
> attach is always performed.
>
> * ->attach_task() doesn't (or at least aren't supposed to) have access
> to the old cgroup.
>
> * During cancel, there's no way to access the affected tasks.
>
> This patchset introduces cgroup_taskset along with some accessors and
> iterator, updates methods to use it, consolidates usages and drops
> superflous methods.
>
> It contains the following six patches.
>
> 0001-cgroup-subsys-attach_task-should-be-called-after-mig.patch
> 0002-cgroup-improve-old-cgroup-handling-in-cgroup_attach_.patch
> 0003-cgroup-introduce-cgroup_taskset-and-use-it-in-subsys.patch
> 0004-cgroup-don-t-use-subsys-can_attach_task-or-attach_ta.patch
> 0005-cgroup-cpuset-don-t-use-ss-pre_attach.patch
> 0006-cgroup-kill-subsys-can_attach_task-pre_attach-and-at.patch
I don't understand the point on patches 3,4,5,6
Why pushing the task iterations down to the subsystems?
^ permalink raw reply
* Re: [PATCH 1/6] cgroup: subsys->attach_task() should be called after migration
From: Frederic Weisbecker @ 2011-08-24 0:32 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton; +Cc: containers, lizf, linux-kernel, linux-pm, paul
In-Reply-To: <1314138000-2049-2-git-send-email-tj@kernel.org>
On Wed, Aug 24, 2011 at 12:19:55AM +0200, Tejun Heo wrote:
> cgroup_attach_task() calls subsys->attach_task() after
> cgroup_task_migrate(); however, cgroup_attach_proc() calls it before
> migration. This actually affects some of the users. Update
> cgroup_attach_proc() such that ->attach_task() is called after
> migration.
There have been a patch posted recently:
"[PATCH][BUGFIX] cgroups: fix ordering of calls in cgroup_attach_proc"
that not only fixes that ordering but also only attach the task if the
migration happened correctly (task not exited).
Can somebody queue it for 3.2 ?
^ permalink raw reply
* Re: [PATCH 1/2] PM: add runtime PM support to core Primecell driver
From: Rafael J. Wysocki @ 2011-08-23 22:24 UTC (permalink / raw)
To: Russell King - ARM Linux; +Cc: linux-pm, linux-arm-kernel
In-Reply-To: <E1Qvo1q-0000qv-TQ@rmk-PC.arm.linux.org.uk>
On Tuesday, August 23, 2011, Russell King - ARM Linux wrote:
> Add runtime PM support to the core Primecell driver, following the PCI
> model of how this is done.
>
> Rather than having every driver fiddle about with enabling runtime PM,
> that's dealt with in the core and instead, drivers just do a put() in
> their probe and a balancing get() in their remove function to activate
> runtime PM for the device.
>
> As we're dealing with enabling runtime PM in the core, fix up spi-pl022
> as it must not enable and disable runtime PM itself anymore.
>
> Tested-by: Linus Walleij <linus.walleij@linaro.org>
> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> drivers/amba/bus.c | 57 +++++++++++++++++++++++++++++++--
> drivers/spi/spi-pl022.c | 80 ++++++++++++++++++++++++++++-------------------
> 2 files changed, 102 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index d74926e..84bdaac 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -365,6 +365,40 @@ static int amba_pm_restore_noirq(struct device *dev)
>
> #endif /* !CONFIG_HIBERNATE_CALLBACKS */
>
> +#ifdef CONFIG_PM_RUNTIME
> +/*
> + * Hooks to provide runtime PM of the pclk (bus clock). It is safe to
> + * enable/disable the bus clock at runtime PM suspend/resume as this
> + * does not result in loss of context. However, disabling vcore power
> + * would do, so we leave that to the driver.
> + */
> +static int amba_pm_runtime_suspend(struct device *dev)
> +{
> + struct amba_device *pcdev = to_amba_device(dev);
> + int ret = pm_generic_runtime_suspend(dev);
> +
> + if (ret == 0 && dev->driver)
> + clk_disable(pcdev->pclk);
> +
> + return ret;
> +}
> +
> +static int amba_pm_runtime_resume(struct device *dev)
> +{
> + struct amba_device *pcdev = to_amba_device(dev);
> + int ret;
> +
> + if (dev->driver) {
> + ret = clk_enable(pcdev->pclk);
> + /* Failure is probably fatal to the system, but... */
> + if (ret)
> + return ret;
> + }
> +
> + return pm_generic_runtime_resume(dev);
> +}
> +#endif
> +
> #ifdef CONFIG_PM
>
> static const struct dev_pm_ops amba_pm = {
> @@ -383,8 +417,8 @@ static const struct dev_pm_ops amba_pm = {
> .poweroff_noirq = amba_pm_poweroff_noirq,
> .restore_noirq = amba_pm_restore_noirq,
> SET_RUNTIME_PM_OPS(
> - pm_generic_runtime_suspend,
> - pm_generic_runtime_resume,
> + amba_pm_runtime_suspend,
> + amba_pm_runtime_resume,
> pm_generic_runtime_idle
> )
> };
> @@ -494,10 +528,18 @@ static int amba_probe(struct device *dev)
> if (ret)
> break;
>
> + pm_runtime_get_noresume(dev);
> + pm_runtime_set_active(dev);
> + pm_runtime_enable(dev);
> +
> ret = pcdrv->probe(pcdev, id);
> if (ret == 0)
> break;
>
> + pm_runtime_disable(dev);
> + pm_runtime_set_suspended(dev);
> + pm_runtime_put_noidle(dev);
> +
> amba_put_disable_pclk(pcdev);
> amba_put_disable_vcore(pcdev);
> } while (0);
> @@ -509,7 +551,16 @@ static int amba_remove(struct device *dev)
> {
> struct amba_device *pcdev = to_amba_device(dev);
> struct amba_driver *drv = to_amba_driver(dev->driver);
> - int ret = drv->remove(pcdev);
> + int ret;
> +
> + pm_runtime_get_sync(dev);
> + ret = drv->remove(pcdev);
> + pm_runtime_put_noidle(dev);
> +
> + /* Undo the runtime PM settings in amba_probe() */
> + pm_runtime_disable(dev);
> + pm_runtime_set_suspended(dev);
> + pm_runtime_put_noidle(dev);
>
> amba_put_disable_pclk(pcdev);
> amba_put_disable_vcore(pcdev);
> diff --git a/drivers/spi/spi-pl022.c b/drivers/spi/spi-pl022.c
> index 730b4a3..078338f 100644
> --- a/drivers/spi/spi-pl022.c
> +++ b/drivers/spi/spi-pl022.c
> @@ -515,9 +515,6 @@ static void giveback(struct pl022 *pl022)
> if (msg->complete)
> msg->complete(msg->context);
> /* This message is completed, so let's turn off the clocks & power */
> - clk_disable(pl022->clk);
> - amba_pclk_disable(pl022->adev);
> - amba_vcore_disable(pl022->adev);
> pm_runtime_put(&pl022->adev->dev);
> }
>
> @@ -1545,9 +1542,6 @@ static void pump_messages(struct work_struct *work)
> * (poll/interrupt/DMA)
> */
> pm_runtime_get_sync(&pl022->adev->dev);
> - amba_vcore_enable(pl022->adev);
> - amba_pclk_enable(pl022->adev);
> - clk_enable(pl022->clk);
> restore_state(pl022);
> flush(pl022);
>
> @@ -2186,8 +2180,6 @@ pl022_probe(struct amba_device *adev, const struct amba_id *id)
> }
> printk(KERN_INFO "pl022: mapped registers from 0x%08x to %p\n",
> adev->res.start, pl022->virtbase);
> - pm_runtime_enable(dev);
> - pm_runtime_resume(dev);
>
> pl022->clk = clk_get(&adev->dev, NULL);
> if (IS_ERR(pl022->clk)) {
> @@ -2195,7 +2187,6 @@ pl022_probe(struct amba_device *adev, const struct amba_id *id)
> dev_err(&adev->dev, "could not retrieve SSP/SPI bus clock\n");
> goto err_no_clk;
> }
> -
> /* Disable SSP */
> writew((readw(SSP_CR1(pl022->virtbase)) & (~SSP_CR1_MASK_SSE)),
> SSP_CR1(pl022->virtbase));
> @@ -2235,12 +2226,9 @@ pl022_probe(struct amba_device *adev, const struct amba_id *id)
> goto err_spi_register;
> }
> dev_dbg(dev, "probe succeeded\n");
> - /*
> - * Disable the silicon block pclk and any voltage domain and just
> - * power it up and clock it when it's needed
> - */
> - amba_pclk_disable(adev);
> - amba_vcore_disable(adev);
> +
> + /* let runtime pm put suspend */
> + pm_runtime_put(dev);
> return 0;
>
> err_spi_register:
> @@ -2249,7 +2237,6 @@ pl022_probe(struct amba_device *adev, const struct amba_id *id)
> destroy_queue(pl022);
> pl022_dma_remove(pl022);
> free_irq(adev->irq[0], pl022);
> - pm_runtime_disable(&adev->dev);
> err_no_irq:
> clk_put(pl022->clk);
> err_no_clk:
> @@ -2271,6 +2258,12 @@ pl022_remove(struct amba_device *adev)
> if (!pl022)
> return 0;
>
> + /*
> + * undo pm_runtime_put() in probe. I assume that we're not
> + * accessing the primecell here.
> + */
> + pm_runtime_get_noresume(&adev->dev);
> +
> /* Remove the queue */
> if (destroy_queue(pl022) != 0)
> dev_err(&adev->dev, "queue remove failed\n");
> @@ -2288,10 +2281,10 @@ pl022_remove(struct amba_device *adev)
> return 0;
> }
>
> -#ifdef CONFIG_PM
> -static int pl022_suspend(struct amba_device *adev, pm_message_t state)
> +#ifdef CONFIG_SUSPEND
> +static int pl011_suspend(struct device *dev)
> {
> - struct pl022 *pl022 = amba_get_drvdata(adev);
> + struct pl022 *pl022 = dev_get_drvdata(dev);
> int status = 0;
>
> status = stop_queue(pl022);
> @@ -2300,34 +2293,58 @@ static int pl022_suspend(struct amba_device *adev, pm_message_t state)
> return status;
> }
>
> - amba_vcore_enable(adev);
> - amba_pclk_enable(adev);
> + amba_vcore_enable(pl022->adev);
> + amba_pclk_enable(pl022->adev);
> load_ssp_default_config(pl022);
> - amba_pclk_disable(adev);
> - amba_vcore_disable(adev);
> + amba_pclk_disable(pl022->adev);
> + amba_vcore_disable(pl022->adev);
> dev_dbg(&adev->dev, "suspended\n");
> return 0;
> }
>
> -static int pl022_resume(struct amba_device *adev)
> +static int pl022_resume(struct device *dev)
> {
> - struct pl022 *pl022 = amba_get_drvdata(adev);
> + struct pl022 *pl022 = dev_get_drvdata(dev);
> int status = 0;
>
> /* Start the queue running */
> status = start_queue(pl022);
> if (status)
> - dev_err(&adev->dev, "problem starting queue (%d)\n", status);
> + dev_err(dev, "problem starting queue (%d)\n", status);
> else
> - dev_dbg(&adev->dev, "resumed\n");
> + dev_dbg(dev, "resumed\n");
>
> return status;
> }
> -#else
> -#define pl022_suspend NULL
> -#define pl022_resume NULL
> #endif /* CONFIG_PM */
>
> +#ifdef CONFIG_PM_RUNTIME
> +static int pl022_runtime_suspend(struct device *dev)
> +{
> + struct pl022 *pl022 = dev_get_drvdata(dev);
> +
> + clk_disable(pl022->clk);
> + amba_vcore_disable(pl022->adev);
> +
> + return 0;
> +}
> +
> +static int pl022_runtime_resume(struct device *dev)
> +{
> + struct pl022 *pl022 = dev_get_drvdata(dev);
> +
> + amba_vcore_enable(pl022->adev);
> + clk_enable(pl022->clk);
> +
> + return 0;
> +}
> +#endif
> +
> +static const struct dev_pm_ops pl022_dev_pm_ops = {
> + SET_SYSTEM_SLEEP_PM_OPS(pl022_suspend, pl022_resume)
> + SET_RUNTIME_PM_OPS(pl022_runtime_suspend, pl022_runtime_resume, NULL)
> +};
> +
> static struct vendor_data vendor_arm = {
> .fifodepth = 8,
> .max_bpw = 16,
> @@ -2407,12 +2424,11 @@ static struct amba_id pl022_ids[] = {
> static struct amba_driver pl022_driver = {
> .drv = {
> .name = "ssp-pl022",
> + .pm = &pl022_dev_pm_ops,
> },
> .id_table = pl022_ids,
> .probe = pl022_probe,
> .remove = __devexit_p(pl022_remove),
> - .suspend = pl022_suspend,
> - .resume = pl022_resume,
> };
>
>
>
^ permalink raw reply
* [PATCH 6/6] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
From: Tejun Heo @ 2011-08-23 22:20 UTC (permalink / raw)
To: rjw, paul, lizf; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
These three methods are no longer used. Kill them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
include/linux/cgroup.h | 3 --
kernel/cgroup.c | 53 ++++-------------------------------------------
2 files changed, 5 insertions(+), 51 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2470c8e..5659d37 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -490,11 +490,8 @@ struct cgroup_subsys {
void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset);
- int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset);
- void (*pre_attach)(struct cgroup *cgrp);
- void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 474674b..374a4cb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1926,13 +1926,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
goto out;
}
}
- if (ss->can_attach_task) {
- retval = ss->can_attach_task(cgrp, tsk);
- if (retval) {
- failed_ss = ss;
- goto out;
- }
- }
}
retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
@@ -1940,10 +1933,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
goto out;
for_each_subsys(root, ss) {
- if (ss->pre_attach)
- ss->pre_attach(cgrp);
- if (ss->attach_task)
- ss->attach_task(cgrp, tsk);
if (ss->attach)
ss->attach(ss, cgrp, &tset);
}
@@ -2075,7 +2064,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
{
int retval, i, group_size, nr_todo;
struct cgroup_subsys *ss, *failed_ss = NULL;
- bool cancel_failed_ss = false;
/* guaranteed to be initialized later, but the compiler needs this */
struct css_set *oldcg;
struct cgroupfs_root *root = cgrp->root;
@@ -2166,21 +2154,6 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
goto out_cancel_attach;
}
}
- /* a callback to be run on every thread in the threadgroup. */
- if (ss->can_attach_task) {
- /* run on each task in the threadgroup. */
- for (i = 0; i < group_size; i++) {
- tc = flex_array_get(group, i);
- if (tc->cgrp == cgrp)
- continue;
- retval = ss->can_attach_task(cgrp, tc->task);
- if (retval) {
- failed_ss = ss;
- cancel_failed_ss = true;
- goto out_cancel_attach;
- }
- }
- }
}
/*
@@ -2217,15 +2190,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
}
/*
- * step 3: now that we're guaranteed success wrt the css_sets, proceed
- * to move all tasks to the new cgroup, calling ss->attach_task for each
- * one along the way. there are no failure cases after here, so this is
- * the commit point.
+ * step 3: now that we're guaranteed success wrt the css_sets,
+ * proceed to move all tasks to the new cgroup. There are no
+ * failure cases after here, so this is the commit point.
*/
- for_each_subsys(root, ss) {
- if (ss->pre_attach)
- ss->pre_attach(cgrp);
- }
for (i = 0; i < group_size; i++) {
tc = flex_array_get(group, i);
/* leave current thread as it is if it's already there */
@@ -2235,19 +2203,11 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
/* if the thread is PF_EXITING, it can just get skipped. */
retval = cgroup_task_migrate(cgrp, tc->cgrp, tc->task, true);
BUG_ON(retval != 0 && retval != -ESRCH);
-
- /* attach each task to each subsystem */
- for_each_subsys(root, ss) {
- if (ss->attach_task)
- ss->attach_task(cgrp, tc->task);
- }
}
/* nothing is sensitive to fork() after this point. */
/*
- * step 4: do expensive, non-thread-specific subsystem callbacks.
- * TODO: if ever a subsystem needs to know the oldcgrp for each task
- * being moved, this call will need to be reworked to communicate that.
+ * step 4: do subsystem attach callbacks.
*/
for_each_subsys(root, ss) {
if (ss->attach)
@@ -2271,11 +2231,8 @@ out_cancel_attach:
/* same deal as in cgroup_attach_task */
if (retval) {
for_each_subsys(root, ss) {
- if (ss == failed_ss) {
- if (cancel_failed_ss && ss->cancel_attach)
- ss->cancel_attach(ss, cgrp, &tset);
+ if (ss == failed_ss)
break;
- }
if (ss->cancel_attach)
ss->cancel_attach(ss, cgrp, &tset);
}
--
1.7.6
^ permalink raw reply related
* [PATCH 5/6] cgroup, cpuset: don't use ss->pre_attach()
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
->pre_attach() is supposed to be called before migration, which is
observed during process migration but task migration does it the other
way around. The only ->pre_attach() user is cpuset which can do the
same operaitons in ->can_attach(). Collapse cpuset_pre_attach() into
cpuset_can_attach().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
Documentation/cgroups/cgroups.txt | 20 --------------------
kernel/cpuset.c | 29 ++++++++++++-----------------
2 files changed, 12 insertions(+), 37 deletions(-)
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 2eee7cf..afb7cde 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -610,13 +610,6 @@ called on a fork. If this method returns 0 (success) then this should
remain valid while the caller holds cgroup_mutex and it is ensured
that either attach() or cancel_attach() will be called in future.
-int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
-(cgroup_mutex held by caller)
-
-As can_attach, but for operations that must be run once per task to be
-attached (possibly many when using cgroup_attach_proc). Called after
-can_attach.
-
void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
@@ -627,12 +620,6 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded. The parameters are identical to can_attach().
-void pre_attach(struct cgroup *cgrp);
-(cgroup_mutex held by caller)
-
-For any non-per-thread attachment work that needs to happen before
-attach_task. Needed by cpuset.
-
void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
@@ -641,13 +628,6 @@ Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
The parameters are identical to can_attach().
-void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
-(cgroup_mutex held by caller)
-
-As attach, but for operations that must be run once per task to be attached,
-like can_attach_task. Called before attach. Currently does not support any
-subsystem that might need the old_cgrp for every thread in the group.
-
void fork(struct cgroup_subsy *ss, struct task_struct *task)
Called when a task is forked into a cgroup.
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 472ddd6..f0b8df3 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1367,6 +1367,15 @@ static int fmeter_getrate(struct fmeter *fmp)
return val;
}
+/*
+ * Protected by cgroup_lock. The nodemasks must be stored globally because
+ * dynamically allocating them is not allowed in can_attach, and they must
+ * persist until attach.
+ */
+static cpumask_var_t cpus_attach;
+static nodemask_t cpuset_attach_nodemask_from;
+static nodemask_t cpuset_attach_nodemask_to;
+
/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset)
@@ -1393,29 +1402,16 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
if ((ret = security_task_setscheduler(task)))
return ret;
}
- return 0;
-}
-
-/*
- * Protected by cgroup_lock. The nodemasks must be stored globally because
- * dynamically allocating them is not allowed in pre_attach, and they must
- * persist among pre_attach, and attach.
- */
-static cpumask_var_t cpus_attach;
-static nodemask_t cpuset_attach_nodemask_from;
-static nodemask_t cpuset_attach_nodemask_to;
-
-/* Set-up work for before attaching each task. */
-static void cpuset_pre_attach(struct cgroup *cont)
-{
- struct cpuset *cs = cgroup_cs(cont);
+ /* prepare for attach */
if (cs == &top_cpuset)
cpumask_copy(cpus_attach, cpu_possible_mask);
else
guarantee_online_cpus(cs, cpus_attach);
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+
+ return 0;
}
static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -1901,7 +1897,6 @@ struct cgroup_subsys cpuset_subsys = {
.create = cpuset_create,
.destroy = cpuset_destroy,
.can_attach = cpuset_can_attach,
- .pre_attach = cpuset_pre_attach,
.attach = cpuset_attach,
.populate = cpuset_populate,
.post_clone = cpuset_post_clone,
--
1.7.6
^ permalink raw reply related
* [PATCH 4/6] cgroup: don't use subsys->can_attach_task() or ->attach_task()
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf
Cc: Peter Zijlstra, containers, Ingo Molnar, Daisuke Nishimura,
linux-kernel, James Morris, Tejun Heo, linux-pm,
KAMEZAWA Hiroyuki
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations. Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead. Most converions are straight-forward.
Noteworthy changes are,
* In cgroup_freezer, remove unnecessary NULL assignments to unused
methods. It's useless and very prone to get out of sync, which
already happened.
* In cpuset, PF_THREAD_BOUND test is checked for each task. This
doesn't make any practical difference but is conceptually cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
---
block/blk-cgroup.c | 45 +++++++++++++++++++-----------
kernel/cgroup_freezer.c | 14 +++-------
kernel/cpuset.c | 70 +++++++++++++++++++++-------------------------
kernel/events/core.c | 13 +++++---
kernel/sched.c | 31 +++++++++++++--------
5 files changed, 91 insertions(+), 82 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bcaf16e..99e0bd4 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,8 +30,10 @@ EXPORT_SYMBOL_GPL(blkio_root_cgroup);
static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
struct cgroup *);
-static int blkiocg_can_attach_task(struct cgroup *, struct task_struct *);
-static void blkiocg_attach_task(struct cgroup *, struct task_struct *);
+static int blkiocg_can_attach(struct cgroup_subsys *, struct cgroup *,
+ struct cgroup_taskset *);
+static void blkiocg_attach(struct cgroup_subsys *, struct cgroup *,
+ struct cgroup_taskset *);
static void blkiocg_destroy(struct cgroup_subsys *, struct cgroup *);
static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
@@ -44,8 +46,8 @@ static int blkiocg_populate(struct cgroup_subsys *, struct cgroup *);
struct cgroup_subsys blkio_subsys = {
.name = "blkio",
.create = blkiocg_create,
- .can_attach_task = blkiocg_can_attach_task,
- .attach_task = blkiocg_attach_task,
+ .can_attach = blkiocg_can_attach,
+ .attach = blkiocg_attach,
.destroy = blkiocg_destroy,
.populate = blkiocg_populate,
#ifdef CONFIG_BLK_CGROUP
@@ -1614,30 +1616,39 @@ done:
* of the main cic data structures. For now we allow a task to change
* its cgroup only if it's the only owner of its ioc.
*/
-static int blkiocg_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int blkiocg_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
+ struct task_struct *task;
struct io_context *ioc;
int ret = 0;
/* task_lock() is needed to avoid races with exit_io_context() */
- task_lock(tsk);
- ioc = tsk->io_context;
- if (ioc && atomic_read(&ioc->nr_tasks) > 1)
- ret = -EINVAL;
- task_unlock(tsk);
-
+ cgroup_taskset_for_each(task, cgrp, tset) {
+ task_lock(task);
+ ioc = task->io_context;
+ if (ioc && atomic_read(&ioc->nr_tasks) > 1)
+ ret = -EINVAL;
+ task_unlock(task);
+ if (ret)
+ break;
+ }
return ret;
}
-static void blkiocg_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void blkiocg_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
+ struct task_struct *task;
struct io_context *ioc;
- task_lock(tsk);
- ioc = tsk->io_context;
- if (ioc)
- ioc->cgroup_changed = 1;
- task_unlock(tsk);
+ cgroup_taskset_for_each(task, cgrp, tset) {
+ task_lock(task);
+ ioc = task->io_context;
+ if (ioc)
+ ioc->cgroup_changed = 1;
+ task_unlock(task);
+ }
}
void blkio_policy_register(struct blkio_policy_type *blkiop)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index a2b0082..2cb5e72 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -162,10 +162,14 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
struct cgroup_taskset *tset)
{
struct freezer *freezer;
+ struct task_struct *task;
/*
* Anything frozen can't move or be moved to/from.
*/
+ cgroup_taskset_for_each(task, new_cgroup, tset)
+ if (cgroup_freezing(task))
+ return -EBUSY;
freezer = cgroup_freezer(new_cgroup);
if (freezer->state != CGROUP_THAWED)
@@ -174,11 +178,6 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
return 0;
}
-static int freezer_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
-{
- return cgroup_freezing(tsk) ? -EBUSY : 0;
-}
-
static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
{
struct freezer *freezer;
@@ -374,10 +373,5 @@ struct cgroup_subsys freezer_subsys = {
.populate = freezer_populate,
.subsys_id = freezer_subsys_id,
.can_attach = freezer_can_attach,
- .can_attach_task = freezer_can_attach_task,
- .pre_attach = NULL,
- .attach_task = NULL,
- .attach = NULL,
.fork = freezer_fork,
- .exit = NULL,
};
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2e5825b..472ddd6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1372,33 +1372,34 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset)
{
struct cpuset *cs = cgroup_cs(cgrp);
+ struct task_struct *task;
+ int ret;
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
- /*
- * Kthreads bound to specific cpus cannot be moved to a new cpuset; we
- * cannot change their cpu affinity and isolating such threads by their
- * set of allowed nodes is unnecessary. Thus, cpusets are not
- * applicable for such threads. This prevents checking for success of
- * set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
- * be changed.
- */
- if (cgroup_taskset_first(tset)->flags & PF_THREAD_BOUND)
- return -EINVAL;
-
+ cgroup_taskset_for_each(task, cgrp, tset) {
+ /*
+ * Kthreads bound to specific cpus cannot be moved to a new
+ * cpuset; we cannot change their cpu affinity and
+ * isolating such threads by their set of allowed nodes is
+ * unnecessary. Thus, cpusets are not applicable for such
+ * threads. This prevents checking for success of
+ * set_cpus_allowed_ptr() on all attached tasks before
+ * cpus_allowed may be changed.
+ */
+ if (task->flags & PF_THREAD_BOUND)
+ return -EINVAL;
+ if ((ret = security_task_setscheduler(task)))
+ return ret;
+ }
return 0;
}
-static int cpuset_can_attach_task(struct cgroup *cgrp, struct task_struct *task)
-{
- return security_task_setscheduler(task);
-}
-
/*
* Protected by cgroup_lock. The nodemasks must be stored globally because
* dynamically allocating them is not allowed in pre_attach, and they must
- * persist among pre_attach, attach_task, and attach.
+ * persist among pre_attach, and attach.
*/
static cpumask_var_t cpus_attach;
static nodemask_t cpuset_attach_nodemask_from;
@@ -1417,39 +1418,34 @@ static void cpuset_pre_attach(struct cgroup *cont)
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
}
-/* Per-thread attachment work. */
-static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
-{
- int err;
- struct cpuset *cs = cgroup_cs(cont);
-
- /*
- * can_attach beforehand should guarantee that this doesn't fail.
- * TODO: have a better way to handle failure here
- */
- err = set_cpus_allowed_ptr(tsk, cpus_attach);
- WARN_ON_ONCE(err);
-
- cpuset_change_task_nodemask(tsk, &cpuset_attach_nodemask_to);
- cpuset_update_task_spread_flag(cs, tsk);
-}
-
static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup_taskset *tset)
{
struct mm_struct *mm;
- struct task_struct *tsk = cgroup_taskset_first(tset);
+ struct task_struct *task;
+ struct task_struct *leader = cgroup_taskset_first(tset);
struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *oldcs = cgroup_cs(oldcgrp);
+ cgroup_taskset_for_each(task, cgrp, tset) {
+ /*
+ * can_attach beforehand should guarantee that this doesn't
+ * fail. TODO: have a better way to handle failure here
+ */
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+
+ cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
+ cpuset_update_task_spread_flag(cs, task);
+ }
+
/*
* Change mm, possibly for multiple threads in a threadgroup. This is
* expensive and may sleep.
*/
cpuset_attach_nodemask_from = oldcs->mems_allowed;
cpuset_attach_nodemask_to = cs->mems_allowed;
- mm = get_task_mm(tsk);
+ mm = get_task_mm(leader);
if (mm) {
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
if (is_memory_migrate(cs))
@@ -1905,9 +1901,7 @@ struct cgroup_subsys cpuset_subsys = {
.create = cpuset_create,
.destroy = cpuset_destroy,
.can_attach = cpuset_can_attach,
- .can_attach_task = cpuset_can_attach_task,
.pre_attach = cpuset_pre_attach,
- .attach_task = cpuset_attach_task,
.attach = cpuset_attach,
.populate = cpuset_populate,
.post_clone = cpuset_post_clone,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b8785e2..95e189d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7000,10 +7000,13 @@ static int __perf_cgroup_move(void *info)
return 0;
}
-static void
-perf_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *task)
+static void perf_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
- task_function_call(task, __perf_cgroup_move, task);
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, cgrp, tset)
+ task_function_call(task, __perf_cgroup_move, task);
}
static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -7017,7 +7020,7 @@ static void perf_cgroup_exit(struct cgroup_subsys *ss, struct cgroup *cgrp,
if (!(task->flags & PF_EXITING))
return;
- perf_cgroup_attach_task(cgrp, task);
+ task_function_call(task, __perf_cgroup_move, task);
}
struct cgroup_subsys perf_subsys = {
@@ -7026,6 +7029,6 @@ struct cgroup_subsys perf_subsys = {
.create = perf_cgroup_create,
.destroy = perf_cgroup_destroy,
.exit = perf_cgroup_exit,
- .attach_task = perf_cgroup_attach_task,
+ .attach = perf_cgroup_attach,
};
#endif /* CONFIG_CGROUP_PERF */
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..dd7e460 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8966,24 +8966,31 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
sched_destroy_group(tg);
}
-static int
-cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static int cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, cgrp, tset) {
#ifdef CONFIG_RT_GROUP_SCHED
- if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
- return -EINVAL;
+ if (!sched_rt_can_attach(cgroup_tg(cgrp), task))
+ return -EINVAL;
#else
- /* We don't support RT-tasks being in separate groups */
- if (tsk->sched_class != &fair_sched_class)
- return -EINVAL;
+ /* We don't support RT-tasks being in separate groups */
+ if (task->sched_class != &fair_sched_class)
+ return -EINVAL;
#endif
+ }
return 0;
}
-static void
-cpu_cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
+static void cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
- sched_move_task(tsk);
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, cgrp, tset)
+ sched_move_task(task);
}
static void
@@ -9071,8 +9078,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
.name = "cpu",
.create = cpu_cgroup_create,
.destroy = cpu_cgroup_destroy,
- .can_attach_task = cpu_cgroup_can_attach_task,
- .attach_task = cpu_cgroup_attach_task,
+ .can_attach = cpu_cgroup_can_attach,
+ .attach = cpu_cgroup_attach,
.exit = cpu_cgroup_exit,
.populate = cpu_cgroup_populate,
.subsys_id = cpu_cgroup_subsys_id,
--
1.7.6
^ permalink raw reply related
* [PATCH 3/6] cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf
Cc: containers, Daisuke Nishimura, linux-kernel, James Morris,
Tejun Heo, linux-pm, KAMEZAWA Hiroyuki
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods. This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.
Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset. This unifies passed parameters so
that all methods have access to all information. Conversions in this
patchset are identical and don't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
---
Documentation/cgroups/cgroups.txt | 26 ++++++----
include/linux/cgroup.h | 28 +++++++++-
kernel/cgroup.c | 99 +++++++++++++++++++++++++++++++++----
kernel/cgroup_freezer.c | 2 +-
kernel/cpuset.c | 18 ++++---
mm/memcontrol.c | 16 +++---
security/device_cgroup.c | 7 ++-
7 files changed, 153 insertions(+), 43 deletions(-)
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index cd67e90..2eee7cf 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -594,16 +594,21 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
called multiple times against a cgroup.
int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct task_struct *task)
+ struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
-Called prior to moving a task into a cgroup; if the subsystem
-returns an error, this will abort the attach operation. If a NULL
-task is passed, then a successful result indicates that *any*
-unspecified task can be moved into the cgroup. Note that this isn't
+Called prior to moving one or more tasks into a cgroup; if the
+subsystem returns an error, this will abort the attach operation.
+@tset contains the tasks to be attached and is guaranteed to have at
+least one task in it. If there are multiple, it's guaranteed that all
+are from the same thread group, @tset contains all tasks from the
+group whether they're actually switching cgroup or not, and the first
+task is the leader. Each @tset entry also contains the task's old
+cgroup and tasks which aren't switching cgroup can be skipped easily
+using the cgroup_taskset_for_each() iterator. Note that this isn't
called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex and it is ensured that either
-attach() or cancel_attach() will be called in future.
+remain valid while the caller holds cgroup_mutex and it is ensured
+that either attach() or cancel_attach() will be called in future.
int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
(cgroup_mutex held by caller)
@@ -613,14 +618,14 @@ attached (possibly many when using cgroup_attach_proc). Called after
can_attach.
void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct task_struct *task, bool threadgroup)
+ struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called when a task attach operation has failed after can_attach() has succeeded.
A subsystem whose can_attach() has some side-effects should provide this
function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
-succeeded.
+succeeded. The parameters are identical to can_attach().
void pre_attach(struct cgroup *cgrp);
(cgroup_mutex held by caller)
@@ -629,11 +634,12 @@ For any non-per-thread attachment work that needs to happen before
attach_task. Needed by cpuset.
void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct cgroup *old_cgrp, struct task_struct *task)
+ struct cgroup_taskset *tset)
(cgroup_mutex held by caller)
Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
+The parameters are identical to can_attach().
void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
(cgroup_mutex held by caller)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index da7e4bc..2470c8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -457,6 +457,28 @@ void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
/*
+ * Control Group taskset, used to pass around set of tasks to cgroup_subsys
+ * methods.
+ */
+struct cgroup_taskset;
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset);
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset);
+struct cgroup *cgroup_taskset_cur_cgroup(struct cgroup_taskset *tset);
+int cgroup_taskset_size(struct cgroup_taskset *tset);
+
+/**
+ * cgroup_taskset_for_each - iterate cgroup_taskset
+ * @task: the loop cursor
+ * @skip_cgrp: skip if task's cgroup matches this, %NULL to iterate through all
+ * @tset: taskset to iterate
+ */
+#define cgroup_taskset_for_each(task, skip_cgrp, tset) \
+ for ((task) = cgroup_taskset_first((tset)); (task); \
+ (task) = cgroup_taskset_next((tset))) \
+ if (!(skip_cgrp) || \
+ cgroup_taskset_cur_cgroup((tset)) != (skip_cgrp))
+
+/*
* Control Group subsystem type.
* See Documentation/cgroups/cgroups.txt for details
*/
@@ -467,14 +489,14 @@ struct cgroup_subsys {
int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct task_struct *tsk);
+ struct cgroup_taskset *tset);
int (*can_attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
void (*cancel_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct task_struct *tsk);
+ struct cgroup_taskset *tset);
void (*pre_attach)(struct cgroup *cgrp);
void (*attach_task)(struct cgroup *cgrp, struct task_struct *tsk);
void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct cgroup *old_cgrp, struct task_struct *tsk);
+ struct cgroup_taskset *tset);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
void (*exit)(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup *old_cgrp, struct task_struct *task);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cf5f3e3..474674b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1739,11 +1739,85 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
}
EXPORT_SYMBOL_GPL(cgroup_path);
+/*
+ * Control Group taskset
+ */
struct task_and_cgroup {
struct task_struct *task;
struct cgroup *cgrp;
};
+struct cgroup_taskset {
+ struct task_and_cgroup single;
+ struct flex_array *tc_array;
+ int tc_array_len;
+ int idx;
+ struct cgroup *cur_cgrp;
+};
+
+/**
+ * cgroup_taskset_first - reset taskset and return the first task
+ * @tset: taskset of interest
+ *
+ * @tset iteration is initialized and the first task is returned.
+ */
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset)
+{
+ if (tset->tc_array) {
+ tset->idx = 0;
+ return cgroup_taskset_next(tset);
+ } else {
+ tset->cur_cgrp = tset->single.cgrp;
+ return tset->single.task;
+ }
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_first);
+
+/**
+ * cgroup_taskset_next - iterate to the next task in taskset
+ * @tset: taskset of interest
+ *
+ * Return the next task in @tset. Iteration must have been initialized
+ * with cgroup_taskset_first().
+ */
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
+{
+ struct task_and_cgroup *tc;
+
+ if (!tset->tc_array || tset->idx >= tset->tc_array_len)
+ return NULL;
+
+ tc = flex_array_get(tset->tc_array, tset->idx++);
+ tset->cur_cgrp = tc->cgrp;
+ return tc->task;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_next);
+
+/**
+ * cgroup_taskset_cur_cgroup - return the matching cgroup for the current task
+ * @tset: taskset of interest
+ *
+ * Return the cgroup for the current (last returned) task of @tset. This
+ * function must be preceded by either cgroup_taskset_first() or
+ * cgroup_taskset_next().
+ */
+struct cgroup *cgroup_taskset_cur_cgroup(struct cgroup_taskset *tset)
+{
+ return tset->cur_cgrp;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_cur_cgroup);
+
+/**
+ * cgroup_taskset_size - return the number of tasks in taskset
+ * @tset: taskset of interest
+ */
+int cgroup_taskset_size(struct cgroup_taskset *tset)
+{
+ return tset->tc_array ? tset->tc_array_len : 1;
+}
+EXPORT_SYMBOL_GPL(cgroup_taskset_size);
+
+
/*
* cgroup_task_migrate - move a task from one cgroup to another.
*
@@ -1828,15 +1902,19 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
struct cgroup_subsys *ss, *failed_ss = NULL;
struct cgroup *oldcgrp;
struct cgroupfs_root *root = cgrp->root;
+ struct cgroup_taskset tset = { };
/* Nothing to do if the task is already in that cgroup */
oldcgrp = task_cgroup_from_root(tsk, root);
if (cgrp == oldcgrp)
return 0;
+ tset.single.task = tsk;
+ tset.single.cgrp = oldcgrp;
+
for_each_subsys(root, ss) {
if (ss->can_attach) {
- retval = ss->can_attach(ss, cgrp, tsk);
+ retval = ss->can_attach(ss, cgrp, &tset);
if (retval) {
/*
* Remember on which subsystem the can_attach()
@@ -1867,7 +1945,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
if (ss->attach_task)
ss->attach_task(cgrp, tsk);
if (ss->attach)
- ss->attach(ss, cgrp, oldcgrp, tsk);
+ ss->attach(ss, cgrp, &tset);
}
synchronize_rcu();
@@ -1889,7 +1967,7 @@ out:
*/
break;
if (ss->cancel_attach)
- ss->cancel_attach(ss, cgrp, tsk);
+ ss->cancel_attach(ss, cgrp, &tset);
}
}
return retval;
@@ -2005,6 +2083,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
struct task_struct *tsk;
struct task_and_cgroup *tc;
struct flex_array *group;
+ struct cgroup_taskset tset = { };
/*
* we need to make sure we have css_sets for all the tasks we're
* going to move -before- we actually start moving them, so that in
@@ -2067,6 +2146,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
} while_each_thread(leader, tsk);
/* remember the number of threads in the array for later. */
group_size = i;
+ tset.tc_array = group;
+ tset.tc_array_len = group_size;
rcu_read_unlock();
/* methods shouldn't be called if no task is actually migrating */
@@ -2079,7 +2160,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
*/
for_each_subsys(root, ss) {
if (ss->can_attach) {
- retval = ss->can_attach(ss, cgrp, leader);
+ retval = ss->can_attach(ss, cgrp, &tset);
if (retval) {
failed_ss = ss;
goto out_cancel_attach;
@@ -2169,10 +2250,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
* being moved, this call will need to be reworked to communicate that.
*/
for_each_subsys(root, ss) {
- if (ss->attach) {
- tc = flex_array_get(group, 0);
- ss->attach(ss, cgrp, tc->cgrp, tc->task);
- }
+ if (ss->attach)
+ ss->attach(ss, cgrp, &tset);
}
/*
@@ -2194,11 +2273,11 @@ out_cancel_attach:
for_each_subsys(root, ss) {
if (ss == failed_ss) {
if (cancel_failed_ss && ss->cancel_attach)
- ss->cancel_attach(ss, cgrp, leader);
+ ss->cancel_attach(ss, cgrp, &tset);
break;
}
if (ss->cancel_attach)
- ss->cancel_attach(ss, cgrp, leader);
+ ss->cancel_attach(ss, cgrp, &tset);
}
}
out_put_tasks:
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 4e82525..a2b0082 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -159,7 +159,7 @@ static void freezer_destroy(struct cgroup_subsys *ss,
*/
static int freezer_can_attach(struct cgroup_subsys *ss,
struct cgroup *new_cgroup,
- struct task_struct *task)
+ struct cgroup_taskset *tset)
{
struct freezer *freezer;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10131fd..2e5825b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1368,10 +1368,10 @@ static int fmeter_getrate(struct fmeter *fmp)
}
/* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
- struct task_struct *tsk)
+static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
- struct cpuset *cs = cgroup_cs(cont);
+ struct cpuset *cs = cgroup_cs(cgrp);
if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
@@ -1384,7 +1384,7 @@ static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
* set_cpus_allowed_ptr() on all attached tasks before cpus_allowed may
* be changed.
*/
- if (tsk->flags & PF_THREAD_BOUND)
+ if (cgroup_taskset_first(tset)->flags & PF_THREAD_BOUND)
return -EINVAL;
return 0;
@@ -1434,12 +1434,14 @@ static void cpuset_attach_task(struct cgroup *cont, struct task_struct *tsk)
cpuset_update_task_spread_flag(cs, tsk);
}
-static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
- struct cgroup *oldcont, struct task_struct *tsk)
+static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct cgroup_taskset *tset)
{
struct mm_struct *mm;
- struct cpuset *cs = cgroup_cs(cont);
- struct cpuset *oldcs = cgroup_cs(oldcont);
+ struct task_struct *tsk = cgroup_taskset_first(tset);
+ struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
+ struct cpuset *cs = cgroup_cs(cgrp);
+ struct cpuset *oldcs = cgroup_cs(oldcgrp);
/*
* Change mm, possibly for multiple threads in a threadgroup. This is
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 930de94..b2802cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5460,8 +5460,9 @@ static void mem_cgroup_clear_mc(void)
static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
+ struct task_struct *p = cgroup_taskset_first(tset);
int ret = 0;
struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
@@ -5499,7 +5500,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
mem_cgroup_clear_mc();
}
@@ -5616,9 +5617,9 @@ retry:
static void mem_cgroup_move_task(struct cgroup_subsys *ss,
struct cgroup *cont,
- struct cgroup *old_cont,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
+ struct task_struct *p = cgroup_taskset_first(tset);
struct mm_struct *mm = get_task_mm(p);
if (mm) {
@@ -5633,19 +5634,18 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
return 0;
}
static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
}
static void mem_cgroup_move_task(struct cgroup_subsys *ss,
struct cgroup *cont,
- struct cgroup *old_cont,
- struct task_struct *p)
+ struct cgroup_taskset *tset)
{
}
#endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 4450fbe..8b5b5d8 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -62,11 +62,12 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
struct cgroup_subsys devices_subsys;
static int devcgroup_can_attach(struct cgroup_subsys *ss,
- struct cgroup *new_cgroup, struct task_struct *task)
+ struct cgroup *new_cgrp, struct cgroup_taskset *set)
{
- if (current != task && !capable(CAP_SYS_ADMIN))
- return -EPERM;
+ struct task_struct *task = cgroup_taskset_first(set);
+ if (current != task && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
return 0;
}
--
1.7.6
^ permalink raw reply related
* [PATCH 2/6] cgroup: improve old cgroup handling in cgroup_attach_proc()
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
cgroup_attach_proc() behaves differently from cgroup_attach_task() in
the following aspects.
* All hooks are invoked even if no task is actually being moved.
* ->can_attach_task() is called for all tasks in the group whether the
new cgrp is different from the current cgrp or not; however,
->attach_task() is skipped if new equals new. This makes the calls
asymmetric.
This patch improves old cgroup handling in cgroup_attach_proc() by
looking up the current cgroup at the head, recording it in the flex
array along with the task itself, and using it to remove the above two
differences. This will also ease further changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
kernel/cgroup.c | 70 ++++++++++++++++++++++++++++++++++--------------------
1 files changed, 44 insertions(+), 26 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a606fa2..cf5f3e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1739,6 +1739,11 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
}
EXPORT_SYMBOL_GPL(cgroup_path);
+struct task_and_cgroup {
+ struct task_struct *task;
+ struct cgroup *cgrp;
+};
+
/*
* cgroup_task_migrate - move a task from one cgroup to another.
*
@@ -1990,15 +1995,15 @@ static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
*/
int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
{
- int retval, i, group_size;
+ int retval, i, group_size, nr_todo;
struct cgroup_subsys *ss, *failed_ss = NULL;
bool cancel_failed_ss = false;
/* guaranteed to be initialized later, but the compiler needs this */
- struct cgroup *oldcgrp = NULL;
struct css_set *oldcg;
struct cgroupfs_root *root = cgrp->root;
/* threadgroup list cursor and array */
struct task_struct *tsk;
+ struct task_and_cgroup *tc;
struct flex_array *group;
/*
* we need to make sure we have css_sets for all the tasks we're
@@ -2017,8 +2022,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
*/
group_size = get_nr_threads(leader);
/* flex_array supports very large thread-groups better than kmalloc. */
- group = flex_array_alloc(sizeof(struct task_struct *), group_size,
- GFP_KERNEL);
+ group = flex_array_alloc(sizeof(*tc), group_size, GFP_KERNEL);
if (!group)
return -ENOMEM;
/* pre-allocate to guarantee space while iterating in rcu read-side. */
@@ -2042,8 +2046,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
}
/* take a reference on each task in the group to go in the array. */
tsk = leader;
- i = 0;
+ i = nr_todo = 0;
do {
+ struct task_and_cgroup ent;
+
/* as per above, nr_threads may decrease, but not increase. */
BUG_ON(i >= group_size);
get_task_struct(tsk);
@@ -2051,14 +2057,23 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
* saying GFP_ATOMIC has no effect here because we did prealloc
* earlier, but it's good form to communicate our expectations.
*/
- retval = flex_array_put_ptr(group, i, tsk, GFP_ATOMIC);
+ ent.task = tsk;
+ ent.cgrp = task_cgroup_from_root(tsk, root);
+ retval = flex_array_put(group, i, &ent, GFP_ATOMIC);
BUG_ON(retval != 0);
i++;
+ if (ent.cgrp != cgrp)
+ nr_todo++;
} while_each_thread(leader, tsk);
/* remember the number of threads in the array for later. */
group_size = i;
rcu_read_unlock();
+ /* methods shouldn't be called if no task is actually migrating */
+ retval = 0;
+ if (!nr_todo)
+ goto out_put_tasks;
+
/*
* step 1: check that we can legitimately attach to the cgroup.
*/
@@ -2074,8 +2089,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
if (ss->can_attach_task) {
/* run on each task in the threadgroup. */
for (i = 0; i < group_size; i++) {
- tsk = flex_array_get_ptr(group, i);
- retval = ss->can_attach_task(cgrp, tsk);
+ tc = flex_array_get(group, i);
+ if (tc->cgrp == cgrp)
+ continue;
+ retval = ss->can_attach_task(cgrp, tc->task);
if (retval) {
failed_ss = ss;
cancel_failed_ss = true;
@@ -2091,23 +2108,22 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
*/
INIT_LIST_HEAD(&newcg_list);
for (i = 0; i < group_size; i++) {
- tsk = flex_array_get_ptr(group, i);
+ tc = flex_array_get(group, i);
/* nothing to do if this task is already in the cgroup */
- oldcgrp = task_cgroup_from_root(tsk, root);
- if (cgrp == oldcgrp)
+ if (tc->cgrp == cgrp)
continue;
/* get old css_set pointer */
- task_lock(tsk);
- if (tsk->flags & PF_EXITING) {
+ task_lock(tc->task);
+ if (tc->task->flags & PF_EXITING) {
/* ignore this task if it's going away */
- task_unlock(tsk);
+ task_unlock(tc->task);
continue;
}
- oldcg = tsk->cgroups;
+ oldcg = tc->task->cgroups;
get_css_set(oldcg);
- task_unlock(tsk);
+ task_unlock(tc->task);
/* see if the new one for us is already in the list? */
- if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+ if (css_set_check_fetched(cgrp, tc->task, oldcg, &newcg_list)) {
/* was already there, nothing to do. */
put_css_set(oldcg);
} else {
@@ -2130,20 +2146,19 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
ss->pre_attach(cgrp);
}
for (i = 0; i < group_size; i++) {
- tsk = flex_array_get_ptr(group, i);
+ tc = flex_array_get(group, i);
/* leave current thread as it is if it's already there */
- oldcgrp = task_cgroup_from_root(tsk, root);
- if (cgrp == oldcgrp)
+ if (tc->cgrp == cgrp)
continue;
/* if the thread is PF_EXITING, it can just get skipped. */
- retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+ retval = cgroup_task_migrate(cgrp, tc->cgrp, tc->task, true);
BUG_ON(retval != 0 && retval != -ESRCH);
/* attach each task to each subsystem */
for_each_subsys(root, ss) {
if (ss->attach_task)
- ss->attach_task(cgrp, tsk);
+ ss->attach_task(cgrp, tc->task);
}
}
/* nothing is sensitive to fork() after this point. */
@@ -2154,8 +2169,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
* being moved, this call will need to be reworked to communicate that.
*/
for_each_subsys(root, ss) {
- if (ss->attach)
- ss->attach(ss, cgrp, oldcgrp, leader);
+ if (ss->attach) {
+ tc = flex_array_get(group, 0);
+ ss->attach(ss, cgrp, tc->cgrp, tc->task);
+ }
}
/*
@@ -2184,10 +2201,11 @@ out_cancel_attach:
ss->cancel_attach(ss, cgrp, leader);
}
}
+out_put_tasks:
/* clean up the array of referenced threads in the group. */
for (i = 0; i < group_size; i++) {
- tsk = flex_array_get_ptr(group, i);
- put_task_struct(tsk);
+ tc = flex_array_get(group, i);
+ put_task_struct(tc->task);
}
out_free_group_list:
flex_array_free(group);
--
1.7.6
^ permalink raw reply related
* [PATCH 1/6] cgroup: subsys->attach_task() should be called after migration
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314138000-2049-1-git-send-email-tj@kernel.org>
cgroup_attach_task() calls subsys->attach_task() after
cgroup_task_migrate(); however, cgroup_attach_proc() calls it before
migration. This actually affects some of the users. Update
cgroup_attach_proc() such that ->attach_task() is called after
migration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
kernel/cgroup.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d2b6ce..a606fa2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2135,14 +2135,16 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
oldcgrp = task_cgroup_from_root(tsk, root);
if (cgrp == oldcgrp)
continue;
+
+ /* if the thread is PF_EXITING, it can just get skipped. */
+ retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+ BUG_ON(retval != 0 && retval != -ESRCH);
+
/* attach each task to each subsystem */
for_each_subsys(root, ss) {
if (ss->attach_task)
ss->attach_task(cgrp, tsk);
}
- /* if the thread is PF_EXITING, it can just get skipped. */
- retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
- BUG_ON(retval != 0 && retval != -ESRCH);
}
/* nothing is sensitive to fork() after this point. */
--
1.7.6
^ permalink raw reply related
* [PATCHSET] cgroup: introduce cgroup_taskset and consolidate subsys methods
From: Tejun Heo @ 2011-08-23 22:19 UTC (permalink / raw)
To: rjw, paul, lizf; +Cc: containers, linux-pm, linux-kernel
Hello,
cgroup has grown quite some number of subsys methods. Some of them
are overlapping, inconsistent with each other and called under
different conditions depending on whether they're called for a single
task or whole process. Unfortunately, these callbacks are complicated
and incomplete at the same time.
* ->attach_task() is called after migration for task attach but before
for process.
* Ditto for ->pre_attach().
* ->can_attach_task() is called for every task in the thread group but
->attach_task() skips the ones which don't actually change cgroups.
* Task attach becomes noop if the task isn't actually moving. Process
attach is always performed.
* ->attach_task() doesn't (or at least aren't supposed to) have access
to the old cgroup.
* During cancel, there's no way to access the affected tasks.
This patchset introduces cgroup_taskset along with some accessors and
iterator, updates methods to use it, consolidates usages and drops
superflous methods.
It contains the following six patches.
0001-cgroup-subsys-attach_task-should-be-called-after-mig.patch
0002-cgroup-improve-old-cgroup-handling-in-cgroup_attach_.patch
0003-cgroup-introduce-cgroup_taskset-and-use-it-in-subsys.patch
0004-cgroup-don-t-use-subsys-can_attach_task-or-attach_ta.patch
0005-cgroup-cpuset-don-t-use-ss-pre_attach.patch
0006-cgroup-kill-subsys-can_attach_task-pre_attach-and-at.patch
and is based on the current linux-pm/pm-freezer (7b5b95b3f5 "freezer:
remove should_send_signal() and update frozen()"), and available in
the following git tree.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git freezer
I based this on top of pm-freezer because cgroup_freezer changes
conflict (easy to resolve but still) and I'm planning on making
further changes to cgroup_freezer which will depend on both freezer
and cgroup changes. How should we route these changes?
1. As this patchset would affect other cgroup changes, it makes sense
to route these through the cgroup branch (BTW, where is it?) and
propagate things there. In that case, I'll re-spin the patches on
top of that tree and send a pull request for the merged branch to
Rafael.
2. Alternatively, if cgroup isn't expected to have too extensive
changes in this cycle, we can just funnel all these through
Rafael's tree.
3. Yet another choice would be applying these on Rafael's tree and
then pull that into cgroup tree as further changes aren't gonna
affect cgroup all that much.
What do you guys think?
Thank you.
Documentation/cgroups/cgroups.txt | 46 +++-----
block/blk-cgroup.c | 45 +++++---
include/linux/cgroup.h | 31 ++++-
kernel/cgroup.c | 200 ++++++++++++++++++++++++--------------
kernel/cgroup_freezer.c | 16 ---
kernel/cpuset.c | 105 +++++++++----------
kernel/events/core.c | 13 +-
kernel/sched.c | 31 +++--
mm/memcontrol.c | 16 +--
security/device_cgroup.c | 7 -
10 files changed, 289 insertions(+), 221 deletions(-)
--
tejun
^ permalink raw reply
* Re: [PATCH 02/11] PM: extend PM QoS with per-device wake-up constraints
From: Rafael J. Wysocki @ 2011-08-23 21:31 UTC (permalink / raw)
To: Mark Brown; +Cc: Linux PM mailing list, linux-omap, Jean Pihet, markgross
In-Reply-To: <20110823092155.GJ9232@opensource.wolfsonmicro.com>
On Tuesday, August 23, 2011, Mark Brown wrote:
> On Sun, Aug 21, 2011 at 08:05:53PM +0200, Rafael J. Wysocki wrote:
> > On Sunday, August 21, 2011, Mark Brown wrote:
>
> > > I don't understand why the driver would need to know what situation it's
> > > in. I'd been working on the basis that the idea was that the driver
> > > said what the constraints it has are and then some code with a more
> > > system wide view would make the actual decisions for things outside the
> > > driver domian.
>
> > That's correct, but in order to figure out a "sensible default"
> > the driver generally would need to know what the expectations with
> > respect to it are. Otherwise it can very well generate a random
> > number and use that.
>
> Right, but I'd expect that should be something that the driver can
> generally do based on knowledge of what it needs to deliver to its
> users. It doesn't need to be tunable to that - for example, input
> devices will have a reasonable idea of the response time needed to
> deliver interactive performance. If it's really got no idea then
> hopefully not providng any constraints will do something sensible.
Perhaps. Still, that requires some policy to be put into drivers,
which I don't think is entirely correct.
Thanks,
Rafael
^ permalink raw reply
* Re: [PATCH] Add regulator driver for the bq2407x family of charger ICs
From: Heiko Stübner @ 2011-08-23 20:15 UTC (permalink / raw)
To: Mark Brown; +Cc: linux-pm, Liam Girdwood
In-Reply-To: <20110823115024.GL9232@opensource.wolfsonmicro.com>
Am Dienstag 23 August 2011, 13:50:24 schrieb Mark Brown:
> On Sat, Aug 20, 2011 at 10:24:51PM +0200, Heiko Stübner wrote:
>
> Mostly looks good, just a few fairly small comments below.
>
> > + if (pdata->max_uA && pdata->max_uA > 500000
> > + && max_uA >= pdata->max_uA) {
> > + dev_dbg(rdev_get_dev(rdev),
> > + "setting current limit to %d mA\n",
> > + pdata->max_uA / 1000);
> > + gpio_set_value(pdata->gpio_en2, 1);
> > + gpio_set_value(pdata->gpio_en1, 0);
> > + } else if (max_uA >= 500000) {
> > + dev_dbg(rdev_get_dev(rdev),
> > + "setting current limit to 500 mA\n");
> > + gpio_set_value(pdata->gpio_en2, 0);
> > + gpio_set_value(pdata->gpio_en1, 1);
> > + } else if (max_uA >= 100000) {
> > + dev_dbg(rdev_get_dev(rdev),
> > + "setting current limit to 100 mA\n");
> > + gpio_set_value(pdata->gpio_en2, 0);
> > + gpio_set_value(pdata->gpio_en1, 0);
> > + } else {
> > + dev_dbg(rdev_get_dev(rdev),
> > + "setting current limit to 0 mA\n");
> > + gpio_set_value(pdata->gpio_en2, 1);
> > + gpio_set_value(pdata->gpio_en1, 1);
> > + }
>
> I'd rather expect this to return an error sometimes.
gpio_set_value is a void function, so I'm not sure what could cause an error
here.
Heiko
^ permalink raw reply
* Re: [PATCH 2/4] PM: Reference counting of power.subsys_data
From: Pavel Machek @ 2011-08-23 20:09 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Greg KH, LKML, linux-sh
In-Reply-To: <201108090031.42691.rjw@sisk.pl>
Hi!
> Since the power.subsys_data device field will be used by multiple
> filesystems, introduce a reference counting mechanism for it to avoid
> freeing it prematurely or changing its value at a wrong time.
>
> Make the PM clocks management code that currently is the only user of
> power.subsys_data use the new reference counting.
>
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> drivers/base/power/Makefile | 2
> drivers/base/power/clock_ops.c | 24 ++---------
> drivers/base/power/common.c | 86 +++++++++++++++++++++++++++++++++++++++++
> include/linux/pm.h | 3 +
> 4 files changed, 95 insertions(+), 20 deletions(-)
>
> Index: linux-2.6/drivers/base/power/common.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/drivers/base/power/common.c
> @@ -0,0 +1,86 @@
> +/*
> + * drivers/base/power/common.c - Common device power management code.
> + *
> + * Copyright (C) 2011 Rafael J. Wysocki <rjw@sisk.pl>, Renesas Electronics Corp.
This looks like you now work for Renesas....?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply
* Re: [PATCH v7 2/4] PM: Introduce DEVFREQ: generic DVFS framework with device-specific OPPs
From: Turquette, Mike @ 2011-08-23 17:34 UTC (permalink / raw)
To: MyungJoo Ham
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <1314086044-24659-3-git-send-email-myungjoo.ham@samsung.com>
On Tue, Aug 23, 2011 at 12:54 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> With OPPs, a device may have multiple operable frequency and voltage
> sets. However, there can be multiple possible operable sets and a system
> will need to choose one from them. In order to reduce the power
> consumption (by reducing frequency and voltage) without affecting the
> performance too much, a Dynamic Voltage and Frequency Scaling (DVFS)
> scheme may be used.
>
> This patch introduces the DVFS capability to non-CPU devices with OPPs.
> DVFS is a techique whereby the frequency and supplied voltage of a
> device is adjusted on-the-fly. DVFS usually sets the frequency as low
> as possible with given conditions (such as QoS assurance) and adjusts
> voltage according to the chosen frequency in order to reduce power
> consumption and heat dissipation.
>
> The generic DVFS for devices, DEVFREQ, may appear quite similar with
> /drivers/cpufreq. However, CPUFREQ does not allow to have multiple
> devices registered and is not suitable to have multiple heterogenous
> devices with different (but simple) governors.
>
> Normally, DVFS mechanism controls frequency based on the demand for
> the device, and then, chooses voltage based on the chosen frequency.
> DEVFREQ also controls the frequency based on the governor's frequency
> recommendation and let OPP pick up the pair of frequency and voltage
> based on the recommended frequency. Then, the chosen OPP is passed to
> device driver's "target" callback.
>
> When PM QoS is going to be used with the DEVFREQ device, the device
> driver should enable OPPs that are appropriate with the current PM QoS
> requests. In order to do so, the device driver may call opp_enable and
> opp_disable at the notifier callback of PM QoS so that PM QoS's
> update_target() call enables the appropriate OPPs. Note that at least
> one of OPPs should be enabled at any time; be careful when there is a
> transition.
>
> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Looks OK to me.
Reviewed-by: Mike Turquette <mturquette@ti.com>
Regards,
Mike
>
> ---
> The test code with board support for Exynos4-NURI is at
> http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads/devfreq
>
> ---
> Thank you for your valuable comments, Rafael, Greg, Pavel, Colin, Mike,
> and Kevin.
> Changes from v6
> - Type revised for timing variables
> - Removed unnecessary code and variable
>
> Changes at v6-resubmit from v6
> - Use jiffy directly instead of ktime
> - Be prepared for profile->polling_ms changes (not supported fully at
> this stage)
>
> Changes from v5
> - Uses OPP availability change notifier
> - Removed devfreq_interval. Uses one jiffy instead. DEVFREQ adjusts
> polling interval based on the interval requirement of DEVFREQ
> devices.
> - Moved devfreq to /drivers/devfreq to accomodate devfreq-related files
> including governors and devfreq drivers.
> - Coding style revised.
> - Updated devfreq_add_device interface to get tunable values.
>
> Changed from v4
> - Removed tickle, which is a duplicated feature; PM QoS can do the same.
> - Allow to extend polling interval if devices have longer polling intervals.
> - Relocated private data of governors.
> - Removed system-wide sysfs
>
> Changed from v3
> - In kerneldoc comments, DEVFREQ has ben replaced by devfreq
> - Revised removing devfreq entries with error mechanism
> - Added and revised comments
> - Removed unnecessary codes
> - Allow to give a name to a governor
> - Bugfix: a tickle call may cancel an older tickle call that is still in
> effect.
>
> Changed from v2
> - Code style revised and cleaned up.
> - Remove DEVFREQ entries that incur errors except for EAGAIN
> - Bug fixed: tickle for devices without polling governors
>
> Changes from v1(RFC)
> - Rename: DVFS --> DEVFREQ
> - Revised governor design
> . Governor receives the whole struct devfreq
> . Governor should gather usage information (thru get_dev_status) itself
> - Periodic monitoring runs only when needed.
> - DEVFREQ no more deals with voltage information directly
> - Removed some printks.
> - Some cosmetics update
> - Use freezable_wq.
> ---
> drivers/Kconfig | 2 +
> drivers/Makefile | 2 +
> drivers/devfreq/Kconfig | 39 ++++++
> drivers/devfreq/Makefile | 1 +
> drivers/devfreq/devfreq.c | 297 +++++++++++++++++++++++++++++++++++++++++++++
> include/linux/devfreq.h | 107 ++++++++++++++++
> 6 files changed, 448 insertions(+), 0 deletions(-)
> create mode 100644 drivers/devfreq/Kconfig
> create mode 100644 drivers/devfreq/Makefile
> create mode 100644 drivers/devfreq/devfreq.c
> create mode 100644 include/linux/devfreq.h
>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 95b9e7e..a1efd75 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
>
> source "drivers/virt/Kconfig"
>
> +source "drivers/devfreq/Kconfig"
> +
> endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 7fa433a..97c957b 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/
>
> # Virtualization drivers
> obj-$(CONFIG_VIRT_DRIVERS) += virt/
> +
> +obj-$(CONFIG_PM_DEVFREQ) += devfreq/
> diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
> new file mode 100644
> index 0000000..1fb42de
> --- /dev/null
> +++ b/drivers/devfreq/Kconfig
> @@ -0,0 +1,39 @@
> +config ARCH_HAS_DEVFREQ
> + bool
> + depends on ARCH_HAS_OPP
> + help
> + Denotes that the architecture supports DEVFREQ. If the architecture
> + supports multiple OPP entries per device and the frequency of the
> + devices with OPPs may be altered dynamically, the architecture
> + supports DEVFREQ.
> +
> +menuconfig PM_DEVFREQ
> + bool "Generic Dynamic Voltage and Frequency Scaling (DVFS) support"
> + depends on PM_OPP && ARCH_HAS_DEVFREQ
> + help
> + With OPP support, a device may have a list of frequencies and
> + voltages available. DEVFREQ, a generic DVFS framework can be
> + registered for a device with OPP support in order to let the
> + governor provided to DEVFREQ choose an operating frequency
> + based on the OPP's list and the policy given with DEVFREQ.
> +
> + Each device may have its own governor and policy. DEVFREQ can
> + reevaluate the device state periodically and/or based on the
> + OPP list changes (each frequency/voltage pair in OPP may be
> + disabled or enabled).
> +
> + Like some CPUs with CPUFREQ, a device may have multiple clocks.
> + However, because the clock frequencies of a single device are
> + determined by the single device's state, an instance of DEVFREQ
> + is attached to a single device and returns a "representative"
> + clock frequency from the OPP of the device, which is also attached
> + to a device by 1-to-1. The device registering DEVFREQ takes the
> + responsiblity to "interpret" the frequency listed in OPP and
> + to set its every clock accordingly with the "target" callback
> + given to DEVFREQ.
> +
> +if PM_DEVFREQ
> +
> +comment "DEVFREQ Drivers"
> +
> +endif # PM_DEVFREQ
> diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
> new file mode 100644
> index 0000000..168934a
> --- /dev/null
> +++ b/drivers/devfreq/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_PM_DEVFREQ) += devfreq.o
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> new file mode 100644
> index 0000000..df63bdc
> --- /dev/null
> +++ b/drivers/devfreq/devfreq.c
> @@ -0,0 +1,297 @@
> +/*
> + * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
> + * for Non-CPU Devices Based on OPP.
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/opp.h>
> +#include <linux/devfreq.h>
> +#include <linux/workqueue.h>
> +#include <linux/platform_device.h>
> +#include <linux/list.h>
> +#include <linux/printk.h>
> +#include <linux/hrtimer.h>
> +
> +/*
> + * devfreq_work periodically monitors every registered device.
> + * The minimum polling interval is one jiffy. The polling interval is
> + * determined by the minimum polling period among all polling devfreq
> + * devices. The resolution of polling interval is one jiffy.
> + */
> +static bool polling;
> +static struct workqueue_struct *devfreq_wq;
> +static struct delayed_work devfreq_work;
> +
> +/* The list of all device-devfreq */
> +static LIST_HEAD(devfreq_list);
> +static DEFINE_MUTEX(devfreq_list_lock);
> +
> +/**
> + * find_device_devfreq() - find devfreq struct using device pointer
> + * @dev: device pointer used to lookup device devfreq.
> + *
> + * Search the list of device devfreqs and return the matched device's
> + * devfreq info. devfreq_list_lock should be held by the caller.
> + */
> +static struct devfreq *find_device_devfreq(struct device *dev)
> +{
> + struct devfreq *tmp_devfreq;
> +
> + if (unlikely(IS_ERR_OR_NULL(dev))) {
> + pr_err("DEVFREQ: %s: Invalid parameters\n", __func__);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + list_for_each_entry(tmp_devfreq, &devfreq_list, node) {
> + if (tmp_devfreq->dev == dev)
> + return tmp_devfreq;
> + }
> +
> + return ERR_PTR(-ENODEV);
> +}
> +
> +/**
> + * devfreq_do() - Check the usage profile of a given device and configure
> + * frequency and voltage accordingly
> + * @devfreq: devfreq info of the given device
> + */
> +static int devfreq_do(struct devfreq *devfreq)
> +{
> + struct opp *opp;
> + unsigned long freq;
> + int err;
> +
> + err = devfreq->governor->get_target_freq(devfreq, &freq);
> + if (err)
> + return err;
> +
> + opp = opp_find_freq_ceil(devfreq->dev, &freq);
> + if (opp == ERR_PTR(-ENODEV))
> + opp = opp_find_freq_floor(devfreq->dev, &freq);
> +
> + if (IS_ERR(opp))
> + return PTR_ERR(opp);
> +
> + if (devfreq->previous_freq == freq)
> + return 0;
> +
> + err = devfreq->profile->target(devfreq->dev, opp);
> + if (err)
> + return err;
> +
> + devfreq->previous_freq = freq;
> + return 0;
> +}
> +
> +/**
> + * devfreq_update() - Notify that the device OPP has been changed.
> + * @dev: the device whose OPP has been changed.
> + */
> +static int devfreq_update(struct notifier_block *nb, unsigned long type,
> + void *devp)
> +{
> + struct devfreq *devfreq;
> + int err = 0;
> +
> + mutex_lock(&devfreq_list_lock);
> + devfreq = container_of(nb, struct devfreq, nb);
> + /* Reevaluate the proper frequency */
> + err = devfreq_do(devfreq);
> + mutex_unlock(&devfreq_list_lock);
> + return err;
> +}
> +
> +/**
> + * devfreq_monitor() - Periodically run devfreq_do()
> + * @work: the work struct used to run devfreq_monitor periodically.
> + *
> + */
> +static void devfreq_monitor(struct work_struct *work)
> +{
> + static unsigned long last_polled_at;
> + struct devfreq *devfreq, *tmp;
> + int error;
> + unsigned long jiffies_passed;
> + unsigned long next_jiffies = ULONG_MAX, now = jiffies;
> +
> + /* Initially last_polled_at = 0, polling every device at bootup */
> + jiffies_passed = now - last_polled_at;
> + last_polled_at = now;
> + if (jiffies_passed == 0)
> + jiffies_passed = 1;
> +
> + mutex_lock(&devfreq_list_lock);
> +
> + list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
> + if (devfreq->next_polling == 0)
> + continue;
> +
> + /*
> + * Reduce more next_polling if devfreq_wq took an extra
> + * delay. (i.e., CPU has been idled.)
> + */
> + if (devfreq->next_polling <= jiffies_passed) {
> + error = devfreq_do(devfreq);
> +
> + /* Remove a devfreq with an error. */
> + if (error && error != -EAGAIN) {
> + dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
> + error, devfreq->governor->name);
> +
> + list_del(&devfreq->node);
> + kfree(devfreq);
> +
> + continue;
> + }
> + devfreq->next_polling = devfreq->polling_jiffies;
> +
> + /* No more polling required (polling_ms changed) */
> + if (devfreq->next_polling == 0)
> + continue;
> + } else {
> + devfreq->next_polling -= jiffies_passed;
> + }
> +
> + next_jiffies = (next_jiffies > devfreq->next_polling) ?
> + devfreq->next_polling : next_jiffies;
> + }
> +
> + if (next_jiffies > 0 && next_jiffies < ULONG_MAX) {
> + polling = true;
> + queue_delayed_work(devfreq_wq, &devfreq_work, next_jiffies);
> + } else {
> + polling = false;
> + }
> +
> + mutex_unlock(&devfreq_list_lock);
> +}
> +
> +/**
> + * devfreq_add_device() - Add devfreq feature to the device
> + * @dev: the device to add devfreq feature.
> + * @profile: device-specific profile to run devfreq.
> + * @governor: the policy to choose frequency.
> + * @data: private data for the governor. The devfreq framework does not
> + * touch this value.
> + */
> +int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
> + struct devfreq_governor *governor, void *data)
> +{
> + struct devfreq *devfreq;
> + struct srcu_notifier_head *nh;
> + int err = 0;
> +
> + if (!dev || !profile || !governor) {
> + dev_err(dev, "%s: Invalid parameters.\n", __func__);
> + return -EINVAL;
> + }
> +
> + mutex_lock(&devfreq_list_lock);
> +
> + devfreq = find_device_devfreq(dev);
> + if (!IS_ERR(devfreq)) {
> + dev_err(dev, "%s: Unable to create devfreq for the device. It already has one.\n", __func__);
> + err = -EINVAL;
> + goto out;
> + }
> +
> + devfreq = kzalloc(sizeof(struct devfreq), GFP_KERNEL);
> + if (!devfreq) {
> + dev_err(dev, "%s: Unable to create devfreq for the device\n",
> + __func__);
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + devfreq->dev = dev;
> + devfreq->profile = profile;
> + devfreq->governor = governor;
> + devfreq->next_polling = devfreq->polling_jiffies
> + = msecs_to_jiffies(devfreq->profile->polling_ms);
> + devfreq->previous_freq = profile->initial_freq;
> + devfreq->data = data;
> +
> + devfreq->nb.notifier_call = devfreq_update;
> + nh = opp_get_notifier(dev);
> + if (IS_ERR(nh)) {
> + err = PTR_ERR(nh);
> + goto out;
> + }
> + err = srcu_notifier_chain_register(nh, &devfreq->nb);
> + if (err)
> + goto out;
> +
> + list_add(&devfreq->node, &devfreq_list);
> +
> + if (devfreq_wq && devfreq->next_polling && !polling) {
> + polling = true;
> + queue_delayed_work(devfreq_wq, &devfreq_work,
> + devfreq->next_polling);
> + }
> +out:
> + mutex_unlock(&devfreq_list_lock);
> +
> + return err;
> +}
> +
> +/**
> + * devfreq_remove_device() - Remove devfreq feature from a device.
> + * @device: the device to remove devfreq feature.
> + */
> +int devfreq_remove_device(struct device *dev)
> +{
> + struct devfreq *devfreq;
> + struct srcu_notifier_head *nh;
> + int err = 0;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + mutex_lock(&devfreq_list_lock);
> + devfreq = find_device_devfreq(dev);
> + if (IS_ERR(devfreq)) {
> + err = PTR_ERR(devfreq);
> + goto out;
> + }
> +
> + nh = opp_get_notifier(dev);
> + if (IS_ERR(nh)) {
> + err = PTR_ERR(nh);
> + goto out;
> + }
> +
> + list_del(&devfreq->node);
> + srcu_notifier_chain_unregister(nh, &devfreq->nb);
> + kfree(devfreq);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return 0;
> +}
> +
> +/**
> + * devfreq_init() - Initialize data structure for devfreq framework and
> + * start polling registered devfreq devices.
> + */
> +static int __init devfreq_init(void)
> +{
> + mutex_lock(&devfreq_list_lock);
> + polling = false;
> + devfreq_wq = create_freezable_workqueue("devfreq_wq");
> + INIT_DELAYED_WORK_DEFERRABLE(&devfreq_work, devfreq_monitor);
> + mutex_unlock(&devfreq_list_lock);
> +
> + devfreq_monitor(&devfreq_work.work);
> + return 0;
> +}
> +late_initcall(devfreq_init);
> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> new file mode 100644
> index 0000000..8252238
> --- /dev/null
> +++ b/include/linux/devfreq.h
> @@ -0,0 +1,107 @@
> +/*
> + * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
> + * for Non-CPU Devices Based on OPP.
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef __LINUX_DEVFREQ_H__
> +#define __LINUX_DEVFREQ_H__
> +
> +#include <linux/notifier.h>
> +
> +#define DEVFREQ_NAME_LEN 16
> +
> +struct devfreq;
> +struct devfreq_dev_status {
> + /* both since the last measure */
> + unsigned long total_time;
> + unsigned long busy_time;
> + unsigned long current_frequency;
> +};
> +
> +struct devfreq_dev_profile {
> + unsigned long max_freq; /* may be larger than the actual value */
> + unsigned long initial_freq;
> + unsigned int polling_ms; /* 0 for at opp change only */
> +
> + int (*target)(struct device *dev, struct opp *opp);
> + int (*get_dev_status)(struct device *dev,
> + struct devfreq_dev_status *stat);
> +};
> +
> +/**
> + * struct devfreq_governor - Devfreq policy governor
> + * @name Governor's name
> + * @get_target_freq Returns desired operating frequency for the device.
> + * Basically, get_target_freq will run
> + * devfreq_dev_profile.get_dev_status() to get the
> + * status of the device (load = busy_time / total_time).
> + */
> +struct devfreq_governor {
> + char name[DEVFREQ_NAME_LEN];
> + int (*get_target_freq)(struct devfreq *this, unsigned long *freq);
> +};
> +
> +/**
> + * struct devfreq - Device devfreq structure
> + * @node list node - contains the devices with devfreq that have been
> + * registered.
> + * @dev device pointer
> + * @profile device-specific devfreq profile
> + * @governor method how to choose frequency based on the usage.
> + * @nb notifier block registered to the corresponding OPP to get
> + * notified for frequency availability updates.
> + * @polling_jiffies interval in jiffies.
> + * @previous_freq previously configured frequency value.
> + * @next_polling the number of remaining jiffies to poll with
> + * "devfreq_monitor" executions to reevaluate
> + * frequency/voltage of the device. Set by
> + * profile's polling_ms interval.
> + * @data Private data of the governor. The devfreq framework does not
> + * touch this.
> + *
> + * This structure stores the devfreq information for a give device.
> + */
> +struct devfreq {
> + struct list_head node;
> +
> + struct device *dev;
> + struct devfreq_dev_profile *profile;
> + struct devfreq_governor *governor;
> + struct notifier_block nb;
> +
> + unsigned long polling_jiffies;
> + unsigned long previous_freq;
> + unsigned int next_polling;
> +
> + void *data; /* private data for governors */
> +};
> +
> +#if defined(CONFIG_PM_DEVFREQ)
> +extern int devfreq_add_device(struct device *dev,
> + struct devfreq_dev_profile *profile,
> + struct devfreq_governor *governor,
> + void *data);
> +extern int devfreq_remove_device(struct device *dev);
> +#else /* !CONFIG_PM_DEVFREQ */
> +static int devfreq_add_device(struct device *dev,
> + struct devfreq_dev_profile *profile,
> + struct devfreq_governor *governor,
> + void *data)
> +{
> + return 0;
> +}
> +
> +static int devfreq_remove_device(struct device *dev)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_PM_DEVFREQ */
> +
> +#endif /* __LINUX_DEVFREQ_H__ */
> --
> 1.7.4.1
>
>
^ permalink raw reply
* Re: [PATCH v7 4/4] PM / DEVFREQ: add sysfs interface
From: Turquette, Mike @ 2011-08-23 17:34 UTC (permalink / raw)
To: MyungJoo Ham
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <1314086044-24659-5-git-send-email-myungjoo.ham@samsung.com>
On Tue, Aug 23, 2011 at 12:54 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> Device specific sysfs interface /sys/devices/.../power/devfreq_*
> - governor R: name of governor
> - cur_freq R: current frequency
> - max_freq R: maximum operable frequency
> - min_freq R: minimum operable frequency
> - set_freq R: read user specified frequency (0 if not specified yet)
> W: set user specified frequency
> - polling_interval R: polling interval in ms given with devfreq profile
> W: update polling interval.
>
> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>
> --
> Changed from v6
> - poling_interval is writable.
>
> Changed from v5
> - updated devferq_update usage.
>
> Changed from v4
> - removed system-wide sysfs interface
> - removed tickling sysfs interface
> - added set_freq for userspace governor (and any other governors that
> require user input)
>
> Changed from v3
> - corrected sysfs API usage
> - corrected error messages
> - moved sysfs entry location
> - added sysfs entries
>
> Changed from v2
> - add ABI entries for devfreq sysfs interface
> ---
> Documentation/ABI/testing/sysfs-devices-power | 47 +++++++
> drivers/devfreq/devfreq.c | 185 +++++++++++++++++++++++++
> 2 files changed, 232 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
> index 8ffbc25..ba8bc35 100644
> --- a/Documentation/ABI/testing/sysfs-devices-power
> +++ b/Documentation/ABI/testing/sysfs-devices-power
> @@ -165,3 +165,50 @@ Description:
>
> Not all drivers support this attribute. If it isn't supported,
> attempts to read or write it will yield I/O errors.
> +
> +What: /sys/devices/.../power/devfreq_governor
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_governor shows the name
> + of the governor used by the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_cur_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_cur_freq shows the current
> + frequency of the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_max_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_cur_freq shows the
Copy/paste error? Description should reference devfreq_max_freq not
devfreq_cur_freq.
> + maximum operable frequency of the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_min_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_cur_freq shows the
> + minimum operable frequency of the corresponding device.
Similar to the above.
> +
> +What: /sys/devices/.../power/devfreq_set_freq
> +Date: August 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_set_freq sets and shows
> + the user specified desired frequency of the device. The
> + governor may and may not use the value. With the basic
> + governors given with devfreq.c, userspace governor is
> + using the value.
As I stated in patch 3, this should conditionally exist only if the
userspace governor is being used for this device.
The existing devfreq_cur_freq covers the read-only case well.
Regards,
Mike
> +
> +What: /sys/devices/.../power/devfreq_polling_interval
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_polling_interval sets and
> + shows the requested polling interval of the corresponding
> + device. The values are represented in ms. If the value is less
> + than 1 jiffy, it is considered to be 0, which means no polling.
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index df63bdc..3070250 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -37,6 +37,8 @@ static struct delayed_work devfreq_work;
> static LIST_HEAD(devfreq_list);
> static DEFINE_MUTEX(devfreq_list_lock);
>
> +static struct attribute_group dev_attr_group;
> +
> /**
> * find_device_devfreq() - find devfreq struct using device pointer
> * @dev: device pointer used to lookup device devfreq.
> @@ -149,6 +151,8 @@ static void devfreq_monitor(struct work_struct *work)
> dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
> error, devfreq->governor->name);
>
> + sysfs_unmerge_group(&devfreq->dev->kobj,
> + &dev_attr_group);
> list_del(&devfreq->node);
> kfree(devfreq);
>
> @@ -239,6 +243,8 @@ int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
> queue_delayed_work(devfreq_wq, &devfreq_work,
> devfreq->next_polling);
> }
> +
> + sysfs_merge_group(&dev->kobj, &dev_attr_group);
> out:
> mutex_unlock(&devfreq_list_lock);
>
> @@ -271,6 +277,8 @@ int devfreq_remove_device(struct device *dev)
> goto out;
> }
>
> + sysfs_unmerge_group(&dev->kobj, &dev_attr_group);
> +
> list_del(&devfreq->node);
> srcu_notifier_chain_unregister(nh, &devfreq->nb);
> kfree(devfreq);
> @@ -279,6 +287,183 @@ out:
> return 0;
> }
>
> +static ssize_t show_governor(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> + if (!df->governor)
> + return -EINVAL;
> +
> + return sprintf(buf, "%s\n", df->governor->name);
> +}
> +
> +static ssize_t show_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> +
> + return sprintf(buf, "%lu\n", df->previous_freq);
> +}
> +
> +static ssize_t show_max_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> + unsigned long freq = ULONG_MAX;
> + struct opp *opp;
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> + if (!df->dev)
> + return -EINVAL;
> +
> + opp = opp_find_freq_floor(df->dev, &freq);
> + if (IS_ERR(opp))
> + return PTR_ERR(opp);
> +
> + return sprintf(buf, "%lu\n", freq);
> +}
> +
> +static ssize_t show_min_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> + unsigned long freq = 0;
> + struct opp *opp;
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> + if (!df->dev)
> + return -EINVAL;
> +
> + opp = opp_find_freq_ceil(df->dev, &freq);
> + if (IS_ERR(opp))
> + return PTR_ERR(opp);
> +
> + return sprintf(buf, "%lu\n", freq);
> +}
> +
> +static ssize_t show_polling_interval(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> + if (!df->profile)
> + return -EINVAL;
> +
> + return sprintf(buf, "%d\n", df->profile->polling_ms);
> +}
> +
> +static ssize_t store_polling_interval(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct devfreq *df = find_device_devfreq(dev);
> + unsigned int value;
> + int ret;
> +
> + if (IS_ERR(df))
> + return PTR_ERR(df);
> + if (!df->profile)
> + return -EINVAL;
> +
> + ret = sscanf(buf, "%u", &value);
> + if (ret != 1)
> + return -EINVAL;
> +
> + df->profile->polling_ms = value;
> + df->next_polling = df->polling_jiffies
> + = msecs_to_jiffies(value);
> +
> + mutex_lock(&devfreq_list_lock);
> + if (df->next_polling > 0 && !polling) {
> + polling = true;
> + queue_delayed_work(devfreq_wq, &devfreq_work,
> + df->next_polling);
> + }
> + mutex_unlock(&devfreq_list_lock);
> +
> + return count;
> +}
> +
> +static ssize_t set_user_frequency(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct devfreq *devfreq;
> + unsigned long wanted;
> + int err = 0;
> +
> + sscanf(buf, "%lu", &wanted);
> +
> + mutex_lock(&devfreq_list_lock);
> + devfreq = find_device_devfreq(dev);
> +
> + if (IS_ERR(devfreq)) {
> + err = PTR_ERR(devfreq);
> + goto out;
> + }
> +
> + devfreq->user_set_freq = wanted;
> + err = count;
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + if (err >= 0)
> + devfreq_update(&devfreq->nb, 0, NULL);
> + return err;
> +}
> +
> +static ssize_t show_user_frequency(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *devfreq;
> + int err = 0;
> +
> + mutex_lock(&devfreq_list_lock);
> + devfreq = find_device_devfreq(dev);
> +
> + if (IS_ERR(devfreq)) {
> + err = PTR_ERR(devfreq);
> + goto out;
> + }
> +
> + err = sprintf(buf, "%lu\n", devfreq->user_set_freq);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return err;
> +
> +}
> +
> +static DEVICE_ATTR(devfreq_governor, 0444, show_governor, NULL);
> +static DEVICE_ATTR(devfreq_cur_freq, 0444, show_freq, NULL);
> +static DEVICE_ATTR(devfreq_max_freq, 0444, show_max_freq, NULL);
> +static DEVICE_ATTR(devfreq_min_freq, 0444, show_min_freq, NULL);
> +static DEVICE_ATTR(devfreq_polling_interval, 0644, show_polling_interval,
> + store_polling_interval);
> +static DEVICE_ATTR(devfreq_set_freq, 0644, show_user_frequency,
> + set_user_frequency);
> +static struct attribute *dev_entries[] = {
> + &dev_attr_devfreq_governor.attr,
> + &dev_attr_devfreq_cur_freq.attr,
> + &dev_attr_devfreq_max_freq.attr,
> + &dev_attr_devfreq_min_freq.attr,
> + &dev_attr_devfreq_polling_interval.attr,
> + &dev_attr_devfreq_set_freq.attr,
> + NULL,
> +};
> +static struct attribute_group dev_attr_group = {
> + .name = power_group_name,
> + .attrs = dev_entries,
> +};
> +
> /**
> * devfreq_init() - Initialize data structure for devfreq framework and
> * start polling registered devfreq devices.
> --
> 1.7.4.1
>
>
^ permalink raw reply
* Re: [PATCH v7 3/4] PM / DEVFREQ: add basic governors
From: Turquette, Mike @ 2011-08-23 17:29 UTC (permalink / raw)
To: MyungJoo Ham
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <1314086044-24659-4-git-send-email-myungjoo.ham@samsung.com>
On Tue, Aug 23, 2011 at 12:54 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> Four CPUFREQ-like governors are provided as examples.
>
> powersave: use the lowest frequency possible. The user (device) should
> set the polling_ms as 0 because polling is useless for this governor.
>
> performance: use the highest freqeuncy possible. The user (device)
> should set the polling_ms as 0 because polling is useless for this
> governor.
>
> userspace: use the user specified frequency stored at
> devfreq.user_set_freq. With sysfs support in the following patch, a user
> may set the value with the sysfs interface.
>
> simple_ondemand: simplified version of CPUFREQ's ONDEMAND governor.
>
> When a user updates OPP entries (enable/disable/add), OPP framework
> automatically notifies DEVFREQ to update operating frequency
> accordingly. Thus, DEVFREQ users (device drivers) do not need to update
> DEVFREQ manually with OPP entry updates or set polling_ms for powersave
> , performance, userspace, or any other "static" governors.
>
> Note that these are given only as basic examples for governors and any
> devices with DEVFREQ may implement their own governors with the drivers
> and use them.
>
> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>
> ---
> Changed from v5
> - Seperated governor files from devfreq.c
> - Allow simple ondemand to be tuned for each device
> ---
> drivers/devfreq/Kconfig | 36 ++++++++++++
> drivers/devfreq/Makefile | 4 +
> drivers/devfreq/governor_performance.c | 24 ++++++++
> drivers/devfreq/governor_powersave.c | 24 ++++++++
> drivers/devfreq/governor_simpleondemand.c | 88 +++++++++++++++++++++++++++++
> drivers/devfreq/governor_userspace.c | 27 +++++++++
> include/linux/devfreq.h | 41 +++++++++++++
> 7 files changed, 244 insertions(+), 0 deletions(-)
> create mode 100644 drivers/devfreq/governor_performance.c
> create mode 100644 drivers/devfreq/governor_powersave.c
> create mode 100644 drivers/devfreq/governor_simpleondemand.c
> create mode 100644 drivers/devfreq/governor_userspace.c
>
> diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
> index 1fb42de..643b055 100644
> --- a/drivers/devfreq/Kconfig
> +++ b/drivers/devfreq/Kconfig
> @@ -34,6 +34,42 @@ menuconfig PM_DEVFREQ
>
> if PM_DEVFREQ
>
> +comment "DEVFREQ Governors"
> +
> +config DEVFREQ_GOV_SIMPLE_ONDEMAND
> + bool "Simple Ondemand"
> + help
> + Chooses frequency based on the recent load on the device. Works
> + similar as ONDEMAND governor of CPUFREQ does. A device with
> + Simple-Ondemand should be able to provide busy/total counter
> + values that imply the usage rate. A device may provide tuned
> + values to the governor with data field at devfreq_add_device().
> +
> +config DEVFREQ_GOV_PERFORMANCE
> + bool "Performance"
> + help
> + Sets the frequency at the maximum available frequency.
> + This governor always returns UINT_MAX as frequency so that
> + the DEVFREQ framework returns the highest frequency available
> + at any time.
> +
> +config DEVFREQ_GOV_POWERSAVE
> + bool "Powersave"
> + help
> + Sets the frequency at the minimum available frequency.
> + This governor always returns 0 as frequency so that
> + the DEVFREQ framework returns the lowest frequency available
> + at any time.
> +
> +config DEVFREQ_GOV_USERSPACE
> + bool "Userspace"
> + help
> + Sets the frequency at the user specified one.
> + This governor returns the user configured frequency if there
> + has been an input to /sys/devices/.../power/devfreq_set_freq.
> + Otherwise, the governor does not change the frequnecy
> + given at the initialization.
> +
> comment "DEVFREQ Drivers"
>
> endif # PM_DEVFREQ
> diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
> index 168934a..4564a89 100644
> --- a/drivers/devfreq/Makefile
> +++ b/drivers/devfreq/Makefile
> @@ -1 +1,5 @@
> obj-$(CONFIG_PM_DEVFREQ) += devfreq.o
> +obj-$(CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND) += governor_simpleondemand.o
> +obj-$(CONFIG_DEVFREQ_GOV_PERFORMANCE) += governor_performance.o
> +obj-$(CONFIG_DEVFREQ_GOV_POWERSAVE) += governor_powersave.o
> +obj-$(CONFIG_DEVFREQ_GOV_USERSPACE) += governor_userspace.o
> diff --git a/drivers/devfreq/governor_performance.c b/drivers/devfreq/governor_performance.c
> new file mode 100644
> index 0000000..c47eff8
> --- /dev/null
> +++ b/drivers/devfreq/governor_performance.c
> @@ -0,0 +1,24 @@
> +/*
> + * linux/drivers/devfreq/governor_performance.c
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/devfreq.h>
> +
> +static int devfreq_performance_func(struct devfreq *df,
> + unsigned long *freq)
> +{
> + *freq = UINT_MAX; /* devfreq_do will run "floor" */
> + return 0;
> +}
> +
> +struct devfreq_governor devfreq_performance = {
> + .name = "performance",
> + .get_target_freq = devfreq_performance_func,
> +};
> diff --git a/drivers/devfreq/governor_powersave.c b/drivers/devfreq/governor_powersave.c
> new file mode 100644
> index 0000000..4f128d8
> --- /dev/null
> +++ b/drivers/devfreq/governor_powersave.c
> @@ -0,0 +1,24 @@
> +/*
> + * linux/drivers/devfreq/governor_powersave.c
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/devfreq.h>
> +
> +static int devfreq_powersave_func(struct devfreq *df,
> + unsigned long *freq)
> +{
> + *freq = 0; /* devfreq_do will run "ceiling" to 0 */
> + return 0;
> +}
> +
> +struct devfreq_governor devfreq_powersave = {
> + .name = "powersave",
> + .get_target_freq = devfreq_powersave_func,
> +};
> diff --git a/drivers/devfreq/governor_simpleondemand.c b/drivers/devfreq/governor_simpleondemand.c
> new file mode 100644
> index 0000000..18fe8be
> --- /dev/null
> +++ b/drivers/devfreq/governor_simpleondemand.c
> @@ -0,0 +1,88 @@
> +/*
> + * linux/drivers/devfreq/governor_simpleondemand.c
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/devfreq.h>
> +#include <linux/math64.h>
> +
> +/* Default constants for DevFreq-Simple-Ondemand (DFSO) */
> +#define DFSO_UPTHRESHOLD (90)
> +#define DFSO_DOWNDIFFERENCTIAL (5)
> +static int devfreq_simple_ondemand_func(struct devfreq *df,
> + unsigned long *freq)
> +{
> + struct devfreq_dev_status stat;
> + int err = df->profile->get_dev_status(df->dev, &stat);
> + unsigned long long a, b;
> + unsigned int dfso_upthreshold = DFSO_UPTHRESHOLD;
> + unsigned int dfso_downdifferential = DFSO_DOWNDIFFERENCTIAL;
> + struct devfreq_simple_ondemand_data *data = df->data;
> +
> + if (err)
> + return err;
> +
> + if (data) {
> + if (data->upthreshold)
> + dfso_upthreshold = data->upthreshold;
> + if (data->downdifferential)
> + dfso_downdifferential = data->downdifferential;
> + }
> + if (dfso_upthreshold > 100 ||
> + dfso_upthreshold < dfso_downdifferential)
> + return -EINVAL;
> +
> + /* Assume MAX if it is going to be divided by zero */
> + if (stat.total_time == 0) {
> + *freq = UINT_MAX;
> + return 0;
> + }
> +
> + /* Prevent overflow */
> + if (stat.busy_time >= (1 << 24) || stat.total_time >= (1 << 24)) {
> + stat.busy_time >>= 7;
> + stat.total_time >>= 7;
> + }
> +
> + /* Set MAX if it's busy enough */
> + if (stat.busy_time * 100 >
> + stat.total_time * dfso_upthreshold) {
> + *freq = UINT_MAX;
> + return 0;
> + }
> +
> + /* Set MAX if we do not know the initial frequency */
> + if (stat.current_frequency == 0) {
> + *freq = UINT_MAX;
> + return 0;
> + }
> +
> + /* Keep the current frequency */
> + if (stat.busy_time * 100 >
> + stat.total_time * (dfso_upthreshold - dfso_downdifferential)) {
> + *freq = stat.current_frequency;
> + return 0;
> + }
> +
> + /* Set the desired frequency based on the load */
> + a = stat.busy_time;
> + a *= stat.current_frequency;
> + b = div_u64(a, stat.total_time);
> + b *= 100;
> + b = div_u64(b, (dfso_upthreshold - dfso_downdifferential / 2));
> + *freq = (unsigned long) b;
> +
> + return 0;
> +}
> +
> +struct devfreq_governor devfreq_simple_ondemand = {
> + .name = "simple_ondemand",
> + .get_target_freq = devfreq_simple_ondemand_func,
> +};
> diff --git a/drivers/devfreq/governor_userspace.c b/drivers/devfreq/governor_userspace.c
> new file mode 100644
> index 0000000..6af8b6d
> --- /dev/null
> +++ b/drivers/devfreq/governor_userspace.c
> @@ -0,0 +1,27 @@
> +/*
> + * linux/drivers/devfreq/governor_simpleondemand.c
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + * MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/devfreq.h>
> +
> +static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq)
> +{
> + if (df->user_set_freq == 0)
> + *freq = df->previous_freq; /* No user freq specified yet */
I think that user_set_freq == 0 is a valid request from a user. A
user might not know what the valid frequencies are for a device, but
knows that he/she wants the lowest one. Writing 0 to the sysfs value
should give the floor of the available frequencies.
> + else
> + *freq = df->user_set_freq;
> +
> + return 0;
> +}
> +
> +struct devfreq_governor devfreq_userspace = {
> + .name = "userspace",
> + .get_target_freq = devfreq_userspace_func,
> +};
The set_user_frequency attribute should be moved out of devfreq.c
(patch 4) and live here. There is no purpose to that entry except for
the userspace governor and it should not exist in sysfs unless this
governor is in use. To be clear, there should still be a read-only
show_frequency or something in devfreq.c.
This again touches on my long-standing complaints with this series'
design. set_user_frequency is an attribute that belongs to the
governor, not to the core. Such misplacement of attribute/entity
ownership is common in this series, but I won't go down that rabbit
hole again. In the mean time at least this one instance of the
problem for the userspace gov should be fixed up.
Regards,
Mike
> diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
> index 8252238..8f614fa 100644
> --- a/include/linux/devfreq.h
> +++ b/include/linux/devfreq.h
> @@ -13,6 +13,7 @@
> #ifndef __LINUX_DEVFREQ_H__
> #define __LINUX_DEVFREQ_H__
>
> +#include <linux/opp.h>
> #include <linux/notifier.h>
>
> #define DEVFREQ_NAME_LEN 16
> @@ -63,6 +64,8 @@ struct devfreq_governor {
> * "devfreq_monitor" executions to reevaluate
> * frequency/voltage of the device. Set by
> * profile's polling_ms interval.
> + * @user_set_freq User specified adequete frequency value (thru sysfs
> + * interface). Governors may and may not use this value.
> * @data Private data of the governor. The devfreq framework does not
> * touch this.
> *
> @@ -80,6 +83,7 @@ struct devfreq {
> unsigned long previous_freq;
> unsigned int next_polling;
>
> + unsigned long user_set_freq; /* governors may ignore this. */
> void *data; /* private data for governors */
> };
>
> @@ -89,6 +93,37 @@ extern int devfreq_add_device(struct device *dev,
> struct devfreq_governor *governor,
> void *data);
> extern int devfreq_remove_device(struct device *dev);
> +
> +#ifdef CONFIG_DEVFREQ_GOV_POWERSAVE
> +extern struct devfreq_governor devfreq_powersave;
> +#endif
> +#ifdef CONFIG_DEVFREQ_GOV_PERFORMANCE
> +extern struct devfreq_governor devfreq_performance;
> +#endif
> +#ifdef CONFIG_DEVFREQ_GOV_USERSPACE
> +extern struct devfreq_governor devfreq_userspace;
> +#endif
> +#ifdef CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND
> +extern struct devfreq_governor devfreq_simple_ondemand;
> +/**
> + * struct devfreq_simple_ondemand_data - void *data fed to struct devfreq
> + * and devfreq_add_device
> + * @ upthreshold If the load is over this value, the frequency jumps.
> + * Specify 0 to use the default. Valid value = 0 to 100.
> + * @ downdifferential If the load is under upthreshold - downdifferential,
> + * the governor may consider slowing the frequency down.
> + * Specify 0 to use the default. Valid value = 0 to 100.
> + * downdifferential < upthreshold must hold.
> + *
> + * If the fed devfreq_simple_ondemand_data pointer is NULL to the governor,
> + * the governor uses the default values.
> + */
> +struct devfreq_simple_ondemand_data {
> + unsigned int upthreshold;
> + unsigned int downdifferential;
> +};
> +#endif
> +
> #else /* !CONFIG_PM_DEVFREQ */
> static int devfreq_add_device(struct device *dev,
> struct devfreq_dev_profile *profile,
> @@ -102,6 +137,12 @@ static int devfreq_remove_device(struct device *dev)
> {
> return 0;
> }
> +
> +#define devfreq_powersave NULL
> +#define devfreq_performance NULL
> +#define devfreq_userspace NULL
> +#define devfreq_simple_ondemand NULL
> +
> #endif /* CONFIG_PM_DEVFREQ */
>
> #endif /* __LINUX_DEVFREQ_H__ */
> --
> 1.7.4.1
>
>
^ permalink raw reply
* Re: [POWER DOMAIN suspend callbacks] Observation.
From: Kevin Hilman @ 2011-08-23 17:06 UTC (permalink / raw)
To: Santosh; +Cc: Govindraj R, Linux PM mailing list, linux-omap
In-Reply-To: <4E53B6FC.8030507@ti.com>
Hi Santosh,
Santosh <santosh.shilimkar@ti.com> writes:
> Rafael, Kevin,
>
> On latest kernel( V3.1-rc1+), the subsystem(driver) suspend
> callbacks are not getting called because power domain callbcaks
> are populated.
>
> And as per commit 4d27e9dc{PM: Make power domain callbacks take
> precedence over subsystem ones}, it's expected bahavior.
Correct.
> Who is suppose to call the driver suspend callback?
If populated, the PM domain callbacks should call the driver callbacks.
If there are no PM domain callbacks, then the subsystem (in this case,
the platform_bus) should be calling the driver callbacks.
> Some drivers/subsystem would have state machine which needs to
> be suspended.
>
> Is the power domain suspend callback, suppose to take care of
> it ? If yes, then that seems to be missing for OMAP.
Yup, there's a bug. They're not missing, just misplaced. ;)
When adding the noirq callbacks to ensure devices are idled late in
suspend by omap_device, I the patch commited mistakenly uses
SET_SYSTEM_SLEEP_PM_OPS(), which sets the "normal" suspend/resume
handlers and not the noirq handlers.
Can you try the patch below? I only briefly tested it on omap3/n900 so
far.
This populates most of the PM domain methods with the same ones used by
the subystem (platform_bus) and only overrides the noirq methods with
custom versions. This patch should make all the driver's suspend/resume
methods be called as expected.
After a bit more sanitiy testing, I'll post a real patch for the -rc
series.
Kevin
diff --git a/arch/arm/plat-omap/omap_device.c b/arch/arm/plat-omap/omap_device.c
index d8f2299..7a0d248 100644
--- a/arch/arm/plat-omap/omap_device.c
+++ b/arch/arm/plat-omap/omap_device.c
@@ -626,7 +626,8 @@ static struct dev_pm_domain omap_device_pm_domain = {
SET_RUNTIME_PM_OPS(_od_runtime_suspend, _od_runtime_resume,
_od_runtime_idle)
USE_PLATFORM_PM_SLEEP_OPS
- SET_SYSTEM_SLEEP_PM_OPS(_od_suspend_noirq, _od_resume_noirq)
+ .suspend_noirq = _od_suspend_noirq,
+ .resume_noirq = _od_resume_noirq,
}
};
^ permalink raw reply related
* Re: [POWER DOMAIN suspend callbacks] Observation.
From: Govindraj @ 2011-08-23 15:01 UTC (permalink / raw)
To: Santosh; +Cc: Basak, Partha, Govindraj R, Linux PM mailing list, linux-omap
In-Reply-To: <4E53B6FC.8030507@ti.com>
On Tue, Aug 23, 2011 at 7:49 PM, Santosh <santosh.shilimkar@ti.com> wrote:
> Rafael, Kevin,
>
> On latest kernel( V3.1-rc1+), the subsystem(driver) suspend
> callbacks are not getting called because power domain callbcaks
> are populated.
>
> And as per commit 4d27e9dc{PM: Make power domain callbacks take
> precedence over subsystem ones}, it's expected bahavior.
>
> Who is suppose to call the driver suspend callback?
> Some drivers/subsystem would have state machine which needs to
> be suspended.
>
> Is the power domain suspend callback, suppose to take care of
> it ? If yes, then that seems to be missing for OMAP.
>
Looks like from
_od_suspend_noirq
-> pm_generic_suspend_noirq
-> __pm_generic_call(dev, PM_EVENT_SUSPEND, true);
--> [..]
case PM_EVENT_SUSPEND:
callback = noirq ? pm->suspend_noirq : pm->suspend;
shouldn't we call pm_generic_suspend from _od_suspend_noirq
rather than calling pm_generic_suspend_noirq?
Since _noirq seems to call .suspend_noirq hook which
most of the them have not populated.
pm_generic_suspend* calls may fail to call legacy platform .suspend hooks?
Looks like we need below change for driver .suspend hook to be called.
diff --git a/arch/arm/plat-omap/omap_device.c b/arch/arm/plat-omap/omap_device.c
index b6b4097..c2a18ae 100644
--- a/arch/arm/plat-omap/omap_device.c
+++ b/arch/arm/plat-omap/omap_device.c
@@ -584,9 +584,9 @@ static int _od_suspend_noirq(struct device *dev)
int ret;
if (od->flags & OMAP_DEVICE_NO_IDLE_ON_SUSPEND)
- return pm_generic_suspend_noirq(dev);
+ return pm_generic_suspend(dev);
- ret = pm_generic_suspend_noirq(dev);
+ ret = pm_generic_suspend(dev);
if (!ret && !pm_runtime_status_suspended(dev)) {
if (pm_generic_runtime_suspend(dev) == 0) {
@@ -604,7 +604,7 @@ static int _od_resume_noirq(struct device *dev)
struct omap_device *od = to_omap_device(pdev);
if (od->flags & OMAP_DEVICE_NO_IDLE_ON_SUSPEND)
- return pm_generic_resume_noirq(dev);
+ return pm_generic_resume(dev);
if ((od->flags & OMAP_DEVICE_SUSPENDED) &&
!pm_runtime_status_suspended(dev)) {
@@ -613,7 +613,7 @@ static int _od_resume_noirq(struct device *dev)
pm_generic_runtime_resume(dev);
}
- return pm_generic_resume_noirq(dev);
+ return pm_generic_resume(dev);
}
#endif
--
Thanks,
Govindraj.R
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox